Statistical Methods for Ambiguity Resolution

Statistical methods have been developed (E.g., Charniak, 1993) that help select part of speech, word sense, and grammatical structure for syntactically ambiguous sentences.  Here, we explore statistical methods for part of speech tagging and the use of probabilistic grammars in parsing.

Part of speech tagging

Statistical methods for reducing ambiguity in sentences have been developed by using  large corpora (e.g., the Brown Corpus or the COBUILD Corpus) about word usage.  The words and sentences in these corpora are pre-tagged with the parts of speech, grammatical structure, and frequencies of usage for words and sentences taken from a broad sample of written language. Part of speech tagging is the process of selecting the most likely part of speech from among the alternatives for each word in a sentence.  In practice, only about 10% of the distinct words in a million-word cprpus have two or more parts of speech, as illustrated by the following summary for the complete Brown Corpus (DeRose 1988):
 
Number
of Tags
Number of
distinct words
Percent 
of total
1 35,340 89.6%
2 3,760 9.5
3 264 0.6
4 61 0.1
5 12 0.03
6 2 0.005
7 1 0.0025
Total 2-7 4,100 10.4%

The one word in the Brown corpus that has 7 different part of speech tags is "still."  Presumably, those tags include at least n, v, adj, adv, and conj.  In practice, there are a small number of different sets of tags (or "tagsets") in use today.  For example, the tagset used by the Brown corpus has 87 different tags.  In the following discussion, we shall use a very small 4-tag set, {n, v, det, prep} in order to maintain clarity of the ideas behind the statistical approach to part of speech tagging.

Given n or v as the two part of speech tags (T), for the word W = flies, the outcome is partially governed by looking at the probability of each choice, based on the frequency of occurrence of the given word in a large corpus of representative text.

Compare prob(T = v | W = flies), which reads "the probability of a verb (v) choice, given that the word is "flies," with prob(T = n | W = flies).

In statistics, the following is true for two independent events X and Y:

prob(X | Y) = prob(X & Y) / prob(Y).

That is, the probability of event X occurring, after knowing that Y has occurred, is the quotient of the probability of events X and Y both occurring and the probability of Y occurring alone.  Thus,

prob(T = n | W = flies) = prob(T = n & W = flies) / prob(W = flies)

How are these probabilities estimated?  The collected experience of using these words in a large corpus of known (and pre-tagged) text, is used as an estimator for these probabilities.

Suppose, for example, that we have a corpus of 1,273,000 words (approximately the size of the Brown corpus) which has 1000 occurrences of the word flies, 400 pre-tagged as nouns (n) and 600 pre-tagged as verbs (v).  Then prob(W = flies), the probability that a randomly-selected word in a new text is flies, is 1000/1,273,000 = 0.0008.  Similarly, the probability that the word is flies and it is a noun (n) or a verb (v) can be estimated by:
 

prob(T = n & W = flies) = 400/1,273,000 = 0.0003
prob(T = v & W = flies) = 600/1,273,000 = 0.0005

So prob(T = v | W = flies) = prob(T = v & W = flies) / prob(W = flies) = 0.0005/0.0008 = 0.625.  In other words, the prediction that an arbitrary occurrence of the word flies is a verb will be correct 62.5% of the time.

The process of part of speech tagging for an entire sentence with t words relies not only on the relative frequencies of the sentence's words in a pre-tagged corpus, but also on the part(s) of speech of all the other words in the sentence.  Let T1...Tn be a series of parts of speech for an n-word sentence with words W1...Wn.  To assign the most likely parts of speech for these words using statistical techniques, we want to find values for T1...Tn that maximizes the following:
 

prob(T1...Tn | W1...Wn) =
   prob(T1...Tn) * prob( W1...Wn | T1...Tn) / prob(W1...Wn)                (1)
 

For example, for the text "Fruit flies like a banana." we want to find part of speech tags T1...T5 for this sequence of n=5 words that maximizes:

        prob(T1...T5 | fruit flies like a banana) =
            prob(T1...T5) * prob(fruit flies like a banana | T1...T5) / prob(fruit flies like a banana)

where each part of speech tag Ti comes from the set {n, v, prep, det}.

Since this calculation would be very complex (it would require too much data), researchers use the following (reasonably good) estimate for prob(T1...Tn | W1...Wn), using the following approximations:

prob(T1...Tn)  = ||iprob(Ti | Ti-1)                       for i=1,...,n
prob(W1...Wn | T1...Tn)  = ||iprob(Wi | Ti)          for i=1,...,n
 

(Read  "= ||i" as "approximates the product over i.") This particular method of estimation is called a bigram model, since each part of speech tag Ti is dependent only on the previous word's part of speech choice, Ti-1.  In practice, both bigram and trigram models are used.  Now, the problem is simplified to finding a series of choices T1...Tn that maximizes the following:
 
prob(T1...Tn | W1...Wn)  = ||iprob(Ti | Ti-1) * ||iprob(Wi | Ti     for i=1,...,n        (2)
 

which is a much less complex calculation than equation (1) above.  Notice here also that the denominator in (1) is dropped, since its value would be constant over all sequences of tags T1...Tn.

For our example, we want to maximize the following:

         prob(T1...T5| fruit flies like a banana)  = ||iprob(Ti| Ti-1)*  ||iprob(Wi| Ti)     for i=1,...,5
           =  prob(T1| Ø)* prob(T2| T1)* prob(T3| T2) * prob(T4| T3) * prob(T5| T4) *
                 prob(fruit| T1) * prob(flies| T2) * prob(like| T3) * prob(a| T4) * prob(banana| T5)
Here, the symbol Ø denotes the beginning of the sentence.  These individual probabilities are readily computed from the frequencies found for the pre-tagged sentences in the corpus.  That is, the probabilities prob(Ti | Ti-1) for each adjacent pair of parts of speech choice {n, v, det, prep}can be estimated by counting frequencies in the corpus and writing the result in the form of a "Markov chain," such as the one shown below (adapted from Allen, p 200):

This expresses, for instance, that the probability of having a verb (v) immediately following a noun (n) in a sentence is prob(v | n) = 0.43.  In other words, 43% of the occurrences of a noun in the corpus are immediately followed by a verb.

With this information, we can estimate prob(n n v det n), for instance, as:
 

prob(n n v det n) =  prob(n | Ø) * prob(n | n) * prob(v | n) * prob(det | v) * prob(n | det)
     =  .29 * .13 * .43 * .6 * 1
     = 0.00973
 

The corpus also provides estimates for the individual probabilites for each word's possible part of speech.  Here is some sample data derived from a small corpus (Allen, 1995) for a few words of interest to us:
prob(a | det) = .036               prob(a | n) = .001            prob(man | n) = .035
prob(banana | n) = .076        prob(like | v) = .1            prob(man | v) = .007
prob(flies | n) = .025             prob(like | prep) = .065    prob(hill | n) = .083
prob(flies | v) = .07               prob(like | n) = .013         prob(telescope | n) = .075
prob(fruit | n) = .061            prob(the | det) = .045       prob(on | prep) = .09
prob(fruit | v) = .005            prob(woman | n) = .076    prob(with | prep) = .09

Now we can use equation (2) to calculate the likelihood of the sentence "fruit flies like a banana" being tagged with the parts of speech n n v det n as follows:

prob(n n v det n | fruit flies like a banana)
     = prob(n n v det n) * prob(fruit | n) * prob(flies | n) * prob(like | v) * prob(a | det) *
         prob(banana | n)
     =  0.00973 * 0.061 * .025 * .1 * .036 * .076
     = 0.00973 * 4.17 * 10-7
     = 4.06 * 10-9

To find the most likely sequence of tags for a sentence, we need to maximize this calculation over all possible tag sequences for the sentence.  This could be a complex computation if the "brute force" approach is used.  That is, for an n-word sentence and p parts of speech, there are pn different taggings.  Even for our small sentence of n=5 words and p=4 different parts of speech {n, v, det, prep}, the number of calcuations would be 45 = 1,024.  However, given our particular sentence, and the fact that several different taggings have 0 probability, there are at most 2*2*3*2*1 = 24 different non-zero calculations, as shown below:

fruit flies like    a     banana
n        n    n      det    n
v        v    v      n
               prep


Moreover, this number is reduced further by the fact that some of the transitions in the Markov chain are zero, such as prob(v | v) = 0.  There are, practically speaking, only the following 8 different taggings for our sentence that give nonzero probabilities:

fruit flies like    a     banana
tagging = [n, n, n, n, n]
probability = 1.24795e-013

tagging = [n, n, v, n, n]
probability = 7.32753e-012

tagging = [n, n, v, det, n]
probability = 4.05833e-009

tagging = [n, n, prep, n, n]
probability = 4.22384e-012

tagging = [n, n, prep, det, n]
probability = 3.32909e-009

tagging = [n, v, n, n, n]
probability = 2.66722e-012

tagging = [n, v, prep, n, n]
probability = 8.89074e-012

tagging = [n, v, prep, det, n]
probability = 7.00738e-009

The probabilities calculated here are computed by the Prolog program grammar11.pl, whose implicit nondeterminism assists in the tedious search for alternative taggings and the accompanying calculations outlined above.  Among these, the best alternatives suggest that fruit flies is a n n sequence or a n v sequence, leading to the following "most probable" taggings for the entire sentence.

prob(n n v det n | fruit flies like a banana)
prob(n v prep det n | fruit flies like a banana)

Furthermore, there is an algorithm called the Viterbi algorithm, which provides a strategy by which most of the non-zero alternatives that lead to nonzero probabilities can also be avoided. That is, the tagging with the maximum probability can be developed directly in a single pass on the words in the sentence, and not require the above calculation for all the 24 combinations of part of speech taggings listed above. (For the details of the Viterbi algorithm, see Allen, p 203).

The general results of this tagging method are quite positive (Allen p 203): "Researchers consistently report part of speech sentence tagging with 95% or better accuracy using trigram models."

Probabilistic grammars

Markov models can be extended to grammar rules to help govern choices among alternatives when the sentence is syntactically ambiguous (that is, there are two or more different parse trees for the sentence).  This technique is called probabilistic grammars.

A probabilistic grammar has a probability associated with each rule, based on its frequency of use in the parsed version of a corpus.  For example, the following probabilistic grammar's rules have their probabilities derived from a small corpus (Fig 7.17 in Allen).
 
Grammar rule 
prob
1.  s --> np vp 
1
2.  vp --> v np pp 
0.386
3.       --> v np 
0.393
4.       --> v pp 
0.22
5.  np --> np pp
0.24
6.       --> n n 
0.09
7.       --> n 
0.14
8.       --> det n 
0.55
9.  pp --> p np 
1

For a given phrase like a banana and grammatical category (np), we look at all the possible subtrees that can be generated for that phrase, assigning appropriate parts of speech to the words.  For example,

prob(a banana | np)
     =  prob(rule 8 | np) * prob(a | det) * prob(banana | n)
        +  prob(rule 6 | np) * prob(a | n) * prob(banana | n)
     = .55 * .036 * .076 + .09 * .001 * .076
     = 1.51*10-3

(Note that only rules 6 and 8 come into play here, since they are the only ones that generate two-word noun phrases.)  Here, we are using the same conditional probabilities for individual words, like prob(banana | n), that we used in the previous section for part of speech tagging.

For an entire sentence, we generate all of its possible parse trees, and then compute the probability that each of those trees will occur, using the above strategy.  The preferred parse tree is the one whose probability is the maximum.  Consider, for example, the two parse trees for the sentence fruit flies like a banana that are generated by the above grammar:

These trees are annotated with the numbers of the grammar rules that are used at each level.  Thus, to determine which is the more probable parse for this sentence, we compute the following for each tree:

prob(fruit flies like a banana | s)

and choose the maximum value from these two computations.  Here is a summary of the necessary calculations:

I.  prob(fruit flies like a banana | s)
        = prob(rule 1 | s) * prob(fruit flies | np) * prob(like a banana | vp)
        = 1 * prob(rule 6 | np) * prob(fruit | n) * prob(flies | n) *
            prob(rule 3 | vp) * prob(like | v) * prob(a banana | np)
        = .09 * .061 *  .025 * .393 * .1 * 1.51*10-3
        = 8.14 * 10-9

II. prob(fruit flies like a banana | s)
        = prob(rule 1 | s) * prob(fruit | np) * prob(flies like a banana | vp)
        = 1 * prob(rule 7 | np) * prob(fruit | n) *
            prob(rule 4 | vp) * prob(flies | v) * prob(like a banana | pp)
        = prob(rule 7 | np) * prob(fruit | n) *
            prob(rule 4 | vp) * prob(flies | v) * prob(rule 9 | pp) * prob(like | prep) *
            prob(a banana | np)
        = .14 * .061 * .22 * .07 * 1 * .065 * 1.51*10-3
        = 12.9 * 10-9
Here, we do not show the details of the calculation of 1.51*10-3 for prob(a banana | np) that was developed above.  By these calculations, parse tree II is slightly favored over parse tree I.  This result suggests that this sentence would mean that fruit flies in the same way that a banana flies.  Adding semantic considerations, we might find that the more plausible parse would have "fruit flies" as a noun phrase and "like" as a verb.

According to Allen and others (p 213), probabilistic parsing gives the correct result (parse tree) about half the time, which is better than pure guess-work but not great in general.  For instance, the results we obtained by parsing "fruit flies like a banana," from the sample grammar and probabilities delivers the less intuitive result.  Thus, we should consider more robust approaches to semantics if we want to be able to parse sentences like this one correctly.  Statistical methods that don't take semantic features into account have limited value in complex NLP situations.

Exercises

  1. Complete the calculation of  prob(n v prep det n | fruit flies like a banana), given the probabilities used in the discussion above.  What is the resulting probability?  Is this greater than that computed for prob(n n v det n | fruit flies like a banana)?  Does this result agree with your intuition about which is the more reasonable interpretation for this ambiguous sentence?
  2. Revise the grammar given in the discussion above by changing the probabilities for rules 2, 3, and 4 to .32, .33, and .35.  What effect does this change have on the selection of the best parse tree for the sentence Fruit flies like a banana?
  3. Change the following probabilities in the Markov chain to the following:
  4. prob(n | v) = .53
    prob(det | v) = .32
    prob(prep | v) = .15
    Now recalculate the part of speech taggings for the sentence Fruit flies like a banana.  What now is the best part of speech tagging for the words in this sentence?
  5. Consider the sentence The woman saw the man with the telescope.  Compute the statistically best parse for this sentence, given the two alternative attachments of the prepositional phrase with the telescope.
  6. Consider the sentence The woman saw the man on the hill with the telescope.  Using the above grammar, how many different parse trees does this sentence have?  Compute the statistically best parse for this sentence.
  7. (Optional extra credit)   Consider the problems of part of speech tagging and probabilistic parsing using a Prolog program that implements the grammar and lexicon discussed here.  What would be needed in the Prolog program to automatically calculate the part of speech taggings and select the best one?  What would be needed in the Prolog program to automatically calculate the parse tree probabilities and select the best one?  Design such a program.

References

  1. Charniak, E., Statistical Language Learning, MIT Press (1993).
  2. Allen, J., Natural Language Understanding, Chapter 7.
  3. DeRose, S.J., "Grammatical category disambiguation by statistical optimization," Computational Linguistics 14, 31-39 (1988).
  4. Jurafsky, D. and J. Martin, Speech and Language Processing, Prentice-Hall (2000).