nlp3

Word Features and Agreement Issues

Feature Systems for English

Within a sentence, several additional structural requirements must be fulfilled, in addition to its basic structural integrity suggested by the grammar. These requirements include person and number agreement between and within noun and verb phrases, proper use of verb forms, appropriate phrase structures following verbs, and proper use of prepositional phrases with various verbs.

Many of these requirements can be ensured by adding so-called "features" to the lexicon, and then augmenting the grammar rules so that the features (like number and person) play a role in discriminating between correct and incorrect sentences.

Number and Person Agreement

The number feature of a noun or noun phrase is either singular or plural. Correspondingly, verbs and verb phrases must agree in number with the noun phrase that is the subject of the sentence. For instance,

The man sees the fish.

is correct, but not

The man see the fish.

Alongside number is the person feature of noun and verb phrases. Each of the above sentences is in the third person, since its subject is neither the speaker nor the hearer. When the subject of the sentence is the speaker, it is expressed in the first person, as in

I see the fish.

When the subject is the hearer, it is identified as the second person, as in

You see the fish.

Noun phrase-verb phrase agreement takes person into account as well as number.

A Simple Grammar with Number Agreement

Here is a simple grammar that forces number agreement between the noun and verb phrases of a sentence. A copy is available as grammar4 in the prolog directory. In this program, singular and plural number are denoted by the atoms s and p, respectively.

s(s(NP,VP)) --> np(NP, Number), vp(VP, Number), terminator.

np(np(D,N), Number) --> det(D, Number), n(N, Number).
np(np(P), Number) --> pro(P, Number).

vp(vp(IV), Number) --> iv(IV, Number).
vp(vp(TV,NP), Number) --> tv(TV, Number), np(NP, _).

det(det(W), Number) --> [W], {det(W, Number)}.
det(the, _).
det(a, s).

n(n(W), Number) --> [W], {n(W, Number)}.
n(dog, s).
n(fish, _).
n(man, s). n(men, p).
n(saw, s).

pro(pro(W), Number) --> [W], {pro(W, Number)}.
pro(he, s).

iv(iv(W), Number) --> [W], {iv(W, Number)}.
iv(cries, s). iv(cry, p).

tv(tv(W), Number) --> [W], {tv(W, Number)}.
tv(sees, s).    tv(see, p).
tv(wants, s).   tv(want, p).
tv(was, s).     tv(were, p).

terminator --> ['.'] ; ['?'] ; ['!'].

This grammar has some interesting characteristics. First, the number feature is prominently carried in rules where it is needed to enforce number agreement among nouns, verbs, determiners, noun phrases, verb phrases, and sentences. Recall the Prolog convention that several occurrences of a variable, like Number, within a single rule must all instantiate to the same value (s or p in this case) whenever that rule is used in a parse.

Second, some words in the lexicon (e.g., the and fish) can represent associate with both singular and plural number, and this is identified by _ (don't care) number entries for those words in the lexicon.

Third, the grammar doesn't take person into account, as it should in reality. This extension is left as an exercise.

Verb Forms and Verb Subcategorization

In addition to person and number agreement, sentences may be expressed in different tenses and may have different constructions following the verb to express various kinds of events. Tense is expressed by various codings for verb forms, including the following (Allen, p 87):

base form - e.g., see

pres (simple present) - e.g., The dog sees the fish.

past - e.g., The dog saw the fish.

ing (present participle) - e.g., The dog is seeing the fish.

pastprt (past participle) - The dog has seen the fish

inf (infinitive) - The dog wants to see the fish

Different constructions that follow the main verb in a sentence are sometimes defined using "verb subcategorization," as in the following (Allen, p 88):

_none (intransitive verbs) - The dog smiles.

_np (simple transitive) - The dog sees the fish.

_np_np - The dog gave the man the fish.

_vp:inf - The dog wants to cry.

_np_vp:inf - The dog wants the fish to go.

_vp:ing - The dog keeps hoping for the fish.

_np_vp:ing - The dog saw the fish swimming in the water.

_np_vp:base - The dog saw the fish swim.

Add the idea of having prepositional phrases in three general classes:

to - e.g., to

loc (location) - e.g., in, on, under, beside, ...

mot (motion) - e.g., to, toward, from, along, ...

And we can see the following additional verb subcategorizations:

_np_pp:to - The dog gave the fish to the man.

_pp:loc - The dog is in the water.

_np_pp:loc - The dog put the fish in the water.

_pp:mot - The dog went to the water.

_np_pp:mot - The dog took the fish to the man.

_adjp - The dog is happy.

_np_adjp - The man kept the dog happy.

_s:that - Jack believed that the dog took the fish.

_s:for - Jack hoped for the dog to take the fish.

Using A Lisp Interpreter for Grammar and Lexicon

To take advantage of the NLP ideas in the Allen text, we switch from a Prolog-based to a Lisp-based style of expressing and exercising grammars and lexicons. Allen provides an interpreter tool for the grammars and lexicons discussed in that text, which are all provided in electronic form in the directory jallen/Parser1.1. (You should copy the directory jallen to your own directory, for later use throughout this course.) This tool is invoked using the following Lisp commands (assuming you have started Adobe Common Lisp out of your subdirectory jallen/Parser1.1):

> (load "LOADP")

Now to load and run the grammar and lexicon in Chapter 4, type the command:

> (loadChapter4)

and you are ready to parse and examine the structure of sentences that are discussed there.

The lexicon design is discussed on pages 90-93 of Allen. Each entry in the lexicon is coded using a Lisp-like notation according to its category, root, person and number agreement, and (in the case of verbs) form and subcategorization features. For instance, the word dog appears in the lexicon as follows:

(dog (n (root DOG1) (agr 3s))

which says that dog is a noun (n), has root form DOG1 (which identifies a particular use, or sense, of the word dog), and has third person singular (3s) agreement features. The word saw has two different entries in the lexicon, one of which is:

(saw (v (root SEE1) (VFORM past) (subcat _np) (agr ?a)))

This identifies the word as a verb (v), with root SEE1, verb form (tense) past, with subcategorization _np (taking a noun phrase as an object), and requiring number and person agreement (agr ?a).

The grammar also takes a Lisp-like form, as discussed on pp 94-98 of Allen. As in our Prolog grammars, these grammars carry agreement and subcategorization features, along with the definition of the grammatical classes and syntactic structure. For instance, the rule

(NP AGR ?a) --> (ART AGR ?a) (N AGR ?a)

describes a noun phrase with a particular person-number agreement feature as an article and a noun with the same agreement feature. This is encoded into the following interpretable Lisp expression:

((np) -2> (art (agr ?a)) (head (n (agr ?a))))

Here, the special convention -n> encodes the arrow of the nth grammar rule in a way that uniquely identifies that rule for the purpose of tracing.

A Simple Grammar and Lexicon with Verb Subcategorization Features

The grammar and lexicon shown in Figures 4.6 and 4.7 utilize the person, number, verb form, and verb subcategorization features summarized above. They are encoded in Lisp notation in the file jallen/Parser1.1/Grams/chapt4. Here is a complete listing of that lexicon encoding.

(setq *lexicon4-6*
'((a (art (agr 3s) (root A1)))
    (be (v (root BE1) (vform bare) (subcat (? s _adjp _np)) (irreg-pres +)
           (irreg-past +)))
    (cry (v (root CRY1) (vform bare) (subcat _none)))
    (dog (n (root DOG1) (agr 3s)))
    (fish (n (root FISH1) (agr (? a 3s 3p)) (IRREG-PL +)))
    (happy (adj (subcat _vp-inf) (root HAPPY1)))
    (he (pro (root HE1) (AGR 3s)))
    (is (v (root BE1) (VFORM pres) (SUBCAT (? s _adjp _np)) (AGR 3s)))
    (Jack (name (agr 3s) (root JACK1)))
    (man (n (root MAN1) (agr 3s)))
    (men (n (root MAN1) (agr 3p)))
    (saw (n (root SAW1) (agr 3s)))
    (saw (v (root SAW2) (vform bare) (subcat _np)))
    (saw (v (root SEE1) (VFORM past) (subcat _np) (agr ?a)))
    (see (v (root SEE1) (VFORM bare) (subcat _np) (irreg-past +)
         (en-pastprt +)))
    (seed (n (root SEED1) (AGR 3s)))
    (the (art (root THE1) (agr (? a 3s 3p))))
    (to (to (vform inf)))
    (want (v (root WANT1) (VFORM bare)
             (subcat (? s _np _vp-inf _np_vp-inf))))
    (was (v (root BE1) (VFORM past) (AGR (? a 1s 3s))
            (SUBCAT (? s _adjp _np))))
    (were (v (root BE1) (VFORM past) (AGR (? a 2s 1p 2p 3p))
             (SUBCAT (? s _adjp _np))))
    (+s (+S))
    (+ed (+ED))
    (+en (+EN))
    (+ing (+ING))))

Here is a complete listing of the encoded grammar.

(setq *grammar4-7*
      '((headfeatures (s agr) (vp vform agr) (np agr))
        ((s (inv -))
          -1>
             (np (agr ?a)) (head (vp (vform (? v past pres)) (agr ?a))))
         ((np)
           -2>
               (art (agr ?a)) (head (n (agr ?a))))
        ((np)
           -3>
             (head (pro)))
        ((vp)
           -4>
              (head (v (subcat _none))))
        ((vp)
           -5>
              (head (v (subcat _np))) (np))
        ((vp)
            -6>
               (head (v (subcat _vp-inf))) (vp (vform inf)))
        ((vp)
           -7>
              (head (v (subcat _np_vp-inf))) (np) (vp (vform inf)))
        ((vp)
           -8>
               (head (v (subcat _adjp))) (adjp))
        ((vp (vform inf))
           -9>
              (head (to)) (vp (vform bare)))
        ((adjp)
           -10>
             (head (adj)))
        ((adjp)
           -11>
              (head (adj (subcat _vp-inf))) (vp (vform inf)))))

Below is a trace of the parse of the sentence "He wants to be happy." shown in Figure 4.9, using this grammar and lexicon. It is obtained by the Lisp function call (BU-parse '(he want +s to be happy)) followed by the function call (show-answers).

S57:<S ((INV -) (AGR 3S) (1 NP50) (2 VP56))> from 0 to 6 from rule -1>
NP50:<NP ((AGR 3S) (1 PRO44))> from 0 to 1 from rule -3>
    PRO44:<PRO ((LEX HE) (ROOT HE1) (AGR 3S))> from 0 to 1 from rule NIL
VP56:<VP ((VFORM PRES) (AGR 3S) (1 V52)
            (2 VP55))> from 1 to 6 from rule -6>
    V52:<V ((AGR 3S) (VFORM PRES) (ROOT WANT1) (SUBCAT _VP-INF) (1 V45)
            (2 +S46))> from 1 to 3 from rule -LEX1>
      V45:<V ((LEX WANT) (ROOT WANT1) (VFORM BARE)
              (SUBCAT
               (? S6 _NP_VP-INF _VP-INF
                _NP)))> from 1 to 2 from rule NIL
      +S46:<+S ((LEX +S))> from 2 to 3 from rule NIL
    VP55:<VP ((VFORM INF) (AGR -) (1 TO47)
              (2 VP54))> from 3 to 6 from rule -9>
      TO47:<TO ((LEX TO) (VFORM INF))> from 3 to 4 from rule NIL
      VP54:<VP ((VFORM BARE) (AGR -) (1 V48)
                (2 ADJP53))> from 4 to 6 from rule -8>
        V48:<V ((LEX BE) (ROOT BE1) (VFORM BARE) (SUBCAT _ADJP)
                (IRREG-PRES +)
                (IRREG-PAST +))> from 4 to 5 from rule NIL
        ADJP53:<ADJP ((1 ADJ49))> from 5 to 6 from rule -10>
          ADJ49:<ADJ ((LEX HAPPY) (SUBCAT _VP-INF)
                      (ROOT HAPPY1))> from 5 to 6 from rule NIL

This trace should be compared with the parse tree shown in Figure 4.9 on page 97 of Allen. Note that different levels of indentation here correspond with different levels of subtree in that figure. It's a bit tedious to unravel, but a close examination of the indentation structure reveals the structure of the parse tree itself. All derived feature values are attached to each different node of the tree, along with the identifying number of the grammar rule that generated that node.

Additional Requirements and Features

Exercises

Augment grammar4 so that it takes person into account as well as number, in determining the grammaticality of its sentences. Consider adding a new grammatical category, pronoun, that permits the words i, you, he, she, it, we, and they to the language. Also add a new feature to the lexicon, person, which identifies the person of a noun or verb as 1, 2, or 3 according to its form. For instance, the pronoun she would be defined as pro(she, s, 3), the pronoun you would be defined as pro(you, _, 2), and so forth.
Exercise the lexicon and grammar given in the file chapt4, using all the sentences given as examples in the above notes. Be sure to perform morphological analysis manually, so that sentences like "The dog wants to cry" can be correctly parsed. I.e., type the parsing command as (BU-parse '(the dog want +s to cry)) rather than (BU-parse '(the dog wants to cry)). Which of these sentences do not parse correctly and why?
Augment the grammar and lexicon in the file chapt4 so that various forms of give are included in the lexicon, allowing sentences like "The dog gave the man the fish" to be successfully parsed.
(Optional) Augment the grammar and lexicon in the file chapt4 so that all the sentences given as examples in the above discussion can be successfully parsed.

Referenes

Allen, Chapter 4
Matthews, Chapter 11