Prolog and NLP

Review of Prolog Fundamentals

Prolog programs are made from terms, which can be either constants, variables, or structures.
  • A constant is either an atom (like the, zebra, 'John',  and '.') or a nonnegative integer (like 24)
  • A variable is a series of letters (A-Z, a-z, _) that begins with a capital letter (like John).
  • Note: constants cannot begin with a capital letter unless they are enclosed in quotes.
  • A structure is a predicate with zero or more arguments, written in functional notation.  E.g.,
  • n(zebra)
    speaks(boris, russian)
    np(X, Y)
    The number of arguments is called the predicate's "arity" (1, 2, and 2 in these examples).
    A fact is a term followed by a period (.).  A rule is a term followed by :- and a series of terms (term1, term2, ..., termN) separated by commas and ended by a period (.).  That is, rules have the following form:
    term :- term1, term2, ..., termN .
    A Prolog program is a series of facts and rules:
    speaks(boris, russian).
    speaks(john, english).
    speaks(mary, russian).
    speaks(mary, english).
    understands(Person1, Person2) :- speaks(Person1, L), speaks(Person2, L).



    This program asserts the facts that Boris and Mary speak Russian and John and Mary speak English.  It also defines the relation "understands" between two persons, which is true exactlyl when they both speak the same language, denoted by the variable L.

    A Prolog rule succeeds only when there are assignments (called "instantiations") of its variables for which each of the terms on its right-hand side (of the :- operator) is simultaneously satisfied for those assignments.  Otherwise, the rule is said to fail.

    To exercise a Prolog program, one writes "queries" in reply to the Prolog prompt ?-.  Here is a simple query that asks the question "Who speaks Russian?"

    ?- speaks(Who, russian).
    In reply, Prolog tries to satisfy a query by finding a fact or series of fact/rule applications that will resolve the query; that is, one that will assign values to the variables in the query that cause the fact or rule to succeed, in the pattern-matching sense.

    In this particular example, Prolog examines all the facts and rules that have "speaks" as a predicate (there are four), in the order they are written in the program.  Since "russian" is provided as a constant, the only variable to be resolved is the variable "Who".  Using the first fact in the program, Prolog replies with:

    Who = boris
    since that assignment to the variable Who causes the first fact to succeed.  At this point, the user may want to know if there are other ways of satisfying the same query, in which a semi-colon (;) is typed.  Prolog continues its search through the program, reporting the next success, if there is one.  When there are no more successful instantiations for the variable "Who," Prolog finally replies "No" and the process stops.  Here is the complete interaction:
    ?- speaks(Who, russian).
    Who = boris ;
    Who = mary ;
    Another sort of query that this program can handle is one that asks questions like "Does John understand Boris?"  or "Who understands Boris?" or "Find all people that understand each other."  These can be written as  the following Prolog queries, with the replies to the first two shown as well:
    ?- understands(john, boris).
    ?- understands(Who, boris).
    Who = boris ;
    Who = mary ;
    ?- understands(P1, P2).
    To see how these queries are satisfied, we need to see how the rule for "understands" is evaluated.  Any query of the form "understands(X, Y)" appeals to that rule, which can be satisfied only if there is a common instantiantion for the variables X, Y, and L for which "speaks(X, L)" and "speaks(Y, L)" are both satisfied.  These two terms are often called "subgoals" of the main goal "understands(X, Y)".

    Prolog takes the subgoals in a rule from left to right, so that a search is first made for values of X and L for which "speaks(X, L)" is satisfied.  Once such values are found, these same values are used wherever their variables appear in the search to satisfy additional subgoals for that rule, like "speaks(Y, L)".

    Let's follow the process of trying to satisfy the first query above:

    understands(john, boris)
                |     5 | fail
                |        |
            1  |        |
        speaks(john, L)         speaks(boris, L)
                |                        3  /       |
                |       L = english  /        |
            2  |                          /      4 |
        speaks(john, english)        fail



    This query fails because the only instantiation of L that satisfies "speaks(john, L)" does not simultaneously satisfy "speaks(boris, L)" for this program.  The numbers assigned to the links in the search diagram indicate the order in which subgoals are tried.

    The second query above has a somewhat more complex search process, which begins in the following way.

    understands(Who, boris)
                |     5 |
                |        |  Who = boris
            1  |        |
        speaks(Who, L)            speaks(boris, L)
                |                              3 /      |
                |        Who = boris    /       |
                |         L = russian    /        |
            2  |                              /      4 |
        speaks(boris, russian)        speaks(boris, russian)



    Once this step is completed, with the reply "Who = boris," the process continues to search for other instantiations of Who and L that will satisfy goals 1 and 3 simultaneously.  Thus, the process eventually uncovers the answer "Who = mary", and no more.

    Sentences as Lists of Atoms

    The basic data structure in Prolog programming is the list, which is written as a series of terms separated by commas and enclosed in brackets [].  Sentences are usually represented as lists of atoms, as in the following example:
    [the, giraffe, sleeps, '.']
    The empty list is denoted by [], while a "don't care" entry in a list is represented by an underscore (_).  The following denote lists of one, two, and three elements, respectively.
    [X, Y]
    [_, _, Z]



    The head (first) element of a list is distinguished from its remaining elements by a vertical bar.  Thus,

    [X | Y]



    denotes a list whose head is the element X and whose remaining elements form the list [Y].

    Here is a simple Prolog function that defines the concatenation of two lists together to form one (Matthews, 74).  The first two arguments for this function represent the two lists, and the third represents the result.

    concat([], X, X).
    concat([H | T], Y, [H | Z]) :- concat(T, Y, Z).



    This rather odd-looking recursive definition has two parts.  The "base case" is defined by the first line, which simply affirms that the empty list concatenated with any other list returns that other list as the result.  The recursive case, defined by the second line, says that if Z is the result of concatenating lists T and Y, then concatenating any new list [H | T] with Y gives the result [H | Z].

    The dynamics of execution in this case is important to understand, since this form of recursive definition occurs persistently in Prolog.  Consider the query

    ?- concat([english, russian], [spanish], L).
    which presumably will yield the list L = [english, russian, spanish].  Here is a partial trace of the various instantiations of variables H, T, X, Y, and Z as this result is developed.
    concat([english, russian], [spanish], L).
                |    H = english, T = [russian], Y = [spanish], L = [english | Z]
            1  |
    concat([russian], [spanish], [Z]).
                |    H = russian, T = [], Y = [spanish], [Z] = [russian | Z']
            2  |
    concat([], [spanish], [Z']).
                |    X = [spanish], Z' = spanish
            3  |
    concat([], [spanish], [spanish]).



    The first two calls use the recursive rule, which just strips the head H from the first argument and recalls with a shorter list as first argument.  The final call uses the base case, which forces instantiantion of the variable X = [spanish] and Z' = spanish.  Thus, the result [H | Z'] in the second call is resolved as [russian, spanish] and identified with the list Z in the first call.  Finally, the list L is resolved to [english, russian, spanish], using this newly-discovered value for Z as its tail.

    The use of primes (') in this analysis helps to distinguish one use of a variable in a rule at one level of recursion from another use of the same variable at another level.

    Pre- and Postprocessing Sentences

    In order to have a user-friendly interface, a natural language processing system must provide support for users to type sentences in the ordinary way, and to view them on the screen in their usual form.  Since the normal style of sentence representation for Prolog processing is as a list of atoms, auxiliary functions must be provided to convert the user's sequence of keystrokes that comprise a sentence into an appropriate list, and then to go the other way and display the list representation of a sentence as a series of words.

    Below is a collection of functions that accomplish these tasks.  Those which preprocess the user's typing are headed by the function read_sent.  The function which displays a sentence is called display_sent.

    % Prolog functions to read a sentence (adapted from Clocksin & Mellish 1994).
    read_sent([W | Ws]) :- get0(C), read_word(C, W, C1), rest(W, C1, Ws).
    % Given a word and the next character, read the rest of the sentence
    rest(W, _, []) :- terminator(W), !.
    rest(W, C, [W1 | Ws]) :- read_word(C, W1, C1), rest(W1, C1, Ws).
    % Read a single word, given an initial character, and remember next character.
    read_word(C, W, C1) :- punctuation(C), !, name(W, [C]), get0(C1).
    read_word(C, W, C2) :- in_word(C, NewC), !, get0(C1), rest_word(C1, Cs, C2),
                           name(W, [NewC | Cs]).
    read_word(C, W, C2) :- get0(C1), read_word(C1, W, C2).
    rest_word(C, [NewC | Cs], C2) :- in_word(C, NewC), !, get0(C1), 
                                     rest_word(C1, Cs, C2). 
    rest_word(C, [], C).
    % These are the ASCII codes for single-character words (punctuation marks).
    punctuation(44).   % ,
    punctuation(59).   % ;
    punctuation(58).   % :
    punctuation(63).   % ?
    punctuation(33).   % !
    punctuation(46).   % .
    % These can be inside a word; capitals are converted to lower case.
    in_word(C, C) :- C>96, C<123.            % ASCII a, b, ..., z
    in_word(C, L) :- C>64, C<91, L is C+32.  %       A, B, ..., Z
    in_word(C, C) :- C>47, C<58.             %       0, 1, ..., 9
    in_word(39, 39).                         %       '
    in_word(45, 45).                         %       -
    % These terminate a sentence.
    % Function to display a sentence that is stored as a list.
    display_sent([]) :- nl.
    display_sent([W | Rest]) :- name(W, [W1 | Rest]), punctuation(W1), !,
                                write(W), display_sent(Rest).
    display_sent([W | Rest]) :- write(' '), write(W), display_sent(Rest).
    To help clarify these definitions, it is important to understand the use of the Prolog functions get0 and name.  The call get0(C) simply retrieves the next character typed by the user and associates it with the variable C.  The call name(W, L) associates any word W, which is an atom, with the list L of ASCII codes which represent its characters.  Thus, the call name(giraffe, X) returns the list X = [103, 105, 114, 97, 102, 102, 101].  Similarly, the call name(X, [103, 105, 114, 97, 102, 102, 101]) returns the atom X = giraffe.

    The auxiliary facts punctuation, in_word, and terminator help to classify characters as punctuation marks, normal letters within a word, and sentence-terminators.

    Finally, the function display_sent distinguishes two cases: one in which a word is a punctuation mark (in which case it is displayed without a leading space) and the other in which a word is not a punctuation mark.  The Prolog function write displays an individual atom without any leading or trailing spaces.  the function nl denotes "go to a new line on the screen."

    Programming for NLP: a First Look

    Suppose we want to define a Prolog program that is effectively a grammar that will parse sentences, like the grammar illustrated in Figure 1.  Assuming the list representation for sentences, we can write Prolog rules that partition the sentence into its grammatical categories, following the structure of the grammar rules themselves.  For example, consider the grammar rule:
    s --> np vp



    A corresponding Prolog rule would be:

    s(X, Y) :- np(X, U), vp(U, Y).



    Here, we have added variables that stand for lists.  In particular, X denotes the list representation of the sentence being parsed, or "recognized," and Y represents the resulting tail of the list that will remain if this rule succeeds.  The interpretation here mirrors that of the original grammar rule: "X is a sentence, leaving Y, if the beginning of X can be identified as a noun  phrase, leaving U, and the beginning of U can be identified as a verb phrase, leaving Y."

    The entire Prolog program for the grammar in Figure 1 is shown below (and is called "grammar1" in the course directory):

    s(X, Y) :- np(X, U), vp(U, V), terminator(V, Y).
    np(X, Y) :- det(X, U), n(U, Y).
    vp(X, Y) :- iv(X, Y).
    vp(X, Y) :- tv(X, U), np(U, Y).
    det([the | Y], Y).
    det([a | Y], Y).
    n([giraffe | Y], Y).
    n([apple | Y], Y).
    iv([dreams | Y], Y).
    tv([dreams | Y], Y).
    tv([eats | Y], Y).
    terminator(['.'], []).
    terminator(['?'], []).
    terminator(['!'], []).
    The first rule in this program is slightly expanded to accommodate the terminator that occurs at the end of a sentence.  The facts that describe the terminators are self-explanatory.  The remaining rules in this program correspond identically with those of the grammar in Figure 1.  Note that the facts that identify lexical items (like giraffe, eats, etc.) effectively strip those items from the head of the list being passed through the grammar.

    To see how this works, consider the following Prolog query, which asks whether or not "the giraffe dreams." is a sentence.

    ?- s([the, giraffe, dreams, '.'], [])
    Here,  X and Y are identified with the two lists given as arguments, and the task is to find lists U and V that will satisfy, in the order given, each of the following three goals (using the right-hand side of the first rule in the program).
    np([the, giraffe, dreams, '.'], U)      vp(U, V)      terminator(V, [])
    One way to see the dynamics of this process is to trace the processing of the query itself:

    ï 1 | 1 call s([the,giraffe,dreams,.],[])
    ï 2 | 2 call np([the,giraffe,dreams,.],_7091)
    ï 3 | 3 call det([the,giraffe,dreams,.],_7401)
    ï 3 | 3 exit det([the,giraffe,dreams,.],[giraffe,dreams,.])
    ï 4 | 3 call n([giraffe,dreams,.],_7091)
    ï 4 | 3 exit n([giraffe,dreams,.],[dreams,.])
    ï 2 | 2 exit np([the,giraffe,dreams,.],[dreams,.])
    ï 5 | 2 call vp([dreams,.],_7089)
    ï 6 | 3 call iv([dreams,.],_7089)
    ï 6 | 3 exit iv([dreams,.],[.])
    ï 5 | 2 exit vp([dreams,.],[.])
    ï 7 | 2 call terminator([.],[])
    ï 7 | 2 exit terminator([.],[])
    ï 1 | 1 exit s([the,giraffe,dreams,.],[])

    From this trace, we can see that the variables U and V are instantiated to [dreams, '.'] and ['.'] respectively, in order to satisfy the right-hand side of the first grammar rule.

    Notice that, upon exit from each level of the trace, one or more words are removed from the head of the list.  The call to terminator removes the final atom from the list, completing the successful parse of the sentence.

    A more careful analysis of this tace reveals a direct correspondence between the various successful calls (rule applications) and the nodes of the parse tree shown in Figure 1 for this sentence (excluding the extra nodes added for the terminator).  So reading a trace can be helpful when developing a complex parse tree for a sentence.

    Definite Clause Grammars: Simplifying the Notation

    Using Prolog to encode complex grammars in this way is often more cumbersome than helpful.  For that reason, Prolog provides a very compact notation that directly mimics the notation of context-free grammar rules themselves.

    This notation is called the Definite Clause Grammar (DCG), and is very simple to asimilate.  The new operator --> is substituted in place of the operator :-, and the intermediate list variables are dropped.  So the Prolog rule

    s(X, Y) :- np(X, U), vp(U, V), terminator(V, Y).
    can be replaced by its equivalent simplified version:
    s --> np, vp, terminator.
    In making this transformation, it is important that we are not changing the arity of the function s (still 2) or the meaning of the rule itself.  These notations are introduced as sort of "macros" which allow Prolog rules to be written almost identically to the grammar rules which they emulate.  A complete rewriting of this grammar is shown below (and is available as "grammar2" in the course directory):
    s --> np, vp, terminator.

    np --> det, n.

    vp --> iv.
    vp --> tv, np.

    det --> [the].
    det --> [a].
    n --> [giraffe].
    n --> [apple].
    iv --> [dreams].
    tv --> [dreams].
    tv --> [eats].

    terminator --> ['.'] ; ['?'] ; ['!'].

    The semicolon that separates the different kinds of terminators in the last rule means "or".  Thus, this rule is equivalent to three separate rules for terminator that have three different right-hand sides.  This grammar is fully equivalent to the previous grammar; queries are formed and program execution occurs in exactly the same fashion as before.

    Further Refinements

    To simplify and clarify our later work, we introduce two additional refinements to the grammar-writing rules, the capability to generate a parse tree directly from the grammar and a basis for designing a lexicon with lexical features.

    The above grammar can be modified so that a query gives not just a "Yes" or "No" answer, but a complete parse tree in functional form as a response.  For instance, the functional form of the parse tree in Figure 1 is:

    s(np(det(the), n(giraffe)), vp(iv(dreams)), terminator('.'))
    This modification is accomplished by adding an additional argument to the left-hand side of each rule, and appropriate variables to hold the intermediate values that are derived in the intermediate stages of execution.  For instance, the first rule in the grammar above would be augmented as follows:
    s(s(NP,VP)) --> np(NP), vp(VP), terminator.
    This means that the query needs an extra argument, alongside the sentence to be parsed and the empty list.  That argument, appearing first in the query is a variable that will hold the resulting parse tree, as shown below:
    ?- s(Tree, [the, giraffe, dreams, '.'], []).
    The second modification recognizes that the rules for grammars are usually divided into two classes; those which describe the syntactic structure of a sentence (s, np, vp) and those which define individual lexical items (n, iv, tv, det, terminator). The latter group defines, in effect, the lexicon of individual words and their properties that form the vocabulary of a language.  The following provides a useful convention for making individual entries in the lexicon:

    For every word category (like n), and set of rules of the form (n --> [word]), define a new rule of the form

    n(n(N) --> [N], {n(N)}.

    and a set of facts of the form:


    For instance, the above grammar has two nouns, giraffe and apple, defined by the set of rules:

    n --> [giraffe].
    n --> [apple].

    This set can be replaced by the following new rule and two facts:

    n(n(N) --> [N], {n(N)}.

    Now if we want to add more nouns to this grammar, we simply add more facts to the two already there.

    Acomplete revision of the above grammar that accommodates these two refinements is given below (this one's called "grammar3" in the course directory).

    s(s(NP,VP)) --> np(NP), vp(VP), terminator.

    np(np(D,N)) --> det(D), n(N).

    vp(vp(IV)) --> iv(IV).
    vp(vp(TV,NP)) --> tv(TV), np(NP).

    det(det(D)) --> [D], {det(D)}.

    n(n(N)) --> [N], {n(N)}.

    iv(iv(IV)) --> [IV], {iv(IV)}.

    tv(tv(TV)) --> [TV], {tv(TV)}.

    terminator --> ['.'] ; ['?'] ; ['!'].


    1. Using one of the programs grammar1 or grammar2, along with the read_sent program as an input aide, determine whether or not each of the following is a valid sentence:
    2. a.  The giraffe eats the apple.
      b.  The apple eats the giraffe.
      c.  The giraffe eats.
    3. Suggest a small change in the grammar that would make all three valid sentences.
    4. Exercise grammar3a along with read_sent to obtain parse trees for each of the valid sentences in question 1.
    5. In the style of grammar3a, define a new Prolog grammar that will implement the language defined in Figure 2.  That is, ambiguous sentences like fruit flies like an apple should yield two different parse trees.  Exercise your grammar to be sure that it generates both parse trees for this sentence.
    6. Consider the sentence "the woman saw the man on the hill with the telescope."
    7. a.  Briefly discuss the different ways in which this sentence can be interpreted (i.e., the sentence is syntactically ambiguous).
      b.  Define an ambiguous context free grammar which allows these different interpretations to be realized.
      c.  Design a Prolog grammar that is equivalent to your grammar in part b, and exercise it to be sure that these interpretations occur as different parse trees for the given sentence.
    8. (optional) If you know Prolog well, consider writing functions that will display the output of grammar3 in a more tree-like, readable fashion.  That is, the result

    9. s(np(det(the), n(giraffe)), vp(iv(dreams)), terminator('.'))
      would be displayed as follows:


    1. Clocksin and Mellish, Programming in Prolog 4e, Springer, 1997.
    2. Matthews, Clive, An Introduction to Natural Language Processing through Prolog, Longman, 1998.
    Modified: 19-Jul-1999