Overview of NLP: Issues and Strategies

Natural Language Processing (NLP) is the capacity of a computer to "understand" natural language text at a level that allows meaningful interaction between the computer and a person working in a particular application domain.

Application Domains of NLP:

Tools for NLP:

Linguistic Organization of NLP

E.g., eat + s = eats

Grammars and parsing

Syntactic categories (common denotations) in NLP

A context-free grammar (CFG) is a list of rules that define the set of all well-formed sentences in a language. Each rule has a left-hand side, which identifies a syntactic category, and a right-hand side, which defines its alternative component parts, reading from left to right.

E.g., the rule s --> np vp means that "a sentence is defined as a noun phrase followed by a verb phrase." Figure 1 shows a simple CFG that describes the sentences from a small subset of English.


A sentence in the language defined by a CFG is a series of words that can be derived by
systematically applying the rules, beginning with a rule that has s on its left-hand side. A
parse of the sentence is a series of rule applications in which a syntactic category is replaced
by the right-hand side of a rule that has that category on its left-hand side, and the final
rule application yields the sentence itself. E.g., a parse of the sentence "the giraffe dreams" is:

s => np vp => det n vp => the n vp => the giraffe vp => the giraffe iv => the giraffe dreams

A convenient way to describe a parse is to show its parse tree, which is simply a graphical
display of the parse. Figure 1 shows a parse tree for the sentence "the giraffe dreams". Note
that the root of every subtree has a grammatical category that appears on the left-hand side of
a rule, and the children of that root are identical to the elements on the right-hand side of that rule.

If this looks like familiar territory from your study of programming languages, that's a good observation. CFGs are, in fact, the orignin of the device called BNF (Backus-Naur Form) for describing the syntax of programming languages. CFGs were invented by the linguist Noam Chomsky in 1957. BNF originated with the design of the Algol programming language in 1960.

Goals of Linguistic Grammars

NLP vs PLP (Programming Language Processing):

There are some parallels, and some fundamental distinctions, between the goals and methods of progamming language processing (design and compiler strategies) and natural language processing. Here is a brief summary:


NLP PLP
domain of discourse broad: what can be expressed narrow: what can be computed
lexicon large/complex small/simple
grammatical constructs many and varied
- declarative
- interrogative
- fragments
etc.
few
- declarative
- imperative
meanings of an expression many one
tools and techniques morphological analysis
syntactic analysis
semantic analysis
integration of world knowledge
lexical analysis
context-free parsing
code generation/compiling
interpreting

References

  1. Matthews, Clive, An Introduction to Natural Language Processing through Prolog, Longman, 1998.
  2. Allen, James, Natural Language Understanding 2e, Benjamin Cummings, 1995.
  3. Wilks, Yorick, "Natural Language Processing," Communications of the ACM 39, 1 (Jan 1996), 60-62.
  4. Covington, Michael, Natural Language Processing for Prolog Programmers , Prentice Hall, 1994.
  5. Manning, C. and H. Schutze, Foundations of Statistical Natural Language Processing, MIT Press, 1999.