Innovations in Text Interpolation

Jacobs, Rau 1993

Jacbos and Rau describe their information extraction system (SCISOR), which uses a large lexicon, shallow probabilistic preprocessing, and robust parsing. The first third of the article is a well-wrtten explanation of the shift in research focus in NLP towards robust wide-coverage systems that occured in the early 90's. They give some of the motivations behind this shift, and describe how it relates to larger issues in AI. They also give descriptions of some of the tasks that became focused by this shift: information retrieval, information extraction, and text categorization.

Jacobs and Rou assert that there are three basic issues are essential in building robust wide-coverage systems:

  • Lexicon design and scope.
  • shallow ("weak") and probabilistic preprocessing.
  • control structures for robust parsing.

Most of the rest of the paper describes how their system handles these three issues:

  • They use a large (for the time) lexicon, with an organization similar to WordNet. Each lexical entry represents a single word sense, and contains information about what syntactic frames it can occur in, what arguments it can take, etc.
  • They employ 4 shallow preprocessing stages:
    • Tagging (part of speech and stem/affixes)
    • "Template activation," which identifies verbs that are candidates for creating MUC events/relations and nouns that might fill event/relation slots.
    • Shallow parsing/chunking.
    • Topic detection.
  • The control structure for robust parsing is based on "relations," which are head-relation-argument tripples (like acquire-object-company). The parser makes use of constituants from preprocessing, and tries to construct parses that give more likely relations. The likelihood of relations is based on the likelihood of each pair in the relation (i.e., the likeihood of the head-relation pair, the relation-argument pair, and the head-argument pair).

The last section of the paper describes the MUC-3 evaluation, and reports their system's performance (57% precision, 52% recall). Their system was the highest-scoring system at MUC-3.


  author =       {Paul S. Jacobs, Lisa F. Rau},
  title =        {Innovations in Text Interpretation},
  journal =      {Artificial Intelligence},
  year =         1993,
  volume =       63,
  pages =        {143-191}