Tags:

Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping

Riloff, Jones 1999

Riloff and Jones present an unsupervised learning algorithm for named entity (and nominal) detection & classification. It generates two resources:

  • The "semantic lexicon" is basically a wordlist for each category. It contains words and phrases that are usually classified with a given category.
  • The "pattern dictionary" is a list of patterns specifying contexts that usually appear with a given category. They are simple surface-string patterns, such as "operates in x." The patterns are generated with AutoSlog.

The algorithm starts with a "seed list" of words that are known to be in-category. They used around 5 words for their examples. The algorithm then uses a feedback loop, where the semantic lexicon is used to improve the pattern dictionary, and the pattern dictionary is used to improve the semantic lexicon. In particular, they repeat the following loop:

  • Find the pattern that best matches the current wordlist. "Best match" is determined by a combination of precision and coverage. Add it to the pattern list.
  • Add all words & phrases generated by the new pattern to the wordlist.

One problem with this approach is that it tends to generate out-of-category words, which then sets up a feedback loop, producing out-of-category patterns, which produce more out-of-category words, etc. To avoid this, they use "multi-level bootstrapping," where they run the basic algorithm repeatedly, and after each run of the basic algorithm, they throw away all but the 5 best new words, and all of the patterns. To find the 5 best words, they check which words are generated by the most patterns. That way, they can be more sure that they're only adding in-category words.

The results are not spectacular, and I'm not convinced that the approach would scale well to large wordlists. (The worldists they generate contain only ~150 words). But it's possible that the approach could be improved, e.g., by adding weights to words, or by starting with a larger seed list.

Bibtex

@InProceedings{riloff1999,
  author =       {Ellen Riloff and Rosie Jones},
  title =        {Learning Dictionaries for Information Extraction
                  by Multi-Level Bootstrapping},
  booktitle =    {Proceedings of the Sixteenth National Conference
                  on Artificial Intelligence},
  year =         1999
}

Login