Tags:

An Algorithm that Learns What's in a Name

Bikel, Schwartz, Weischedel 1999

Bikel et al describe IdentiFinder, an HMM-based named entity detector/classifier. Words are processed sequentially, and each word is assigned a tag that is either a category or "not-a-name." The system uses a generative model with 3 steps for each chunk/chink:

  • Generate a name-class (either a category or not-a-name), conditioned on the previous name-class and the previous word.
  • Generate the first word, conditioned on the current and previous name-classes.
  • Generate each subsequent word, conditioned on the current name-class and the previous word.

Backoff and smoothing is used, with word-features. Viterbi decoding is used to tag new text. (Note: although they call it an HMM, it seems like a normal (non-hidden) markov model to me.)

The results for NE tagging are pretty good, and IdentiFinder does well on upper-case and foreign language data. In the last few sections of the article, Bikel et al discuss the effects of case information, word features, amount of training data, etc., on the system. The also note that treating punctuation as separate words can be both good and bad: it's good to help delimit names, but it can prevent you from getting at useful context info. (Note that a maxent model could get around this by using bag-of-words context features.)

See Also: Borthwick, Sterling, Agichtein, Grishman 1998

Bibtex

@article{bikel1999,
   author = "Daniel M. Bikel and Richard L. Schwartz and Ralph M. Weischedel",
   title = "An Algorithm that Learns What's in a Name",
   journal = "Machine Learning",
   volume = "34",
   number = "1-3",
   pages = "211-231",
   year = "1999",
   url = "citeseer.nj.nec.com/bikel99algorithm.html"
}

Login