Learning with Scope, with Application to Information Extraction and Classification

Blei, Bagnell, McCallum 2002

Blei et al present a framework for exploiting local regularities in data. For example, when performing IE on webpages, the pages on a single site will all share formatting features; and we should be able to take advantage of those features. The basic idea is to use a set of hidden collection-specific parameters, which account for similarities in individual collections.

To keep things concrete, they present things in terms of IE for webpages. In this case, collections are web sites; and the hidden collection-specific parameters control document formatting. They present two models: a generative model, where the category is generated first, and then the category generates the words and the formatting (with the site-specific parameters); and a discriminitive model, where the category is generated based on the formatting features. They express these models using bayesian nets in the paper.

They describe two methods to estimate the hidden parameters: MAP (=EM for the maximum a posteriori); and a variational approach.

The results are fairly encouraging, with better performance than global naive bayes, epsecially when they use the variational approach to find the hidden parameters.


  author =       {David M. Blei, J. Andrew Bagnell,
                  and Andrew K. McCallum},
  title =        {Learning with Scope, with Application to
                  Information Extraction and Classification},
  booktitle =    {Proceedings for Uncertainty in Artificial
                  Intelligence (UAI)},
  year =         2002