Learning Accurate, Compact, and Interpretable Tree Annotation

Petrov, Barrett, Thibaux, Klein 2006

Petrov, Barrett, Thibaux, & Klein use an automatic approach to tree annotation that is similar to the approach taken by Matsuzaki, Miyao, Tsujii 2005. But their approach differs from the approach taken by Matsuzaki et al in that they split various nonterminals to different degrees, as appropriate to the actual complexity in the data. For example, their system finds that the preposition phrase tag (PP) should be split into 28 distinct categories, while just 2 categories are sufficient to model conjunction phrases (CONJP).

Another important difference between their system and the PCFG-LA system described by Matsuzaki et al is that node decompositions are performed incrementally, via binary splits. This incremental approach gives rise to a tree of node labels, which are much more amenable to linguistic interpretation than the categories generated by the PCFG-LA system.

The learning algorithm for this system begins with the original set of PCFG labels. It then iteratively performs three steps: split, EM, and merge. The split step divides each node label into two new labels; and divides the probability mass of the associated PCFG productions between these new labels. In order to break the symmetry between the new labels, a small amount of randomness is added to the PCFG production probabilities. The EM step uses Expectation Maximization to learn probabilities for all rules by optimizing the likelihood of the training data. The merge step then examines each split that was made, and estimates what the effect would be of removing the split. If the effect is small enough, then the two split nodes merged back together. This merge operation can be thought of as analogous to the pruning step in the construction of decision trees, where decision structures that do not significantly improve performance are pruned away to reduce the number of parameters that the model must learn, thereby avoiding over-fitting. This split-merge procedure is used because it is much easier to estimate what the effect of a merge will be than it is to estimate what the effect of a split will be.

Like the PCFG-LA system, this system does not define a one-to-one mapping between canonical values and transformed values: a single canonical tree will correspond to a relatively large set of annotated trees. As a result, calculating the best unannotated tree for a given sentence is NP-hard. Petrov et al therefore perform parsing using an algorithm that maximizes the total number of correct productions, rather than the probability of the unannotated parse.

Detailed Notes


Klein et al describe a method for improving parser performance by splitting nonterminal symbols into new symbols. The usage patterns of these new symbols are learned using EM. This is related to lexicalization, in that it adds info to symbols, but different in a few ways:

  1. Instead of building a new symbol for each head, we only add symbols where we need them.
  2. The new symbols can be used in a wider variety of ways (e.g., to propogate info around the tree.)


  1. Binarize every tree in the corpus (using left-branching binarization).
  2. For each iteration:

    1. Splitting:

      • Split all symbols in two (e.g., NP -> NP-1 nad NP-2).
      • When we split a symbol, we duplicate all rules that use it.
      • Add a small amount of randomness to break the symmetry between the original symbols. (i.e., multiply every rule prob by a random number from .99 to 1.01 or so).
    2. Use EM to learn the probabilities of all rules.

      • E-step: treat the probabilities of rules (e.g., NP->DT NN) as constant, and find the posterior probability of using each rule at a given position.
      • M-step: use those posterior probabilities as weighted observations to update the rule probabilities.
    3. Merging:

      • For each split that we made, estimate how much our score would decrease if we merged them back together. If the result is small enough, then perform the merge.
  1. Smoothing:

    • minimize overfitting by taking a weighted average between the original class and the split classes. Almost all of the weight (99%) is put on the split class.
  2. Parsing:

    • So now we have a set of weighted rules that generate an annotated output.
    • But we don't want the most likely annotated output; we want the most likely overall output.
    • But finding this is intractible.
    • So use an algorithm that maximizes the total number of correct rules we generate. (This actually more directly maximizes LP and LR, even though they may not be what we really care about.)
    • They actually use a "corse-to-fine pruning" scheme, where they start out parsing with the original grammar, and then parse with the augmented grammar, pruning out any chart edges that look too unlikely.

Grammar Analysis

Klein et al analyzed the grammar generated by their system, and found that many of the split-categories that were generated were interpretable. For example, determiners were split between demonstratives, definites, indefinites, etc. The non-terminal symbols are harder to interpret, but they found that they were often used to propagate information from one place to another. E.g., to allow the distribution of determiners to be different, depending on whether the NP was in subject or object position.

Ideas for My Thesis

  • Klein et al used a split-merge procedure because it's hard to decide what effect a split will have, but much easier to approximate the difference that a merge will have. I could make use of a similar procedure, perhaps -- do a bunch of splits at once, and then see what effect re-merging them would have?
  • Klein et al use an EM algorithm to decide how exactly the new symbols should be used. This is very interesting to me, because it certainly differs from the way I was envisioning new symbols, where we would decide how they should be used somewhat a priori. Would it be better to use something like EM to try to figure out what the best set of symbols is?
  • Klein et al talk about the problem of finding the best overall parse, as opposed to the best annotated parse. This may be a similar question for me -- if there are multiple annotations that map to the same output value, then do I want to find the output value whose sum is the highest? If so, how do I do that? Hrm. Well, one possibility is to make use of the fact that some of the splits I'm considering are dependant on features of the input.. So constrain the system to only generate sequences that are consistant with the input?