Can Subcategorization Help a Statistical Dependency Parser?

Zeman 2002

Zeman presents a statistical dependency parser for Czech, and shows that adding features based on verb subcategorization improves performance. Czech has a rich morphological case system, and relatively free word order.

The baseline system is a statistical dependency parser that predicts dependencies between words based on their morphological category, distance, and direction. In addition, several constraints, such as "projectivity" are imposed to eliminate incorrect parses.

They first extend this baseline system by lexicalizing prepositions, conjunctions, the verb "to be", and (non-adjectival) adverbs. They also adjust the probabilities of dependencies involving verbs, by summing their nonlexicalized and lexicalized probabilities. (Note: the result isn't a real probability distribution; but it increases performance.) These changes give almost no change in performance (77.0 equation0.png 77.1).

They then use a subcategorization dictionary to give preferences to "subcategorized dependencies," or dependencies that are listed in the dictionary. (The dictionary specifies case marking for the subcategorized dependencies.) This increases the performance by about 1.5 (77.1 equation0.png 78.7). All performance scores are measured as accuracy over dependencies.


  author =       {Daniel Zeman},
  title =        {Can Subcategorization Help a Statistical Dependency Parser?},
  booktitle =    {Proceedings of the 19th International Conference on Computational Linguistics},
  year =         2002,
  address =      {Taibei, Taiwan}