A Fast, Accurate Deterministic Parser for Chinese

Wang, Sagae, & Mitamura 2006

Mengqui Wang, Kenji Sagae, & Teruko Mitamura
ACL 2006
pp. 425-432


Wang et al define a chinese parser that uses SVM (or other classifiers) to do shift-reduce parsing in a single pass. They also look at combining classifiers. The main goal is to do fast parsing -- since parsing is done in a single pass, it is much faster than PCFG-type algorithms.

Parsing Model

Wang et al use a classic shift-reduce parser. At each step, the classifier must decide whether to shift a new token onto the stack, or to perform a reduction. Reduction can be unary or binary, and binary reductions are marked with a head-direction. (This head-selection is used as a feature for later decisions). If the classifier selects an invalid action given the state, then the parse fails.


Wang et all tried several classifiers: SVMs with degree 2 polynomial kernel; maxent; decision trees; and memory-based learning. SVMs give the best performance, decision trees are fastest, and maxent is in the middle. (No paired features for maxent, which might explain its lower performance.)


I'm still a little unclear on the features used. E.g., some of the features below may actually be combined?? See the paper for more details (what little there is), but here is a basic list:

  • boolean: are we expecting a close punctuation mark?
  • boolean: is queue empty?
  • boolean: is there a comman between stack[-1] and stack[-2]
  • last action by the classifier
  • number of words in stack[-1]
  • number of words in stack[-2]
  • headword of stack[i], i=[-1, -2, -3, -4]
  • part of speech of stack[i], i=[-1, -2, -3, -4]
  • headword of queue[i], i=[1, 2, 3, 4]
  • part of speech of queue[i], i=[1, 2, 3, 4]
  • nonterminal label of stack[-1]
  • nonterminal label of stack[-2]
  • number of punctuation chars in stack[-1]
  • number of punctuation chars in stack[-2]
  • rhythmic features of stack[-1] and of stack[-2]
  • linear distance between headwords of stack[-1] and stack[-2]
  • nonterminal label, pos, and hedadword of left & right child of stack[-1] and stack[-2]
  • etc.