Tags: |
---|
A Fast, Accurate Deterministic Parser for Chinese
Summary
Wang et al define a chinese parser that uses SVM (or other classifiers) to do shift-reduce parsing in a single pass. They also look at combining classifiers. The main goal is to do fast parsing -- since parsing is done in a single pass, it is much faster than PCFG-type algorithms.
Parsing Model
Wang et al use a classic shift-reduce parser. At each step, the classifier must decide whether to shift a new token onto the stack, or to perform a reduction. Reduction can be unary or binary, and binary reductions are marked with a head-direction. (This head-selection is used as a feature for later decisions). If the classifier selects an invalid action given the state, then the parse fails.
Classifiers
Wang et all tried several classifiers: SVMs with degree 2 polynomial kernel; maxent; decision trees; and memory-based learning. SVMs give the best performance, decision trees are fastest, and maxent is in the middle. (No paired features for maxent, which might explain its lower performance.)
Features
I'm still a little unclear on the features used. E.g., some of the features below may actually be combined?? See the paper for more details (what little there is), but here is a basic list:
- boolean: are we expecting a close punctuation mark?
- boolean: is queue empty?
- boolean: is there a comman between stack[-1] and stack[-2]
- last action by the classifier
- number of words in stack[-1]
- number of words in stack[-2]
- headword of stack[i], i=[-1, -2, -3, -4]
- part of speech of stack[i], i=[-1, -2, -3, -4]
- headword of queue[i], i=[1, 2, 3, 4]
- part of speech of queue[i], i=[1, 2, 3, 4]
- nonterminal label of stack[-1]
- nonterminal label of stack[-2]
- number of punctuation chars in stack[-1]
- number of punctuation chars in stack[-2]
- rhythmic features of stack[-1] and of stack[-2]
- linear distance between headwords of stack[-1] and stack[-2]
- nonterminal label, pos, and hedadword of left & right child of stack[-1] and stack[-2]
- etc.