Tags: | |
---|---|
Download: |
Voting between Multiple Data Representations for Text Chunking
Hong Shen & Anoop Sakar, 2005. Advances in Artificial Intelligence: 18th Conference of the Canadian Society for Computational Studies of Intelligence, Canadian AI 2005, Victoria, Canada, May 9-11, 2005. Proceedings
Summary
Shen & Sakar get improved chunking performance by using 5 different 2nd order HMMs, each with a different output representation, and combining their results by majority voting.
Task
Shen & Sakar looked at two data sets: the "Base NP" data set and the CoNLL-2000 chunking data set. In the base NP data set, the task is to identify np chunks. In the CoNLL-2000 data set, the task is to identify a variety of different chunk types (noun, verb, etc); and to correctly identify their type.
In both cases, performance appears to be evaluated on a per-token basis, using each corpus's native tagging format, for only the tokens contained in chunks. I.e., precision is the number of correct non-O tags divided by the number of non-O tags in the output; and recall is the number of correct non-O tags divided by the number of non-O tags in the gold standard data. (It's worth noting that the article claims at one point that evaluation is done per-chunk; but this seems directly contradicted by several other parts of the article.)
Output Representations
Shen & Sakar considered the following output representations:
- IOB1: O=a token that is "outside" a chunk; B=a token that begins a chunk that immediately follows another chunk; I=a token that is "inside" a chunk, but is not marked B.
- IOB2: O=a token that is "outside" a chunk; B=a token that begins a chunk; I=a token that is "inside" a chunk, but is not marked B.
- IOE1: O=a token that is "outside" a chunk; E=a token that ends a chunk that immediately follows another chunk; I=a token that is "inside" a chunk, but is not marked E.
- IOE2: O=a token that is "outside" a chunk; E=a token that ends a chunk; I=a token that is "inside" a chunk, but is not marked E.
- O+C: O=a token that is "outside" a chunk; B=a token that begins a multiword chunk; E=a token that ends a multiword chunk; S=a token that forms a single-word chunk; I=a token that is "inside" a chunk, but is not marked B, E, or S.
Examples:
word | IOB1 | IOB2 | IOE1 | IOE2 | O+C |
---|---|---|---|---|---|
In | O | O | O | O | O |
early | I | B | I | I | B |
trading | I | I | I | E | E |
in | O | O | O | O | O |
Hong | I | B | I | I | B |
Kong | I | I | E | E | E |
Monday | B | B | I | E | S |
, | O | O | O | O | O |
gold | I | B | I | E | S |
was | O | O | O | O | O |
quoted | O | O | O | O | O |
at | O | O | O | O | O |
$ | I | B | I | I | B |
366.50 | I | I | E | E | E |
an | B | B | I | I | B |
ounce | I | I | I | E | E |
. | O | O | O | O | O |
Model
The machine learning model used by Shen & Sakar was based on a 2nd order HMM:
____ ____ ____ ____ / V / V / V / V y1 --> y2 --> y3 --> y4 --> | | | | V V V V x1 x2 x3 x4
Where x[i] is the i'th "input" and y[i] is the i'th "output." Shen & Sakar examine several variations on this model, which basically come down to redefining what is considered the "input" and what is considered the "output". The two basic models are:
Model | Input | Output |
---|---|---|
basic trigram | POS | chunk tag |
SP | POS | (POS, chunk tag) |
What SP is doing is basically multiplying the number of states in the markov model, which lets its decisions rely on more history information.
(Where chunk tag is one of I,O,B,E,S.) The remaining models are formed by lexicalizing the input & output for specific words. E.g., SP+Lex-WHF uses (POS,word) as input and (POS,word,chunk tag) as output for words that occur at least 100 times in the training data. Again, these lexicalized models are basically just giving the markov model more history to use in making decisions.
Evaluation
Shen & Sakar evaluated their system on NP chunking & CoNNL-2000 data sets. They did achieve an increase in F-score, but it was modest (95.23 for base NP, vs best-previous of 94.22). They also pointed out that their model trains much faster than the best-previous, which was an SVM.