Tags:

Representing Text Chunks

Sang & Veenstra 1999

  1. EACL

Summary

Sang & Veenstra compare the effect of different data representations on the performance of a memory-learning-based NP chunker. They get a minor performance improvement by using the best encoding.

Data Representations

Sang & Veenstra compared the following data representations:

  • IOB1: O=a token that is "outside" a chunk; B=a token that begins a chunk that immediately follows another chunk; I=a token that is "inside" a chunk, but is not marked B.
  • IOB2: O=a token that is "outside" a chunk; B=a token that begins a chunk; I=a token that is "inside" a chunk, but is not marked B.
  • IOE1: O=a token that is "outside" a chunk; E=a token that ends a chunk that immediately follows another chunk; I=a token that is "inside" a chunk, but is not marked E.
  • IOE2: O=a token that is "outside" a chunk; E=a token that ends a chunk; I=a token that is "inside" a chunk, but is not marked E.
  • [+]: The combination of two taggers, one of which assigns '[' or '.' (for beginning of an NP or not); and the other of which assigns ']' or '.'. Chunks are formed by finding all sequences where the first word is tagged '[', the last is tagged ']', and all intervening words are tagged '.' by both taggers.
  • [+IO: The combination of two taggers, one of which assings '[' or '.'; and the other of which assigns 'I' or 'O'. If all words that are tagged '[' and 'I' are changed to 'B', then interpret the result as IOB2.
  • IO+]: The combination of two taggers, one of which assings ']' or '.'; and the other of which assigns 'I' or 'O'. If all words that are tagged ']' and 'I' are changed to 'E', then interpret the result as IOE2.

Learning Model

Sang & Veenstra used a memory-based learning model (IB1-IG). For each test example, it finds the k nearest neighbors in the test data, where distance is computed as the weighted sum of distances between individual features; and feature distance is zero for equal features, and information gain for non-equal features.

Features

The basic features are words & POS tags in the context of the current word. They used heldout data to decide how large the contexts should be for these features. The optimal context size varied depending on the data representation chosen. E.g., representations with explicit open bracket info preferred larger left contexts.

Cascaded Classifier

They then built a cascaded classifier, that used the output of the basic classifier as an input feature. In particular, when classifying a given word, it looked at the basic classifier's output for context words; but not for that word itself. (Otherwise, the feature for that word would get too much weight, and drown out all other features.)

Next, they tried building cascaded classifiers that used the output of several basic classifiers, each of which had different context windows.

Finally, they tried varying the parameter k in 'k nearest neighbors.'

Results

Sang & Veenstra found that IOB1 consistently performed best, but that the differences in F-score were fairly minor. However, there were some differences in the precision vs recall tradeoff.

Login