Distributed Training Stragegies for the Structured Perceptron

Ryan McDonald, Keith Hall, and Gideon Mann 2010

Multiclass/structured perceptron

Repeat until convergence:

• Consider training instances in order
• Compute update to weight vector w
• For perceptron: Move weighted vector towards the vector that would give us the right answer for this instance

Error

Error processing latex u'k<\\frac{R^2}{\\gamma^2}'
/bin/sh: /sw/bin/latex: No such file or directory


k = training errors R = radius gamma = margin of best unit separating weight vector

So percetron will converge if the data is seperable

Why do we care about perceptron?

• Fast: no partition function & quick convergence
• Robust to approximations in inference. still works with approximate inference.
• Sparsity with respect to unobserved features
• easy to implement
• close to state-of-the-art performance

Training is slow; we'd like to distribute it.

Batch learning: * Exact learning via mapreduce [chu et al 07]: gradient independence * Parameter mixtures * Trade-off: resource consumption vs exactness

### Prior Work

For distributed learning:

• Break the training set into pieces
• Run learning on those subsets of the data
• Average the weights that the individual models learned
• For Maxent:
• To work, requires "stability" with respect to the parameters
• Good empirical results

For perceptrons, this approach doesn't work well

### Iterative Parameter Mixtures

• Run a single epoch over each shard of the training set.
• Take the average
• Update the individual models
• Repeat

Will this converge/be bounded in training errors? Yes. Sort-of. See the paper for details.

### Experiments

NER: iterative parameter mix gets very good score (same as single training model) very quickly.

Same for Depenency Parsing.

### Asynchronous Perceptron

Weight vector at a shared location; lots of network traffic. You can get some minor differences w/ the single-system training, depending on what the potential delay is.