Ryan McDonald, Keith Hall, and Gideon Mann 2010

Tags:	[NAACL2010]

Distributed Training Stragegies for the Structured Perceptron

Multiclass/structured perceptron

Repeat until convergence:

Consider training instances in order
- Compute update to weight vector w
  - For perceptron: Move weighted vector towards the vector that would give us the right answer for this instance

Error

Error processing latex u'k<\\frac{R^2}{\\gamma^2}'

/bin/sh: /sw/bin/latex: No such file or directory

k = training errors R = radius gamma = margin of best unit separating weight vector

So percetron will converge if the data is seperable

Why do we care about perceptron?

Fast: no partition function & quick convergence
Robust to approximations in inference. still works with approximate inference.
Sparsity with respect to unobserved features
easy to implement
close to state-of-the-art performance

Training is slow; we'd like to distribute it.

Batch learning: * Exact learning via mapreduce [chu et al 07]: gradient independence * Parameter mixtures * Trade-off: resource consumption vs exactness

Prior Work

For distributed learning:

Break the training set into pieces
Run learning on those subsets of the data
Average the weights that the individual models learned
For Maxent:
- To work, requires "stability" with respect to the parameters
- Good empirical results

For perceptrons, this approach doesn't work well

Iterative Parameter Mixtures

Run a single epoch over each shard of the training set.
Take the average
Update the individual models
Repeat

Will this converge/be bounded in training errors? Yes. Sort-of. See the paper for details.

Experiments

NER: iterative parameter mix gets very good score (same as single training model) very quickly.

Same for Depenency Parsing.

Asynchronous Perceptron

Weight vector at a shared location; lots of network traffic. You can get some minor differences w/ the single-system training, depending on what the potential delay is.