Distributed Training Stragegies for the Structured Perceptron

Ryan McDonald, Keith Hall, and Gideon Mann 2010

Multiclass/structured perceptron

Repeat until convergence:

  • Consider training instances in order
    • Compute update to weight vector w
      • For perceptron: Move weighted vector towards the vector that would give us the right answer for this instance


Error processing latex u'k<\\frac{R^2}{\\gamma^2}'
/bin/sh: /sw/bin/latex: No such file or directory

k = training errors R = radius gamma = margin of best unit separating weight vector

So percetron will converge if the data is seperable

Why do we care about perceptron?

  • Fast: no partition function & quick convergence
  • Robust to approximations in inference. still works with approximate inference.
  • Sparsity with respect to unobserved features
  • easy to implement
  • close to state-of-the-art performance

Training is slow; we'd like to distribute it.

Batch learning: * Exact learning via mapreduce [chu et al 07]: gradient independence * Parameter mixtures * Trade-off: resource consumption vs exactness

Prior Work

For distributed learning:

  • Break the training set into pieces
  • Run learning on those subsets of the data
  • Average the weights that the individual models learned
  • For Maxent:
    • To work, requires "stability" with respect to the parameters
    • Good empirical results

For perceptrons, this approach doesn't work well

Iterative Parameter Mixtures

  • Run a single epoch over each shard of the training set.
  • Take the average
  • Update the individual models
  • Repeat

Will this converge/be bounded in training errors? Yes. Sort-of. See the paper for details.


NER: iterative parameter mix gets very good score (same as single training model) very quickly.

Same for Depenency Parsing.

Asynchronous Perceptron

Weight vector at a shared location; lots of network traffic. You can get some minor differences w/ the single-system training, depending on what the potential delay is.