Structured prediction
Machine learning and data mining 

Structured prediction

Machine learning venues

Structured prediction or structured (output) learning is an umbrella term for supervised machine learning techniques that involve predicting structured objects, rather than scalar discrete or real values.^{[1]}
For example, the problem of translating a natural language sentence into a syntactic representation such as a parse tree can be seen as a structured prediction problem in which the structured output domain is the set of all possible parse trees.
Probabilistic graphical models form a large class of structured prediction models. In particular, Bayesian networks and random fields are popularly used to solve structured prediction problems in a wide variety of application domains including bioinformatics, natural language processing, speech recognition, and computer vision. Other algorithms and models for structured prediction include inductive logic programming, structured SVMs, Markov logic networks and constrained conditional models.
Similar to commonly used supervised learning techniques, structured prediction models are typically trained by means of observed data in which the true prediction value is used to adjust model parameters. Due to the complexity of the model and the interrelations of predicted variables the process of prediction using a trained model and of training itself is often computationally infeasible and approximate inference and learning methods are used.
Contents
Example: sequence tagging
Sequence tagging is a class of problems prevalent in natural language processing, where input data are often sequences (e.g. sentences of text). The sequence tagging problem appears in several guises, e.g. partofspeech tagging and named entity recognition. In POS tagging, each word in a sequence must receive a "tag" (class label) that expresses its "type" of word:
The main challenge in this problem is to resolve ambiguity: the word "sentence" can also be a verb in English, and so can "tagged".
While this problem can be solved by simply performing classification of individual tokens, that approach does not take into account the empirical fact that tags do not occur independently; instead, each tag displays a strong conditional dependence on the tag of the previous word. This fact can be exploited in a sequence model such as a hidden Markov model or conditional random field^{[2]} that predicts the entire tag sequence for a sentence, rather than just individual tags, by means of the Viterbi algorithm.
Structured perceptron
One of the easiest ways to understand algorithms for general structured prediction is the structured perceptron of Collins.^{[3]} This algorithm combines the venerable perceptron algorithm for learning linear classifiers with an inference algorithm (classically the Viterbi algorithm when used on sequence data) and can be described abstractly as follows. First define a "joint feature function" Φ(x, y) that maps a training sample x and a candidate prediction y to a vector of length n (x and y may have any structure; n is problemdependent, but must be fixed for each model). Let GEN be a function that generates candidate predictions. Then:
 Let w be a weight vector of length n
 For a predetermined number of iterations:
 For each sample x in the training set with true output t:
 Make a prediction ŷ = arg max {y ∈ GEN(x)} (w^{⊤} Φ(x, y))
 Update w , from ŷ to t: w=w+c(Φ(x, ŷ)+ Φ(x, t)), c is learning rate
 For each sample x in the training set with true output t:
In practice, finding the argmax over GEN(x) will be done using an algorithm such as Viterbi or maxsum, rather than an exhaustive search through an exponentially large set of candidates.
The idea of learning is similar to multiclass perceptron.
See also
 Conditional random field
 Structured support vector machine
 Recurrent neural network, in particular Elman networks (SRNs)
References
 ↑ Gökhan BakIr, Ben Taskar, Thomas Hofmann, Bernhard Schölkopf, Alex Smola and SVN Vishwanathan (2007), Predicting Structured Data, MIT Press.
 ↑ Lafferty, J., McCallum, A., Pereira, F. (2001). "Conditional random fields: Probabilistic models for segmenting and labeling sequence data" (PDF). Proc. 18th International Conf. on Machine Learning. pp. 282–289.CS1 maint: uses authors parameter (link)<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ Collins, Michael (2002). Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms (PDF). Proc. EMNLP. 10.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 Noah Smith, Linguistic Structure Prediction, 2011.
 Michael Collins, Discriminative Training Methods for Hidden Markov Models, 2002.