Generative models vs. discriminative models
There are two broad classes of methods for determining the parameters of a linear classifier \vec w. They can be
generative and
discriminative models. Methods of the former model
joint probability distribution, whereas methods of the latter model
conditional density functions P({\rm class}|\vec x). Examples of such algorithms include: •
Linear Discriminant Analysis (LDA)—assumes
Gaussian conditional density models •
Naive Bayes classifier with multinomial or multivariate Bernoulli event models. The second set of methods includes
discriminative models, which attempt to maximize the quality of the output on a
training set. Additional terms in the training cost function can easily perform
regularization of the final model. Examples of discriminative training of linear classifiers include: •
Logistic regression—maximum likelihood estimation of \vec w assuming that the observed training set was generated by a binomial model that depends on the output of the classifier. •
Perceptron—an algorithm that attempts to fix all errors encountered in the training set • Fisher's Linear Discriminant Analysis—an algorithm (different than "LDA") that maximizes the ratio of between-class scatter to within-class scatter, without any other assumptions. It is in essence a method of dimensionality reduction for binary classification. •
Support vector machine—an algorithm that maximizes the
margin between the decision hyperplane and the examples in the training set.
Note: Despite its name, LDA does not belong to the class of discriminative models in this taxonomy. However, its name makes sense when we compare LDA to the other main linear
dimensionality reduction algorithm:
principal components analysis (PCA). LDA is a
supervised learning algorithm that utilizes the labels of the data, while PCA is an
unsupervised learning algorithm that ignores the labels. To summarize, the name is a historical artifact. Discriminative training often yields higher accuracy than modeling the conditional density functions. However, handling missing data is often easier with conditional density models. All of the linear classifier algorithms listed above can be converted into non-linear algorithms operating on a different input space \varphi(\vec x), using the
kernel trick.
Discriminative training Discriminative training of linear classifiers usually proceeds in a
supervised way, by means of an
optimization algorithm that is given a training set with desired outputs and a
loss function that measures the discrepancy between the classifier's outputs and the desired outputs. Thus, the learning algorithm solves an optimization problem of the form :\underset{\mathbf{w}}{\arg\min} \;R(\mathbf{w}) + C \sum_{i=1}^N L(y_i, \mathbf{w}^\mathsf{T} \mathbf{x}_i) where • is a vector of classifier parameters, • is a loss function that measures the discrepancy between the classifier's prediction and the true output for the 'th training example, • is a
regularization function that prevents the parameters from getting too large (causing
overfitting), and • is a scalar constant (set by the user of the learning algorithm) that controls the balance between the regularization and the loss function. Popular loss functions include the
hinge loss (for linear SVMs) and the
log loss (for linear logistic regression). If the regularization function is
convex, then the above is a
convex problem. Many algorithms exist for solving such problems; popular ones for linear classification include (
stochastic)
gradient descent,
L-BFGS,
coordinate descent and
Newton methods. == See also ==