Classifying data is a common task in
machine learning. Suppose some data points, each belonging to one of two sets, are given and we wish to create a model that will decide which set a
new data point will be in. In the case of
support vector machines, a data point is viewed as a
p-dimensional vector (a list of
p numbers), and we want to know whether we can separate such points with a (
p − 1)-dimensional
hyperplane. This is called a
linear classifier. There are many hyperplanes that might classify (separate) the data. One reasonable choice as the best hyperplane is the one that represents the largest separation, or margin, between the two sets. So we choose the hyperplane so that the distance from it to the nearest data point on each side is maximized. If such a hyperplane exists, it is known as the
maximum-margin hyperplane and the linear classifier it defines is known as a
maximum margin classifier. More formally, given some training data \mathcal{D}, a set of
n points of the form :\mathcal{D} = \left\{ (\mathbf{x}_i, y_i)\mid\mathbf{x}_i \in \mathbb{R}^p,\, y_i \in \{-1,1\}\right\}_{i=1}^n where the
yi is either 1 or −1, indicating the set to which the point \mathbf{x}_i belongs. Each \mathbf{x}_i is a
p-dimensional
real vector. We want to find the maximum-margin hyperplane that divides the points having y_i=1 from those having y_i=-1. Any hyperplane can be written as the set of points \mathbf{x} satisfying : \mathbf{w}\cdot\mathbf{x} - b=0, where \cdot denotes the
dot product and {\mathbf{w}} the (not necessarily normalized)
normal vector to the hyperplane. The parameter \tfrac{b}{\|\mathbf{w}\|} determines the offset of the hyperplane from the origin along the normal vector {\mathbf{w}}. If the training data are linearly separable, we can select two hyperplanes in such a way that they separate the data and there are no points between them, and then try to maximize their distance. == See also ==