Least-squares support vector machine

Given a training set \{ x_i ,y_i \}_{i = 1}^N with input data x_i \in \mathbb{R}^n and corresponding binary class labels y_i \in \{ -1, +1 \}, the SVM classifier, according to Vapnik's original formulation, satisfies the following conditions: : \begin{cases} w^T \phi (x_i ) + b \ge 1, & \text{if } \quad y_i = +1, \\ w^T \phi (x_i ) + b \le - 1, & \text{if } \quad y_i = -1, \end{cases} which is equivalent to : y_i \left[ {w^T \phi (x_i ) + b} \right] \ge 1,\quad i = 1, \ldots, N, where \phi(x) is the nonlinear map from original space to the high- or infinite-dimensional space. Inseparable data In case such a separating hyperplane does not exist, we introduce so-called slack variables \xi_i such that : \begin{cases} y_i \left[ {w^T \phi (x_i ) + b} \right] \ge 1 - \xi _i , & i = 1, \ldots, N, \\ \xi _i \ge 0, & i = 1, \ldots, N. \end{cases} According to the structural risk minimization principle, the risk bound is minimized by the following minimization problem: : \min J_1 (w,\xi )=\frac{1}{2}w^T w + c\sum\limits_{i = 1}^N \xi_i , : \text{Subject to } \begin{cases} y_i \left[ {w^T \phi (x_i ) + b} \right] \ge 1 - \xi _i , & i = 1, \ldots, N, \\ \xi _i \ge 0, & i = 1, \ldots ,N , \end{cases} To solve this problem, we could construct the Lagrangian function: : L_1(w,b,\xi,\alpha,\beta)=\frac{1}{2}w^T w + c\sum\limits_{i = 1}^N {\xi _i } - \sum\limits_{i=1}^N \alpha_i \left\{ y_i \left[ {w^T \phi (x_i ) + b} \right] - 1 + \xi _i \right\} - \sum\limits_{i=1}^N \beta_i \xi_i, where \alpha_i \ge 0,\ \beta _i \ge 0\ (i = 1, \ldots, N) are the Lagrangian multipliers. The optimal point will be in the saddle point of the Lagrangian function, and then we obtain {{NumBlk|::| \begin{cases} \frac{ \partial L_1 }{\partial w} = 0\quad \to \quad w = \sum\limits_{i = 1}^N \alpha _i y_i \phi (x_i ) ,\\ \frac{\partial L_1 }{\partial b} = 0\quad \to \quad \sum\limits_{i = 1}^N \alpha _i y_i = 0 ,\\ \frac{\partial L_1 }{\partial \xi _i } = 0\quad \to \quad \alpha _i +\beta_i = c,\;i = 1, \ldots ,N \quad \to \quad 0 \le \alpha _i \le c,\;i = 1, \ldots ,N. \end{cases} |}} By substituting w by its expression in the Lagrangian formed from the appropriate objective and constraints, we will get the following quadratic programming problem: : \max Q_1(\alpha) = -\frac{1}{2}\sum\limits_{i,j = 1}^N {\alpha _i \alpha _j y_i y_j K(x_i ,x_j )} + \sum\limits_{i = 1}^N \alpha_i, where K(x_i ,x_j ) = \left\langle \phi (x_i ), \phi (x_j) \right\rangle is called the kernel function. Solving this QP problem subject to constraints in (), we will get the hyperplane in the high-dimensional space and hence the classifier in the original space. Least-squares SVM formulation The least-squares version of the SVM classifier is obtained by reformulating the minimization problem as : \min J_2(w,b,e) = \frac{\mu}{2} w^T w + \frac{\zeta}{2}\sum\limits_{i = 1}^N e_i^2, subject to the equality constraints : y_i \left[ {w^T \phi (x_i ) + b} \right] = 1 - e_{i} ,\quad i = 1, \ldots ,N . The least-squares SVM (LS-SVM) classifier formulation above implicitly corresponds to a regression interpretation with binary targets y_i = \pm 1. Using y_i^2 = 1, we have : \sum\limits_{i = 1}^N e_i^2 = \sum\limits_{i = 1}^N (y_i e_i)^2 = \sum\limits_{i = 1}^N e_i^2 = \sum\limits_{i = 1}^N \left( y_i - (w^T \phi(x_i) + b) \right)^2, with e_i = y_i - (w^T \phi(x_i) + b). Notice, that this error would also make sense for least-squares data fitting, so that the same end results holds for the regression case. Hence the LS-SVM classifier formulation is equivalent to : J_2(w,b,e) = \mu E_W + \zeta E_D with E_W = \frac{1}{2} w^T w and E_D = \frac{1}{2} \sum\limits_{i = 1}^N e_i^2 = \frac{1}{2} \sum\limits_{i = 1}^N \left(y_i - (w^T \phi(x_i) + b) \right)^2. Both \mu and \zeta should be considered as hyperparameters to tune the amount of regularization versus the sum squared error. The solution does only depend on the ratio \gamma = \zeta / \mu, therefore the original formulation uses only \gamma as tuning parameter. We use both \mu and \zeta as parameters in order to provide a Bayesian interpretation to LS-SVM. The solution of LS-SVM regressor will be obtained after we construct the Lagrangian function: : \begin{cases} L_2 (w,b,e,\alpha )\; = J_2 (w,e) - \sum\limits_{i = 1}^N \alpha _i \left\{ { \left[ {w^T \phi (x_i ) + b} \right] + e_i - y_i } \right\} ,\\ \quad \quad \quad \quad \quad \; = \frac{1}{2}w^T w + \frac{\gamma }{2} \sum\limits_{i = 1}^N e_i^2 - \sum\limits_{i = 1}^N \alpha _i \left\{ \left[ w^T \phi (x_i ) + b \right] + e_i -y_i \right\} , \end{cases} where \alpha_i \in \mathbb{R} are the Lagrange multipliers. The conditions for optimality are : \begin{cases} \frac{\partial L_2 }{\partial w} = 0\quad \to \quad w = \sum\limits_{i = 1}^N \alpha _i \phi (x_i ) , \\ \frac{\partial L_2 }{\partial b} = 0\quad \to \quad \sum\limits_{i = 1}^N \alpha _i = 0 ,\\ \frac{\partial L_2 }{\partial e_i } = 0\quad \to \quad \alpha _i = \gamma e_i ,\;i = 1, \ldots ,N ,\\ \frac{\partial L_2 }{\partial \alpha _i } = 0\quad \to \quad y_i = w^T \phi (x_i ) + b + e_i ,\,i = 1, \ldots ,N . \end{cases} Elimination of w and e will yield a linear system instead of a quadratic programming problem: : \left[ \begin{matrix} 0 & 1_N^T \\ 1_N & \Omega + \gamma ^{ - 1} I_N \end{matrix} \right] \left[ \begin{matrix} b \\ \alpha \end{matrix} \right] = \left[ \begin{matrix} 0 \\ Y \end{matrix} \right] , with Y = [y_1 , \ldots ,y_N ]^T, 1_N = [1, \ldots ,1]^T and \alpha = [\alpha _1 , \ldots ,\alpha _N ]^T. Here, I_N is an N \times N identity matrix, and \Omega \in \mathbb{R}^{N \times N} is the kernel matrix defined by \Omega _{ij} = \phi (x_i )^T \phi (x_j ) = K(x_i ,x_j ). Kernel function K For the kernel function K(•, •) one typically has the following choices: • Linear kernel : K(x,x_i ) = x_i^T x, • Polynomial kernel of degree d: K(x,x_i ) = \left( {1 + x_i^T x/c} \right)^d , • Radial basis function RBF kernel : K(x,x_i ) = \exp \left( { - \left\| {x - x_i } \right\|^2 /\sigma ^2 } \right), • MLP kernel : K(x,x_i ) = \tanh \left( {k\,x_i^T x + \theta } \right), where d, c, \sigma, k and \theta are constants. Notice that the Mercer condition holds for all c, \sigma \in \mathbb{R}^+ and d \in N values in the polynomial and RBF case, but not for all possible choices of k and \theta in the MLP case. The scale parameters c, \sigma and k determine the scaling of the inputs in the polynomial, RBF and MLP kernel function. This scaling is related to the bandwidth of the kernel in statistics, where it is shown that the bandwidth is an important parameter of the generalization behavior of a kernel method. ==Bayesian interpretation for LS-SVM==