More generally, let \mathcal{H}=\{h(\cdot;\omega) | \omega \in \Omega\} be the possibly
infinite set of weak classifiers, also termed
hypotheses. One way to write down the problem LPBoost solves is as a
linear program with infinitely many variables. The primal linear program of LPBoost, optimizing over the non-negative weight vector \boldsymbol{\alpha}, the non-negative vector \boldsymbol{\xi} of slack variables and the
margin \rho is the following. :\begin{array}{cl} \underset{\boldsymbol{\alpha},\boldsymbol{\xi},\rho}{\min} & -\rho + D \sum_{n=1}^{\ell} \xi_n\\ \textrm{sb.t.} & \sum_{\omega \in \Omega} y_n \alpha_{\omega} h(\boldsymbol{x}_n ; \omega) + \xi_n \geq \rho,\qquad n=1,\dots,\ell,\\ & \sum_{\omega \in \Omega} \alpha_{\omega} = 1,\\ & \xi_n \geq 0,\qquad n=1,\dots,\ell,\\ & \alpha_{\omega} \geq 0,\qquad \omega \in \Omega,\\ & \rho \in {\mathbb R}. \end{array} Note the effects of slack variables \boldsymbol{\xi} \geq 0: their one-norm is penalized in the objective function by a constant factor D, which—if small enough—always leads to a primal feasible linear program. Here we adopted the notation of a parameter space \Omega, such that for a choice \omega \in \Omega the weak classifier h(\cdot ; \omega): \mathcal{X} \to \{-1,1\} is uniquely defined. When the above linear program was first written down in early publications about boosting methods it was disregarded as intractable due to the large number of variables \boldsymbol{\alpha}. Only later it was discovered that such linear programs can indeed be solved efficiently using the classic technique of
column generation.
Column generation for LPBoost In a
linear program a
column corresponds to a primal variable.
Column generation is a technique to solve large linear programs. It typically works in a restricted problem, dealing only with a subset of variables. By generating primal variables iteratively and on-demand, eventually the original unrestricted problem with all variables is recovered. By cleverly choosing the columns to generate the problem can be solved such that while still guaranteeing the obtained solution to be optimal for the original full problem, only a small fraction of columns has to be created.
LPBoost dual problem Columns in the primal linear program corresponds to rows in the
dual linear program. The equivalent dual linear program of LPBoost is the following linear program. :\begin{array}{cl} \underset{\boldsymbol{\lambda},\gamma}{\max} & \gamma\\ \textrm{sb.t.} & \sum_{n=1}^{\ell} y_n h(\boldsymbol{x}_n ; \omega) \lambda_n + \gamma \leq 0,\qquad \omega \in \Omega,\\ & 0 \leq \lambda_n \leq D,\qquad n=1,\dots,\ell,\\ & \sum_{n=1}^{\ell} \lambda_n = 1,\\ & \gamma \in \mathbb{R}. \end{array} For
linear programs the optimal value of the primal and
dual problem are equal. For the above primal and dual problems, the optimal value is equal to the negative 'soft margin'. The soft margin is the size of the margin separating positive from negative training instances minus positive slack variables that carry penalties for margin-violating samples. Thus, the soft margin may be positive although not all samples are linearly separated by the classification function. The latter is called the 'hard margin' or 'realized margin'.
Convergence criterion Consider a subset of the satisfied constraints in the dual problem. For any finite subset we can solve the linear program and thus satisfy all constraints. If we could prove that of all the constraints which we did not add to the dual problem no single constraint is violated, we would have proven that solving our restricted problem is equivalent to solving the original problem. More formally, let \gamma^* be the optimal objective function value for any restricted instance. Then, we can formulate a search problem for the 'most violated constraint' in the original problem space, namely finding \omega^* \in \Omega as :\omega^* = \underset{\omega \in \Omega}{\textrm{argmax}} \sum_{n=1}^{\ell} y_n h(\boldsymbol{x}_n;\omega) \lambda_n. That is, we search the space \mathcal{H} for a single
decision stump h(\cdot;\omega^*) maximizing the left hand side of the dual constraint. If the constraint cannot be violated by any choice of decision stump, none of the corresponding constraint can be active in the original problem and the restricted problem is equivalent.
Penalization constant D The positive value of penalization constant D has to be found using
model selection techniques. However, if we choose D=\frac{1}{\ell \nu}, where \ell is the number of training samples and 0 , then the new parameter \nu has the following properties. • \nu is an upper bound on the fraction of training errors; that is, if k denotes the number of misclassified training samples, then \frac{k}{\ell} \leq \nu. • \nu is a lower bound on the fraction of training samples outside or on the margin. == Algorithm ==