Many
supervised learning problems involve an output variable and a vector of input variables , related to each other with some probabilistic distribution. The goal is to find some function \hat{F}(x) that best approximates the output variable from the values of input variables. This is formalized by introducing some
loss function L(y, F(x)) and minimizing it in expectation: : \hat{F} = \operatorname{argmin}\limits_F \mathbb{E}_{x,y}[L(y, F(x))]. The gradient boosting method assumes a real-valued . It seeks an approximation \hat{F}(x) in the form of a weighted sum of functions h_m (x) from some class \mathcal{H}, called base (or
weak) learners: : \hat{F}(x) = \sum_{m=1}^M \gamma_m h_m(x) + \mbox{const}, where \gamma_m is the weight at stage m. We are usually given a training set \{ (x_1,y_1), \dots , (x_n,y_n) \} of known values of and corresponding values of . In accordance with the
empirical risk minimization principle, the method tries to find an approximation \hat{F}(x) that minimizes the average value of the loss function on the training set, i.e., minimizes the empirical risk. It does so by starting with a model, consisting of a constant function F_0(x), and incrementally expands it in a
greedy fashion: :F_0(x) = \underset{h_m \in \mathcal{H}}{\arg\min} \sum_{i=1}^n {L(y_i, h_m(x_i))}, :F_m(x) = F_{m-1}(x) + \left(\underset{h_m \in \mathcal{H}}{\operatorname{arg\,min}} \left[\sum_{i=1}^n L(y_i, F_{m-1}(x_i) + h_m(x_i))\right]\right)(x), for m \geq 1, where h_m \in \mathcal{H} is a base learner function. Unfortunately, choosing the best function h_m at each step for an arbitrary loss function is a computationally infeasible optimization problem in general. Therefore, we restrict our approach to a simplified version of the problem. The idea is to apply a
steepest descent step to this minimization problem (functional gradient descent). The basic idea is to find a local minimum of the loss function by iterating on F_{m-1}(x). In fact, the local maximum-descent direction of the loss function is the negative gradient. Hence, moving a small amount \gamma such that the linear approximation remains valid: : F_m(x) = F_{m-1}(x) - \gamma \sum_{i=1}^n \nabla_{F_{m-1}} L(y_i, F_{m-1}(x_i)) where \gamma > 0. For small \gamma, this implies that L(y_i, F_{m}(x_i)) \le L(y_i, F_{m-1}(x_i)). To prove the following, consider the objective : O = \sum_{i=1}^n L(y_i, F_{m-1}(x_i) + h_m(x_i)) Doing a Taylor expansion around the fixed point F_{m-1}(x_i) up to first order : O = \sum_{i=1}^n L(y_i, F_{m-1}(x_i) + h_m(x_i)) \approx \sum_{i=1}^n L(y_i, F_{m-1}(x_i)) + h_m(x_i)\nabla_{F_{m-1} L(y_i, F_{m-1}(x_i))}+\cdots Now differentiating w.r.t. h_m(x_i), only the derivative of the second term remains \nabla_{F_{m-1}}L(y_i, F_{m-1}(x_i)). This is the direction of steepest ascent and hence we must move in the opposite (i.e., negative) direction in order to move in the direction of steepest descent. Furthermore, we can optimize \gamma by finding the \gamma value for which the loss function has a minimum: : \gamma_m = \underset{\gamma}{\operatorname{argmin}} \sum_{i=1}^n L(y_i, F_m(x_i)) = \underset{\gamma}{\arg\min} {\sum_{i=1}^n L\left(y_i, F_{m-1}(x_i) - \gamma \nabla_{F_{m-1}} L(y_i, F_{m-1}(x_i)) \right)}. If we considered the continuous case, i.e., where \mathcal{H} is the set of arbitrary differentiable functions on \R, we would update the model in accordance with the following equations :F_m(x) = F_{m-1}(x) - \gamma_m \sum_{i=1}^n {\nabla_{F_{m-1}} L(y_i, F_{m-1}(x_i))} where \gamma_m is the step length, defined as \gamma_m = \underset{\gamma}{\arg\min} \sum_{i=1}^n L\left(y_i, F_{m-1}(x_i) - \gamma \nabla_{F_{m-1}} L(y_i, F_{m-1}(x_i)) \right). In the discrete case however, i.e. when the set \mathcal{H} is finite, we choose the candidate function closest to the gradient of for which the coefficient may then be calculated with the aid of
line search on the above equations. Note that this approach is a heuristic and therefore doesn't yield an exact solution to the given problem, but rather an approximation. In pseudocode, the generic gradient boosting method is: Input: training set \{(x_i, y_i)\}_{i=1}^n, a differentiable loss function L(y, F(x)), number of iterations . Algorithm: • Initialize model with a constant value: • : F_0(x) = \underset{\gamma}{\arg\min} \sum_{i=1}^n L(y_i, \gamma). • For = 1 to : • Compute so-called
pseudo-residuals: • : r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F(x)=F_{m-1}(x)} \quad \text{for } i=1,\ldots,n. • Fit a base learner (or weak learner, e.g. tree) closed under scaling h_m(x) to pseudo-residuals, i.e. train it using the training set \{(x_i, r_{im})\}_{i=1}^n. • Compute multiplier \gamma_m by solving the following one-dimensional optimization problem: • : \gamma_m = \underset{\gamma}{\operatorname{argmin}} \sum_{i=1}^n L\left(y_i, F_{m-1}(x_i) + \gamma h_m(x_i)\right). • Update the model: • : F_m(x) = F_{m-1}(x) + \gamma_m h_m(x). • Output F_M(x). == Gradient tree boosting ==