Limited-memory BFGS

The algorithm starts with an initial estimate of the optimal value, \mathbf{x}_0, and proceeds iteratively to refine that estimate with a sequence of better estimates \mathbf{x}_1,\mathbf{x}_2,\ldots. The derivatives of the function g_k:=\nabla f(\mathbf{x}_k) are used as a key driver of the algorithm to identify the direction of steepest descent, and also to form an estimate of the Hessian matrix (second derivative) of f(\mathbf{x}). L-BFGS shares many features with other quasi-Newton algorithms, but is very different in how the matrix-vector multiplication d_k=-H_k g_k is carried out, where d_k is the approximate Newton's direction, g_k is the current gradient, and H_k is the inverse of the Hessian matrix. There are multiple published approaches using a history of updates to form this direction vector. Here, we give a common approach, the so-called "two loop recursion." We take as given x_k, the position at the -th iteration, and g_k\equiv\nabla f(x_k) where f is the function being minimized, and all vectors are column vectors. We also assume that we have stored the last updates of the form :s_k = x_{k+1} - x_k :y_k = g_{k+1} - g_k. We define \rho_k = \frac{1}{y^{\top}_k s_k} , and H^0_k will be the 'initial' approximate of the inverse Hessian that our estimate at iteration begins with. The algorithm is based on the BFGS recursion for the inverse Hessian as :H_{k+1} = (I-\rho_k s_k y_k^\top)H_k(I-\rho_k y_k s_k^\top) + \rho_k s_k s_k^\top. For a fixed we define a sequence of vectors q_{k-m},\ldots,q_k as q_k:=g_k and q_i:=(I-\rho_i y_i s_i^\top)q_{i+1}. Then a recursive algorithm for calculating q_i from q_{i+1} is to define \alpha_i := \rho_i s_i^\top q_{i+1} and q_i=q_{i+1}-\alpha_i y_i. We also define another sequence of vectors z_{k-m},\ldots,z_k as z_i:=H_iq_i. There is another recursive algorithm for calculating these vectors which is to define z_{k-m}=H_k^0 q_{k-m} and then recursively define \beta_i:=\rho_i y_i^\top z_i and z_{i+1}=z_i + (\alpha_i - \beta_i)s_i. The value of z_k is then our ascent direction. Thus we can compute the descent direction as follows: : \begin{array}{l} q = g_k\\ \mathtt{For}\ i=k-1, k-2, \ldots, k-m\\ \qquad \alpha_i = \rho_i s^\top_i q\\ \qquad q = q - \alpha_i y_i\\ \gamma_k = \frac{s_{k - m}^{\top} y_{k - m}}{y_{k - m}^{\top} y_{k - m}} \\ H^0_k= \gamma_k I\\ z = H^0_k q\\ \mathtt{For}\ i=k-m, k-m+1, \ldots, k-1\\ \qquad \beta_i = \rho_i y^\top_i z\\ \qquad z = z + s_i (\alpha_i - \beta_i)\\ z = -z \end{array} This formulation gives the search direction for the minimization problem, i.e., z = - H_k g_k. For maximization problems, one should thus take instead. Note that the initial approximate inverse Hessian H^0_k is chosen as a diagonal matrix or even a multiple of the identity matrix since this is numerically efficient. The scaling of the initial matrix \gamma_k ensures that the search direction is well scaled and therefore the unit step length is accepted in most iterations. A Wolfe line search is used to ensure that the curvature condition is satisfied and the BFGS updating is stable. Note that some software implementations use an Armijo backtracking line search, but cannot guarantee that the curvature condition y_k^{\top} s_k > 0 will be satisfied by the chosen step since a step length greater than 1 may be needed to satisfy this condition. Some implementations address this by skipping the BFGS update when y_k^{\top} s_k is negative or too close to zero, but this approach is not generally recommended since the updates may be skipped too often to allow the Hessian approximation H_k to capture important curvature information. Some solvers employ so called damped (L)BFGS update which modifies quantities s_k and y_k in order to satisfy the curvature condition. The two-loop recursion formula is widely used by unconstrained optimizers due to its efficiency in multiplying by the inverse Hessian. However, it does not allow for the explicit formation of either the direct or inverse Hessian and is incompatible with non-box constraints. An alternative approach is the compact representation, which involves a low-rank representation for the direct and/or inverse Hessian. This represents the Hessian as a sum of a diagonal matrix and a low-rank update. Such a representation enables the use of L-BFGS in constrained settings, for example, as part of the SQP method. ==Applications==