Many improvements on the basic stochastic gradient descent algorithm have been proposed and used. In particular, in machine learning, the need to set a
learning rate (step size) has been recognized as problematic. Setting this parameter too high can cause the algorithm to diverge; setting it too low makes it slow to converge. A conceptually simple extension of stochastic gradient descent makes the learning rate a decreasing function of the iteration number , giving a
learning rate schedule, so that the first iterations cause large changes in the parameters, while the later ones do only fine-tuning. Such schedules have been known since the work of MacQueen on
-means clustering. Practical guidance on choosing the step size in several variants of SGD is given by Spall.
Implicit updates (ISGD) As mentioned earlier, classical stochastic gradient descent is generally sensitive to
learning rate . Fast convergence requires large learning rates but this may induce numerical instability. The problem can be largely solved by considering
implicit updates whereby the stochastic gradient is evaluated at the next iterate rather than the current one: w^\text{new} := w^\text{old} - \eta\, \nabla Q_i(w^\text{new}). This equation is implicit since w^\text{new} appears on both sides of the equation. It is a stochastic form of the
proximal gradient method since the update can also be written as: w^\text{new} := \arg\min_w \left\{ Q_i(w) + \frac{1}{2\eta} \left\|w - w^\text{old}\right\|^2 \right\}. As an example, consider least squares with features x_1, \ldots, x_n \in\mathbb{R}^p and observations y_1, \ldots, y_n\in\mathbb{R}. We wish to solve: \min_w \sum_{j=1}^n \left(y_j - x_j'w\right)^2, where x_j' w = x_{j1} w_1 + x_{j, 2} w_2 + ... + x_{j,p} w_p indicates the inner product. Note that x could have "1" as the first element to include an intercept. Classical stochastic gradient descent proceeds as follows: w^\text{new} = w^\text{old} + \eta \left(y_i - x_i'w^\text{old}\right) x_i where i is uniformly sampled between 1 and n. Although theoretical convergence of this procedure happens under relatively mild assumptions, in practice the procedure can be quite unstable. In particular, when \eta is misspecified so that I - \eta x_i x_i' has large absolute eigenvalues with high probability, the procedure may diverge numerically within a few iterations. In contrast,
implicit stochastic gradient descent (shortened as ISGD) can be solved in closed-form as: w^\text{new} = w^\text{old} + \frac{\eta}{1 + \eta \left\|x_i\right\|^2} \left(y_i - x_i'w^\text{old}\right) x_i. This procedure will remain numerically stable virtually for all \eta as the
learning rate is now normalized. Such comparison between classical and implicit stochastic gradient descent in the least squares problem is very similar to the comparison between
least mean squares (LMS) and
normalized least mean squares filter (NLMS). Even though a closed-form solution for ISGD is only possible in least squares, the procedure can be efficiently implemented in a wide range of models. Specifically, suppose that Q_i(w) depends on w only through a linear combination with features x_i, so that we can write \nabla_w Q_i(w) = -q(x_i'w) x_i, where q() \in\mathbb{R} may depend on x_i, y_i as well but not on w except through x_i'w. Least squares obeys this rule, and so does
logistic regression, and most
generalized linear models. For instance, in least squares, q(x_i'w) = y_i - x_i'w, and in logistic regression q(x_i'w) = y_i - S(x_i'w), where S(u) = e^u/(1+e^u) is the
logistic function. In
Poisson regression, q(x_i'w) = y_i - e^{x_i'w}, and so on. In such settings, ISGD is simply implemented as follows. Let f(\xi) = \eta q(x_i'w^\text{old} + \xi \|x_i\|^2), where \xi is scalar. Then, ISGD is equivalent to: w^\text{new} = w^\text{old} + \xi^\ast x_i,~\text{where}~\xi^\ast = f(\xi^\ast). The scaling factor \xi^\ast\in\mathbb{R} can be found through the
bisection method since in most regular models, such as the aforementioned generalized linear models, function q() is decreasing, and thus the search bounds for \xi^\ast are [\min(0, f(0)), \max(0, f(0))].
Momentum Further proposals include the
momentum method or the
heavy ball method, which in ML context appeared in
Rumelhart,
Hinton and
Williams' paper on backpropagation learning and borrowed the idea from Soviet mathematician Boris Polyak's 1964 article on solving functional equations. Stochastic gradient descent with momentum remembers the update at each iteration, and determines the next update as a
linear combination of the gradient and the previous update: \Delta w := \alpha \Delta w - \eta\, \nabla Q_i(w) w := w + \Delta w that leads to: w := w - \eta\, \nabla Q_i(w) + \alpha \Delta w where the
parameter w which minimizes Q(w) is to be
estimated, \eta is a step size (sometimes called the
learning rate in machine learning) and \alpha is an exponential
decay factor between 0 and 1 that determines the relative contribution of the current gradient and earlier gradients to the weight change. The name momentum stems from an analogy to
momentum in physics: the weight vector w, thought of as a particle traveling through parameter space, incurs acceleration from the gradient of the loss ("
force"). Unlike in classical stochastic gradient descent, it tends to keep traveling in the same direction, preventing oscillations. Momentum has been used successfully by computer scientists in the training of
artificial neural networks for several decades. The
momentum method is closely related to
underdamped Langevin dynamics, and may be combined with
simulated annealing. In mid-1980s the method was modified by
Yurii Nesterov to use the gradient predicted at the next point, and the resulting so-called
Nesterov Accelerated Gradient was sometimes used in ML in the 2010s.
Averaging Averaged stochastic gradient descent, invented independently by Ruppert and Polyak in the late 1980s, is ordinary stochastic gradient descent that records an average of its parameter vector over time. That is, the update is the same as for ordinary stochastic gradient descent, but the algorithm also keeps track of \bar{w} = \frac{1}{t} \sum_{i=0}^{t-1} w_i.When optimization is done, this averaged parameter vector takes the place of .
AdaGrad AdaGrad (for adaptive
gradient algorithm) is a modified stochastic gradient descent algorithm with per-parameter
learning rate, first published in 2011. Informally, this increases the learning rate for and decreases the learning rate for ones that are less sparse. This strategy often improves convergence performance over standard stochastic gradient descent in settings where data is sparse and sparse parameters are more informative. Examples of such applications include natural language processing and image recognition.
RMSProp RMSProp (for Root Mean Square Propagation) is a method invented in 2012 by James Martens and
Ilya Sutskever, at the time both PhD students in Geoffrey Hinton's group, in which the
learning rate is, like in Adagrad, adapted for each of the parameters. The idea is to divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight. Unusually, it was not published in an article but merely described in a
Coursera lecture. So, first the running average is calculated in terms of means square, v(w,t):=\gamma v(w,t-1) + \left(1-\gamma\right) \left(\nabla Q_i(w)\right)^2 where, \gamma is the forgetting factor. The concept of storing the historical gradient as sum of squares is borrowed from Adagrad, but "forgetting" is introduced to solve Adagrad's diminishing learning rates in non-convex problems by gradually decreasing the influence of old data. And the parameters are updated as, w:=w-\frac{\eta}{\sqrt{v(w,t)}}\nabla Q_i(w) RMSProp has shown good adaptation of learning rate in different applications. RMSProp can be seen as a generalization of
Rprop and is capable to work with mini-batches as well opposed to only full-batches. (short for Adaptive Moment Estimation) is a 2014 update to the
RMSProp optimizer combining it with the main feature of the
Momentum method. In this optimization algorithm, running averages with exponential forgetting of both the gradients and the second moments of the gradients are used. Given parameters w^ {(t)} and a loss function L ^ {(t)} , where t indexes the current training iteration (indexed at 1 ), Adam's parameter update is given by: m_w ^ {(t)} := \beta_1 m_w ^ {(t-1)} + \left(1 - \beta_1\right) \nabla _w L ^ {(t-1)} v_w ^ {(t)} := \beta_2 v_w ^ {(t-1)} + \left(1 - \beta_2\right) \left(\nabla _w L ^ {(t-1)} \right)^2 \hat{m}_w ^ {(t)} = \frac{m_w ^ {(t)}}{1 - \beta_1^t} \hat{v}_w ^ {(t)} = \frac{ v_w ^ {(t)}}{1 - \beta_2^t} w ^ {(t)} := w ^ {(t-1)} - \eta \frac{\hat{m}_w^{(t)}}{\sqrt{\hat{v}_w^{(t)}} + \varepsilon} where \varepsilon is a small scalar (e.g. 10^{-8}) used to prevent division by 0, and \beta_1 (e.g. 0.9) and \beta_2 (e.g. 0.999) are the forgetting factors for gradients and second moments of gradients, respectively. Squaring and square-rooting is done element-wise. As the exponential moving averages of the gradient m_w ^ {(t)} and the squared gradient v_w ^ {(t)} are initialized with a vector of 0's, there would be a bias towards zero in the first training iterations. A factor \tfrac{1}{1 - \beta_{1/2}^t} is introduced to compensate this bias and get better estimates \hat{m}_w ^ {(t)} and \hat{v}_w ^ {(t)}. The initial proof establishing the convergence of Adam was incomplete, and subsequent analysis has revealed that Adam does not converge for all convex objectives. Despite this,
Adam continues to be used due to its strong performance in practice.
Variants The popularity of
Adam inspired many variants and enhancements. Some examples include: • Nesterov-enhanced gradients:
NAdam,
FASFA • varying interpretations of second-order information:
Powerpropagation and
AdaSqrt. • Using
infinity norm:
AdaMax which improves convergence over
Adam by using maximum of past squared gradients instead of the exponential average.
AdamX further improves convergence over
AMSGrad. •
AdamW, which improves the
weight decay.
Sign-based stochastic gradient descent Even though sign-based optimization goes back to the aforementioned
Rprop, in 2018 researchers tried to simplify Adam by removing the magnitude of the stochastic gradient from being taken into account and only considering its sign. This results in a significantly lower communication cost of transferring gradients from workers to the parameter server. In this sense, it serves to better compress the gradient information, while having comparable convergence to standard SGD. However, directly determining the required Hessian matrices for optimization may not be possible in practice. Practical and theoretically sound methods for second-order versions of SGD that do not require direct Hessian information are given by Spall and others. (A less efficient method based on finite differences, instead of simultaneous perturbations, is given by Ruppert.) Another approach to the approximation Hessian matrix is replacing it with the Fisher information matrix, which transforms usual gradient to natural. These methods not requiring direct Hessian information are based on either values of the summands in the above empirical risk function or values of the gradients of the summands (i.e., the SGD inputs). In particular, second-order optimality is asymptotically achievable without direct calculation of the Hessian matrices of the summands in the empirical risk function. When the objective is a
nonlinear least-squares loss Q(w) = \frac{1}{n} \sum_{i=1}^n Q_i(w) = \frac{1}{n} \sum_{i=1}^n (m(w;x_i)-y_i)^2, where m(w;x_i) is the predictive model (e.g., a
deep neural network) the objective's structure can be exploited to estimate 2nd order information using gradients only. The resulting methods are simple and often effective == Approximations in continuous time ==