The
minimum of the sum of squares is found by setting the
gradient to zero. Since the model contains
m parameters, there are
m gradient equations: \frac{\partial S}{\partial \beta_j}=2\sum_i r_i\frac{\partial r_i}{\partial \beta_j} = 0,\ j=1,\ldots,m, and since r_i=y_i-f(x_i,\boldsymbol \beta), the gradient equations become -2\sum_i r_i\frac{\partial f(x_i,\boldsymbol \beta)}{\partial \beta_j}=0,\ j=1,\ldots,m. The gradient equations apply to all least squares problems. Each particular problem requires particular expressions for the model and its
partial derivatives.
Linear least squares A regression model is a linear one when the model comprises a
linear combination of the parameters, i.e., f(x, \boldsymbol \beta) = \sum_{j = 1}^m \beta_j \phi_j(x), where the function \phi_j is a function of x . L(D, \boldsymbol{\beta}) = \left\|Y - X\boldsymbol{\beta} \right\|^2 = (Y - X\boldsymbol{\beta})^\mathsf{T} (Y - X\boldsymbol{\beta}) = Y^\mathsf{T}Y- 2Y^\mathsf{T}X\boldsymbol{\beta} + \boldsymbol{\beta}^\mathsf{T}X^\mathsf{T}X\boldsymbol{\beta} The gradient of the loss is: \frac{\partial L(D, \boldsymbol{\beta})}{\partial \boldsymbol{\beta}} = \frac{\partial \left(Y^\mathsf{T}Y- 2Y^\mathsf{T}X\boldsymbol{\beta} + \boldsymbol{\beta}^\mathsf{T}X^\mathsf{T}X\boldsymbol{\beta}\right)}{\partial \boldsymbol{\beta}} = -2X^\mathsf{T}Y + 2X^\mathsf{T}X\boldsymbol{\beta} Setting the gradient of the loss to zero and solving for \boldsymbol{\beta}, we get: -2X^\mathsf{T}Y + 2X^\mathsf{T}X\boldsymbol{\beta} = 0 \Rightarrow X^\mathsf{T}Y = X^\mathsf{T}X\boldsymbol{\beta} \boldsymbol{\hat{\beta}} = \left(X^\mathsf{T}X\right)^{-1} X^\mathsf{T}Y
Non-linear least squares There is, in some cases, a
closed-form solution to a non-linear least squares problem – but in general there is not. In the case of no closed-form solution, numerical algorithms are used to find the value of the parameters \beta that minimizes the objective. Most algorithms involve choosing initial values for the parameters. Then, the parameters are refined iteratively, that is, the values are obtained by successive approximation: {\beta_j}^{k+1} = {\beta_j}^k+\Delta \beta_j, where a superscript
k is an iteration number, and the vector of increments \Delta \beta_j is called the shift vector. In some commonly used algorithms, at each iteration the model may be linearized by approximation to a first-order
Taylor series expansion about \boldsymbol \beta^k: \begin{align} f(x_i,\boldsymbol \beta) &= f^k(x_i,\boldsymbol \beta) +\sum_j \frac{\partial f(x_i,\boldsymbol \beta)}{\partial \beta_j} \left(\beta_j-{\beta_j}^k \right) \\[1ex] &= f^k(x_i,\boldsymbol \beta) +\sum_j J_{ij} \,\Delta\beta_j. \end{align} The
Jacobian J is a function of constants, the independent variable
and the parameters, so it changes from one iteration to the next. The residuals are given by r_i = y_i - f^k(x_i, \boldsymbol \beta)- \sum_{k=1}^{m} J_{ik}\,\Delta\beta_k = \Delta y_i- \sum_{j=1}^m J_{ij}\,\Delta\beta_j. To minimize the sum of squares of r_i, the gradient equation is set to zero and solved for \Delta \beta_j: -2\sum_{i=1}^n J_{ij} \left( \Delta y_i-\sum_{k=1}^m J_{ik} \, \Delta \beta_k \right) = 0, which, on rearrangement, become
m simultaneous linear equations, the
normal equations: \sum_{i=1}^{n}\sum_{k=1}^m J_{ij} J_{ik} \, \Delta \beta_k=\sum_{i=1}^n J_{ij} \, \Delta y_i \qquad (j=1,\ldots,m). The normal equations are written in matrix notation as \left(\mathbf{J}^\mathsf{T} \mathbf{J}\right) \Delta \boldsymbol \beta = \mathbf{J}^\mathsf{T}\Delta \mathbf{y}. \mathbf{\left(J^TWJ\right) \, \Delta \boldsymbol \beta=J^TW \, \Delta y} if weights are used. --> These are the defining equations of the
Gauss–Newton algorithm.
Differences between linear and nonlinear least squares • The model function,
f, in LLSQ (linear least squares) is a linear combination of parameters of the form f = X_{i1}\beta_1 + X_{i2}\beta_2 +\cdots The model may represent a straight line, a parabola or any other linear combination of functions. In NLLSQ (nonlinear least squares) the parameters appear as functions, such as \beta^2, e^{\beta x} and so forth. If the derivatives \partial f / \partial \beta_j are either constant or depend only on the values of the independent variable, the model is linear in the parameters. Otherwise, the model is nonlinear. • Need initial values for the parameters to find the solution to a NLLSQ problem; LLSQ does not require them. • Solution algorithms for NLLSQ often require that the Jacobian can be calculated similar to LLSQ. Analytical expressions for the partial derivatives can be complicated. If analytical expressions are impossible to obtain either the partial derivatives must be calculated by numerical approximation or an estimate must be made of the Jacobian, often via
finite differences. • Non-convergence (failure of the algorithm to find a minimum) is a common phenomenon in NLLSQ. • LLSQ is globally concave so non-convergence is not an issue. • Solving NLLSQ is usually an iterative process which has to be terminated when a convergence criterion is satisfied. LLSQ solutions can be computed using direct methods, although problems with large numbers of parameters are typically solved with iterative methods, such as the
Gauss–Seidel method. • In LLSQ the solution is unique, but in NLLSQ there may be multiple minima in the sum of squares. • Under the condition that the errors are uncorrelated with the predictor variables, LLSQ yields unbiased estimates, but even under that condition NLLSQ estimates are generally biased. These differences must be considered whenever the solution to a nonlinear least squares problem is being sought. ==Example==