Suppose that for a known
real matrix A and vector \mathbf{b}, we wish to find a vector \mathbf{x} such that A\mathbf{x} = \mathbf{b}, where \mathbf{x} and \mathbf{b} may be of different sizes and A may even be non-square. The standard approach is
ordinary least squares linear regression. However, if no \mathbf{x} satisfies the equation or more than one \mathbf{x} does—that is, the solution is not unique—the problem is said to be
ill posed. In such cases, ordinary least squares estimation leads to an
overdetermined, or more often an
underdetermined system of equations. Most real-world phenomena have the effect of
low-pass filters in the forward direction where A maps \mathbf{x} to \mathbf{b}. Therefore, in solving the inverse-problem, the inverse mapping operates as a
high-pass filter that has the undesirable tendency of amplifying noise (
eigenvalues / singular values are largest in the reverse mapping where they were smallest in the forward mapping). In addition, ordinary least squares implicitly nullifies every element of the reconstructed version of \mathbf{x} that is in the null-space of A, rather than allowing for a model to be used as a prior for \mathbf{x}. Ordinary least squares seeks to minimize the sum of squared
residuals, which can be compactly written as \left\|A\mathbf{x} - \mathbf{b}\right\|_2^2, where \|\cdot\|_2 is the
Euclidean norm. In order to give preference to a particular solution with desirable properties, a regularization term can be included in this minimization: \left\|A\mathbf{x} - \mathbf{b}\right\|_2^2 + \left\|\Gamma \mathbf{x}\right\|_2^2=\left\|\mathcal{A}\mathbf{x} - \mathcal{b}\right\|_2^2, where \mathcal{A}=\begin{pmatrix}A\\\Gamma\end{pmatrix} and \mathcal{b}=\begin{pmatrix}\mathbf{b}\\\boldsymbol0\end{pmatrix}, for some suitably chosen
Tikhonov matrix \Gamma . In many cases, this matrix is chosen as a scalar multiple of the
identity matrix (\Gamma = \alpha I), giving preference to solutions with smaller
norms; this is known as
regularization. In other cases, high-pass operators (e.g., a
difference operator or a weighted
Fourier operator) may be used to enforce smoothness if the underlying vector is believed to be mostly continuous. This regularization improves the conditioning of the problem, thus enabling a direct numerical solution. Treating it as an ordinary least squares problem with
augmented matrices \mathcal{A} and \mathcal{b}, the solution is \hat\mathbf{x} =(\mathcal{A}^\mathsf{T} \mathcal{A})^{-1} \mathcal{A}^\mathsf{T} \mathbf{\mathcal{b}} = (A^\mathsf{T} A + \Gamma^\mathsf{T} \Gamma)^{-1} A^\mathsf{T} \mathbf{b}. The effect of regularization may be varied by the scale of the matrix \Gamma. For \Gamma = 0 this reduces to the unregularized least-squares solution, provided that (
AT
A)−1 exists. Note that in case of a
complex matrix A, as usual the transpose A^\mathsf{T} has to be replaced by the
Hermitian transpose A^\mathsf{H}. regularization is used in many contexts aside from linear regression, such as
classification with
logistic regression or
support vector machines, and matrix factorization.
Application to existing fit results Since Tikhonov Regularization simply adds a quadratic term to the objective function in optimization problems, it is possible to do so after the unregularised optimisation has taken place. E.g., if the above problem with \Gamma = 0 yields the solution \hat\mathbf{x}_0, the solution in the presence of \Gamma \ne 0 can be expressed as: \hat\mathbf{x} = B \hat\mathbf{x}_0, with the "regularisation matrix" B = \left(A^\mathsf{T} A + \Gamma^\mathsf{T} \Gamma\right)^{-1} A^\mathsf{T} A. If the parameter fit comes with a covariance matrix of the estimated parameter uncertainties V_0, then the regularisation matrix will be B = (V_0^{-1} + \Gamma^\mathsf{T}\Gamma)^{-1} V_0^{-1}, and the regularised result will have a new covariance V = B V_0 B^\mathsf{T}. In the context of arbitrary likelihood fits, this is valid, as long as the quadratic approximation of the
likelihood function is valid. This means that, as long as the perturbation from the unregularised result is small, one can regularise any result that is presented as a best fit point with a covariance matrix. No detailed knowledge of the underlying likelihood function is needed.
Generalized Tikhonov regularization For general multivariate normal distributions for \mathbf x and the data error, one can apply a transformation of the variables to reduce to the case above. Equivalently, one can seek an \mathbf x to minimize \left\|A \mathbf x - \mathbf b\right\|_P^2 + \left\|\mathbf x - \mathbf x_0\right\|_Q^2, where we have used \left\|\mathbf{x}\right\|_Q^2 to stand for the weighted norm squared \mathbf{x}^\mathsf{T} Q \mathbf{x} (compare with the
Mahalanobis distance). In the Bayesian interpretation P is the inverse
covariance matrix of \mathbf b, \mathbf x_0 is the
expected value of \mathbf x, and Q is the inverse covariance matrix of \mathbf x. The Tikhonov matrix is not explicitly included because the corresponding regularization term \left\|\Gamma \mathbf x - \mathbf x_0'\right\|_{Q'}^2 reduces to above with \Gamma \mathbf x_0=\mathbf x_0' and Q=\Gamma^T Q' \Gamma. For normal regularization where Q'=I, the Tikhonov matrix then appears in the
Cholesky factorization Q = \Gamma^\mathsf{T} \Gamma and is considered a
whitening filter. This generalized problem has an optimal solution \hat\mathbf{x} which can be written explicitly using the formula \mathbf \hat\mathbf{x} = \left(A^\mathsf{T} PA + Q\right)^{-1} \left(A^\mathsf{T} P \mathbf{b} + Q \mathbf{x}_0\right) = \mathbf x_0 + \left(A^\mathsf{T} P A + Q \right)^{-1} \left(A^\mathsf{T} P \left(\mathbf b - A \mathbf x_0\right)\right). ==Lavrentyev regularization==