If we choose the conjugate vectors \mathbf{p}_k carefully, then we may not need all of them to obtain a good approximation to the solution \mathbf{x}_*. So, we want to regard the conjugate gradient method as an iterative method. This also allows us to approximately solve systems where n is so large that the direct method would take too much time. We denote the initial guess for \mathbf{x}_* by \mathbf{x}_0 (we can assume without loss of generality that \mathbf{x}_0 = \mathbf{0}, otherwise consider the system \mathbf{Az} = \mathbf{b} - \mathbf{Ax}_0 instead). Starting with \mathbf{x}_0 we search for the solution and in each iteration we need a metric to tell us whether we are closer to the solution \mathbf{x}_* (that is unknown to us). This metric comes from the fact that the solution \mathbf{x}_* is also the unique minimizer of the following
quadratic function : f(\mathbf{x}) = \tfrac12 \mathbf{x}^\mathsf{T} \mathbf{A}\mathbf{x} - \mathbf{x}^\mathsf{T} \mathbf{b}, \qquad \mathbf{x}\in\mathbb{R}^n \,. The existence of a unique minimizer is apparent as its
Hessian matrix of second derivatives is symmetric positive-definite : \mathbf{H}(f(\mathbf{x})) = \mathbf{A} \,, and that the minimizer (use Df(\mathbf{x}) = 0) solves the initial problem follows from its first derivative : \nabla f(\mathbf{x}) = \mathbf{A} \mathbf{x} - \mathbf{b} \,. This suggests taking the first basis vector \mathbf{p}_0 to be the negative of the gradient of f at \mathbf{x} = \mathbf{x}_0. The gradient of f equals \mathbf{Ax} - \mathbf{b}. Starting with an initial guess \mathbf{x}_0, this means we take \mathbf{p}_0 = \mathbf{b} - \mathbf{Ax}_0 . The other vectors in the basis will be conjugate to the gradient, hence the name
conjugate gradient method. Note that \mathbf{p}_0 is also the
residual provided by this initial step of the algorithm. Let \mathbf{r}_k be the
residual at the kth step: : \mathbf{r}_k = \mathbf{b} - \mathbf{Ax}_k. As observed above, \mathbf{r}_k is the negative gradient of f at \mathbf{x}_k, so the
gradient descent method would require to move in the direction
rk. Here, however, we insist that the directions \mathbf{p}_k must be conjugate to each other. A practical way to enforce this is by requiring that the next search direction be built out of the current residual and all previous search directions. The conjugation constraint is an orthonormal-type constraint and hence the algorithm can be viewed as an example of
Gram-Schmidt orthonormalization. This gives the following expression: :\mathbf{p}_{k} = \mathbf{r}_{k} - \sum_{i (see the picture at the top of the article for the effect of the conjugacy constraint on convergence). Following this direction, the next optimal location is given by : \mathbf{x}_{k+1} = \mathbf{x}_k + \alpha_k \mathbf{p}_k with : \alpha_{k} = \frac{\mathbf{p}_k^\mathsf{T} (\mathbf{b} - \mathbf{Ax}_k )}{\mathbf{p}_k^\mathsf{T} \mathbf{A} \mathbf{p}_k} = \frac{\mathbf{p}_{k}^\mathsf{T} \mathbf{r}_{k}}{\mathbf{p}_{k}^\mathsf{T} \mathbf{A} \mathbf{p}_{k}}, where the last equality follows from the definition of \mathbf{r}_k . The expression for \alpha_k can be derived if one substitutes the expression for
xk+1 into
f and minimizing it with respect to \alpha_k : \begin{align} f(\mathbf{x}_{k+1}) &= f(\mathbf{x}_k + \alpha_k \mathbf{p}_k) =: g(\alpha_k) \\ g'(\alpha_k) &\overset{!}{=} 0 \quad \Rightarrow \quad \alpha_{k} = \frac{\mathbf{p}_k^\mathsf{T} (\mathbf{b} - \mathbf{Ax}_k)}{\mathbf{p}_k^\mathsf{T} \mathbf{A} \mathbf{p}_k} \,. \end{align}
The resulting algorithm The above algorithm gives the most straightforward explanation of the conjugate gradient method. Seemingly, the algorithm as stated requires storage of all previous searching directions and residue vectors, as well as many matrix–vector multiplications, and thus can be computationally expensive. However, a closer analysis of the algorithm shows that \mathbf{r}_i is orthogonal to \mathbf{r}_j, i.e. \mathbf{r}_i^\mathsf{T} \mathbf{r}_j=0 , for i \neq j. And \mathbf{p}_i is \mathbf{A}-orthogonal to \mathbf{p}_j, i.e. \mathbf{p}_i^\mathsf{T} \mathbf{A} \mathbf{p}_j=0 , for i \neq j. This can be regarded that as the algorithm progresses, \mathbf{p}_i and \mathbf{r}_i span the same
Krylov subspace, where \mathbf{r}_i form the orthogonal basis with respect to the standard inner product, and \mathbf{p}_i form the orthogonal basis with respect to the inner product induced by \mathbf{A}. Therefore, \mathbf{x}_k can be regarded as the projection of \mathbf{x} on the Krylov subspace. That is, if the CG method starts with \mathbf{x}_0 = 0, thenx_k = \mathrm{argmin}_{y \in \mathbb{R}^n} {\left\{(x_*-y)^{\top} A(x_*-y): y \in \operatorname{span}\left\{b, A b, \ldots, A^{k-1} b\right\}\right\}} where x_* is the solution to \mathbf{A} \mathbf{x}= \mathbf{b}. The algorithm is detailed below for solving \mathbf{A} \mathbf{x}= \mathbf{b} where \mathbf{A} is a real, symmetric, positive-definite matrix. The input vector \mathbf{x}_0 can be an approximate initial solution or \mathbf{0}. It is a different formulation of the exact procedure described above. :\begin{align} & \mathbf{r}_0 := \mathbf{b} - \mathbf{A x}_0 \\ & \hbox{if } \mathbf{r}_{0} \text{ is sufficiently small, then return } \mathbf{x}_{0} \text{ as the result}\\ & \mathbf{p}_0 := \mathbf{r}_0 \\ & k := 0 \\ & \text{repeat} \\ & \qquad \alpha_k := \frac{\mathbf{r}_k^\mathsf{T} \mathbf{r}_k}{\mathbf{p}_k^\mathsf{T} \mathbf{A p}_k} \\ & \qquad \mathbf{x}_{k+1} := \mathbf{x}_k + \alpha_k \mathbf{p}_k \\ & \qquad \mathbf{r}_{k+1} := \mathbf{r}_k - \alpha_k \mathbf{A p}_k \\ & \qquad \hbox{if } \mathbf{r}_{k+1} \text{ is sufficiently small, then exit loop} \\ & \qquad \beta_k := \frac{\mathbf{r}_{k+1}^\mathsf{T} \mathbf{r}_{k+1}}{\mathbf{r}_k^\mathsf{T} \mathbf{r}_k} \\ & \qquad \mathbf{p}_{k+1} := \mathbf{r}_{k+1} + \beta_k \mathbf{p}_k \\ & \qquad k := k + 1 \\ & \text{end repeat} \\ & \text{return } \mathbf{x}_{k+1} \text{ as the result} \end{align} This is the most commonly used algorithm. The same formula for \beta_k is also used in the Fletcher–Reeves
nonlinear conjugate gradient method.
Restarts We note that \mathbf{x}_{1} is computed by the
gradient descent method applied to \mathbf{x}_{0}. Setting \beta_{k}=0 would similarly make \mathbf{x}_{k+1} computed by the
gradient descent method from \mathbf{x}_{k}, i.e., can be used as a simple implementation of a restart of the conjugate gradient iterations. A norm of the residual is typically used for stopping criteria. The norm of the explicit residual \mathbf{r}_{k+1} := \mathbf{b} - \mathbf{A x}_{k+1} provides a guaranteed level of accuracy both in exact arithmetic and in the presence of the
rounding errors, where convergence naturally stagnates. In contrast, the implicit residual \mathbf{r}_{k+1} := \mathbf{r}_k - \alpha_k \mathbf{A p}_k is known to keep getting smaller in amplitude well below the level of
rounding errors and thus cannot be used to determine the stagnation of convergence.
Computation of alpha and beta In the algorithm, \alpha_k is chosen such that \mathbf{r}_{k+1} is orthogonal to \mathbf{r}_{k}. The denominator is simplified from :\alpha_k = \frac{\mathbf{r}_{k}^\mathsf{T} \mathbf{r}_{k}}{\mathbf{r}_{k}^\mathsf{T} \mathbf{A} \mathbf{p}_k} = \frac{\mathbf{r}_k^\mathsf{T} \mathbf{r}_k}{\mathbf{p}_k^\mathsf{T} \mathbf{A p}_k} since \mathbf{r}_{k+1} = \mathbf{p}_{k+1}-\mathbf{\beta}_{k}\mathbf{p}_{k}. The \beta_k is chosen such that \mathbf{p}_{k+1} is conjugate to \mathbf{p}_{k}. Initially, \beta_k is :\beta_k = - \frac{\mathbf{r}_{k+1}^\mathsf{T} \mathbf{A} \mathbf{p}_k}{\mathbf{p}_k^\mathsf{T} \mathbf{A} \mathbf{p}_k} using :\mathbf{r}_{k+1} = \mathbf{r}_{k} - \alpha_{k} \mathbf{A} \mathbf{p}_{k} and equivalently \mathbf{A} \mathbf{p}_{k} = \frac{1}{\alpha_{k}} (\mathbf{r}_{k} - \mathbf{r}_{k+1}), the numerator of \beta_k is rewritten as : \mathbf{r}_{k+1}^\mathsf{T} \mathbf{A} \mathbf{p}_k = \frac{1}{\alpha_k} \mathbf{r}_{k+1}^\mathsf{T} (\mathbf{r}_k - \mathbf{r}_{k+1}) = - \frac{1}{\alpha_k} \mathbf{r}_{k+1}^\mathsf{T} \mathbf{r}_{k+1} because \mathbf{r}_{k+1} and \mathbf{r}_{k} are orthogonal by design. The denominator is rewritten as : \mathbf{p}_k^\mathsf{T} \mathbf{A} \mathbf{p}_k = (\mathbf{r}_k + \beta_{k-1} \mathbf{p}_{k-1})^\mathsf{T} \mathbf{A} \mathbf{p}_k = \frac{1}{\alpha_k} \mathbf{r}_k^\mathsf{T} (\mathbf{r}_k - \mathbf{r}_{k+1}) = \frac{1}{\alpha_k} \mathbf{r}_k^\mathsf{T} \mathbf{r}_k using that the search directions \mathbf{p}_k are conjugated and again that the residuals are orthogonal. This gives the \beta in the algorithm after cancelling \alpha_k. ====Example code in
Julia (programming language)==== using LinearAlgebra """ x = conjugate_gradient(A, b, x0 = zero(b); atol=length(b)*eps(norm(b)) Return the solution to `A * x = b` using the conjugate gradient method. `A` must be a positive definite matrix or other linear operator. `x0` is the initial guess for the solution (default is the zero vector). `atol` is the absolute tolerance on the magnitude of the residual `b - A * x` for convergence (default is machine epsilon). Returns the approximate solution vector `x`. """ function conjugate_gradient( A, b::AbstractVector, x0::AbstractVector = zero(b); atol=length(b)*eps(norm(b)) ) x = copy(x0) # initialize the solution r = b - A * x0 # initial residual p = copy(r) # initial search direction r²old = r' * r # squared norm of residual k = 0 while r²old > atol^2 # iterate until convergence Ap = A * p # search direction α = r²old / (p' * Ap) # step size @. x += α * p # update solution # Update residual: if (k + 1) % 16 == 0 # every 16 iterations, recompute residual from scratch r .= b .- A * x # to avoid accumulation of numerical errors else @. r -= α * Ap # use the updating formula that saves one matrix-vector product end r²new = r' * r @. p = r + (r²new / r²old) * p # update search direction r²old = r²new # update squared residual norm k += 1 end return x end ====Example code in
MATLAB==== function x = conjugate_gradient(A, b, x0, tol) % Return the solution to `A * x = b` using the conjugate gradient method. % Reminder: A should be symmetric and positive definite. if nargin tol Ap = A * p; alpha = rsold / (p' * Ap); x = x + alpha * p; r = r - alpha * Ap; rsnew = r' * r; p = r + (rsnew / rsold) * p; rsold = rsnew; end end
Numerical example Consider the linear system
Ax =
b given by :\mathbf{A} \mathbf{x}= \begin{bmatrix} 4 & 1 \\ 1 & 3 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 1 \\ 2 \end{bmatrix}, we will perform two steps of the conjugate gradient method beginning with the initial guess :\mathbf{x}_0 = \begin{bmatrix} 2 \\ 1 \end{bmatrix} in order to find an approximate solution to the system.
Solution For reference, the exact solution is : \mathbf{x} = \begin{bmatrix} \frac{1}{11} \\\\ \frac{7}{11} \end{bmatrix} \approx \begin{bmatrix} 0.0909 \\\\ 0.6364 \end{bmatrix} Our first step is to calculate the residual vector
r0 associated with
x0. This residual is computed from the formula
r0 =
b -
Ax0, and in our case is equal to :\mathbf{r}_0 = \begin{bmatrix} 1 \\ 2 \end{bmatrix} - \begin{bmatrix} 4 & 1 \\ 1 & 3 \end{bmatrix} \begin{bmatrix} 2 \\ 1 \end{bmatrix} = \begin{bmatrix}-8 \\ -3 \end{bmatrix} = \mathbf{p}_0. Since this is the first iteration, we will use the residual vector
r0 as our initial search direction
p0; the method of selecting
pk will change in further iterations. We now compute the scalar using the relationship : \alpha_0 = \frac{\mathbf{r}_0^\mathsf{T} \mathbf{r}_0}{\mathbf{p}_0^\mathsf{T} \mathbf{A p}_0} = \frac{\begin{bmatrix} -8 & -3 \end{bmatrix} \begin{bmatrix} -8 \\ -3 \end{bmatrix}}{ \begin{bmatrix} -8 & -3 \end{bmatrix} \begin{bmatrix} 4 & 1 \\ 1 & 3 \end{bmatrix} \begin{bmatrix} -8 \\ -3 \end{bmatrix} } =\frac{73}{331}\approx0.2205 We can now compute
x1 using the formula :\mathbf{x}_1 = \mathbf{x}_0 + \alpha_0\mathbf{p}_0 = \begin{bmatrix} 2 \\ 1 \end{bmatrix} + \frac{73}{331} \begin{bmatrix} -8 \\ -3 \end{bmatrix} \approx \begin{bmatrix} 0.2356 \\ 0.3384 \end{bmatrix}. This result completes the first iteration, the result being an "improved" approximate solution to the system,
x1. We may now move on and compute the next residual vector
r1 using the formula :\mathbf{r}_1 = \mathbf{r}_0 - \alpha_0 \mathbf{A} \mathbf{p}_0 = \begin{bmatrix} -8 \\ -3 \end{bmatrix} - \frac{73}{331} \begin{bmatrix} 4 & 1 \\ 1 & 3 \end{bmatrix} \begin{bmatrix} -8 \\ -3 \end{bmatrix} \approx \begin{bmatrix} -0.2810 \\ 0.7492 \end{bmatrix}. Our next step in the process is to compute the scalar that will eventually be used to determine the next search direction
p1. :\beta_0 = \frac{\mathbf{r}_1^\mathsf{T} \mathbf{r}_1}{\mathbf{r}_0^\mathsf{T} \mathbf{r}_0} \approx \frac{\begin{bmatrix} -0.2810 & 0.7492 \end{bmatrix} \begin{bmatrix} -0.2810 \\ 0.7492 \end{bmatrix}}{\begin{bmatrix} -8 & -3 \end{bmatrix} \begin{bmatrix} -8 \\ -3 \end{bmatrix}} = 0.0088. Now, using this scalar , we can compute the next search direction
p1 using the relationship :\mathbf{p}_1 = \mathbf{r}_1 + \beta_0 \mathbf{p}_0 \approx \begin{bmatrix} -0.2810 \\ 0.7492 \end{bmatrix} + 0.0088 \begin{bmatrix} -8 \\ -3 \end{bmatrix} = \begin{bmatrix} -0.3511 \\ 0.7229 \end{bmatrix}. We now compute the scalar using our newly acquired
p1 using the same method as that used for . : \alpha_1 = \frac{\mathbf{r}_1^\mathsf{T} \mathbf{r}_1}{\mathbf{p}_1^\mathsf{T} \mathbf{A p}_1} \approx \frac{\begin{bmatrix} -0.2810 & 0.7492 \end{bmatrix} \begin{bmatrix} -0.2810 \\ 0.7492 \end{bmatrix}}{ \begin{bmatrix} -0.3511 & 0.7229 \end{bmatrix} \begin{bmatrix} 4 & 1 \\ 1 & 3 \end{bmatrix} \begin{bmatrix} -0.3511 \\ 0.7229 \end{bmatrix} } = 0.4122. Finally, we find
x2 using the same method as that used to find
x1. :\mathbf{x}_2 = \mathbf{x}_1 + \alpha_1 \mathbf{p}_1 \approx \begin{bmatrix} 0.2356 \\ 0.3384 \end{bmatrix} + 0.4122 \begin{bmatrix} -0.3511 \\ 0.7229 \end{bmatrix} = \begin{bmatrix} 0.0909 \\ 0.6364 \end{bmatrix}. The result,
x2, is a "better" approximation to the system's solution than
x1 and
x0. If exact arithmetic were to be used in this example instead of limited-precision, then the exact solution would theoretically have been reached after
n = 2 iterations (
n being the order of the system). ==Finite Termination Property==