In 2015, Robert M. Gower and
Peter Richtarik developed a versatile randomized iterative method for solving a consistent system of linear equations Ax = b which includes the randomized Kaczmarz algorithm as a special case. Other special cases include randomized
coordinate descent, randomized Gaussian descent and randomized Newton method. Block versions and versions with
importance sampling of all these methods also arise as special cases. The method is shown to enjoy exponential rate decay (in expectation) - also known as linear convergence, under very mild conditions on the way randomness enters the algorithm. The Gower-Richtarik method is the first algorithm uncovering a "sibling" relationship between these methods, some of which were independently proposed before, while many of which were new.
Insights about Randomized Kaczmarz Interesting new insights about the randomized Kaczmarz method that can be gained from the analysis of the method include: • The general rate of the Gower-Richtarik algorithm precisely recovers the rate of the randomized Kaczmarz method in the special case when it reduced to it. • The choice of probabilities for which the randomized Kaczmarz algorithm was originally formulated and analyzed (probabilities proportional to the squares of the row norms) is not optimal. Optimal probabilities are the solution of a certain semidefinite program. The theoretical complexity of randomized Kaczmarz with the optimal probabilities can be arbitrarily better than the complexity for the standard probabilities. However, the amount by which it is better depends on the matrix A. There are problems for which the standard probabilities are optimal. • When applied to a system with matrix A which is positive definite, Randomized Kaczmarz method is equivalent to the Stochastic Gradient Descent (SGD) method (with a very special stepsize) for minimizing the strongly convex quadratic function f(x) = \tfrac{1}{2}x^T A x - b^T x. Note that since f is convex, the minimizers of f must satisfy \nabla f(x) = 0 , which is equivalent to Ax = b. The "special stepsize" is the stepsize which leads to a point which in the one-dimensional line spanned by the stochastic gradient minimizes the
Euclidean distance from the unknown(!) minimizer of f , namely, from x^* = A^{-1}b. This insight is gained from a dual view of the iterative process (below described as "Optimization Viewpoint: Constrain and Approximate").
Six Equivalent Formulations The Gower-Richtarik method enjoys six seemingly different but equivalent formulations, shedding additional light on how to interpret it (and, as a consequence, how to interpret its many variants, including randomized Kaczmarz): • 1. Sketching viewpoint: Sketch & Project • 2. Optimization viewpoint: Constrain and Approximate • 3. Geometric viewpoint: Random Intersect • 4. Algebraic viewpoint 1: Random Linear Solve • 5. Algebraic viewpoint 2: Random Update • 6. Analytic viewpoint: Random Fixed Point We now describe some of these viewpoints. The method depends on 2 parameters: • a positive definite matrix B giving rise to a weighted Euclidean inner product \langle x,y \rangle _B := x^T B y and the induced norm :: \|x\|_B = \left (\langle x,x \rangle _B \right )^{\frac{1}{2}}, • and a random matrix S with as many rows as A (and possibly random number of columns).
1. Sketch and Project Given previous iterate x^k, the new point x^{k+1} is computed by drawing a random matrix S (in an iid fashion from some fixed distribution), and setting :x^{k+1} = \underset x \operatorname{arg\ min} \| x - x^k \|_B \text{ subject to } S^T A x = S^T b. That is, x^{k+1} is obtained as the projection of x^k onto the randomly sketched system S^T Ax = S^T b. The idea behind this method is to pick S in such a way that a projection onto the sketched system is substantially simpler than the solution of the original system Ax=b. Randomized Kaczmarz method is obtained by picking B to be the
identity matrix, and S to be the i^{th} unit
coordinate vector with probability p_i = \|a_i\|^2_2/\|A\|_F^2. Different choices of B and S lead to different variants of the method.
2. Constrain and Approximate A seemingly different but entirely equivalent formulation of the method (obtained via Lagrangian duality) is : x^{k+1} = \underset x \operatorname{arg\ min} \left \|x - x^* \right \|_B \text{ subject to } x = x^k + B^{-1}A^T S y, where y is also allowed to vary, and where x^* is any solution of the system Ax=b. Hence, x^{k+1} is obtained by first constraining the update to the
linear subspace spanned by the columns of the random matrix B^{-1}A^T S , i.e., to : \left \{ h : h = B^{-1} A^T S y, \quad y \text{ can vary } \right \}, and then choosing the point x from this subspace which best approximates x^* . This formulation may look surprising as it seems impossible to perform the approximation step due to the fact that x^* is not known (after all, this is what we are trying the compute!). However, it is still possible to do this, simply because x^{k+1} computed this way is the same as x^{k+1} computed via the sketch and project formulation and since x^* does not appear there.
5. Random Update The update can also be written explicitly as : x^{k+1} = x^k - B^{-1}A^T S \left (S^T A B^{-1}A^T S \right )^{\dagger} S^T \left (Ax^k - b \right ), where by M^\dagger we denote the Moore-Penrose pseudo-inverse of matrix M. Hence, the method can be written in the form x^{k+1}=x^k + h^k , where h^k is a
random update vector. Letting M = S^T A B^{-1}A^T S, it can be shown that the system M y = S^T (Ax^k - b) always has a solution y^k, and that for all such solutions the vector x^{k+1} - B^{-1} A^T S y^k is the same. Hence, it does not matter which of these solutions is chosen, and the method can be also written as x^{k+1} = x^k - B^{-1}A^T S y^k . The pseudo-inverse leads just to one particular solution. The role of the pseudo-inverse is twofold: • It allows the method to be written in the explicit "random update" form as above, • It makes the analysis simple through the final, sixth, formulation.
6. Random Fixed Point If we subtract x^* from both sides of the random update formula, denote : Z := A^T S \left (S^T A B^{-1} A^T S \right )^\dagger S^T A, and use the fact that Ax^* = b, we arrive at the last formulation: : x^{k+1} - x^* = \left (I - B^{-1}Z \right ) \left (x^k - x^* \right ), where I is the identity matrix. The iteration matrix, I- B^{-1}Z, is random, whence the name of this formulation.
Convergence By taking conditional expectations in the 6th formulation (conditional on x^k), we obtain : \mathbb{E} \left. \left [x^{k+1}-x^* \right | x^k \right ] = \left (I - B^{-1}\mathbb{E}[Z] \right ) \left [x^k - x^* \right ]. By taking expectation again, and using the tower property of expectations, we obtain : \mathbb{E} \left [x^{k+1}-x^* \right ] = (I - B^{-1}\mathbb{E}[Z]) \mathbb{E}\left [x^k - x^* \right ]. Gower and Richtarik It is also possible to show a stronger result:
Theorem [Gower & Richtarik 2015] The
expected squared norms (rather than norms of expectations) converge at the same rate: : \mathbb{E} \left \| \left [x^{k}-x^* \right ] \right \|^2_B \leq \rho^k \left \|x^0 - x^* \right \|^2_B.
Remark. This second type of convergence is
stronger due to the following identity which holds for any random vector x and any fixed vector x^*: : \left\|\mathbb{E} \left [x - x^* \right ] \right \|^2 = \mathbb{E}\left [ \left \|x-x^* \right \|^2 \right ] - \mathbb{E} \left [\|x-\mathbb{E}[x]\|^2 \right ].
Convergence of Randomized Kaczmarz We have seen that the randomized Kaczmarz method appears as a special case of the Gower-Richtarik method for B=I and S being the i^{th} unit coordinate vector with probability p_i = \|a_i\|_2^2/\|A\|_F^2, where a_i is the i^{th} row of A. It can be checked by direct calculation that :\rho = \|I-B^{-1}\mathbb{E}[Z]\|_B = 1 - \frac{\lambda_{\min}(A^T A)}{\|A\|_F^2}.
Further Special Cases == Algorithm 4: PLSS-Kaczmarz ==