As the optimization problem described above can be solved as a convex problem with respect to either dictionary or sparse coding while the other one of the two is fixed, most of the algorithms are based on the idea of iteratively updating one and then the other. The problem of finding an optimal sparse coding R with a given dictionary \mathbf{D} is known as
sparse approximation (or sometimes just sparse coding problem). A number of algorithms have been developed to solve it (such as
matching pursuit and
LASSO) and are incorporated in the algorithms described below.
Method of optimal directions (MOD) The method of optimal directions (or MOD) was one of the first methods introduced to tackle the sparse dictionary learning problem. The core idea of it is to solve the minimization problem subject to the limited number of non-zero components of the representation vector: \min_{\mathbf{D}, R}\{\|X-\mathbf{D}R\|^2_F\} \,\, \text{s.t.}\,\, \forall i \,\,\|r_i\|_0 \leq T Here, F denotes the
Frobenius norm. MOD alternates between getting the
sparse coding using a method such as
matching pursuit and updating the dictionary by computing the analytical solution of the problem given by \mathbf{D} = XR^+ where R^+ is a
Moore-Penrose pseudoinverse. After this update \mathbf{D} is renormalized to fit the constraints and the new sparse coding is obtained again. The process is repeated until convergence (or until a sufficiently small residue). MOD has proved to be a very efficient method for low-dimensional input data X requiring just a few iterations to converge. However, due to the high complexity of the matrix-inversion operation, computing the pseudoinverse in high-dimensional cases is in many cases intractable. This shortcoming has inspired the development of other dictionary learning methods.
K-SVD K-SVD is an algorithm that performs
SVD at its core to update the atoms of the dictionary one by one and basically is a generalization of
K-means. It enforces that each element of the input data x_i is encoded by a linear combination of not more than T_0 elements in a way identical to the MOD approach: \min_{\mathbf{D}, R}\{\|X-\mathbf{D}R\|^2_F\} \,\, \text{s.t.}\,\, \forall i \,\,\|r_i\|_0 \leq T_0 This algorithm's essence is to first fix the dictionary, find the best possible R under the above constraint (using
Orthogonal Matching Pursuit) and then iteratively update the atoms of dictionary \mathbf{D} in the following manner: \|X - \mathbf{D}R\|^2_F = \left| X - \sum_{i = 1}^K d_i x^i_T\right|^2_F = \| E_k - d_k x^k_T\|^2_F The next steps of the algorithm include
rank-1 approximation of the residual matrix E_k , updating d_k and enforcing the sparsity of x_k after the update. This algorithm is considered to be standard for dictionary learning and is used in a variety of applications. However, it shares weaknesses with MOD being efficient only for signals with relatively low dimensionality and having the possibility for being stuck at local minima.
Stochastic gradient descent One can also apply a widespread stochastic gradient descent method with iterative projection to solve this problem. The idea of this method is to update the dictionary using the first order stochastic gradient and project it on the constraint set \mathcal{C}. The step that occurs at i-th iteration is described by this expression: \mathbf{D}_i = \text{proj}_{\mathcal{C}} \left\{\mathbf{D}_{i-1}-\delta_i\nabla_{\mathbf{D}}\sum_{i \in S}\|x_i-\mathbf{D}r_i\|_2^2+\lambda\|r_i\|_1 \right\}, where S is a random subset of \{1...K\} and \delta_i is a gradient step.
Lagrange dual method An algorithm based on solving a
dual Lagrangian problem provides an efficient way to solve for the dictionary having no complications induced by the sparsity function. Consider the following Lagrangian: \mathcal{L}(\mathbf{D}, \Lambda) = \text{tr}\left((X-\mathbf{D}R)^T(X-\mathbf{D}R)\right) + \sum_{j=1}^n\lambda_j \left({\sum_{i=1}^d\mathbf{D}_{ij}^2-c} \right), where c is a constraint on the norm of the atoms and \lambda_i are the so-called dual variables forming the diagonal matrix \Lambda. We can then provide an analytical expression for the Lagrange dual after minimization over \mathbf{D}: \mathcal{D}(\Lambda) = \min_{\mathbf{D}}\mathcal{L}(\mathbf{D}, \Lambda) = \text{tr}(X^TX-XR^T(RR^T+\Lambda)^{-1}(XR^T)^T-c\Lambda). After applying one of the optimization methods to the value of the dual (such as
Newton's method or
conjugate gradient) we get the value of \mathbf{D}: \mathbf{D}^T=(RR^T+\Lambda)^{-1}(XR^T)^T Solving this problem is less computational hard because the amount of dual variables n is a lot of times much less than the amount of variables in the primal problem.
LASSO In this approach, the optimization problem is formulated as: \min_{r \in \mathbb{R}^n}\{\,\,\|r\|_1\} \,\, \text{subject to}\,\,\|X-\mathbf{D}R\|^2_F , where \epsilon is the permitted error in the reconstruction LASSO. It finds an estimate of r_i by minimizing the least square error subject to a
L1-norm constraint in the solution vector, formulated as: \min_{r \in \mathbb{R}^n} \,\, \dfrac{1}{2}\,\,\|X-\mathbf{D}r\|^2_F + \lambda \,\,\|r\|_1 , where \lambda > 0 controls the trade-off between sparsity and the reconstruction error. This gives the global optimal solution. See also Online dictionary learning for Sparse coding
Parametric training methods Parametric training methods are aimed to incorporate the best of both worlds — the realm of analytically constructed dictionaries and the learned ones. This allows to construct more powerful generalized dictionaries that can potentially be applied to the cases of arbitrary-sized signals. Notable approaches include: • Translation-invariant dictionaries. These dictionaries are composed by the translations of the atoms originating from the dictionary constructed for a finite-size signal patch. This allows the resulting dictionary to provide a representation for the arbitrary-sized signal. • Multiscale dictionaries. This method focuses on constructing a dictionary that is composed of differently scaled dictionaries to improve sparsity. • Sparse dictionaries. This method focuses on not only providing a sparse representation but also constructing a sparse dictionary which is enforced by the expression \mathbf{D} = \mathbf{B}\mathbf{A} where \mathbf{B} is some pre-defined analytical dictionary with desirable properties such as fast computation and \mathbf{A} is a sparse matrix. Such formulation allows to directly combine the fast implementation of analytical dictionaries with the flexibility of sparse approaches.
Online dictionary learning (LASSO approach) Many common approaches to sparse dictionary learning rely on the fact that the whole input data X (or at least a large enough training dataset) is available for the algorithm. However, this might not be the case in the real-world scenario as the size of the input data might be too big to fit it into memory. The other case where this assumption can not be made is when the input data comes in a form of a
stream. Such cases lie in the field of study of
online learning which essentially suggests iteratively updating the model upon the new data points x becoming available. A dictionary can be learned in an online manner the following way: • For t = 1...T: • Draw a new sample x_t • Find a sparse coding using
LARS: r_t = \underset{r \in \mathbb{R}^n}{\text{argmin}}\left(\frac{1}{2}\|x_t-\mathbf{D}_{t-1}r\|+\lambda\|r\|_1\right) • Update dictionary using
block-coordinate approach: \mathbf{D}_t = \underset{\mathbf{D} \in \mathcal{C}}{\text{argmin}}\frac{1}{t}\sum_{i=1}^t\left(\frac{1}{2}\|x_i-\mathbf{D}r_i\|^2_2+\lambda\|r_i\|_1\right) This method allows us to gradually update the dictionary as new data becomes available for sparse representation learning and helps drastically reduce the amount of memory needed to store the dataset (which often has a huge size). == Applications ==