MarketMatrix completion
Company Profile

Matrix completion

Matrix completion is the task of filling in the missing entries of a partially observed matrix, which is equivalent to performing data imputation in statistics. A wide range of datasets are naturally organized in matrix form. One example is the movie-ratings matrix, as appears in the Netflix problem: Given a ratings matrix in which each entry represents the rating of movie by customer , if customer has watched movie and is otherwise missing, we would like to predict the remaining entries in order to make good recommendations to customers on what to watch next. Another example is the document-term matrix: The frequencies of words used in a collection of documents can be represented as a matrix, where each entry corresponds to the number of times the associated term appears in the indicated document.

Low rank matrix completion
One of the variants of the matrix completion problem is to find the lowest rank matrix X which matches the matrix M, which we wish to recover, for all entries in the set E of observed entries. The mathematical formulation of this problem is as follows: :\begin{align} & \underset{X}{\text{min}} & \text{rank} (X) \\ & \text{subject to} & X_{ij} = M_{ij} & \;\; \forall i,j \in E\\ \end{align} Candès and Recht proved that with assumptions on the sampling of the observed entries and sufficiently many sampled entries this problem has a unique solution with high probability. An equivalent formulation, given that the matrix M to be recovered is known to be of rank r, is to solve for X where X_{ij} = M_{ij} \;\; \forall i,j \in E Assumptions A number of assumptions on the sampling of the observed entries and the number of sampled entries are frequently made to simplify the analysis and to ensure the problem is not underdetermined. Uniform sampling of observed entries To make the analysis tractable, it is often assumed that the set E of observed entries and fixed cardinality is sampled uniformly at random from the collection of all subsets of entries of cardinality |E|. To further simplify the analysis, it is instead assumed that E is constructed by Bernoulli sampling, i.e. that each entry is observed with probability p . If p is set to \frac{N}{mn} where N is the desired expected cardinality of E, and m,\;n are the dimensions of the matrix (let m without loss of generality), |E| is within O(n \log n) of N with high probability, thus Bernoulli sampling is a good approximation for uniform sampling. Another simplification is to assume that entries are sampled independently and with replacement. Lower bound on number of observed entries Suppose the m by n matrix M (with m ) we are trying to recover has rank r. There is an information theoretic lower bound on how many entries must be observed before M can be uniquely reconstructed. The set of m by n matrices with rank less than or equal to r is an algebraic variety in {\mathbb C}^{m\times n}with dimension (n+m)r - r^2. Using this result, one can show that at least 4nr - 4r^2 entries must be observed for matrix completion in {\mathbb C}^{n \times n} to have a unique solution when r \leq n/2 . Secondly, there must be at least one observed entry per row and column of M. The singular value decomposition of M is given by U\Sigma V^\dagger. If column i is unobserved, it is easy to see the i^{\text{th}} right singular vector of M, v_i, can be changed to some arbitrary value and still yield a matrix matching M over the set of observed entries. Similarly, if row j is unobserved, the j^{\text{th}} left singular vector of M, u_i can be arbitrary. If we assume Bernoulli sampling of the set of observed entries, the Coupon collector effect implies that entries on the order of O(n\log n) must be observed to ensure that there is an observation from each row and column with high probability. Combining the necessary conditions and assuming that r \ll m, n (a valid assumption for many practical applications), the lower bound on the number of observed entries required to prevent the problem of matrix completion from being underdetermined is on the order of nr\log n . Incoherence The concept of incoherence arose in compressed sensing. It is introduced in the context of matrix completion to ensure the singular vectors of M are not too "sparse" in the sense that all coordinates of each singular vector are of comparable magnitude instead of just a few coordinates having significantly larger magnitudes. The standard basis vectors are then undesirable as singular vectors, and the vector \frac{1}{\sqrt{n}} \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix} in \mathbb{R}^n is desirable. As an example of what could go wrong if the singular vectors are sufficiently "sparse", consider the m by n matrix \begin{bmatrix} 1 & 0 & \cdots & 0 \\ \vdots & & \vdots \\ 0 & 0 & 0 & 0 \end{bmatrix} with singular value decomposition I_m \begin{bmatrix} 1 & 0 & \cdots & 0 \\ \vdots & & \vdots \\ 0 & 0 & 0 & 0 \end{bmatrix} I_n. Almost all the entries of M must be sampled before it can be reconstructed. Candès and Recht define the coherence of a matrix U with column space an r-dimensional subspace of \mathbb{R}^n as \mu (U) = \frac{n}{r} \max_{i , where P_U is the orthogonal projection onto U . Incoherence then asserts that given the singular value decomposition U\Sigma V^\dagger of the m by n matrix M, • \mu (U), \; \mu (V) \leq \mu_0 • The entries of \sum_k u_k v_k ^\dagger have magnitudes upper bounded by \mu_1 \sqrt{\frac{r}{mn}} for some \mu_0, \; \mu_1. == Low rank matrix completion with noise ==
Low rank matrix completion with noise
In real world application, one often observe only a few entries corrupted at least by a small amount of noise. For example, in the Netflix problem, the ratings are uncertain. Candès and Plan showed that it is possible to fill in the many missing entries of large low-rank matrices from just a few noisy samples by nuclear norm minimization. The noisy model assumes that we observe : Y_{ij} = M_{ij} + Z_{ij}, (i,j) \in \Omega, where {Z_{ij}:(i,j) \in \Omega} is a noise term. Note that the noise can be either stochastic or deterministic. Alternatively the model can be expressed as : P_\Omega(Y) = P_\Omega(M) + P_\Omega(Z), where Z is an n \times n matrix with entries Z_{ij} for (i,j) \in \Omega assuming that \|P_\Omega(Z)\|_F\leq\delta for some \delta > 0 .To recover the incomplete matrix, we try to solve the following optimization problem: : \begin{align} & \underset{X}{\text{min}} & \|X\|_* \\ & \text{subject to} & \|P_\Omega(X-Y)\|_F \leq \delta\\ \end{align} Among all matrices consistent with the data, find the one with minimum nuclear norm. Candès and Plan have shown that this reconstruction is accurate. They have proved that when perfect noiseless recovery occurs, then matrix completion is stable vis a vis perturbations. The error is proportional to the noise level \delta. Therefore, when the noise level is small, the error is small. Here the matrix completion problem does not obey the restricted isometry property (RIP). For matrices, the RIP would assume that the sampling operator obeys : (1-\delta)\|X\|^2_F \leq \frac{1}{p}\|P_\Omega(X)\|^2_F \leq (1+\delta)\|X\|^2_F for all matrices X with sufficiently small rank and \delta sufficiently small. The methods are also applicable to sparse signal recovery problems in which the RIP does not hold. == High-rank matrix completion ==
High-rank matrix completion
The high-rank matrix completion in general is NP-hard. However, with certain assumptions, some incomplete high rank matrix or even full rank matrix can be completed. Eriksson, Balzano and Nowak have considered the problem of completing a matrix with the assumption that the columns of the matrix belong to a union of multiple low-rank subspaces. Since the columns belong to a union of subspaces, the problem may be viewed as a missing-data version of the subspace clustering problem. Let X be an n \times N matrix whose (complete) columns lie in a union of at most k subspaces, each of \operatorname{rank} \leq r , and assume N \gg kn. Eriksson, Balzano and Nowak showed that under mild assumptions each column of X can be perfectly recovered with high probability from an incomplete version so long as at least CrN\log^2(n) entries of X are observed uniformly at random, with C>1 a constant depending on the usual incoherence conditions, the geometrical arrangement of subspaces, and the distribution of columns over the subspaces. The algorithm involves several steps: (1) local neighborhoods; (2) local subspaces; (3) subspace refinement; (4) full matrix completion. This method can be applied to Internet distance matrix completion and topology identification. == Algorithms for low-rank matrix completion ==
Algorithms for low-rank matrix completion
Various matrix completion algorithms have been proposed. and discrete-aware based algorithm. Convex relaxation The rank minimization problem is NP-hard. One approach, proposed by Candès and Recht, is to form a convex relaxation of the problem and minimize the nuclear norm \|M\|_* (which gives the sum of the singular values of M) instead of \text{rank}(M) (which counts the number of non zero singular values of M). replaces the \ell_1-norm with a continuous and differentiable approximation of the \ell_0-norm, making the problem more tractable and improving the performance. The discrete-aware matrix completion problem can be formulated as: :\underset{\boldsymbol{X} \in \mathbb{R}^{m \times n}}{\arg \min} \, f(\boldsymbol{X}) + \lambda g(\boldsymbol{X}) + \zeta r(\boldsymbol{X} \mid 0), where: • (\boldsymbol{X}) = \frac{1}{2} \left\| P_{\Omega}(\boldsymbol{X} - \boldsymbol{O}) \right\|_F^2 ensures fidelity to the observed entries, with P_{\Omega} as the projection onto the observed set \Omega and \boldsymbol{O} as the observed matrix. • g(\boldsymbol{X}) = \|\boldsymbol{X}\|_* is the nuclear norm to enforce low-rank structure. • r(\boldsymbol{X} \mid 0) = \sum_{k=1}^ \left\| \operatorname{vec}_{\overline{\Omega}}(\boldsymbol{X}) - a_k \mathbf{1} \right\|_0 is the discrete-space regularizer, with \mathcal{A} being the discrete alphabet (e.g., {1, 2, 3, 4, 5}) and \overline{\Omega} the set of unobserved entries. To solve this non-convex problem, the \ell_0-norm is approximated by a continuous function. This approximation is convexized using fractional programming, transforming the problem into a series of convex subproblems. The algorithm iteratively updates the matrix estimate by applying proximal operations to the discrete-space regularizer and singular value thresholding to enforce the low-rank constraint. Initializing the process with the solution from the \ell_1-norm-based method can accelerate convergence. Simulation results, tested on datasets like MovieLens-100k, demonstrate that this method outperforms both its \ell_1-norm-based predecessor and other state-of-the-art techniques, particularly when the ratio of observed entries is low (e.g., 20% to 60%). == Applications ==
Applications
Several applications of matrix completion are summarized by Candès and Plan as follows: Collaborative filtering Collaborative filtering is the task of making automatic predictions about the interests of a user by collecting taste information from many users. Companies like Apple, Amazon, Barnes and Noble, and Netflix are trying to predict their user preferences from partial knowledge. In these kind of matrix completion problem, the unknown full matrix is often considered low rank because only a few factors typically contribute to an individual's tastes or preference. System identification In control, one would like to fit a discrete-time linear time-invariant state-space model : \begin{align} x(t+1)&=Ax(t)+Bu(t)\\ y(t)&=Cx(t)+Du(t) \end{align} to a sequence of inputs u(t) \in \mathbb{R}^m and outputs y(t) \in \mathbb{R}^p, t = 0, \ldots, N. The vector x(t) \in \mathbb{R}^n is the state of the system at time t and n is the order of the system model. From the input/output pair, one would like to recover the matrices A,B,C,D and the initial state x(0). This problem can also be viewed as a low-rank matrix completion problem. Internet of things (IoT) localization The localization (or global positioning) problem emerges naturally in IoT sensor networks. The problem is to recover the sensor map in Euclidean space from a local or partial set of pairwise distances. Thus it is a matrix completion problem with rank two if the sensors are located in a 2-D plane and three if they are in a 3-D space. Social networks recovery Most of the real-world social networks have low-rank distance matrices. When we are not able to measure the complete network, which can be due to reasons such as private nodes, limited storage or compute resources, we only have a fraction of distance entries known. Criminal networks are a good example of such networks. Low-rank Matrix Completion can be used to recover these unobserved distances. ==See also==
tickerdossier.comtickerdossier.substack.com