The core idea behind random projection is given in the
Johnson-Lindenstrauss lemma, which states that if points in a vector space are of sufficiently high dimension, then they may be projected into a suitable lower-dimensional space in a way which approximately preserves pairwise distances between the points with high probability. In random projection, the original d -dimensional data is projected to a k -dimensional subspace, by multiplying on the left by a random matrix R \in \R^{k \times d} . Using matrix notation: If X_{d \times N} is the original set of N d-dimensional observations, then X_{k \times N}^{RP}=R_{k \times d}X_{d \times N} is the projection of the data onto a lower k-dimensional subspace. Random projection is computationally simple: form the
random matrix "R" and project the d \times N data matrix X onto K dimensions of order O(dkN). If the data matrix X is sparse with about c nonzero entries per column, then the complexity of this operation is of order O(ckN).
Orthogonal random projection A
unit vector can be orthogonally projected to a random subspace. Let u be the original unit vector, and let v be its projection. The norm-squared \|v\|_2^2 has the same distribution as projecting a random point, uniformly sampled on the unit sphere, to its first k coordinates. This is equivalent to sampling a random point in the multivariate gaussian distribution x \sim \mathcal N(0, I_{d \times d}), then normalizing it. Therefore, \|v\|_2^2 has the same distribution as \frac{\sum_{i=1}^k x_i^2}{\sum_{i=1}^k x_i^2 + \sum_{i=k+1}^{d} x_i^2} , which by the
chi-squared construction of the
Beta distribution, has distribution \operatorname{Beta}(k/2, (d-k)/2), with mean k/d. We have a
concentration inequality Pr\left[\left|\|v\|_2-\frac{k}{d}\right| \geq \epsilon \sqrt{\frac{k}{d}}\right] \leq 3 \exp \left(-k \epsilon^2 / 64\right) for any \epsilon \in (0, 1).
Gaussian random projection The random matrix R can be generated using a Gaussian distribution. The first row is a random unit vector uniformly chosen from S^{d-1}. The second row is a random unit vector from the space orthogonal to the first row, the third row is a random unit vector from the space orthogonal to the first two rows, and so on. In this way of choosing R, and the following properties are satisfied: • Spherical symmetry: For any orthogonal matrix A \in O(d), RA and R have the same distribution. • Orthogonality: The rows of R are orthogonal to each other. • Normality: The rows of R are unit-length vectors.
More computationally efficient random projections Achlioptas has shown that the random matrix can be sampled more efficiently. Either the full matrix can be sampled IID according to :R_{i,j} = \sqrt{3/k} \times \begin{cases} +1 & \text{with probability }\frac{1}{6}\\ 0 & \text{with probability }\frac{2}{3}\\ -1 & \text{with probability }\frac{1}{6} \end{cases} or the full matrix can be sampled IID according toR_{i,j} = \sqrt{1/k} \times \begin{cases} +1 & \text{with probability }\frac{1}{2}\\ -1 & \text{with probability }\frac{1}{2} \end{cases} Both are efficient for database applications because the computations can be performed using integer arithmetic. More related study is conducted in. It was later shown how to use integer arithmetic while making the distribution even sparser, having very few nonzeroes per column, in work on the Sparse JL Transform. This is advantageous since a sparse embedding matrix means being able to project the data to lower dimension even faster.
Random Projection with Quantization Random projection can be further condensed by quantization (discretization), with 1-bit (sign random projection) or multi-bits. It is the building block of SimHash, RP tree, and other memory efficient estimation and learning methods. ==Large quasiorthogonal
bases==