Factor analysis

Definition The model attempts to explain a set of p observations in each of n individuals with a set of k common factors (f_{i,j}) where there are fewer factors per unit than observations per unit (k). Each individual has k of their own common factors, and these are related to the observations via the factor loading matrix (L \in \mathbb{R}^{p \times k}), for a single observation, according to : x_{i,m} - \mu_{i} = l_{i,1} f_{1,m} + \dots + l_{i,k} f_{k,m} + \varepsilon_{i,m} where • x_{i,m} is the value of the ith observation of the mth individual, • \mu_i is the observation mean for the ith observation, • l_{i,j} is the loading for the ith observation of the jth factor, • f_{j,m} is the value of the jth factor of the mth individual, and • \varepsilon_{i,m} is the (i,m)th unobserved stochastic error term with mean zero and finite variance. In matrix notation : X - \Mu = L F + \varepsilon where observation matrix X \in \mathbb{R}^{p \times n}, loading matrix L \in \mathbb{R}^{p \times k}, factor matrix F \in \mathbb{R}^{k \times n}, error term matrix \varepsilon \in \mathbb{R}^{p \times n} and mean matrix \Mu \in \mathbb{R}^{p \times n} whereby the (i,m)th element is simply \Mu_{i,m}=\mu_i. Also we will impose the following assumptions on F: • F and \varepsilon are independent. • \mathrm{E}(F) = 0; where \mathrm E is Expectation • \mathrm{Cov}(F)=I where \mathrm{Cov} is the covariance matrix, to make sure that the factors are uncorrelated, and I is the identity matrix. Suppose \mathrm{Cov}(X - \Mu)=\Sigma. Then : \Sigma=\mathrm{Cov}(X - \Mu)=\mathrm{Cov}(LF + \varepsilon),\, and therefore, from conditions 1 and 2 imposed on F above, E[LF]=LE[F]=0 and \mathrm{Cov}(LF+\epsilon)=\mathrm{Cov}(LF)+\mathrm{Cov}(\epsilon), giving : \Sigma = L \mathrm{Cov}(F) L^T + \mathrm{Cov}(\varepsilon),\, or, setting \Psi:=\mathrm{Cov}(\varepsilon), : \Sigma = LL^T + \Psi.\, For any orthogonal matrix Q, if we set L^\prime=\ LQ and F^\prime=Q^T F, the criteria for being factors and factor loadings still hold. Hence a set of factors and factor loadings is unique only up to an orthogonal transformation. Example Suppose a psychologist has the hypothesis that there are two kinds of intelligence, "verbal intelligence" and "mathematical intelligence", neither of which is directly observed. Evidence for the hypothesis is sought in the examination scores from each of 10 different academic fields of 1000 students. If each student is chosen randomly from a large population, then each student's 10 scores are random variables. The psychologist's hypothesis may say that for each of the 10 academic fields, the score averaged over the group of all students who share some common pair of values for verbal and mathematical "intelligences" is some constant times their level of verbal intelligence plus another constant times their level of mathematical intelligence, i.e., it is a linear combination of those two "factors". The numbers for a particular subject, by which the two kinds of intelligence are multiplied to obtain the expected score, are posited by the hypothesis to be the same for all intelligence level pairs, and are called "factor loading" for this subject. For example, the hypothesis may hold that the predicted average student's aptitude in the field of astronomy is :{10 × the student's verbal intelligence} + {6 × the student's mathematical intelligence}. The numbers 10 and 6 are the factor loadings associated with astronomy. Other academic subjects may have different factor loadings. Two students assumed to have identical degrees of verbal and mathematical intelligence may have different measured aptitudes in astronomy because individual aptitudes differ from average aptitudes (predicted above) and because of measurement error itself. Such differences make up what is collectively called the "error" — a statistical term that means the amount by which an individual, as measured, differs from what is average for or predicted by his or her levels of intelligence (see errors and residuals in statistics). The observable data that go into factor analysis would be 10 scores of each of the 1000 students, a total of 10,000 numbers. The factor loadings and levels of the two kinds of intelligence of each student must be inferred from the data. Mathematical model of the same example In the following, matrices will be indicated by indexed variables. "Academic Subject" indices will be indicated using letters a,b and c, with values running from 1 to p which is equal to 10 in the above example. "Factor" indices will be indicated using letters p, q and r, with values running from 1 to k which is equal to 2 in the above example. "Instance" or "sample" indices will be indicated using letters i,j and k, with values running from 1 to N. In the example above, if a sample of N=1000 students participated in the p=10 exams, the ith student's score for the ath exam is given by x_{ai}. The purpose of factor analysis is to characterize the correlations between the variables x_a of which the x_{ai} are a particular instance, or set of observations. In order for the variables to be on equal footing, they are normalized into standard scores z: :z_{ai}=\frac{x_{ai}-\hat\mu_a}{\hat\sigma_a} where the sample mean is: :\hat\mu_a=\tfrac{1}{N}\sum_i x_{ai} and the sample variance is given by: :\hat\sigma_a^2=\tfrac{1}{N-1}\sum_i (x_{ai}-\hat\mu_a)^2 The factor analysis model for this particular sample is then: :\begin{matrix}z_{1,i} & = & \ell_{1,1}F_{1,i} & + & \ell_{1,2}F_{2,i} & + & \varepsilon_{1,i} \\ \vdots & & \vdots & & \vdots & & \vdots \\ z_{10,i} & = & \ell_{10,1}F_{1,i} & + & \ell_{10,2}F_{2,i} & + & \varepsilon_{10,i} \end{matrix} or, more succinctly: : z_{ai}=\sum_p \ell_{ap}F_{pi}+\varepsilon_{ai} where • F_{1i} is the ith student's "verbal intelligence", • F_{2i} is the ith student's "mathematical intelligence", • \ell_{ap} are the factor loadings for the ath subject, for p=1,2. In matrix notation, we have :Z=LF+\varepsilon Observe that by doubling the scale on which "verbal intelligence"—the first component in each column of F—is measured, and simultaneously halving the factor loadings for verbal intelligence makes no difference to the model. Thus, no generality is lost by assuming that the standard deviation of the factors for verbal intelligence is 1. Likewise for mathematical intelligence. Moreover, for similar reasons, no generality is lost by assuming the two factors are uncorrelated with each other. In other words: :\sum_i F_{pi}F_{qi}=\delta_{pq} where \delta_{pq} is the Kronecker delta (0 when p \ne q and 1 when p=q). The errors are assumed to be independent of the factors: :\sum_i F_{pi}\varepsilon_{ai}=0 Since any rotation of a solution is also a solution, this makes interpreting the factors difficult. See disadvantages below. In this particular example, if we do not know beforehand that the two types of intelligence are uncorrelated, then we cannot interpret the two factors as the two different types of intelligence. Even if they are uncorrelated, we cannot tell which factor corresponds to verbal intelligence and which corresponds to mathematical intelligence without an outside argument. The values of the loadings L, the averages \mu, and the variances of the "errors" \varepsilon must be estimated given the observed data X and F (the assumption about the levels of the factors is fixed for a given F). The "fundamental theorem" may be derived from the above conditions: :\sum_i z_{ai}z_{bi}=\sum_j \ell_{aj}\ell_{bj}+\sum_i \varepsilon_{ai}\varepsilon_{bi} The term on the left is the (a,b)-term of the correlation matrix (a p \times p matrix derived as the product of the p \times N matrix of standardized observations with its transpose) of the observed data, and its p diagonal elements will be 1s. The second term on the right will be a diagonal matrix with terms less than unity. The first term on the right is the "reduced correlation matrix" and will be equal to the correlation matrix except for its diagonal values which will be less than unity. These diagonal elements of the reduced correlation matrix are called "communalities" (which represent the fraction of the variance in the observed variable that is accounted for by the factors): : h_a^2=1-\psi_a=\sum_j \ell_{aj}\ell_{aj} The sample data z_{ai} will not exactly obey the fundamental equation given above due to sampling errors, inadequacy of the model, etc. The goal of any analysis of the above model is to find the factors F_{pi} and loadings \ell_{ap} which give a "best fit" to the data. In factor analysis, the best fit is defined as the minimum of the mean square error in the off-diagonal residuals of the correlation matrix: :\varepsilon^2 = \sum_{a\ne b} \left[\sum_i z_{ai}z_{bi}-\sum_j \ell_{aj}\ell_{bj}\right]^2 This is equivalent to minimizing the off-diagonal components of the error covariance which, in the model equations have expected values of zero. This is to be contrasted with principal component analysis which seeks to minimize the mean square error of all residuals. Before the advent of high-speed computers, considerable effort was devoted to finding approximate solutions to the problem, particularly in estimating the communalities by other means, which then simplifies the problem considerably by yielding a known reduced correlation matrix. This was then used to estimate the factors and the loadings. With the advent of high-speed computers, the minimization problem can be solved iteratively with adequate speed, and the communalities are calculated in the process, rather than being needed beforehand. The MinRes algorithm is particularly suited to this problem, but is hardly the only iterative means of finding a solution. If the solution factors are allowed to be correlated (as in 'oblimin' rotation, for example), then the corresponding mathematical model uses skew coordinates rather than orthogonal coordinates. Geometric interpretation The parameters and variables of factor analysis can be given a geometrical interpretation. The data (z_{ai}), the factors (F_{pi}) and the errors (\varepsilon_{ai}) can be viewed as vectors in an N-dimensional Euclidean space (sample space), represented as \mathbf{z}_a, \mathbf{F}_p and \boldsymbol{\varepsilon}_a respectively. Since the data are standardized, the data vectors are of unit length (||\mathbf{z}_a||=1). The factor vectors define a k-dimensional linear subspace (i.e. a hyperplane) in this space, upon which the data vectors are projected orthogonally. This follows from the model equation :\mathbf{z}_a=\sum_p \ell_{ap} \mathbf{F}_p+\boldsymbol{\varepsilon}_a and the independence of the factors and the errors: \mathbf{F}_p\cdot\boldsymbol{\varepsilon}_a=0. In the above example, the hyperplane is just a 2-dimensional plane defined by the two factor vectors. The projection of the data vectors onto the hyperplane is given by :\hat{\mathbf{z}}_a=\sum_p \ell_{ap}\mathbf{F}_p and the errors are vectors from that projected point to the data point and are perpendicular to the hyperplane. The goal of factor analysis is to find a hyperplane which is a "best fit" to the data in some sense, so it doesn't matter how the factor vectors which define this hyperplane are chosen, as long as they are independent and lie in the hyperplane. We are free to specify them as both orthogonal and normal (\mathbf{F}_p\cdot \mathbf{F}_q=\delta_{pq}) with no loss of generality. After a suitable set of factors are found, they may also be arbitrarily rotated within the hyperplane, so that any rotation of the factor vectors will define the same hyperplane, and also be a solution. As a result, in the above example, in which the fitting hyperplane is two dimensional, if we do not know beforehand that the two types of intelligence are uncorrelated, then we cannot interpret the two factors as the two different types of intelligence. Even if they are uncorrelated, we cannot tell which factor corresponds to verbal intelligence and which corresponds to mathematical intelligence, or whether the factors are linear combinations of both, without an outside argument. The data vectors \mathbf{z}_a have unit length. The entries of the correlation matrix for the data are given by r_{ab}=\mathbf{z}_a\cdot\mathbf{z}_b. The correlation matrix can be geometrically interpreted as the cosine of the angle between the two data vectors \mathbf{z}_a and \mathbf{z}_b. The diagonal elements will clearly be 1s and the off diagonal elements will have absolute values less than or equal to unity. The "reduced correlation matrix" is defined as :\hat{r}_{ab}=\hat{\mathbf{z}}_a\cdot\hat{\mathbf{z}}_b. The goal of factor analysis is to choose the fitting hyperplane such that the reduced correlation matrix reproduces the correlation matrix as nearly as possible, except for the diagonal elements of the correlation matrix which are known to have unit value. In other words, the goal is to reproduce as accurately as possible the cross-correlations in the data. Specifically, for the fitting hyperplane, the mean square error in the off-diagonal components :\varepsilon^2=\sum_{a\ne b} \left(r_{ab}-\hat{r}_{ab}\right)^2 is to be minimized, and this is accomplished by minimizing it with respect to a set of orthonormal factor vectors. It can be seen that : r_{ab}-\hat{r}_{ab}= \boldsymbol{\varepsilon}_a\cdot\boldsymbol{\varepsilon}_b The term on the right is just the covariance of the errors. In the model, the error covariance is stated to be a diagonal matrix and so the above minimization problem will in fact yield a "best fit" to the model: It will yield a sample estimate of the error covariance which has its off-diagonal components minimized in the mean square sense. It can be seen that since the \hat{z}_a are orthogonal projections of the data vectors, their length will be less than or equal to the length of the projected data vector, which is unity. The square of these lengths are just the diagonal elements of the reduced correlation matrix. These diagonal elements of the reduced correlation matrix are known as "communalities": : {h_a}^2=||\hat{\mathbf{z}}_a||^2= \sum_p {\ell_{ap}}^2 Large values of the communalities will indicate that the fitting hyperplane is rather accurately reproducing the correlation matrix. The mean values of the factors must also be constrained to be zero, from which it follows that the mean values of the errors will also be zero. ==Practical implementation==