The Spearman correlation coefficient is defined as the
Pearson correlation coefficient between the
rank variables. For a sample of size \ n\ , the \ n\ pairs of
raw scores \ \left( X_i, Y_i\right)\ are converted to ranks \ \operatorname{R}[ {X_i} ], \operatorname{R}[ {Y_i} ]\ , and \ r_s\ is computed as : r_s = \operatorname{\rho}\bigl[\ \operatorname{R}[X],\operatorname{R}[Y]\ \bigr] = \frac{\ \operatorname\mathsf{cov}\bigl[\ \operatorname{R}[X], \operatorname{R}[Y]\ \bigr]\ } {\ \sigma_{\operatorname{R}[X]}\ \sigma_{\operatorname{R}[Y]}\ }, where : \operatorname\rho\ denotes the conventional
Pearson correlation coefficient operator, but applied to the rank variables, : \operatorname\mathsf{cov}\bigl[\ \operatorname{R}[ X ], \operatorname{R}[ Y ]\ \bigr]\ is the
covariance of the rank variables, : \sigma_{\operatorname{R}[ X ]}\ and \ \sigma_{\operatorname{R}[Y]}\ are the
standard deviations of the rank variables. Only when all \ n\ ranks are
distinct integers (no ties), it can be computed using the popular formula : r_s = 1 - \frac{ 6 \sum d_i^2 }{\ n \left( n^2 - 1 \right)\ }\ , where : d_i \equiv \operatorname{R}[ X_i ] - \operatorname{R}[ Y_i ]\ is the difference between the two ranks of each observation, : \ n\ is the number of observations. Consider a bivariate sample \ (X_i, Y_i)\ ,\ i=1, \ldots\ n\ with corresponding rank pairs \ \left( \operatorname{R}[X_i], \operatorname{R}[Y_i] \right) = (R_i, S_i) ~. Then the Spearman correlation coefficient of \ ( X, Y )\ is : r_s = \frac{ \frac{\ 1\ } n \ \sum_{i=1}^n R_i\ S_i - \overline{R}\ \overline{S} }{ \sigma_R \sigma_S }\ , where, as usual, : \begin{align} \overline{R} & = \frac{\ 1\ }{ n } \sum_{i=1}^n R_i, \\[6pt] \overline{S} & = \frac{\ 1\ }{ n } \sum_{i=1}^n S_i, \\[6pt] \sigma_R^2 & = \frac{\ 1\ }{ n } \sum_{i=1}^n \left( R_i - \overline{R} \right)^2, \end{align} and : \sigma_S^2 = \frac{\ 1\ }{ n } \sum_{i=1}^n \left( S_i - \overline{S} \right)^2 ~. We shall show that \ r_s\ can be expressed purely in terms of \ d_i \equiv R_i - S_i\ , provided we assume that there be no ties within each sample. Under this assumption, we have that \ R, S\ can be viewed as random variables distributed like a uniformly distributed discrete random variable U on \ \{\ 1, 2,\ \ldots,\ n\ \}. Hence \ \overline{R} = \overline{S} = \operatorname\mathbb{E}\left[\ U\ \right]\ and \ \sigma_R^2 = \sigma_S^2 = \operatorname\mathsf{Var}\left[\ U\ \right] = \operatorname\mathbb{E} [ U^2 ] - \operatorname\mathbb{E}\left[\ U\ \right]^2\ , where : \begin{align} \operatorname\mathbb{E} [U] & = \frac{\ 1\ }{ n } \sum_{i=1}^n i = \frac{\ n+1\ }{ 2 }, \\[6pt] \operatorname\mathbb{E} [U^2] & = \frac{\ 1\ }{ n } \sum_{i=1}^n i^2 = \frac{\ (n+1)(2n+1)\ }{ 6 }, \end{align} and thus : \operatorname\mathsf{var}\left[\ U\right] = \frac{\ (n+1)\ (2n+1)\ }{ 6 } - \left( \frac{\ n+1\ }{2} \right)^2 = \frac{\ n^2 - 1\ }{ 12 } ~. (These sums can be computed using the formulas for the
triangular numbers and
square pyramidal numbers, or basic
summation results from
umbral calculus.) Observe now that : \begin{align} \frac{\ 1\ }{ n }\ &\sum_{i=1}^n R_i S_i - \overline{R}\overline{S} \\[6pt] &= \frac{\ 1\ }{ n }\ \sum_{i=1}^n \frac{\ 1\ }{ 2 } (R_i^2 + S_i^2 - d_i^2) - \overline{R}^2 \\[6pt] &= \frac{\ 1\ }{ 2 } \frac{\ 1\ }{ n }\ \sum_{i=1}^n R_i^2 + \frac{\ 1\ }{ 2 } \frac{\ 1\ }{ n }\ \sum_{i=1}^n S_i^2 - \frac{\ 1\ }{ 2n }\ \sum_{i=1}^{n} d_i^2 - \overline{R}^2 \\[6pt] &= \left( \frac{\ 1\ }{ n }\ \sum_{i=1}^n R_i^2 - \overline{R}^2 \right) - \frac{\ 1\ }{ 2n }\ \sum_{i=1}^n d_i^2 \\[6pt] &= \sigma_R^2 - \frac{\ 1\ }{ 2n }\ \sum_{i=1}^n d_i^2 \\[6pt] &= \sigma_R\ \sigma_S - \frac{\ 1\ }{ 2n }\ \sum_{i=1}^n d_i^2 \end{align} Putting this all together thus yields : \begin{align} r_s &= \frac{ \ \sigma_R\ \sigma_S - \frac{\ 1\ }{ 2n }\ \sum_{i=1}^n d_i^2 \ }{ \sigma_R\ \sigma_S } \\[6pt] &= 1 - \frac{ \ \sum_{i=1}^n d_i^2 \ }{ 2n \cdot \frac{\ n^2-1\ }{ 12 } } \\[6pt] &= 1 - \frac{ \ 6\ \sum_{i=1}^n d_i^2 }{\ n(n^2 - 1)\ } ~. \end{align} Identical values are usually each assigned
fractional ranks equal to the average of their positions in the ascending order of the values, which is equivalent to averaging over all possible permutations. If ties are present in the data set, the simplified formula above yields incorrect results: Only if in both variables all ranks are distinct, then \ \sigma_{\operatorname{R}[X]}\ \sigma_{\operatorname{R}[Y]} =\ \operatorname{\mathsf var}\bigl[\ \operatorname{R}[X]\ \bigr] = \ \operatorname{\mathsf var}\bigl[\ \operatorname{R}[Y]\ \bigr] =\ \tfrac{\ 1\ }{ 12 }\left( n^2 - 1 \right)\ (calculated according to biased variance). The first equation — normalizing by the standard deviation — may be used even when ranks are normalized to [0, 1] ("relative ranks") because it is insensitive both to translation and linear scaling. The simplified method should also not be used in cases where the data set is truncated; that is, when the Spearman's correlation coefficient is desired for the top
X records (whether by pre-change rank or post-change rank, or both), the user should use the Pearson correlation coefficient formula given above. ==Related quantities==