Statistical inference based on Pearson's correlation coefficient often focuses on one of the following two aims: • One aim is to test the
null hypothesis that the true correlation coefficient
ρ is equal to 0, based on the value of the sample correlation coefficient
r. • The other aim is to derive a
confidence interval that, on repeated sampling, has a given probability of containing
ρ. Methods of achieving one or both of these aims are discussed below.
Using a permutation test Permutation tests provide a direct approach to performing hypothesis tests and constructing confidence intervals. A permutation test for Pearson's correlation coefficient involves the following two steps: • Using the original paired data (
xi,
yi), randomly redefine the pairs to create a new data set (
xi,
y'
), where the ' are a
permutation of the set {1,...,
n}. The permutation ''
is selected randomly, with equal probabilities placed on all n
! possible permutations. This is equivalent to drawing the randomly without replacement from the set {1, ..., n
}. In bootstrapping, a closely related approach, the i
and the are equal and drawn with replacement from {1, ..., n''}; • Construct a correlation coefficient
r from the randomized data. To perform the permutation test, repeat steps (1) and (2) a large number of times. The
p-value for the permutation test is the proportion of the
r values generated in step (2) that are larger than the Pearson correlation coefficient that was calculated from the original data. Here "larger" can mean either that the value is larger in magnitude, or larger in signed value, depending on whether a
two-sided or
one-sided test is desired.
Using a bootstrap The
bootstrap can be used to construct confidence intervals for Pearson's correlation coefficient. In the "non-parametric" bootstrap,
n pairs (
xi,
yi) are resampled "with replacement" from the observed set of
n pairs, and the correlation coefficient
r is calculated based on the resampled data. This process is repeated a large number of times, and the empirical distribution of the resampled
r values are used to approximate the
sampling distribution of the statistic. A 95%
confidence interval for
ρ can be defined as the interval spanning from the 2.5th to the 97.5th
percentile of the resampled
r values.
Standard error If x and y are random variables, with a simple linear relationship between them with an additive normal noise (i.e., y= a + bx + e), then a
standard error associated to the correlation is :\sigma_r \approx \frac{1-r^2}{\sqrt{n}} where r is the correlation and n the sample size.
Testing using Student's t-distribution For pairs from an uncorrelated
bivariate normal distribution, the
sampling distribution of the
studentized Pearson's correlation coefficient follows
Student's t-distribution with degrees of freedom
n − 2. Specifically, if the underlying variables have a bivariate normal distribution, the variable :t = \frac{r}{\sigma_r} = r\sqrt{\frac{n-2}{1 - r^2}} has a student's
t-distribution in the null case (zero correlation). This holds approximately in case of non-normal observed values if sample sizes are large enough. For determining the critical values for
r the inverse function is needed: :r = \frac{t}{\sqrt{n - 2 + t^2}}. Alternatively, large sample, asymptotic approaches can be used. Another early paper provides graphs and tables for general values of
ρ, for small sample sizes, and discusses computational approaches. In the case where the underlying variables are not normal, the sampling distribution of Pearson's correlation coefficient follows a Student's
t-distribution, but the degrees of freedom are reduced.
Using the exact distribution For data that follow a
bivariate normal distribution, the exact density function
f(
r) for the sample correlation coefficient
r of a normal bivariate is :f(r) = \frac{(n - 2)\, \mathrm{\Gamma}(n - 1) \left(1 - \rho^2\right)^{\frac{n - 1}{2}} \left(1 - r^2\right)^{\frac{n - 4}{2}}}{\sqrt{2\pi}\, \operatorname{\Gamma}\mathord\left(n - \tfrac{1}{2}\right) (1 - \rho r)^{n - \frac{3}{2}}} {}_{2}\mathrm{F}_{1}\mathord\left(\tfrac{1}{2}, \tfrac{1}{2}; \tfrac{1}{2}(2n - 1); \tfrac{1}{2}(\rho r + 1)\right) where \Gamma is the
gamma function and {}_{2}\mathrm{F}_{1}(a,b;c;z) is the
Gaussian hypergeometric function. In the special case when \rho = 0 (zero population correlation), the exact density function
f(
r) can be written as :f(r) = \frac{\left( 1-r^2 \right)^{\frac{n - 4}{2}}}{\operatorname{\Beta}\mathord\left(\tfrac{1}{2}, \tfrac{n - 2}{2}\right)}, where \Beta is the
beta function, which is one way of writing the density of a Student's t-distribution for a
studentized sample correlation coefficient, as above.
Using the Fisher transformation In practice,
confidence intervals and
hypothesis tests relating to
ρ are usually carried out using the,
Variance-stabilizing transformation,
Fisher transformation, F: :F(r) \equiv \tfrac{1}{2} \, \ln \left(\frac{1 + r}{1 - r}\right) = \operatorname{artanh}(r)
F(
r) approximately follows a
normal distribution with :\text{mean} = F(\rho) = \operatorname{artanh}(\rho)and
standard error =\text{SE} = \frac{1}{\sqrt{n - 3}}, where
n is the sample size. The approximation error is lowest for a large sample size n and small r and \rho_0 and increases otherwise. Using the approximation, a
z-score is :z = \frac{x - \text{mean}}{\text{SE}} = [F(r) - F(\rho_0)]\sqrt{n - 3} under the
null hypothesis that \rho = \rho_0, given the assumption that the sample pairs are
independent and identically distributed and follow a
bivariate normal distribution. Thus an approximate
p-value can be obtained from a normal probability table. For example, if
z = 2.2 is observed and a two-sided p-value is desired to test the null hypothesis that \rho = 0, the p-value is , where Φ is the standard normal
cumulative distribution function. To obtain a confidence interval for ρ, we first compute a confidence interval for
F(
\rho): :100(1 - \alpha)\%\text{CI}: \operatorname{artanh}(\rho) \in [\operatorname{artanh}(r) \pm z_{\alpha/2}\text{SE}] The inverse Fisher transformation brings the interval back to the correlation scale. :100(1 - \alpha)\%\text{CI}: \rho \in [\tanh(\operatorname{artanh}(r) - z_{\alpha/2}\text{SE}), \tanh(\operatorname{artanh}(r) + z_{\alpha/2}\text{SE})] For example, suppose we observe
r = 0.7 with a sample size of
n=50, and we wish to obtain a 95% confidence interval for
ρ. The transformed value is \operatorname{arctanh} \left ( r \right ) = 0.8673, so the confidence interval on the transformed scale is 0.8673 \pm \frac{1.96}{\sqrt{47}} , or (0.5814, 1.1532). Converting back to the correlation scale yields (0.5237, 0.8188). ==In least squares regression analysis==