Most simply, to understand the bias that needs correcting, think of an extreme case. Suppose the population is (0,0,0,1,2,9), which has a population mean of 2 and a population variance of 62/6. A sample of
n = 1 is drawn, and it turns out to be x_1=0. The best estimate of the population mean is \bar{x} = x_1/n = 0/1 = 0. But if we use the formula (x_1-\bar{x})^2/n = (0-0)^2/1 = 0 to estimate the variance, the estimate of the variance would be zero–and the estimate would be zero for any population and any sample of
n = 1. The problem is that in estimating the sample mean, the process has already made our estimate of the mean close to the value we sampled—identical, for
n = 1. In the case of
n = 1, the variance just cannot be estimated, because there is no variability in the sample. But consider
n = 2. Suppose the sample were (0, 2). Then \bar{x}=1 and \left[(x_1-\bar{x})^2 + (x_2-\bar{x})^2\right] /n = (1+1)/2 = 1, but with Bessel's correction, \left[(x_1-\bar{x})^2 + (x_2-\bar{x})^2\right] /(n-1) = (1+1)/1 = 2, which is an unbiased estimate (if all possible samples of
n = 2 are taken and this method is used, the average estimate will be 12.4, same as the population variance with Bessel's correction.) To see this in more detail, consider the following example. Suppose the mean of the whole population is 2050, but the statistician does not know that, and must estimate it based on this small sample chosen randomly from the population: : 2051,\quad 2053,\quad 2055,\quad 2050,\quad 2051 One may compute the sample average: : \frac{1}{5}\left(2051 + 2053 + 2055 + 2050 + 2051\right) = 2052 This may serve as an observable estimate of the unobservable population average, which is 2050. Now we face the problem of estimating the population variance. That is the average of the squares of the deviations from 2050. If we knew that the population average is 2050, we could proceed as follows: : \begin{align} {} & \frac{1}{5}\left[(2051 - 2050)^2 + (2053 - 2050)^2 + (2055 - 2050)^2 + (2050 - 2050)^2 + (2051 - 2050)^2\right] \\[6pt] = {} & \frac{36}{5} = 7.2 \end{align} But our estimate of the population average is the sample average, 2052. The actual average, 2050, is unknown. So the sample average, 2052, must be used: : \begin{align} {} & \frac{1}{5}\left[(2051 - 2052)^2 + (2053 - 2052)^2 + (2055 - 2052)^2 + (2050 - 2052)^2 + (2051 - 2052)^2\right] \\[6pt] = {} & \frac{16}{5} = 3.2 \end{align} The variance is now smaller, and it (almost) always is. The only exception occurs when the sample average and the population average are the same. To understand why, consider that variance
measures distance from a point, and within a given sample, the sample average is precisely that point which minimises the distances. A variance calculation using
any other average value must produce a larger result. To see this algebraically, we use a
simple identity: : (a+b)^2 = a^2 + 2ab + b^2 With a representing the deviation of an individual sample from the sample mean, and b representing the deviation of the sample mean from the population mean. Note that we have simply decomposed the actual deviation of an individual sample from the (unknown) population mean into two components: the deviation of the single sample from the sample mean, which we can compute, and the additional deviation of the sample mean from the population mean, which we can not. Now, we apply this identity to the squares of deviations from the population mean: : \begin{align} {[}\,\underbrace{2053 - 2050}_{\begin{smallmatrix} \text{Deviation from} \\ \text{the population} \\ \text{mean} \end{smallmatrix}}\,]^2 & = [\,\overbrace{(\,\underbrace{2053 - 2052}_{\begin{smallmatrix} \text{Deviation from} \\ \text{the sample mean} \end{smallmatrix}}\,)}^{\text{This is }a.} + \overbrace{(2052 - 2050)}^{\text{This is }b.}\,]^2 \\ & = \overbrace{(2053 - 2052)^2}^{\text{This is }a^2.} + \overbrace{2(2053 - 2052)(2052 - 2050)}^{\text{This is }2ab.} + \overbrace{(2052 - 2050)^2}^{\text{This is }b^2.} \end{align} Now apply this to all five observations and observe certain patterns: : \begin{alignat}{2} \overbrace{(2051 - 2052)^2}^{\text{This is }a^2.}\ &+\ \overbrace{2(2051 - 2052)(2052 - 2050)}^{\text{This is }2ab.}\ &&+\ \overbrace{(2052 - 2050)^2}^{\text{This is }b^2.} \\ (2053 - 2052)^2\ &+\ 2(2053 - 2052)(2052 - 2050)\ &&+\ (2052 - 2050)^2 \\ (2055 - 2052)^2\ &+\ 2(2055 - 2052)(2052 - 2050)\ &&+\ (2052 - 2050)^2 \\ (2050 - 2052)^2\ &+\ 2(2050 - 2052)(2052 - 2050)\ &&+\ (2052 - 2050)^2 \\ (2051 - 2052)^2\ &+\ \underbrace{2(2051 - 2052)(2052 - 2050)}_{\begin{smallmatrix} \text{The sum of the entries in this} \\ \text{middle column must be 0.} \end{smallmatrix}}\ &&+\ (2052 - 2050)^2 \end{alignat} The sum of the entries in the middle column must be zero because the term
a will be added across all 5 rows, which itself must equal zero. That is because
a contains the 5 individual samples (left side within parentheses) which – when added – naturally have the same sum as adding 5 times the sample mean of those 5 numbers (2052). This means that a
subtraction of these two sums must equal zero. The factor 2 and the term
b in the middle column are equal for all rows, meaning that the relative difference across all rows in the middle column stays the same and can therefore be disregarded. The following statements explain the meaning of the remaining columns: • The sum of the entries in the first column (
a2) is the sum of the squares of the distance from sample to sample mean; • The sum of the entries in the last column (
b2) is the sum of squared distances between the measured sample mean and the correct population mean • Every single row now consists of pairs of
a2 (biased, because the sample mean is used) and
b2 (correction of bias, because it takes the difference between the "real" population mean and the inaccurate sample mean into account). Therefore, the sum of all entries of the first and last column now represents the correct variance, meaning that now the sum of squared distance between samples and population mean is used • The sum of the
a2-column and the b2-column must be bigger than the sum within entries of the
a2-column, since all the entries within the b2-column are positive (except when the population mean is the same as the sample mean, in which case all of the numbers in the last column will be 0). Therefore: • The sum of squares of the distance from samples to the
population mean will always be bigger than the sum of squares of the distance to the
sample mean, except when the sample mean happens to be the same as the population mean, in which case the two are equal. That is why the sum of squares of the deviations from the
sample mean is too small to give an unbiased estimate of the population variance when the average of those squares is found. The smaller the sample size, the larger is the difference between the sample variance and the population variance. == Terminology ==