Sample mean and covariance

The sample mean or empirical mean, and the sample covariance or empirical covariance are statistics computed from a sample of data on one or more random variables.

Definition of the sample mean

The sample mean is the average of the values of a variable in a sample, which is the sum of those values divided by the number of values. Using mathematical notation, if a sample of N observations on variable X is taken from the population, the sample mean is: : \bar{X}=\frac{1}{N}\sum_{i=1}^{N}X_{i}. Under this definition, if the sample (1, 4, 1) is taken from the population (1,1,3,4,0,2,1,0), then the sample mean is \bar{x} = (1+4+1)/3 = 2, as compared to the population mean of \mu = (1+1+3+4+0+2+1+0) /8 = 12/8 = 1.5. Even if a sample is random, it is rarely perfectly representative, and other samples would have other sample means even if the samples were all from the same population. The sample (2, 1, 0), for example, would have a sample mean of 1. If the statistician is interested in K variables rather than one, each observation having a value for each of those K variables, the overall sample mean consists of K sample means for individual variables. Let x_{ij} be the ith independently drawn observation (i=1,...,N) on the jth random variable (j=1,...,K). These observations can be arranged into N column vectors, each with K entries, with the K×1 column vector giving the i-th observations of all variables being denoted \mathbf{x}_i (i=1,...,N). The sample mean vector \mathbf{\bar{x}} is a column vector whose j-th element \bar{x}_{j} is the average value of the N observations of the jth variable: : \bar{x}_{j}=\frac{1}{N} \sum_{i=1}^{N} x_{ij},\quad j=1,\ldots,K. Thus, the sample mean vector contains the average of the observations for each variable, and is written : \mathbf{\bar{x}}=\frac{1}{N}\sum_{i=1}^{N}\mathbf{x}_i = \begin{bmatrix} \bar{x}_1 \\ \vdots \\ \bar{x}_j \\ \vdots \\ \bar{x}_K \end{bmatrix} ==Definition of sample covariance==

Definition of sample covariance

The sample covariance matrix is a K-by-K matrix \textstyle \mathbf{Q}=\left[ q_{jk}\right] with entries : q_{jk}=\frac{1}{N-1}\sum_{i=1}^{N}\left( x_{ij}-\bar{x}_j \right) \left( x_{ik}-\bar{x}_k \right), where q_{jk} is an estimate of the covariance between the th variable and the th variable of the population underlying the data. In terms of the observation vectors, the sample covariance is :\mathbf{Q} = {1 \over {N-1}}\sum_{i=1}^N (\mathbf{x}_i.-\mathbf{\bar{x}}) (\mathbf{x}_i.-\mathbf{\bar{x}})^\mathrm{T}, Alternatively, arranging the observation vectors as the columns of a matrix, so that :\mathbf{F} = \begin{bmatrix}\mathbf{x}_1 & \mathbf{x}_2 & \dots & \mathbf{x}_N \end{bmatrix}, which is a matrix of K rows and N columns. Here, the sample covariance matrix can be computed as :\mathbf{Q} = \frac{1}{N-1}( \mathbf{F} - \mathbf{\bar{x}} \,\mathbf{1}_N^\mathrm{T} ) ( \mathbf{F} - \mathbf{\bar{x}} \,\mathbf{1}_N^\mathrm{T} )^\mathrm{T}, where \mathbf{1}_N is an N by vector of ones. If the observations are arranged as rows instead of columns, so \mathbf{\bar{x}} is now a 1×K row vector and \mathbf{M}=\mathbf{F}^\mathrm{T} is an N×K matrix whose column j is the vector of N observations on variable j, then applying transposes in the appropriate places yields :\mathbf{Q} = \frac{1}{N-1}( \mathbf{M} - \mathbf{1}_N \mathbf{\bar{x}} )^\mathrm{T} ( \mathbf{M} - \mathbf{1}_N \mathbf{\bar{x}} ). Like covariance matrices for random vector, sample covariance matrices are positive semi-definite. To prove it, note that for any matrix \mathbf{A} the matrix \mathbf{A}^T\mathbf{A} is positive semi-definite. Furthermore, a covariance matrix is positive definite if and only if the rank of the \mathbf{x}_i.-\mathbf{\bar{x}} vectors is K. ==Unbiasedness==

Unbiasedness

The sample mean and the sample covariance matrix are unbiased estimates of the mean and the covariance matrix of the random vector \textstyle \mathbf{X}, a row vector whose jth element (j = 1, ..., K) is one of the random variables. The sample covariance matrix has \textstyle N-1 in the denominator rather than \textstyle N due to a variant of Bessel's correction: In short, the sample covariance relies on the difference between each observation and the sample mean, but the sample mean is slightly correlated with each observation since it is defined in terms of all observations. If the population mean \operatorname{E}(\mathbf{X}) is known, the analogous unbiased estimate : q_{jk}=\frac{1}{N}\sum_{i=1}^N \left( x_{ij}-\operatorname{E}(X_j)\right) \left( x_{ik}-\operatorname{E}(X_k)\right), using the population mean, has \textstyle N in the denominator. This is an example of why in probability and statistics it is essential to distinguish between random variables (upper case letters) and realizations of the random variables (lower case letters). The maximum likelihood estimate of the covariance : q_{jk}=\frac{1}{N}\sum_{i=1}^N \left( x_{ij}-\bar{x}_j \right) \left( x_{ik}-\bar{x}_k \right) for the Gaussian distribution case has N in the denominator as well. The ratio of 1/N to 1/(N − 1) approaches 1 for large N, so the maximum likelihood estimate approximately equals the unbiased estimate when the sample is large. ==Distribution of the sample mean==

Distribution of the sample mean

For each random variable, the sample mean is a good estimator of the population mean, where a "good" estimator is defined as being efficient and unbiased. Of course the estimator will likely not be the true value of the population mean since different samples drawn from the same distribution will give different sample means and hence different estimates of the true mean. Thus the sample mean is a random variable, not a constant, and consequently has its own distribution. Denoting with μ the population mean and with \sigma^2 the population variance, for a random sample of n independent observations drawn from the population, the expected value of the sample mean is : \operatorname E (\bar{x}) = \mu and the variance of the sample mean is : \operatorname{var}(\bar{x}) = \frac{\sigma^2} n. If the samples are not independent, but correlated, then special care has to be taken in order to avoid the problem of pseudoreplication. If the population is normally distributed, then the sample mean is normally distributed as follows: : \bar{x} \thicksim N\left\{\mu, \frac{\sigma^2}{n}\right\}. If the population is not normally distributed, the sample mean is nonetheless approximately normally distributed if n is large and σ2/n < +∞. This is a consequence of the central limit theorem. ==Weighted samples==

Weighted samples

In a weighted sample, each vector \textstyle \textbf{x}_{i} (each set of single observations on each of the K random variables) is assigned a weight \textstyle w_i \geq0. Without loss of generality, assume that the weights are normalized: : \sum_{i=1}^{N}w_i = 1. (If they are not, divide the weights by their sum). Then the weighted mean vector \textstyle \mathbf{\bar{x}} is given by : \mathbf{\bar{x}}=\sum_{i=1}^N w_i \mathbf{x}_i. and the elements q_{jk} of the weighted covariance matrix \textstyle \mathbf{Q} are : q_{jk}=\frac{1}{1-\sum_{i=1}^{N}w_i^2} \sum_{i=1}^N w_i \left( x_{ij}-\bar{x}_j \right) \left( x_{ik}-\bar{x}_k \right) . If all weights are the same, \textstyle w_{i}=1/N, the weighted mean and covariance reduce to the (biased) sample mean and covariance mentioned above. ==Criticism==

Criticism

The sample mean and sample covariance are not robust statistics, meaning that they are sensitive to outliers. As robustness is often a desired trait, particularly in real-world applications, robust alternatives may prove desirable, notably quantile-based statistics such as the sample median for location, and interquartile range (IQR) for dispersion. Other alternatives include trimming and Winsorising, as in the trimmed mean and the Winsorized mean. ==See also==

Source: Wikipedia ↗

tickerdossier.com tickerdossier.substack.com