The MSE either assesses the quality of a
predictor (i.e., a function mapping arbitrary inputs to a sample of values of some
random variable), or of an
estimator (i.e., a
mathematical function mapping a
sample of data to an estimate of a
parameter of the
population from which the data is sampled). In the context of prediction, understanding the
prediction interval can also be useful as it provides a range within which a future observation will fall, with a certain probability. The definition of an MSE differs according to whether one is describing a predictor or an estimator.
Predictor If a vector of n predictions is generated from a sample of n data points on all variables, and Y is the vector of observed values of the variable being predicted, with \hat{Y} being the predicted values (e.g. as from a
least-squares fit), then the within-sample MSE of the predictor is computed as \operatorname{MSE}=\frac{1}{n} \sum_{i=1}^n \left(Y_i-\hat{Y_i}\right)^2 In other words, the MSE is the
mean \left(\frac{1}{n} \sum_{i=1}^n \right) of the
squares of the errors \left(Y_i-\hat{Y_i}\right)^2. This is an easily computable quantity for a particular sample (and hence is sample-dependent). In
matrix notation, \operatorname{MSE} = \frac{1}{n}\sum_{i=1}^n (e_i)^2 = \frac{1}{n}\mathbf e^\mathsf T \mathbf e where e_i is Y_i - \hat{Y_i} and \mathbf e is a n \times 1 column vector. The MSE can also be computed on
q data points that were not used in estimating the model, either because they were held back for this purpose, or because these data have been newly obtained. Within this process, known as
cross-validation, the MSE is often called the
test MSE, and is computed as \operatorname{MSE} = \frac{1}{q} \sum_{i=n+1}^{n+q} \left(Y_i-\hat{Y_i}\right)^2
Estimator The MSE of an estimator \hat{\theta} with respect to an unknown parameter \theta is defined as \operatorname{MSE}(\hat{\theta})=\operatorname{Var}_{\theta}(\hat{\theta})+ \operatorname{Bias}(\hat{\theta},\theta)^2.
Proof of variance and bias relationship \begin{align} \operatorname{MSE}(\hat{\theta}) &= \operatorname{E}_\theta \left [(\hat{\theta}-\theta)^2 \right ] \\ &= \operatorname{E}_\theta\left[\left(\hat{\theta}-\operatorname{E}_\theta [\hat\theta]+\operatorname{E}_\theta[\hat\theta]-\theta\right)^2\right]\\ &= \operatorname{E}_\theta\left[\left(\hat{\theta}-\operatorname{E}_\theta[\hat\theta]\right)^2 +2\left (\hat{\theta}-\operatorname{E}_\theta[\hat\theta] \right ) \left (\operatorname{E}_\theta[\hat\theta]-\theta \right )+\left( \operatorname{E}_\theta[\hat\theta]-\theta \right)^2\right] \\ &= \operatorname{E}_\theta\left[\left(\hat{\theta}-\operatorname{E}_\theta[\hat\theta]\right)^2\right]+\operatorname{E}_\theta\left[2 \left (\hat{\theta}-\operatorname{E}_\theta[\hat\theta] \right ) \left (\operatorname{E}_\theta[\hat\theta]-\theta \right ) \right] + \operatorname{E}_\theta\left [ \left(\operatorname{E}_\theta[\hat\theta]-\theta\right)^2 \right] \\ &= \operatorname{E}_\theta\left[\left(\hat{\theta}-\operatorname{E}_\theta[\hat\theta]\right)^2\right]+ 2 \left(\operatorname{E}_\theta[\hat\theta]-\theta\right) \operatorname{E}_\theta\left[\hat{\theta}-\operatorname{E}_\theta[\hat\theta] \right] + \left(\operatorname{E}_\theta[\hat\theta]-\theta\right)^2 && \operatorname{E}_\theta[\hat\theta]-\theta = \text{constant} \\ &= \operatorname{E}_\theta\left[\left(\hat{\theta}-\operatorname{E}_\theta[\hat\theta]\right)^2\right]+ 2 \left(\operatorname{E}_\theta [\hat\theta]-\theta\right) \left ( \operatorname{E}_\theta[\hat{\theta}]-\operatorname{E}_\theta[\hat\theta] \right )+ \left(\operatorname{E}_\theta[\hat\theta]-\theta\right)^2 && \operatorname{E}_\theta[\hat\theta] = \text{constant} \\ &= \operatorname{E}_\theta\left[\left(\hat\theta-\operatorname{E}_\theta[\hat\theta]\right)^2\right]+\left(\operatorname{E}_\theta [\hat\theta]-\theta\right)^2\\ &= \operatorname{Var}_\theta(\hat\theta)+ \operatorname{Bias}_\theta(\hat\theta,\theta)^2 \end{align} An even shorter proof can be achieved using the well-known formula that for a random variable X, \mathbb{E}(X^2) = \operatorname{Var}(X) + (\mathbb{E}(X))^2. By substituting X with, \hat\theta-\theta, we have \begin{aligned} \operatorname{MSE}(\hat{\theta}) &= \mathbb{E}[(\hat\theta-\theta)^2] \\ &= \operatorname{Var}(\hat{\theta} - \theta) + (\mathbb{E}[\hat\theta - \theta])^2 \\ &= \operatorname{Var}(\hat\theta) + \operatorname{Bias}^2(\hat\theta,\theta) \end{aligned} But in real modeling case, MSE could be described as the addition of model variance, model bias, and irreducible uncertainty (see
Bias–variance tradeoff). According to the relationship, the MSE of the estimators could be simply used for the
efficiency comparison, which includes the information of estimator variance and bias. This is called MSE criterion. ==In regression==