Mallows's C_p addresses the issue of
overfitting, in which model selection statistics such as the residual sum of squares always get smaller as more variables are added to a model. Thus, if we aim to select the model giving the smallest residual sum of squares, the model including all variables would always be selected. Instead, the C_p statistic calculated on a
sample of data estimates the
sum squared prediction error (SSPE) as its
population target : E\sum_i (\hat{Y}_i - E(Y_i\mid X_i))^2/\sigma^2, where \hat{Y}_i is the fitted value from the regression model for the
ith case,
E(
Yi |
Xi) is the expected value for the
ith case, and \sigma^2 is the error variance (assumed constant across the cases). The
mean squared prediction error (MSPE) will not automatically get smaller as more variables are added. The optimum model under this criterion is a compromise influenced by the sample size, the
effect sizes of the different predictors, and the degree of
collinearity between them. If
p regressors are selected from a set of
k regressors, with
k >
p, the C_p statistic for that particular set of regressors is defined as: : C_p={SSE_p \over S^2} - N + 2 (p+1), where • SSE_p = \sum_{i=1}^N(Y_i-\hat{Y}_{pi})^2 is the
error sum of squares for the model with
p regressors, • \hat{Y}_{pi} is the
predicted value of the
ith observation of
Y from the
p regressors, •
S2 is the estimation of residuals variance after
regression on the complete set of
k regressors and can be estimated by {1 \over N-k} \sum_{i=1}^N (Y_i- \hat{Y}_i)^2 , • and
N is the
sample size. == Alternative definition ==