Several alternative model selection criteria have been proposed and studied in statistical literature. These include the
Bayesian information criterion (BIC), cross-validation methods,
least squares fitting,
Mallows's Cp, and other information-theoretic approaches such as
Widely Applicable Information Criterion (WAIC),
Deviance information criterion (DIC), and
Hannan–Quinn information criterion (HQC). These methods differ in their assumptions, asymptotic behavior, and suitability depending on the goals of the analysis — such as prediction, inference, or model interpretation. A comprehensive overview of AIC and other model selection methods is given by Ding et al. (2018).
Comparison with BIC A critical difference between AIC and BIC (and their variants) lies in their asymptotic behavior under well-specified and misspecified model classes. Their fundamental differences have been well-studied in regression variable selection and autoregression order selection problems. In general, if the goal is prediction, AIC and leave-one-out cross-validations are preferred. The formula for the
Bayesian information criterion (BIC) is similar to the formula for AIC, but with a different penalty for the number of parameters. With AIC the penalty is , whereas with BIC the penalty is . A comparison of AIC/AICc and BIC is given by , with follow-up remarks by . The authors show that AIC/AICc can be derived in the same Bayesian framework as BIC, just by using different
prior probabilities. In the Bayesian derivation of BIC, though, each candidate model has a prior probability of 1/
R (where
R is the number of candidate models). Additionally, the authors present a few simulation studies that suggest AICc tends to have practical/performance advantages over BIC. A point made by several researchers is that AIC and BIC are appropriate for different tasks. In particular, BIC is argued to be appropriate for selecting the "true model" (i.e. the process that generated the data) from the set of candidate models, whereas AIC is not appropriate. To be specific, if the "true model" is in the set of candidates, then BIC will select the "true model" with probability 1, as ; in contrast, when selection is done via AIC, the probability can be less than 1. Proponents of AIC argue that this issue is negligible, because the "true model" is virtually never in the candidate set. Indeed, it is a common aphorism in statistics that "
all models are wrong"; hence the "true model" (i.e. reality) cannot be in the candidate set. Another comparison of AIC and BIC is given by . Vrieze presents a simulation study—which allows the "true model" to be in the candidate set (unlike with virtually all real data). The simulation study demonstrates, in particular, that AIC sometimes selects a much better model than BIC even when the "true model" is in the candidate set. The reason is that, for finite , BIC can have a substantial risk of selecting a very bad model from the candidate set. This reason can arise even when is much larger than 2. With AIC, the risk of selecting a very bad model is minimized. If the "true model" is not in the candidate set, then the most that we can hope to do is select the model that best approximates the "true model". AIC is appropriate for finding the best approximating model, under certain assumptions. :\mathrm{AIC} = 2k - 2\ln(\hat L) = 2k + n\ln(\hat\sigma^2) - 2C Because only differences in AIC are meaningful, the constant can be ignored, which allows us to conveniently take the following for model comparisons: :\Delta \mathrm{AIC} = 2k + n\ln(\hat\sigma^2) Note that if all the models have the same , then selecting the model with minimum AIC is equivalent to selecting the model with minimum —which is the usual objective of model selection based on least squares.
Comparison with cross-validation Leave-one-out
cross-validation is asymptotically equivalent to AIC, for ordinary linear regression models. Asymptotic equivalence to AIC also holds for
mixed-effects models.
Comparison with Mallows's Cp Akaike stated, 'It is interesting to note that the use of a statistic proposed by
Mallows is essentially equivalent to our present approach'. However, the precise relation between AIC and
Cp requires some nuance. Under a normal regression model with unknown error variance \sigma^2, the AIC statistic, as noted above, is : \mbox{AIC} = n \ln(\mbox{RSS}/n) + 2k (I deliberately stop using \hat\sigma^2 here to avoid confusion below). For large samples, if this model is correct, then \mbox{RSS}/n should be close to the true error variance \sigma^2, and using a one-term Taylor series for the logarithm, {{NumBlk|:| \mbox{AIC} \approx n \ln \sigma^2 + n \left ( \frac{\mbox{RSS}}{n\sigma^2} - 1 \right ) + 2k = n \ln \sigma^2 - n + \frac{\mbox{RSS}}{\sigma^2} + 2k |}} This final expression (neglecting terms with
n) is Mallows'
Cp when \sigma^2 happens to be known. In the more usual situation where this is unknown, an estimate \hat\sigma^2, typically derived from a model using all possible predictors, must be substituted. This leads to an asymptotic equivalence between AIC and
Cp. However, Akaike noted that 'unfortunately some subjective judgement is required for the choice of \hat\sigma^2 in the definition of
Cp'. In the unusual case that \sigma^2 is known, AIC is exactly equal to (). As a result, () is sometimes considered to be AIC, and AIC and
Cp are claimed to be equivalent. Such statements should be considered incorrect; when AIC is correctly implemented, the equivalence is only asymptotic.
Other information criteria Other model selection criteria include the
Widely Applicable Information Criterion (WAIC) and the
Deviance Information Criterion (DIC), both of which are widely used in Bayesian model selection. WAIC, in particular, is asymptotically equivalent to leave-one-out cross-validation and applies even in complex or singular models. The
Hannan–Quinn criterion (HQC) offers a middle ground between AIC and BIC by applying a lighter penalty than BIC but a heavier one than AIC. The
Minimum Description Length (MDL) principle, closely related to BIC, approaches model selection from an information-theoretic perspective, treating it as a compression problem. Each of these methods has advantages depending on model complexity, sample size, and the goal of analysis. ==See also==