Degrees of freedom and information criteria are two fundamental concepts in statistical modeling, which are also taught in introductory statistics courses. But what are the exact abstract definitions for them which can be used to derive specific calculation formula in different situations.

I often use fit criteria like AIC and BIC to choose between models. I know that they try to balance good fit with parsimony, but beyond that I’m not sure what exactly they mean. What are they really doing? Which is better? What does it mean if they disagree?Signed, Adrift on the IC’s

Intuitively, the degrees of freedom of a fitting procedure reflects the effective number of parameters used by the fitting procedure. Thus to most applied statisticians, a fitting procedure’s degrees of freedom is synonymous with its model complexity, or its capacity for overfitting to data. Is this really true? Regularization aims to improve prediction performance by trading an increase in training error for better agreement between training and prediction errors, which is often captured through decreased degrees of freedom. Is this always the case? When does more regularization imply fewer degrees of freedom?

For the above two questions, I think the most important thing is based on the following what-type question:

What are AIC and BIC? What is degrees of freedom?

Akaike’s Information Criterion (AIC) estimates the relative Kullback-Leibler (KL) distance of the likelihood function specified by a fitted candidate model, from the unknown true likelihood function that generated the data:

$D_{KL}(L(y)||L_0(y))=\int L_0(y)\log \frac{L(y)}{L_0(y)}dy=E_0(l(y))-E_0(l_0(y))$

where $L(y)$ is the likelihood function specified by a fitted candidate model, $L_0(y)$ is the unknown true likelihood function, and the expectation $E_0$ is taking under the true model. Note that the fitted model closest to the truth in the KL sense would not necessarily be the model which best fits the observed sample since the observed sample can often be fit arbitrary well by making the model more and more complex. Since $E_0(l_0(y))$ will be the same for all models being considered, KL is minimized by choosing the model with highest $E_0(l(y))$, which can be estimated by an approximately unbiased estimator (up to a constant)

$l-tr(\hat{J}^{-1}\hat{K})$

where $\hat{J}$ is an estimator for the covariance matrix of the parameters based on the second derivatives matrix of $l$ in the parameters and $\hat{K}$ is an estimator based on the cross products of the first derivatives.  Akaike showed that $\hat{J}$ and $\hat{K}$ are asymptotically equal for the true model, so that $tr(\hat{J}^{-1}\hat{K})=tr(I)$, which is the number of parameters. This results in the usual definition for AIC

$AIC=-2l+2p.$

Schwarz’s Bayesian Information Criterion (BIC) is just comparing the posterior probability with the same prior and hence just comparing the likelihoods under different models:

$B=\frac{Pr(M_1|y)}{Pr(M_2|y)}=\frac{Pr(y|M_1)}{Pr(y|M_2)}$

which is just the Bayes factor. Schwarz showed that in many kinds of models $B$ can be roughly approximated by $\exp(l_1-\frac{1}{2}ln(n)p_1-l_2+\frac{1}{2}ln(n)p_2)$

which leads to the definition of BIC

$BIC=-2l+ln(n) p.$

In summary, AIC and BIC are both penalized-likelihood criteria. AIC is an estimate of a constant plus the relative distance between the unknown true likelihood function of the data and the fitted likelihood function of the model, so that a lower AIC means a model is considered to be closer to the truth. BIC is an estimate of a function of the posterior probability of a model being true, under a certain Bayesian setup, so that a lower BIC means that a model is considered to be more likely to be the true model. Both criteria are based on various assumptions and asymptotic approximations. Despite various subtle theoretical differences, their only difference in practice is the size of the penalty; BIC penalizes model complexity more heavily. The only way they should disagree is when AIC chooses a larger model than BIC. Thus, AIC always has a chance of choosing too big a model, regardless of n. BIC has very little chance of choosing too big a model if n is sufficient, but it has a larger chance than AIC, for any given n, of choosing too small a model.

The effective degrees of freedom for an arbitrary modelling approach is defined based on the concept of expected optimism:

$df(\mu, \sigma^2, FIT_{\lambda})=\frac{1}{2\sigma^2}\Big\{E(\|y^*-\hat{y}^{(FIT_{\lambda})}\|_2^2)-E(\|y-\hat{y}^{(FIT_{\lambda})}\|^2_2)\Big\}$

where $\sigma^2$ is the variance of the error term, $y^*$ is an independent copy of data vector $y$ with mean $\mu$, and $FIT_{\lambda}$ is a fitting procedure with tuning parameter $\lambda$. Note that the expected optimism is defined as $w:=\frac{1}{n}\Big\{E(\|y^*-\hat{y}^{(FIT_{\lambda})}\|_2^2)-E(\|y-\hat{y}^{(FIT_{\lambda})}\|^2_2)\Big\}$. And by the optimism theorem, we have that

$df(\mu, \sigma^2, FIT_{\lambda})=\frac{1}{\sigma^2}\sum_{i=1}^n cov(\hat\mu_i, y_i).$

Why does this definition make sense? In fact, under some regularity conditions, Stein proved that

$df=E(\sum_{i=1}^n\frac{\partial\hat\mu_i}{\partial y_i})$

which can be regarded as a sensitivity measure of the fitted values to the observations.

In the linear model, we know that (Mallows) the relationship between the expected prediction error (EPE) and the residual sum of squares (RSS)  follows

$EPE=E(RSS)+2\sigma^2 p,$

which leads to $df=p$.

Here are some references on this topic:

1. Dziak, John J., et al. “Sensitivity and specificity of information criteria.” The Methodology Center and Department of Statistics, Penn State, The Pennsylvania State University (2012).
2. Janson, Lucas, Will Fithian, and Trevor Hastie. “Effective degrees of freedom: A flawed metaphor.” arXiv preprint arXiv:1312.7851 (2013).