In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion (also SIC, SBC, SBIC) is a criterion for model selection among a finite set of models; models with lower BIC are generally preferred. It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC).
When fitting models, it is possible to increase the maximum likelihood by adding parameters, but doing so may result in overfitting. Both BIC and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC for sample sizes greater than 7.
The BIC was developed by Gideon E. Schwarz and published in a 1978 paper, where he gave a Bayesian argument for adopting it.
The BIC is formally defined as
where
= the maximized value of the likelihood function of the model , i.e. , where are the parameter values that maximize the likelihood function;
= the observed data;
= the number of data points in , the number of observations, or equivalently, the sample size;
= the number of parameters estimated by the model. For example, in multiple linear regression, the estimated parameters are the intercept, the slope parameters, and the constant variance of the errors; thus, .
Konishi and Kitagawa derive the BIC to approximate the distribution of the data, integrating out the parameters using Laplace's method, starting with the following model evidence:
where is the prior for under model .
The log-likelihood, , is then expanded to a second order Taylor series about the MLE, , assuming it is twice differentiable as follows:
where is the average observed information per observation, and denotes the residual term. To the extent that is negligible and is relatively linear near , we can integrate out to get the following:
As increases, we can ignore and as they are . Thus,
where BIC is defined as above, and either (a) is the Bayesian posterior mode or (b) uses the MLE and the prior has nonzero slope at the MLE.