Bayesian model comparison

This is the follow up post to my earlier one on the Markov Chain Monte Carlo (MCMC) method for fitting models to data.  I really should have covered Bayesian parameter estimation before this post but as an inadvertent  demonstration of the simplicity of the Bayesian approach I’ll present these ideas in random order.  Although it won’t be immediately apparent in this post, the MCMC is the only computational method you ever really need to learn to do all Bayesian inference.

It often comes up in biology and other fields that you have several possible models that could explain some data and you want to know which model is best.  The first thing you could do is to see which model fits the data best.  However, if a model has more parameters then certainly it will fit the model better.  You could always have a model with as many parameters as data points and that would fit the data perfectly.  So, it is necessary to balance how good you fit the data with the complexity of the model.  There have been several proposals to address this issue.  For example, the Akaike Information Criterion (AIC) or Bayes Information Criterion (BIC).  However, as I will show here, these criteria are just approximations to Bayesian model comparison.

So suppose you have a set of data, which I will denote by D.  This could be any set of measurements over any domain.  For example it could be a time series of points such as the  amount of glucose in the blood stream after a meal measured at each  minute.  We then have some proposed models that could fit the data and we want to know which is the best model.  The model could be a polynomial, a differential equation or anything.  Let’s say we have two models M1 and M2.  The premise behind Bayesian inference is that anything can be assigned a probability so what we want to assess is the probability of a model given the data D.  The model with the highest probability is then the most likely model.  So we want to evaluate P(M1 |D) and compare it to P(M2|D).  A frequentist could not do this because only random variables can be assigned probabilities and in fact there is no systematic way to evaluate models using frequentist statistics.

Now, how do you valuate P(M|D)?  This is where Bayes rule comes in

$P(M|D) = \frac{ P(D|M) P(M)}{P(D)}$   (*)

where P(D|M) is the likelihood function, P(M) is the prior probability for the model and P(D) is the evidence.

This is also generally where people’s eyes start to glaze over.  They can’t really see why simply rewriting probability can be of any use.  The whole point is that while it is not clear how to estimate the probability of a model given the data, it is just a matter of computation to obtain the probability that the data would arise given a model i.e. P(D|M).  However, there is one problem with equation (*) and that is it is not clear how to compute P(D), which is the probability of data marginalized over all models!  The finesse around this problem is to look at odds ratios.  So the odds ratio of M1 compared to M2 is given by

$\frac{P(M1|D)}{P(M2||D)} = \frac{P(D|M1)}{P(D|M2)}\frac{P(M1)}{P(M2)}$.

The ratio $\frac{P(D|M1)}{P(D|M2)}$ is called the Bayes factor and this is the quantity we want to compute.  If the prior probabilities of the models are equal then the Bayes factor is all you need to determine the best model.  The AIC and BIC are approximations of the Bayes factor although some may quibble that the AIC is something different.  Generally, we take logs of the Bayes factor and thus whichever model has the highest model log likelihood (i.e. $\ln P(M|D)$) is the best model.  This approach also generalizes to comparing many models.  You just compare them pairwise and if you use log likelihoods then the model with the highest log likelihood of all the models wins.

The next move is what fuels a massive debate between Bayesians and frequentists.  Suppose each model depends on a set of parameters denoted by $\theta$.  Whenever we compare the model to the data, we need to pick an instantiation of the parameters.  For example, if you have some differential equation model for a time series, then you pick some parameters to specify the model, simulate the differential equation and compare the results to the data.  You can then try other parameters and compare the results to the data.  The Bayesian argument is that to compute P(D|M), you must marginalize over the parameters to obtain

$P(D|M)= \int P(D|M,\theta)P(\theta|D) d\theta$   (**)

So what does this mean?  Well it says that the probability that the data came from a model, is the “average” performance of a model “weighted” by the prior probability of the parameters.   So, now you can see why this step is so controversial.  It suggests that a model is only as good as the prior probability for the parameters of the model.  The Bayesian viewpoint is that even if some model has a set of parameters that makes the model fit exactly to the data, if you have no idea where those parameters are then that model is not very good.

But there is even more to this.  How good a model is also depends on how “broad” the prior distribution is compared to the likelihood function because the “overlap” between the two functions in the integrand of the integral in (**) determines how large the probability is.  As a simple way to see this suppose there is one parameter and the  prior for the parameter  has the form

$P(\theta | M)=\frac{1}{\sigma}, 0 < \theta <\sigma$

Then the likelihood for the model is given by

$P(D|M) =\frac{1}{\sigma}\int_0^\sigma P(D|M,\theta) d\theta$

Now suppose that the likelihood is a single peaked function with a maximum likelihood value of L.  Then we can rewrite the model likelihood as $P(D|M) =L(\delta/\sigma)$, where $\delta$ is just a constant that gives a “measure” of the “width” of the distribution.   We thus see that the probability of a model is given by the maximum likelihood, weighted by the ratio of the width of the data likelihood and the prior $\delta/\sigma\le 1$.  This ratio is called the Occam’s factor. When generalized to k parameters Occam’s factor has the form $(\delta/\sigma)^k$.  Thus, we see that as the number of parameters increases, Occam’s factor decreases and thus models with more parameters are always penalized.

If we take the log of the likelihood we get

$\ln P(D|M) \simeq \ln L - k \ln\sigma/\delta$,

which can be compared to the BIC and AIC.  So to compare two models we just compute the Bayesian log likelihood of the model and the model with the highest value is more likely.  If you have more than one model you just compare all the models to each other pairwise and the model with the highest Bayesian log likelihood is the best.  The beautiful thing about Bayesian inference is that the maximum likelihood value for the parameters, the posterior distribution and the Bayes factor can all be estimated in a single MCMC computation.  I’ll cover this in the next post.

9 thoughts on “Bayesian model comparison”

1. […] is the third post on Bayesian inference.  The other two are here and here. This probably should be the first one to read if you are completely unfamiliar with the […]

Like

2. […] I need to manipulate very large matrices. It is true that we can now do Bayesian inference and model comparison on larger models.  However, the curse of dimensionality strongly works against us here.  If you […]

Like

3. Excellent article. Do you still plan to write a post about estimating posterior distribution by MCMC?

Like

4. Hi,

I probably should write a post making the connection between the posterior and the MCMC explicit. The pieces are all there in the three Bayesian posts I have written so far but perhaps a fourth is warranted. Thanks for the suggestion.

Like

5. Hi,

Thanks for reply. Noticed your blog days before and found lots of interesting posts here. Excellent job.

Like

6. […] data, what we did was to explore classes of models that could explain a given set of data and use Bayesian model comparison to decide which was better.  This approach was used in the work on quantifying insulin’s […]

Like

7. […] a previous post, I summarized the Bayesian approach to model comparison, which requires the calculation of the […]

Like

8. Clark says:

Below **, shouldn’t that be: “weighted” by the posterior probability of the parameters?

Like

9. @Clark Technically, you are correct because I conditioned the model and the parameter probabilities on the data. In principle, you could do model comparison with the priors or posteriors for the parameters. I guess I called it a prior out of habit, convenience and sloppiness. Thanks for pointing this out.

Like