This is the follow up post to my earlier one on the Markov Chain Monte Carlo (MCMC) method for fitting models to data. I really should have covered Bayesian parameter estimation before this post but as an inadvertent demonstration of the simplicity of the Bayesian approach I’ll present these ideas in random order. Although it won’t be immediately apparent in this post, the MCMC is the only computational method you ever really need to learn to do all Bayesian inference.
It often comes up in biology and other fields that you have several possible models that could explain some data and you want to know which model is best. The first thing you could do is to see which model fits the data best. However, if a model has more parameters then certainly it will fit the model better. You could always have a model with as many parameters as data points and that would fit the data perfectly. So, it is necessary to balance how good you fit the data with the complexity of the model. There have been several proposals to address this issue. For example, the Akaike Information Criterion (AIC) or Bayes Information Criterion (BIC). However, as I will show here, these criteria are just approximations to Bayesian model comparison.
So suppose you have a set of data, which I will denote by D. This could be any set of measurements over any domain. For example it could be a time series of points such as the amount of glucose in the blood stream after a meal measured at each minute. We then have some proposed models that could fit the data and we want to know which is the best model. The model could be a polynomial, a differential equation or anything. Let’s say we have two models M1 and M2. The premise behind Bayesian inference is that anything can be assigned a probability so what we want to assess is the probability of a model given the data D. The model with the highest probability is then the most likely model. So we want to evaluate P(M1 |D) and compare it to P(M2|D). A frequentist could not do this because only random variables can be assigned probabilities and in fact there is no systematic way to evaluate models using frequentist statistics.
Now, how do you valuate P(M|D)? This is where Bayes rule comes in
where P(D|M) is the likelihood function, P(M) is the prior probability for the model and P(D) is the evidence.
This is also generally where people’s eyes start to glaze over. They can’t really see why simply rewriting probability can be of any use. The whole point is that while it is not clear how to estimate the probability of a model given the data, it is just a matter of computation to obtain the probability that the data would arise given a model i.e. P(D|M). However, there is one problem with equation (*) and that is it is not clear how to compute P(D), which is the probability of data marginalized over all models! The finesse around this problem is to look at odds ratios. So the odds ratio of M1 compared to M2 is given by
The ratio is called the Bayes factor and this is the quantity we want to compute. If the prior probabilities of the models are equal then the Bayes factor is all you need to determine the best model. The AIC and BIC are approximations of the Bayes factor although some may quibble that the AIC is something different. Generally, we take logs of the Bayes factor and thus whichever model has the highest model log likelihood (i.e. ) is the best model. This approach also generalizes to comparing many models. You just compare them pairwise and if you use log likelihoods then the model with the highest log likelihood of all the models wins.
The next move is what fuels a massive debate between Bayesians and frequentists. Suppose each model depends on a set of parameters denoted by . Whenever we compare the model to the data, we need to pick an instantiation of the parameters. For example, if you have some differential equation model for a time series, then you pick some parameters to specify the model, simulate the differential equation and compare the results to the data. You can then try other parameters and compare the results to the data. The Bayesian argument is that to compute P(D|M), you must marginalize over the parameters to obtain
So what does this mean? Well it says that the probability that the data came from a model, is the “average” performance of a model “weighted” by the prior probability of the parameters. So, now you can see why this step is so controversial. It suggests that a model is only as good as the prior probability for the parameters of the model. The Bayesian viewpoint is that even if some model has a set of parameters that makes the model fit exactly to the data, if you have no idea where those parameters are then that model is not very good.
But there is even more to this. How good a model is also depends on how “broad” the prior distribution is compared to the likelihood function because the “overlap” between the two functions in the integrand of the integral in (**) determines how large the probability is. As a simple way to see this suppose there is one parameter and the prior for the parameter has the form
Then the likelihood for the model is given by
Now suppose that the likelihood is a single peaked function with a maximum likelihood value of L. Then we can rewrite the model likelihood as , where is just a constant that gives a “measure” of the “width” of the distribution. We thus see that the probability of a model is given by the maximum likelihood, weighted by the ratio of the width of the data likelihood and the prior . This ratio is called the Occam’s factor. When generalized to k parameters Occam’s factor has the form . Thus, we see that as the number of parameters increases, Occam’s factor decreases and thus models with more parameters are always penalized.
If we take the log of the likelihood we get
which can be compared to the BIC and AIC. So to compare two models we just compute the Bayesian log likelihood of the model and the model with the highest value is more likely. If you have more than one model you just compare all the models to each other pairwise and the model with the highest Bayesian log likelihood is the best. The beautiful thing about Bayesian inference is that the maximum likelihood value for the parameters, the posterior distribution and the Bayes factor can all be estimated in a single MCMC computation. I’ll cover this in the next post.
erratum: 2017-08-02. In equation (**) I replaced with