This is the third post on Bayesian inference. The other two are here and here. This probably should be the first one to read if you are completely unfamiliar with the topic. Suppose you are trying to model some system and you have a model that you want to match some data. The model could be a function, a set of differential equations, or anything with parameters that can be adjusted. To make this concrete, consider this classic differential equation model of the response of glucose to insulin in the blood:
where is glucose concentration, is insulin, is the action of insulin in some remote compartment, and there are five free parameters . If you include the initial values of and there are seven free parameters. The data consist of measurements of glucose and insulin at discrete time points . The goal is to find a set of free parameters so that the model fits the data at those time points.
The first question to answer is what do you mean by fitting the model at those time points. This is by no means simple to answer and the results will change depending on how you answer it. The standard procedure is to assign some error if the model does not go through the points and try to minimize the error. The most common error function is the mean square error, which we can write as
where is the glucose measurement at time point . Note that is a hidden variable that must be estimated from the glucose and insulin levels. We can then numerically simulate the model using as inputs and measure the total error for a given set of parameters. The classic way of doing this is to take derivatives of the error function with respect to the parameters and find the minimum points. For most models this must be done numerically by some iteration scheme. Generally error functions will have multiple minima and what you find will depend on your initial guess. Finding the global optimum is a very difficult problem for which no general efficient algorithm exists. There are numerous ad hoc methods and books on optimization and it can be quite daunting trying to sort through all of this.
Now, if you also want some sort of estimate of how good your fit is and what sort of uncertainty is associated with your estimate then you’ll need to use statistics. The starting point is to assume some probability distribution for your errors. Generally, a normal distribution is sufficient and you can always check if it was a good assumption after the fact by looking at the error distribution after you fit the data. In some cases, a simple transformation like taking the log will render the errors to be normal. So, suppose the error obeys a normal with zero mean and standard deviation . We can then write
Now, the model points depend on the free parameters so we can rewrite this expression as a likelihood function of the parameters given the data, i.e.
where represents a vector of all the parameters. What the likelihood function does is turn the probability density function on its head. To be a probability density the integral over its domain must be one. This is not true of the likelihood function. However, the likelihood function is still useful because the parameters that maximize it minimizes the error (gets it closer to zero). You can also use the likelihood function to estimate how good your guess is using Fisher information. Note that if you take the log of the likelihood function you get the negative of the error function so minimizing the error is the same as maximizing the likelihood.
This is where the great divide between Bayesian and frequentist statistics comes in. The frequentists, led by Ronald Fisher, believe that probabilities can only be assigned to random variables, like the error, and that parameters of the model are fixed. So in their view, there is some optimal set of parameters out there that is obscured by random noise. So parameters cannot have errors attached to them. What you say instead is that given some set of parameters, this is the range of data values you would expect. You then try to find the parameters such that the expected data values bracket the measured data. It’s kind of a backwards way of thinking and is why statistics is so hard to understand, especially if you come from a dynamical systems/applied math background.
The Bayesian point of view, championed by Harold Jeffreys and Edwin Jaynes, is that everything can be assigned a probability. It simply represents some level of uncertainty you have about something. So parameters have probability distributions. The Bayesian approach is to “normalize” the likelihood function and turn it into a probability. This is done using Bayes’s rule
so now the probability of the parameters given the data points is given by
where is the likelihood function in (*), is called the posterior probability, is called the prior probability, and is called the evidence. Hence, in order to use Bayesian inference you need to have a prior and you need to compute the evidence. If you assume that the prior is flat, this is called a noninformative prior, and the mode of the posterior is the same as the maximum of the likelihood. There was a great battle between the Bayesians and the frequentists in the mid twentieth century but basically the Bayesians have won in math and statistics departments. However, Bayesian ideas are only now entering the lexicon of fields that use statistics like medicine, biology and economics.
One of the reasons it took the Bayesian approach so long to be accepted is that it was very difficult to compute the posterior. The evidence can be written as
which is an integral of the likelihood over the prior distribution of the parameters. In other words, to compute the posterior distribution you have to run the model over the entire parameter space you are interested in and sum that all up. This takes a lot of computational power but now any desktop machine can easily handle it. In fact, the easiest way to compute the posterior is to use the Markov Chain Monte Carlo method which I posted on earlier. What you do is simply keep all the parameter sets generated during a run (after some burn in period). The stationary distribution of the MCMC is the posterior probability distribution. In other words, instead of just keeping the parameter sets that minimized the error, you just save all the parameters generated by the MCMC run. The histogram of them is the posterior density. You can then take the mode, median or mean of the posterior as your best estimate. To get the error you can take the standard deviation of the posterior or the interval between 10 and 90 percent and so forth.
One of the biggest open questions in this approach is that there are no good theorems about the convergence rate of the MCMC. So, you really have no idea how long you have to run. This again is where fast computers partially saves you. I simply run for a long time and wait until nothing much changes and go with that. It’s no worse than using any optimization algorithm when you don’t know if you’ve reached a minimum. The last thing I will say is that you can use a method called parallel tempering, which is like simulated annealing, to speed up the posterior computation and also do model comparison in the same run. I’ll give the details of parallel tempering in the next post.