Bayesian parameter estimation

This is the third post on Bayesian inference.  The other two are here and here. This probably should be the first one to read if you are completely unfamiliar with the topic.  Suppose you are trying to model some system and you have a model that you want to match some data.  The model could be a function, a set of differential equations, or anything with parameters that can be adjusted.  To make this concrete, consider this classic differential equation model of the response of glucose to insulin in the blood:

\dot G = -S_I X G - S_G(G-G_b)

\dot X = c_X[I(t) - X -I_b]

where G is glucose concentration, I is insulin, X is the action of insulin in some remote compartment, and there are five free parameters S_I, S_G, G_b, c_X, I_b.  If you include the initial values of G and X there are seven free parameters.  The data consist of measurements of glucose and insulin at discrete time points \{t_1,t_2,\dots\}.  The goal is to find a set of free parameters so that the model fits the data at those time points.

The first question to answer is what do you mean by fitting the model at those time points.  This is by no means simple to answer and the results will change depending on how you answer it.  The standard procedure is to assign some error if the model does not go through the points and try to minimize the error.  The most common error function is the mean square error, which we can write as

E=\sum_i (g_i-G(t_i)^2

where g_i is the glucose measurement at time point t_i.  Note that X is a hidden variable that must be estimated from the glucose and insulin levels.  We can then numerically simulate the model using I(t_i) as inputs and measure the total error for a given set of parameters.  The classic way of doing this is to take derivatives of the error function with respect to the parameters and find the minimum points.  For most models this must be done numerically  by some iteration scheme.  Generally error functions will have multiple minima and what you find will depend on your initial guess.  Finding the global optimum is a very difficult problem for which no general efficient algorithm exists.  There are numerous ad hoc methods and books on optimization and it can be quite daunting trying to sort through all of this.

Now, if you also want some sort of estimate of how good your fit is and what sort of uncertainty is associated with your estimate then you’ll need to use statistics.  The starting point is to assume some probability distribution for your errors.  Generally, a normal distribution is sufficient and you can always check if it was a good assumption after the fact by looking at the error distribution after you fit the data.  In some cases, a simple transformation like taking the log will render the errors to be normal.   So, suppose the error obeys a normal with zero mean and standard deviation \sigma.  We can then write

P(E) \propto \exp\left(-\frac{\sum_i (g_i-G(t_i))^2}{2\sigma^2}\right)

Now, the model points G(t_i) depend on the free parameters so we can rewrite this expression as a likelihood function of the parameters given the data, i.e.

L(\theta) = \propto \exp\left(-\frac{\sum_i  (g_i-G(t_i))^2}{2\sigma^2}\right) (*)

where \theta represents a vector of all the parameters.  What the likelihood function does is turn the probability density function on its head.  To be a probability density the integral over its domain must be one.  This is not true of the likelihood function.  However, the likelihood function is still useful because the parameters that maximize it minimizes the error (gets it closer to zero).  You can also use the likelihood function to estimate how good your guess is using Fisher information.   Note that if you take the log of the likelihood function you get the negative of the error function so minimizing the error is the same as maximizing the likelihood.

This is where the great divide between Bayesian and frequentist statistics comes in.   The frequentists, led by Ronald Fisher,  believe that probabilities can only be assigned to random variables, like the error, and that parameters of the model are fixed.  So in their view, there is some optimal set of parameters out there that is obscured by random noise.  So parameters cannot have errors attached to them.  What you say instead is that given some set of parameters, this is the range of data values you would expect.  You then try to find the parameters such that the expected data values bracket the measured data.  It’s kind of a backwards way of thinking and is why statistics is so hard to understand, especially if you come from a dynamical systems/applied math background.

The Bayesian point of view, championed by Harold Jeffreys and Edwin Jaynes, is that everything can be assigned a probability.  It simply represents some level of uncertainty you have about something.  So parameters have probability distributions.  The Bayesian approach is to “normalize” the likelihood function and turn it into a probability.  This is done using Bayes’s rule

P(A|B)=\frac{P(B|A)P(A)}{P(B)}

so now the probability of the parameters given the data points D is given by

P(\theta | D) = \frac{P(D|\theta) P(\theta)}{P(D)}

where P(D|\theta) is the likelihood function L(\theta) in (*), P(\theta|D) is called the posterior probability, P(\theta) is called the prior probability, and P(D) is called the evidence.  Hence, in order to use Bayesian inference you need to have a prior and you need to compute the evidence.  If you assume that the prior is flat, this is called a noninformative prior, and the mode of the posterior is the same as the maximum of the likelihood.  There was a great battle between the Bayesians and the frequentists in the mid twentieth century but basically the Bayesians have won  in math and statistics departments.  However, Bayesian ideas are only now entering the lexicon of fields that use statistics like medicine, biology and economics.

One of the reasons it took the Bayesian approach so long to be accepted is that it was very difficult to compute the posterior.  The evidence can be written as

P(D)=\int P(D|\theta)P(\theta) d\theta

which is an integral of the likelihood over the prior distribution of the parameters.  In other words, to compute the posterior distribution you have to run the model over the entire parameter space you are interested in and sum that all up.  This takes a lot of computational power but now any desktop machine can easily handle it. In fact, the easiest way to compute the posterior is to use the Markov Chain Monte Carlo method which I posted on earlier.  What you do is simply keep all the parameter sets generated during a run (after some burn in period).  The stationary distribution of the MCMC is the posterior probability distribution.  In other words, instead of just keeping the parameter sets that minimized the error, you just save all the parameters generated by the MCMC run. The histogram of them is the posterior density.  You can then take the mode, median or mean of the posterior as your best estimate.  To get the error you can take the standard deviation of the posterior or the interval between 10 and 90 percent and so forth.

One of the biggest open questions in this approach is that there are no good theorems about the convergence rate of the MCMC.  So, you really have no idea how long you have to run.   This again is where fast computers partially saves you.  I simply run for a long time and wait until nothing much changes and go with that.  It’s no worse than using any optimization algorithm when you don’t know if you’ve reached a minimum.  The last thing I will say is that you can use a method called parallel tempering, which is like simulated annealing, to speed up the posterior computation and also do model comparison in the same run.  I’ll give the details of parallel tempering in the next post.

12 thoughts on “Bayesian parameter estimation

  1. Actually, to my surprise, they are covering Bayesian statistics in medical school AGAIN, this time with an emphasis on pre-test vs. post-test probability and that the prevalence of the disease in the population really matters (not just the specificity of the test). They always sort of gloss over Bayes Theorem itself but they do a reasonably good job of getting the essential ideas across, especially as applied to interpreting and applying literature information to the patients.

    Score one for Bayesians in medical school – except that the med students complain like crazy anytime “useless” things like Bayes’ Theorem are presented…)

    Like

  2. great article!
    i am approaching from only few weeks to machine learning and i’m lookiing for simple, clean and well-written articles, write more about bayesian inference!

    Like

  3. Love your posts on Bayesian inference as well! Great stuff and keep it up! Super helpful and well written!

    Like

  4. This was very helpful …..I am an undergraduate student and need help to implement this in matlab.
    Is it possible to implement this in Mat lab? Please post more information if any about the same model in Matlab.
    Thankyou.

    Like

  5. There’s one thing I still find confusing. You state that ‘a’ in your MCMC algorithm is the standard deviation of the parameter value, which is the standard deviation of the posterior probability density for the parameter. But to determine the posterior probability density you need to run the MCMC which in itself contains information about the posterior.

    Like

  6. The stationary distribution of the MCMC is the posterior distribution. In some sense it turns the guess distribution into the posterior. It is not circular if that is what you are worried about. You can stick any guess distribution in and the same posterior will arise once you have reached convergence.

    The Metropolis algorithm assumes that the guess distribution is symmetric (i.e. prob of guess given current state is the same as prob of current state given the guess). If this distribution is not symmetric then you use the Metropolis-Hastings algorithm. You can find the formula in any text book or Wikipedia.

    Like

  7. Wonderful posts. Many things are getting clearer on Posterior Probability. I would you to relate it to Machine learning.

    Like

Leave a comment