As I have posted before, I never learned any statistics during my education as a theoretical physicist/applied mathematician. However, it became fairly apparent after I entered biology (although I managed to avoid it for a few years) that fitting models to data and estimating parameters is unavoidable. As I have opined multiple times previously, Bayesian inference and the Markov Chain Monte Carlo (MCMC) method is the best way to do this. I thought I would provide the details on how to implement the algorithm since most of the treatments on the subject are often couched in statistical language that may be hard to decipher for someone with a physics or dynamical systems background. I’ll do this in a series of posts and in a nonstandard order. Instead of building up the entire Bayesian formalism, I thought I would show how the MCMC method can be used to do the job and then later show how it fits into the grand scheme of things to do even more.
Suppose you have data , which consists of some measurements (either a scalar or a vector) at discrete time points
and a proposed model, which produces a time series
, where
represents a set of free parameters that changes the function. This model could be a system of differential equations or just some proposed function like a polynomial. The goal is to find the set of parameters that best fits the data and to evaluate how good the model is. Now, the correct way to do this is to use Bayesian inference and model comparison, which can be computed using the MCMC. However, the MCMC can also be used just to get the parameters in the sense of finding the best fit according to some criterion.
The first thing you must do is to choose an error function. The most common is the sum of square errors given by
where the is some estimate of the error of the data. Often, you can use the experimental uncertainty or just make some guess. Generally, the answer doesn’t depend too much on
. Next, you need to construct a likelihood function. To fully explain what this means, I would need to go into some details about statistics but suffice it to say that usually a Gaussian works perfectly well so I basically use
Hence, the likelihood is maximal when the error is minimal. The method of Maximum Likelihood is simply finding the set of parameters that maximizes the likelihood function.
I’ll now give the simple Metropolis algorithm to do a maximum likelihood estimation. I’ll give it in words first and then write it out in pseudo code. The first thing is to start with some initial guess for the parameter values and compute . If it is more than one parameter you just make a guess for all of them at once. Next propose a new set of parameter values and calculate
and compute the likelihood ratio
The likelihood ratio is a positive real number. You now pick a uniform random number between 0 and 1. If the likelihood ratio is bigger than
you replace the current parameter set with the proposed set and then repeat. At each step of the process you store the value of the current set. In some cases, the current set won’t change but you still keep it (this will be useful when we get to the Bayesian inference). You then run this for a long time and pick the parameter set that had the maximum likelihood. Generally, you just save the parameters after some burn in period.
In pseudo Matlab code:
theta1 = zeros(1,p); % initial guess for p parameters solve (y,theta1); % solve gives the vector y at the desired time points % for parameters theta1 chi1= sum((D - y)^2);
for n = 1: total; theta2 =theta1+ a.* randn(1,p) % p is the number of parameters and % a is the guess standard deviation solve(y,theta2); chi2=sum((D-y)^2);
ratio=exp((-chi2+chi1)/2*sigma^2);
if rand < ratio theta1=theta2; chi1=chi2; end
if n > burntime thetasave(n-burntime)=theta1; chisave(n-burntime)=chi1; end
end Pick thetasave such that chisave is minimal.
Two questions you will have is how long do you run and how do you choose the guesses. Unfortunately, there are no firm answers. A simple thing to do is to pick a Gaussian random variable around the current value as a guess. However, if you have a bound on the parameter, such as it has to be positive, then you need to use the Metropolis-Hastings algorithm, which generalizes to having non-symmetric guesses. As for how long to run, I usually just run the MCMC, increase the time and run again and see how much the answer changes. If the parameters start to converge then you can be confident that you have run long enough. In principle, if you wait long enough the MCMC will sample all of guess space and find the global optimum. In practice, it can get stuck in local minima for a long time before getting out.
June 27, 2010 at 06:03
Nice article. I would love to see where this goes. I have a few questions:
If the mean square error is used as my cost function, why can’t I do simple gradient descent with multiple random initializations? I guess if you don’t have a gradient, then there’s no choice (and it’s not Bayesian). Intuitively this method seems like it’s going to take a long time to find the global optimal solution. Do you know if there is a bound on the convergence rate?
June 27, 2010 at 21:12
You could certainly do a gradient descent and in fact that would be the standard way of doing maximum likelihood estimates. But as you point out, you would need to estimate the gradient and you could get stuck in local minima, depending on how you choose your initial conditions. There is no guaranteed way to find global optima so it’s not clear if the MCMC is going to take longer than any other method. There are no good bounds on convergence rates. Solving that problem would be a major advance. The main reason to use MCMC is that it is the simplest way to estimate the posterior distribution.
June 27, 2010 at 21:14
Interesting. Thanks for the reply. :)
August 6, 2010 at 10:49
[...] model comparison By Carson Chow This is the follow up post to my earlier one on the Markov Chain Monte Carlo (MCMC) method for fitting models to data. I really should have covered Bayesian parameter [...]
November 11, 2010 at 10:41
[...] is the third post on Bayesian inference. The other two are here and here. This probably should be the first one to read if you are completely unfamiliar with the topic. [...]
January 21, 2011 at 12:39
Very helpful. I’m curious about the standard deviation ‘a’ though. What exactly is it? Is it the standard deviation of each of the model parameters from their accepted value and if so, what is the accepted value? Or is it the standard deviation of the current guess and the data, and if so then why would each model parameter be deviated by the same amount?
January 21, 2011 at 13:17
The standard deviation of the parameter value is the standard deviation of the posterior probability density for the parameter. If you read my other posts on Bayesian parameter estimation, the stationary distribution of the MCMC is the posterior distribution of the parameter probability given the data. Given the posterior you can then choose the mean, median, or mode to represent your best guess. However, the posterior represents the best guess of what the parameter is given the likelihood function, prior and data.
October 18, 2011 at 13:19
excellent
July 5, 2012 at 19:35
Do you have a reference or text book for the all the details?
July 7, 2012 at 21:16
Try Phil Gregory’s book.
July 19, 2012 at 16:31
Hi Carson,
What do you use as the proposal distribution for choosing the next guess for the parameter values? I assume it it symmetric since the accept/ reject step is determined purely by the likelihood ratio. Am I also right in assuming that your algorithm uses uniform priors (again because the accept/reject step only depends on the likelihood ratio)?
July 19, 2012 at 17:29
Hi Kenny,
All excellent questions. I usually use a Normal proposal distribution around my current point. What do you mean by symmetric? If you mean bigger or smaller than your current value then it doesn’t need to be. If you mean symmetric between swapping proposal point and current point then it does for Metropolis but not for Metropolis-Hastings, which I did not cover. The prior will be uniform if you let the MCMC roam anywhere. However, if you put boundaries on where it can go then you have effectively included a prior on that domain. You could also modify the proposal distribution depending on where you are, which is like adding a prior.
July 20, 2012 at 13:54
Thanks Carson!
July 20, 2012 at 14:04
Hi Carson,
I also wanted to ask you about some reference information about your research in obesity. I just finished ( May 2012) my PhD in applied math from Brown (with a focus on probability, stats, and numerical analysis) and I have been reading your blog in Obesity. I am really interested in your work and I was wondering what would be some good introductory text and/or paper to help become more familiar with your research.
Best,
Kenny
July 20, 2012 at 14:16
Good question. People have suggested that I write a book:). In terms of my research, this link has all of my obesity references
http://sciencehouse.wordpress.com/2012/05/23/references/
The PLoS Comp Bio paper is a good place to start. The Lancet series summarizes the applications of the model. There is a good book by Keith Frayn on Human metabolism.
September 26, 2012 at 10:24
Interesting your procedure. It is posible for you to post a working example with few data even using a linear funtion for y (model)?
September 26, 2012 at 13:55
@Fernando, do you mean the full code or the outputs?
December 22, 2012 at 08:12
[...] on before, one of the best ways to find an optimal solution to a problem is to search randomly, the Markov Chain Monte Carlo method being the quintessential example. Randomness is useful for searching in places you [...]
March 13, 2013 at 06:54
Hi Carson,
One doubt: is this likelihood always between 0 and 1? i have tried to replicate this code and i am seeing values above 1. I was assuming it would have to be [0,1] for the comparison to the uniform number to make sense (or even the definition of the likelihood). Could you please let me know what i am missing?
Thanks for the entry!
March 13, 2013 at 08:24
Likelihood is not a probability. It can be any nonnegative number.
May 11, 2013 at 16:14
[…] parameters can be accomplished using the Markov Chain Monte Carlo, which I summarized previously here. We will start by defining the partition […]