## MCMC and fitting models to data

As I have posted before, I never learned any statistics during my education as a theoretical physicist/applied mathematician.  However, it became fairly apparent after I entered biology (although I managed to avoid it for a few years) that fitting models to data and estimating parameters is unavoidable.   As I have opined multiple times previously, Bayesian inference and the Markov Chain Monte Carlo (MCMC) method is the best way to do this.   I thought I would provide the details on how to implement the algorithm since most of the treatments on the subject are often couched in statistical language that may be hard to decipher for someone with a physics or dynamical systems background.  I’ll do this in a series of posts and in a nonstandard order.  Instead of building up the entire Bayesian formalism, I thought I would show how the MCMC method can be used to do the job and then later show how it fits into the grand scheme of things to do even more.

Suppose you have data $D(t_i)$, which consists of some measurements (either a scalar or a vector) at discrete time points $t_i$ and a proposed model, which produces a time series $y(t | \theta)$,  where $\theta$ represents a set of free parameters that changes the function.  This model could be a system of differential equations or  just some proposed function like a polynomial.    The goal is to find the set of parameters that best fits the data and to evaluate how good the model is.  Now, the correct way to do this is to use Bayesian inference and model comparison, which can be computed using the MCMC.   However, the MCMC can also be used just to get the parameters in the sense of finding the best fit according to some criterion.

The first thing you must do is to choose an error function.  The most common is the sum of square errors given by

$\chi^2 = \sum_i \frac{(D(t_i)-y(t_i | \theta))^2}{2\sigma^2}$

where the $\sigma$ is some estimate of the error of the data.   Often, you can use the experimental uncertainty or just make some guess.  Generally, the answer doesn’t depend too much on $\sigma$.  Next, you need to construct a likelihood function.  To fully explain what this means, I would need to go into some details about statistics but suffice it to say that usually a Gaussian works perfectly well so I basically use

$P(D | \theta) = \exp(-\chi^2)$

Hence, the likelihood is maximal when the error is minimal.  The method of Maximum Likelihood is simply finding the set of parameters that maximizes the likelihood function.

I’ll now give the simple Metropolis algorithm to do a maximum likelihood estimation.  I’ll give it in words first and then write it out in pseudo code.  The first thing is to start with some initial guess for the parameter values and compute $\chi^2_{\rm current}$.  If it is more than one parameter you just make a guess for all of them at once.  Next  propose a new set of parameter values and calculate $\chi^2_{\rm proposed}$ and compute the likelihood ratio

$\frac{P(D | \theta_{\rm proposed})}{P(D | \theta_{\rm current}) }= \exp(-\chi^2_{\rm proposed}+\chi^2_{\rm current})$

The likelihood ratio is a positive real number.  You now pick a uniform random number $r$ between 0 and 1.  If the likelihood ratio is bigger than $r$ you replace the current parameter set with the proposed set and then repeat.  At each step of the process you store the value of the current set.  In some cases, the current set won’t change but you still keep it (this will be useful when we get to the Bayesian inference).  You then run this for a long time and pick the parameter set that had the maximum likelihood.  Generally,  you just save the parameters after some burn in period.

In pseudo Matlab code:

theta1 = zeros(1,p);
%  initial guess for p parameters

solve (y,theta1);
% solve  gives the vector y at the desired time points
% for parameters theta1

chi1= sum((D - y)^2);
for n = 1: total;
theta2 =theta1+ a.* randn(1,p)
% p is the number of parameters and
% a is the guess standard deviation

solve(y,theta2);

chi2=sum((D-y)^2);
ratio=exp((-chi2+chi1)/2*sigma^2);
if rand < ratio
theta1=theta2;
chi1=chi2;
end
if n > burntime
thetasave(n-burntime)=theta1;
chisave(n-burntime)=chi1;
end
end

Pick thetasave such that chisave is minimal.

Two questions you will have is how long do you run and how do you choose the guesses.  Unfortunately, there are no firm answers.  A  simple thing to do is to pick a Gaussian random variable around the current value as a guess.  However, if you have a bound on the parameter, such as it has to be positive, then you need to use the Metropolis-Hastings algorithm, which generalizes to having non-symmetric guesses.  As for how long to run, I usually just run the MCMC, increase the time and run again and see how much the answer changes.  If the parameters start to converge then you can be confident that you have run long enough.  In principle, if you wait long enough the MCMC will sample all of guess space and find the global optimum.  In practice, it can get stuck in local minima for a long time before getting out.

### 21 Responses to “MCMC and fitting models to data”

1. Memming Says:

Nice article. I would love to see where this goes. I have a few questions:
If the mean square error is used as my cost function, why can’t I do simple gradient descent with multiple random initializations? I guess if you don’t have a gradient, then there’s no choice (and it’s not Bayesian). Intuitively this method seems like it’s going to take a long time to find the global optimal solution. Do you know if there is a bound on the convergence rate?

2. Carson Chow Says:

You could certainly do a gradient descent and in fact that would be the standard way of doing maximum likelihood estimates. But as you point out, you would need to estimate the gradient and you could get stuck in local minima, depending on how you choose your initial conditions. There is no guaranteed way to find global optima so it’s not clear if the MCMC is going to take longer than any other method. There are no good bounds on convergence rates. Solving that problem would be a major advance. The main reason to use MCMC is that it is the simplest way to estimate the posterior distribution.

3. Memming Says:

Interesting. Thanks for the reply. :)

4. Bayesian model comparison « Scientific Clearing House Says:

[...] model comparison By Carson Chow This is the follow up post to my earlier one on the Markov Chain Monte Carlo (MCMC) method for fitting models to data.  I really should have covered Bayesian parameter [...]

5. Bayesian parameter estimation « Scientific Clearing House Says:

[...] is the third post on Bayesian inference.  The other two are here and here. This probably should be the first one to read if you are completely unfamiliar with the topic.  [...]

6. Chris Says:

Very helpful. I’m curious about the standard deviation ‘a’ though. What exactly is it? Is it the standard deviation of each of the model parameters from their accepted value and if so, what is the accepted value? Or is it the standard deviation of the current guess and the data, and if so then why would each model parameter be deviated by the same amount?

7. Carson Chow Says:

The standard deviation of the parameter value is the standard deviation of the posterior probability density for the parameter. If you read my other posts on Bayesian parameter estimation, the stationary distribution of the MCMC is the posterior distribution of the parameter probability given the data. Given the posterior you can then choose the mean, median, or mode to represent your best guess. However, the posterior represents the best guess of what the parameter is given the likelihood function, prior and data.

8. jaipal methary Says:

excellent

9. Kenny Says:

Do you have a reference or text book for the all the details?

10. Carson Chow Says:

Try Phil Gregory’s book.

11. Kenny Says:

Hi Carson,

What do you use as the proposal distribution for choosing the next guess for the parameter values? I assume it it symmetric since the accept/ reject step is determined purely by the likelihood ratio. Am I also right in assuming that your algorithm uses uniform priors (again because the accept/reject step only depends on the likelihood ratio)?

12. Carson Chow Says:

Hi Kenny,

All excellent questions. I usually use a Normal proposal distribution around my current point. What do you mean by symmetric? If you mean bigger or smaller than your current value then it doesn’t need to be. If you mean symmetric between swapping proposal point and current point then it does for Metropolis but not for Metropolis-Hastings, which I did not cover. The prior will be uniform if you let the MCMC roam anywhere. However, if you put boundaries on where it can go then you have effectively included a prior on that domain. You could also modify the proposal distribution depending on where you are, which is like adding a prior.

13. Kenny Says:

Thanks Carson!

14. Kenny Says:

Hi Carson,

I also wanted to ask you about some reference information about your research in obesity. I just finished ( May 2012) my PhD in applied math from Brown (with a focus on probability, stats, and numerical analysis) and I have been reading your blog in Obesity. I am really interested in your work and I was wondering what would be some good introductory text and/or paper to help become more familiar with your research.

Best,
Kenny

15. Carson Chow Says:

Good question. People have suggested that I write a book:). In terms of my research, this link has all of my obesity references

http://sciencehouse.wordpress.com/2012/05/23/references/

The PLoS Comp Bio paper is a good place to start. The Lancet series summarizes the applications of the model. There is a good book by Keith Frayn on Human metabolism.

16. Fernando Says:

Interesting your procedure. It is posible for you to post a working example with few data even using a linear funtion for y (model)?

17. Carson Chow Says:

@Fernando, do you mean the full code or the outputs?

18. Is irrationality necessary? « Scientific Clearing House Says:

[...] on before, one of the best ways to find an optimal solution to a problem is to search randomly, the Markov Chain Monte Carlo method being the quintessential example. Randomness is useful for searching in places you [...]

19. Jose Says:

Hi Carson,

One doubt: is this likelihood always between 0 and 1? i have tried to replicate this code and i am seeing values above 1. I was assuming it would have to be [0,1] for the comparison to the uniform number to make sense (or even the definition of the likelihood). Could you please let me know what i am missing?

Thanks for the entry!

20. Carson Chow Says:

Likelihood is not a probability. It can be any nonnegative number.

21. Bayesian model comparison part 2 | Scientific Clearing House Says:

[…] parameters can be accomplished using the Markov Chain Monte Carlo, which I summarized previously here. We will start by defining the partition […]