MCMC and fitting models to data

As I have posted before, I never learned any statistics during my education as a theoretical physicist/applied mathematician.  However, it became fairly apparent after I entered biology (although I managed to avoid it for a few years) that fitting models to data and estimating parameters is unavoidable.   As I have opined multiple times previously, Bayesian inference and the Markov Chain Monte Carlo (MCMC) method is the best way to do this.   I thought I would provide the details on how to implement the algorithm since most of the treatments on the subject are often couched in statistical language that may be hard to decipher for someone with a physics or dynamical systems background.  I’ll do this in a series of posts and in a nonstandard order.  Instead of building up the entire Bayesian formalism, I thought I would show how the MCMC method can be used to do the job and then later show how it fits into the grand scheme of things to do even more.

Suppose you have data D(t_i), which consists of some measurements (either a scalar or a vector) at discrete time points t_i and a proposed model, which produces a time series y(t | \theta),  where \theta represents a set of free parameters that changes the function.  This model could be a system of differential equations or  just some proposed function like a polynomial.    The goal is to find the set of parameters that best fits the data and to evaluate how good the model is.  Now, the correct way to do this is to use Bayesian inference and model comparison, which can be computed using the MCMC.   However, the MCMC can also be used just to get the parameters in the sense of finding the best fit according to some criterion.

The first thing you must do is to choose an error function.  The most common is the sum of square errors given by

\chi^2 = \sum_i \frac{(D(t_i)-y(t_i | \theta))^2}{2\sigma^2}

where the \sigma is some estimate of the error of the data.   Often, you can use the experimental uncertainty or just make some guess.  Generally, the answer doesn’t depend too much on \sigma.  Next, you need to construct a likelihood function.  To fully explain what this means, I would need to go into some details about statistics but suffice it to say that usually a Gaussian works perfectly well so I basically use

P(D | \theta) = \exp(-\chi^2)

Hence, the likelihood is maximal when the error is minimal.  The method of Maximum Likelihood is simply finding the set of parameters that maximizes the likelihood function.

I’ll now give the simple Metropolis algorithm to do a maximum likelihood estimation.  I’ll give it in words first and then write it out in pseudo code.  The first thing is to start with some initial guess for the parameter values and compute \chi^2_{\rm current}.  If it is more than one parameter you just make a guess for all of them at once.  Next  propose a new set of parameter values and calculate \chi^2_{\rm proposed} and compute the likelihood ratio

\frac{P(D | \theta_{\rm proposed})}{P(D | \theta_{\rm current}) }= \exp(-\chi^2_{\rm proposed}+\chi^2_{\rm current})

The likelihood ratio is a positive real number.  You now pick a uniform random number r between 0 and 1.  If the likelihood ratio is bigger than r you replace the current parameter set with the proposed set and then repeat.  At each step of the process you store the value of the current set.  In some cases, the current set won’t change but you still keep it (this will be useful when we get to the Bayesian inference).  You then run this for a long time and pick the parameter set that had the maximum likelihood.  Generally,  you just save the parameters after some burn in period.

In pseudo Matlab code:

theta1 = zeros(1,p);
%  initial guess for p parameters

solve (y,theta1);
% solve  gives the vector y at the desired time points
% for parameters theta1

chi1= sum((D - y)^2);
for n = 1: total;
theta2 =theta1+ a.* randn(1,p)
% p is the number of parameters and
% a is the guess standard deviation

solve(y,theta2);

chi2=sum((D-y)^2);
ratio=exp((-chi2+chi1)/2*sigma^2);
if rand < ratio
theta1=theta2;
chi1=chi2;
end
if n > burntime
thetasave(n-burntime)=theta1;
chisave(n-burntime)=chi1;
end
end

Pick thetasave such that chisave is minimal.

Two questions you will have is how long do you run and how do you choose the guesses.  Unfortunately, there are no firm answers.  A  simple thing to do is to pick a Gaussian random variable around the current value as a guess.  However, if you have a bound on the parameter, such as it has to be positive, then you need to use the Metropolis-Hastings algorithm, which generalizes to having non-symmetric guesses.  As for how long to run, I usually just run the MCMC, increase the time and run again and see how much the answer changes.  If the parameters start to converge then you can be confident that you have run long enough.  In principle, if you wait long enough the MCMC will sample all of guess space and find the global optimum.  In practice, it can get stuck in local minima for a long time before getting out.

Jun 29, 2015 clarification: My definition of \chi^2 is not the same as what is generally used in statistics. For convenience, I defined \chi^2 with a 2 in the denominator such that the likelihood would be given by \exp(-\chi^2). However, in statistics, one would not include the factor of 2 in the \chi^2 denominator, and the thus the likelihood would be \exp(-\chi^2/2). This doesn’t really make a difference for the algorithm but could be an issue if you wanted to compare to other results.

Advertisements

52 thoughts on “MCMC and fitting models to data

  1. Nice article. I would love to see where this goes. I have a few questions:
    If the mean square error is used as my cost function, why can’t I do simple gradient descent with multiple random initializations? I guess if you don’t have a gradient, then there’s no choice (and it’s not Bayesian). Intuitively this method seems like it’s going to take a long time to find the global optimal solution. Do you know if there is a bound on the convergence rate?

    Like

  2. You could certainly do a gradient descent and in fact that would be the standard way of doing maximum likelihood estimates. But as you point out, you would need to estimate the gradient and you could get stuck in local minima, depending on how you choose your initial conditions. There is no guaranteed way to find global optima so it’s not clear if the MCMC is going to take longer than any other method. There are no good bounds on convergence rates. Solving that problem would be a major advance. The main reason to use MCMC is that it is the simplest way to estimate the posterior distribution.

    Liked by 1 person

  3. Very helpful. I’m curious about the standard deviation ‘a’ though. What exactly is it? Is it the standard deviation of each of the model parameters from their accepted value and if so, what is the accepted value? Or is it the standard deviation of the current guess and the data, and if so then why would each model parameter be deviated by the same amount?

    Like

  4. The standard deviation of the parameter value is the standard deviation of the posterior probability density for the parameter. If you read my other posts on Bayesian parameter estimation, the stationary distribution of the MCMC is the posterior distribution of the parameter probability given the data. Given the posterior you can then choose the mean, median, or mode to represent your best guess. However, the posterior represents the best guess of what the parameter is given the likelihood function, prior and data.

    Like

  5. Hi Carson,

    What do you use as the proposal distribution for choosing the next guess for the parameter values? I assume it it symmetric since the accept/ reject step is determined purely by the likelihood ratio. Am I also right in assuming that your algorithm uses uniform priors (again because the accept/reject step only depends on the likelihood ratio)?

    Liked by 1 person

  6. Hi Kenny,

    All excellent questions. I usually use a Normal proposal distribution around my current point. What do you mean by symmetric? If you mean bigger or smaller than your current value then it doesn’t need to be. If you mean symmetric between swapping proposal point and current point then it does for Metropolis but not for Metropolis-Hastings, which I did not cover. The prior will be uniform if you let the MCMC roam anywhere. However, if you put boundaries on where it can go then you have effectively included a prior on that domain. You could also modify the proposal distribution depending on where you are, which is like adding a prior.

    Liked by 1 person

  7. Hi Carson,

    I also wanted to ask you about some reference information about your research in obesity. I just finished ( May 2012) my PhD in applied math from Brown (with a focus on probability, stats, and numerical analysis) and I have been reading your blog in Obesity. I am really interested in your work and I was wondering what would be some good introductory text and/or paper to help become more familiar with your research.

    Best,
    Kenny

    Like

  8. Interesting your procedure. It is posible for you to post a working example with few data even using a linear funtion for y (model)?

    Like

  9. Hi Carson,

    One doubt: is this likelihood always between 0 and 1? i have tried to replicate this code and i am seeing values above 1. I was assuming it would have to be [0,1] for the comparison to the uniform number to make sense (or even the definition of the likelihood). Could you please let me know what i am missing?

    Thanks for the entry!

    Like

  10. Carson,
    Thx for your post, it has helped me to learn and progress in this important area of Bayesian inference. I’ve played with some home-brewed MCMC today and have been able to fit a few models nicely. However, when generating samples without any noise as incoming data (e.g. perfectly sitting on a line in case of a linear model ax+b),
    the means of the two MCMC chains converge nicely to the true a and b, but the posterior distributions of a and b are still pretty wide, and practically independent on the width of the MCMC updating process.This translates into quite some residual uncertainty in the fit parameters, no matter how many ‘perfect on a line’ datapoints were entered. Intuitively, I would say that there is hardly any uncertainty left. How should I interpret the wide distributions of a and b?
    Peter

    Like

  11. @Pierre A likelihood is not a probability because the integral over that conditional variable need not be one. So while \int P(x|a) dx= 1, \int P(x|a) da does not need to be 1. In some cases, it will be 1 in which case it acts like a probability but there is no constraint that it must.

    Like

  12. Thanks for the most comprehensive introduction to MCMC that I’ve seen to date!

    I may have missed something fairly basic here, but could you possibly explain the reasoning behind comparing the likelihood ratio to a random number between 0 and 1 (rather than just 1)?

    Like

  13. @Jane If you only compared the likelihood ratio to 1 then you are in effect doing a gradient descent, which can get trapped in local minima. Comparing to a random number let’s you go “uphill” every once in awhile. The MCMC will then wander over the objective function landscape, spending more time when the likelihood is high and less time when it is low. The histogram of the parameter values it visits in this wandering trek is proportional to the Bayesian posterior of this parameter.

    Like

  14. @Valerio There are many textbooks on Bayesian Inference. Gregory’s book is pretty good if you’re a physicist and Gelman et al. is good if you have more of a stats background.

    Like

  15. Hi Carson Chow, thank you very much for your reply.
    I got the second book. Unfortunately I cannot find the part about Markov Chain Monte Carlo and Chi Square. I am interested precisaly in that. Your method is interesting, but I am not sure it is correct, so I was looking for any other source that presented it. Is this method (applying MCMC to X^2) your invention?

    Valerio

    Like

  16. Jose, I cannot follow that link, can you give me the doi or the article title? Thank you very much!

    Like

  17. Sorry, something is wrong with the URL. Try google: ” A Comparison of Least-Squares and Bayesian Techniques in Fitting the Orbits of Extrasolar Planets.” by Driscoll, P.; Fischer, D.

    Like

  18. sir,
    can you post pseudo code for estimating two parameters(a,b) in a linear fit (ax+b). actually i working on it. but i am not satisfied with the results i am getting. so it will be helpful if you post pseudo code

    Like

  19. Hi all (and specially Carson),

    First of all, thanks for your post, Carson. Where can I read more about the justification behind constructing a likelihood function like you do ? If you have a reference which is understandable for an engineering mathematics level, that would be great.

    For parameter estimation in a DAE model considering data with replicates:

    I had thought to determine the likelihood of the data given a certain run of the model with certain values of parameters this way:

    1- Run the deterministic model, retrieving the observed variable output value, for a certain value of the independent variable which I have measurements.

    2- Think of this output values as the mean of a gaussian distribution, its variance coming from an estimation of the observed values variance (since I have replicates I can do this for each value of the independent variable).

    3- The likelihood of the data given this value of parameters becomes the product of each observed replicate’s probability in the gaussian distribution constructed around the output of the model.

    Does this procedure make any sense ? Is it equivalent to just using a least-squares likelihood like you do around the observed mean for each value of the independent variable ?

    I would appreciate any help or reference

    Like

  20. @Claudio A book on Bayesian inference meant for physicists and engineers is one by Phil Gregory

    http://www.cambridge.org/us/academic/subjects/statistics-probability/statistics-physical-sciences-and-engineering/bayesian-logical-data-analysis-physical-sciences-comparative-approach-mathematica-support

    Determining the likelihood function depends on what you are interested in. The least squares likelihood in my example is used in a normal distribution so I believe my example produced is the same as yours. The sum of squares in the exponent of a Gaussian is equivalent to the product of individual Gaussians.

    Like

  21. Hi Carson,

    do you have a reference for choosing the sigma in the likelihood function? I like your article and but a paper or book reference would help me a lot.

    Thanks!

    Eike

    Like

  22. I learned a lot from your article. I have a question about a in theta2 =theta1+ a.* randn(1,p). I still did not understand how to get this a. And is this a the same for all the parameters, or for different parameters we have different a.
    Thanks

    Like

  23. @Xiao There is no rigorous theory behind the choice of a. The algorithm will work for any choice. You want a to be as big as possible so you search more space but not so small that you never accept any guesses. A general rule of thumb is to choose an a so that the MCMC algorithm is accepting a guess about one third of the time. There is still a lot of art in parameter fitting.

    Like

  24. @Carson Thank you for your reply. What I want to do is to use the sum of several gaussian function to fit my data. I wrote a matlab code based on the pseudo code. I tried to change different a, but the result is quite different every time. It does not converge. Do you mind my sending my data and code to you and have a look at it?
    Thanks

    Like

  25. @Carson They fit my data. I have tried nonlinear least square and EM algorithm to use Gaussian function to fit my data. And they can give me a pretty good result. I just want to try to use this method on my problem.

    Like

  26. Excellent post. It helped me starting out with MCMC parameter estimation. May I ask how fitting positive parameters only work in practice?

    “However, if you have a bound on the parameter, such as it has to be positive, then you need to use the Metropolis-Hastings algorithm, which generalizes to having non-symmetric guesses. ”

    More specifically, how do I generate a new parameter proposal? And what happens to the likelihood function? I’m a real beginner so please excuse my ignorance.

    Like

  27. @rblilja You can keep parameters positive by choosing a positive only distribution such as a gamma distribution or a truncated normal distribution, i.e. a normal distribution where you ignore negative numbers. Let g(x’ | x) be the guess probability density, i.e. given positive current x’ select positive x with probability g(x’ | x) dx. Make sure to normalize g. The modification to the MCMC is the Hastings part of Metropolis-Hastings. Instead of using the ratios of likelihoods, you test with this function (P(D | x’)g(x | x’))/(P(D | x)g(x’ | x)).

    Like

  28. @Carson Thank you. I decided to go for a gamma distribution as target, ending up with the Matlab code below. I have as well been inspired by the examples found here: https://darrenjw.wordpress.com/2012/06/04/metropolis-hastings-mcmc-when-the-proposal-and-target-have-differing-support/

    I’m not sure I followed your advice entirely, but I end up with results that looks feasible and are quite consistent from run to run. I’m not sure how to calculate the guess probability for the acceptance ratio since I’m using different standard deviations when generating a new proposal (param_sigma). Perhaps I’m doing this completely wrong?

    If you can point out any mistakes or suggest an improvement I would be more than happy. Probability is a weak spot for me.

    % Init all vectors and matrices, setup initial theta and calculate initial cost etc.
    data_sigma = 1.0;
    param_sigma = [0.01 0.1 0.01 1.0];

    for i=1:iterations

    % Generate a non-negative proposal
    theta_proposal(i,:) = theta_current(i,:) + param_sigma.*randn(1, dim);
    %
    while sum(theta_proposal(i,:) < zeros(1,dim))

    theta_proposal(i,:) = theta_current(i,:) + param_sigma.*randn(1, dim);
    end

    % Evaluate cost of proposed theta (fcost = sum((data-f(theta)').^2);)
    chi_proposal(i) = fcost(theta_proposal(i,:), data) / 2*data_sigma*data_sigma;

    % Gamma(2,1) target distribution
    P_proposal = chi_proposal(i)*exp(-chi_proposal(i));
    P_current = chi_current(i)*exp(-chi_current(i));

    % Guess probability density
    g_proposal = normalCdf(chi_proposal(i), 0, 1); % what sigma to use here?
    g_current = normalCdf(chi_current(i), 0, 1); % what sigma to use here?

    % Calculate acceptance ratio
    acceptance_ratio(i) = P_proposal*g_current / P_current*g_proposal;

    fprintf('Iteration %d of %d:\n', i, iterations);
    fprintf('\tTheta current: %s\n', sprintf('%f ', theta_current(i,:)));
    fprintf('\tTheta current cost: %f\n', chi_current(i));
    fprintf('\tTheta proposal: %s\n', sprintf('%f ', theta_proposal(i,:)));
    fprintf('\tTheta proposal cost: %f\n', chi_proposal(i));

    uniform(i) = rand;

    % Descide if taking the proposed step
    if uniform(i) < acceptance_ratio(i)

    % Take step
    theta_current(i+1,:) = theta_proposal(i,:);
    chi_current(i+1) = chi_proposal(i);
    else
    % Remain
    theta_current(i+1,:) = theta_current(i,:);
    chi_current(i+1) = chi_current(i);
    end
    end

    Like

  29. I just would like to emphasise that sometimes sigma in the Chi square is important. If you have two models that give two different fits (one better than the other) and that you compute the 2 Chi squares, depending on your observed data, the value will be very different with and without sigma. In other words, if you have data lying on a large range of Y values (several order of magnitude), then, without sigma, the low Y values will have a lower importance than the high Y values, as such, when the model is completely wrong at reproducing the low Y values, the Chi square won’t change much compare to a good model, as long as the large Y values are well represented. If you add a sigma (could be 1% of the observed data), then the low Y values will have the same “weight” as the large Y values, and the relative Chi square between two models, one representing well the low Y values and one badly representing the low Y values will be much larger. Hope I’m clear, not easy to explain with word, but I think this is important to say since in your text you mention that sigma is not a determinant parameter.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s