Bayesian model comparison Part 2

May 11, 2013May 13, 2013 Carson Chow Bayes, Pedagogy, Probablity, Statistics

In a previous post, I summarized the Bayesian approach to model comparison, which requires the calculation of the Bayes factor between two models. Here I will show one computational approach that I use called thermodynamic integration borrowed from molecular dynamics. Recall, that we need to compute the model likelihood function

$P(D|M)=\int P((D|M,\theta)P(\theta|M) d\theta$ (1)

for each model where $P(D|M,\theta)$ is just the parameter dependent likelihood function we used to find the posterior probabilities for the parameters of the model.

The integration over the parameters can be accomplished using the Markov Chain Monte Carlo, which I summarized previously here. We will start by defining the partition function

$Z(\beta) = \int P(D|M,\theta)^\beta P(\theta| M) d\theta$ (2)

where $\beta$ is an inverse temperature. The derivative of the log of the partition function gives

$\frac{d}{d\beta}\ln Z(\beta)=\frac{\int d\theta \ln[P(D |\theta,M)] P(D | \theta, M)^\beta P(\theta|M)}{\int d\theta \ P(D | \theta, M)^\beta P(\theta | M)}$ (3)

which is equal to the ensemble average of $\ln P(D|\theta,M)$ . However, if we assume that the MCMC has reached stationarity then we can replace the ensemble average with a time average $\frac{1}{T}\sum_{i=1}^T \ln P(D|\theta, M)$ . Integrating (3) over $\beta$ from 0 to 1 gives

$\ln Z(1) = \ln Z(0) + \int \langle \ln P(D|M,\theta)\rangle d\beta$

From (1) and (2), we see that $Z(1)=P(D|M)$ , which is what we want to compute and $Z(0)=\int P(\theta|M) d\theta=1$ .

Hence, to perform Bayesian model comparison, we simply run the MCMC for each model at different temperatures (i.e. use $P(D|M,\theta)^\beta$ as the likelihood in the standard MCMC) and then integrate the log likelihoods $Z(1)$ over $\beta$ at the end. For a Gaussian likelihood function, changing temperature is equivalent to changing the data “error”. The higher the temperature the larger the presumed error. In practice, I usually run at seven to ten different values of $\beta$ and use a simple trapezoidal rule to integrate over $\beta$ . I can even do parameter inference and model comparison in the same MCMC run.

Erratum, 2013-5-2013, I just fixed an error in the final formula

8 thoughts on “Bayesian model comparison Part 2”

ishi says:

May 13, 2013 at 01:23

nice post though i don’t know many of the terms well. does remind me a (‘it-for’) bit of arxiv.org/abs/0901.4075v1 and arxiv.org/abs/1204.648

i do like using the idea of ‘temperature’ for some reason—‘getting hot i here’. i b(r)ought this up to some economists and they werent too sure. some say temperature is intuitively how we feel at times, or average kinetic energy, or a lagrangian multiplier (curve fitting parameter). yakovenko at u md says something its like the average income in economics — like bentham from the 1800’s). i’m agnostic.

LikeLike
ishi says:

May 13, 2013 at 01:30

ps i see yakoneko has a review written for foley arxiv.org/abs/1204.6483v1

LikeLike
Cheng Ly says:

May 13, 2013 at 08:59

Very nice post. In practice, does the chosen distribution of theta matter even though it is integrated out via P(theta|M)?

LikeLike
Carson Chow says:

May 13, 2013 at 09:42

@ishi temperature is like the width of the distribution

@cheng The prior distribution for theta doesn’t matter. The formula assumes it is normalized correctly but you could account for it even if it isn’t.

LikeLike
Carson Chow says:

May 13, 2013 at 09:53

@cheng Actually, if you don’t set ln Z(0)=0 then it doesn’t matter what the prior is as long as it is integrable.

LikeLike
ishi says:

May 16, 2013 at 04:13

refreshing memory, temperature says how many microstates are accessible–at T=0, just one, at T=infinity every particle can be in any microstate with equal probability. or, its the inverse of the change in entropy with respect to energy. the view of it as an ‘average’ basically says as everyone in lake wobegon rises up above average, they have more opportunities to be anything so long as the exponential form is maintained—- they can’t be anything all the time . (it might be nice to use some non-archimedian, surreal, or other logic however in which you have an expanding universe with evolving temperature in which people can be eveything all the time (multiverse in ‘the bottle’ (gil scott heron))).

LikeLike
lcnature says:

January 6, 2014 at 22:49

Hi Carson,
Thanks a lot for this post! If a subset of my parameters belong to Dirichlet distribution (being bounded and having dependence on each other), is there any caution I need to take?

LikeLike
Carson Chow says:

January 6, 2014 at 23:13

@Icnature The MCMC is not foolproof. If the parameters depend on each strongly so they sit on some low dimensional manifold then the MCMC will fail to converge. However, if you have a good prior then including that will help.

LikeLike

Scientific Clearing House

Carson C. Chow

Bayesian model comparison Part 2

8 thoughts on “Bayesian model comparison Part 2”

Leave a comment

Share this:

8 thoughts on “Bayesian model comparison Part 2”

Leave a comment