John Ioannidis has a recent paper in Nature Reviews Neuroscience arguing that many results in neuroscience are wrong. The argument follows his previous papers of why most published results are wrong (see here and here) but emphasizes the abundance of studies with small sample sizes in neuroscience. This both reduces the chances of finding true positives and increases the chances of obtaining false positives. Under powered studies are also susceptible to what is called the “winner’s curse” where the effect sizes of true positives are artificially amplified. My take is that any phenomenon with a small effect should be treated with caution even if it is real. If you really wanted to find what causes a given disease then you probably want to find something that is associated with all cases, not just in a small percentage of them.
Archive for the ‘Bayes’ Category
In a previous post, I summarized the Bayesian approach to model comparison, which requires the calculation of the Bayes factor between two models. Here I will show one computational approach that I use called thermodynamic integration borrowed from molecular dynamics. Recall, that we need to compute the model likelihood function
for each model where is just the parameter dependent likelihood function we used to find the posterior probabilities for the parameters of the model.
The integration over the parameters can be accomplished using the Markov Chain Monte Carlo, which I summarized previously here. We will start by defining the partition function
where is an inverse temperature. The derivative of the log of the partition function gives
which is equal to the ensemble average of . However, if we assume that the MCMC has reached stationarity then we can replace the ensemble average with a time average . Integrating (3) over from 0 to 1 gives
From (1) and (2), we see that , which is what we want to compute and .
Hence, to perform Bayesian model comparison, we simply run the MCMC for each model at different temperatures (i.e. use as the likelihood in the standard MCMC) and then integrate the log likelihoods over at the end. For a Gaussian likelihood function, changing temperature is equivalent to changing the data “error”. The higher the temperature the larger the presumed error. In practice, I usually run at seven to ten different values of and use a simple trapezoidal rule to integrate over . I can even do parameter inference and model comparison in the same MCMC run.
Erratum, 2013-5-2013, I just fixed an error in the final formula
Different internal Bayesian likelihood functions may be why we disagree. The recent shooting tragedies in Colorado and Wisconsin have set off a new round of arguments about gun control. Following the debate has made me realize that the reason the two sides can’t agree (and why different sides of almost every controversial issue can’t agree) is that their likelihood functions are completely different. The optimal way to make inferences about the world is to use Bayesian inference and there is some evidence that we are close to optimal in some circumstances. Nontechnically, Bayesian inference is a way to update the strength of your belief in something (i.e. probability) given new data. What you do is to combine your prior probability with the likelihood of the data given your internal model of the issue (and then normalize to get a posterior probability). For a more technical treatment of Bayesian inference, see here. I posted previously (see here) that I thought that drastic differences in prior probabilities is why people don’t seem to update their beliefs when faced with overwhelming evidence to the contrary. However, I’m starting to realize that the main reason might be that they have completely different models of the world, which in technical terms is their likelihood function.
Consider the issue of gun control. The anti-gun control side argue that “guns don’t kill people, people kill people’ and that restricting access to guns won’t prevent determined malcontents from coming up with some means to kill. The pro-gun control side argues that the amount of gun violence is inversely proportional to the ease of access to guns. After all, you would be hard pressed to kill twenty people in a movie theatre with a sword. The difference in these two viewpoints can be summarized by their models of the world. The anti-gun control people believe that the distribution of the will of people who would commit violence looks like this
where the horizontal line represents a level of gun restriction. In this world view, no amount of gun restriction would prevent these people from undertaking their nefarious designs. On the other hand, the pro-gun control side believes that the same distribution looks like this
in which case, the higher you set the barrier the fewer the number of crimes committed. Given these two views of the world, it is clear why new episodes of gun violence like the recent ones in Colorado and Wisconsin do not change people’s minds. What you would need to do is to teach a new likelihood function to the other side and that may take decades if at all.
I listened today to the podcast of science fiction writer Bruce Sterling’s Long Now Foundation talk from 2004 on “The Singularity: Your Future as a Black Hole”. The talk is available here. Sterling describes some of the ideas of mathematician and science fiction writer Vernor Vinge’s conception of the singularity as a scary moment in time where super human intelligence ends the human era and we have no way to predict what will happen. I won’t address the issue of whether or not such a moment in time will or not happen in the near future or ever. I’ve posted about it in the past (e.g. see here, here and here). What I do want to discuss is whether or not there can exist events or phenomena that are so incomprehensible that it will reduce us to a quivering pile of mush. I think an excellent starting point is former US Secretary of Defense, Donald Rumsfeld’s infamous speech from 2002 regarding the link between Iraq and weapons of mass destruction prior to the Iraq war, where he said:
[T]here are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – there are things we do not know we don’t know.
While Rumsfeld was mocked by the popular media for this seemingly inane statement, I actually think (geopolitical consequences aside) that it was the deepest thing I ever heard him say. Rumsfeld is a Bayesian! There is a very important distinction between known unknowns and unknown unknowns. In the first case, we can assign a probability to the event. In the second we cannot. Stock prices are known unknowns, while black swans are unknown unknowns. (Rumsfeld’s statement predates Nassim Taleb’s book.) My question is not whether we can predict black swans (by definition we cannot) but whether something can ever occur that we wouldn’t even be able to describe it much less understand it.
In Bayesian language, a known unknown would be any event for which a prior probability exists. Unknown unknowns are events for which there is no prior. It’s not just that the prior is zero, but also that it is not included in our collection (i.e. sigma algebra) of possibilities. Now, the space of all possible things is uncountably infinite so it seems likely that there will be things that you cannot imagine. However, I claim that by simply acknowledging that there exist things that I cannot possible ever imagine, is sufficient to remove the surprise. We’ve witnessed enough in the post-modern world, to assign a nonzero prior to the possibility that anything can and will happen. That is not to say that we won’t be very upset or disturbed by some event. We may read about some horrific act of cruelty tomorrow that will greatly perturb us but it won’t be inconceivably shocking. The entire world could vanish tomorrow and be replaced by an oversized immersion blender and while I wouldn’t be very happy about it and would be extremely puzzled by how it happened, I would not say that it was impossible. Perhaps I won’t be able to predict what will happen after the singularity arrives but I won’t be surprised by it.
Biophysical Journal, Volume 102, Issue 3, 399-406, 8 February 2012
William J. Heuett, Bernard V. Miller, Susan B. Racette, John O. Holloszy, Carson C. Chow, and Vipul Periwal
Abstract: Inference of the insulin secretion rate (ISR) from C-peptide measurements as a quantification of pancreatic β-cell function is clinically important in diseases related to reduced insulin sensitivity and insulin action. ISR derived from C-peptide concentration is an example of nonparametric Bayesian model selection where a proposed ISR time-course is considered to be a “model”. An inferred value of inaccessible continuous variables from discrete observable data is often problematic in biology and medicine, because it is a priori unclear how robust the inference is to the deletion of data points, and a closely related question, how much smoothness or continuity the data actually support. Predictions weighted by the posterior distribution can be cast as functional integrals as used in statistical field theory. Functional integrals are generally difficult to evaluate, especially for nonanalytic constraints such as positivity of the estimated parameters. We propose a computationally tractable method that uses the exact solution of an associated likelihood function as a prior probability distribution for a Markov-chain Monte Carlo evaluation of the posterior for the full model. As a concrete application of our method, we calculate the ISR from actual clinical C-peptide measurements in human subjects with varying degrees of insulin sensitivity. Our method demonstrates the feasibility of functional integral Bayesian model selection as a practical method for such data-driven inference, allowing the data to determine the smoothing timescale and the width of the prior probability distribution on the space of models. In particular, our model comparison method determines the discrete time-step for interpolation of the unobservable continuous variable that is supported by the data. Attempts to go to finer discrete time-steps lead to less likely models.
I think the recent uproar over the acquittal of Casey Anthony clearly shows how our internal system of inference can be at odds with the American judicial system. For those of you who don’t pay attention to the mainstream media, Casey Anthony was a young mother of a toddler that was found dead. What captivated the American public was that the toddler had been missing for a month before Anthony reported it. She lied to her parents and the authorities about the whereabouts of the child and even appeared celebratory in public during the period of the child’s disappearance. In the court of public opinion, Anthony was clearly guilty. The fact that a mother showed no anxiety whatsoever over the disappearance of a child clearly indicates that she was the culprit.
The American judicial system requires that if there is any reasonable doubt of guilt then a person must be acquitted. The burden of proof is on the prosecution. In this case, there were no witnesses and no physical evidence linking Anthony to the death of the child or that the child was even murdered. Thus there was the remote possibility that Anthony was not responsible for the child’s death but simply took advantage of the situation, as macabre as that may be. Even though the probability of these two unlikely events – a mother wishing to be free of her child and a child going missing – is exceedingly low, it is still non zero and thus the jury was forced to acquit.
Now if a single piece of evidence had linked Anthony to the crime, say a fingerprint or DNA sample, then she most likely would have been found guilty. The interesting aspect of this is that there is also an equally low probability that someone could have planted the evidence to frame her. Thus, reasonable doubt is not a global quantity according to the law. It is not sufficient that the total probability that the accused is guilty be high, it also matters if it is high in each of several categories, i.e. motive, opportunity, and direct physical evidence or witnesses. Circumstantial evidence is insufficient to convict a criminal. However, it appears that our brains do not work this way. We seem to take the global probability of guilt and go with that.
I can never remember the form of distributions so here is a cheat sheet for the density functions of some commonly used ones.
The most recent dust-up over statistical significance and being wrong regards the publication of a paper on ESP in a top psychology journal. Here’s a link to a paper criticizing the result using Bayesian analysis. Below is an excerpt from the journal Science.
Science: The decision by a top psychology journal to publish a paper on extrasensory perception (ESP) has sparked a lively discussion on blogs and in the mainstream media. The paper’s author, Daryl Bem, a respected social psychologist and professor emeritus at Cornell University, argues that the results of nine experiments he conducted with more than 1000 college students provide statistically significant evidence of an ability to predict future events. Not surprisingly, word that the paper will appear in an upcoming issue of the Journal of Personality and Social Psychology (JPSP) has provoked outrage from pseudoscience debunkers and counteraccusations of closed-mindedness from those willing to consider the possibility of psychic powers.
It has also rekindled a long-running debate about whether the statistical tools commonly used in psychology—and most other areas of science—too often lead researchers astray. “The real lesson to be learned from this is not that ESP exists, it’s that the methods we’re using aren’t protecting us against spurious results,” says David Krantz, a statistician at Columbia University…
…But statisticians have long argued that there are problems lurking in the weeds when it comes to standard statistical methods like the t test. What scientists generally want to know is the probability that a given hypothesis is true given the data they’ve observed. But that’s not what a p-value tells them, says Adrian Raftery, a statistician at the University of Washington, Seattle. In Bem’s erotic photo experiment, the p-value of less than .01 means, by definition, that there’s less than a 1% chance he would have observed these data—or data pointing to an even stronger ESP effect—if ESP does not exist. “Some people would turn that around and say there’s a 99% chance there’s something going on, but that’s wrong,” Raftery says.
Not only does this type of thinking reflect a misunderstanding of what a p-value is, but it also overestimates the probability that an effect is real, Raftery and other statisticians say. Work by Raftery, for example, suggests that p-values in the .001 to .01 range reflect a true effect only 86% to 92% of the time. The problem is more acute for larger samples, which can give rise to a small p-value even when the effect is negligible for practical purposes, Raftery says.
He and others champion a different approach based on so-called Bayesian statistics. Based on a theory developed by Thomas Bayes, an 18th century English minister, these methods are designed to determine the probability that a hypothesis is true given the data a researcher has observed. It’s a more intuitive approach that’s conceptually more in line with the goals of scientists, say its advocates. Also, unlike the standard approach, which assumes that each new experiment takes place in a vacuum, Bayesian statistics takes prior knowledge into consideration.
This is the third post on Bayesian inference. The other two are here and here. This probably should be the first one to read if you are completely unfamiliar with the topic. Suppose you are trying to model some system and you have a model that you want to match some data. The model could be a function, a set of differential equations, or anything with parameters that can be adjusted. To make this concrete, consider this classic differential equation model of the response of glucose to insulin in the blood:
where is glucose concentration, is insulin, is the action of insulin in some remote compartment, and there are five free parameters . If you include the initial values of and there are seven free parameters. The data consist of measurements of glucose and insulin at discrete time points . The goal is to find a set of free parameters so that the model fits the data at those time points.
This is the follow up post to my earlier one on the Markov Chain Monte Carlo (MCMC) method for fitting models to data. I really should have covered Bayesian parameter estimation before this post but as an inadvertent demonstration of the simplicity of the Bayesian approach I’ll present these ideas in random order. Although it won’t be immediately apparent in this post, the MCMC is the only computational method you ever really need to learn to do all Bayesian inference.
It often comes up in biology and other fields that you have several possible models that could explain some data and you want to know which model is best. The first thing you could do is to see which model fits the data best. However, if a model has more parameters then certainly it will fit the model better. You could always have a model with as many parameters as data points and that would fit the data perfectly. So, it is necessary to balance how good you fit the data with the complexity of the model. There have been several proposals to address this issue. For example, the Akaike Information Criterion (AIC) or Bayes Information Criterion (BIC). However, as I will show here, these criteria are just approximations to Bayesian model comparison.
As I have posted before, I never learned any statistics during my education as a theoretical physicist/applied mathematician. However, it became fairly apparent after I entered biology (although I managed to avoid it for a few years) that fitting models to data and estimating parameters is unavoidable. As I have opined multiple times previously, Bayesian inference and the Markov Chain Monte Carlo (MCMC) method is the best way to do this. I thought I would provide the details on how to implement the algorithm since most of the treatments on the subject are often couched in statistical language that may be hard to decipher for someone with a physics or dynamical systems background. I’ll do this in a series of posts and in a nonstandard order. Instead of building up the entire Bayesian formalism, I thought I would show how the MCMC method can be used to do the job and then later show how it fits into the grand scheme of things to do even more.
Suppose you have data , which consists of some measurements (either a scalar or a vector) at discrete time points and a proposed model, which produces a time series , where represents a set of free parameters that changes the function. This model could be a system of differential equations or just some proposed function like a polynomial. The goal is to find the set of parameters that best fits the data and to evaluate how good the model is. Now, the correct way to do this is to use Bayesian inference and model comparison, which can be computed using the MCMC. However, the MCMC can also be used just to get the parameters in the sense of finding the best fit according to some criterion.
I just gave a seminar in the math department at the University of Iowa today. I gave a talk that was similar to the one I gave at the Mathematical Biosciences Institute on Bayesian Inference for dynamical systems. My slides are here. I was lucky I made it to give this talk. My flight from Chicago O’Hare to Cedar Rapids, Iowa was cancelled yesterday and I was rebooked on another flight tonight, which wouldn’t have been of much use for my talk this afternoon. There was one fight left last evening to Cedar Rapids but bad weather cancelled several flights yesterday so there were many stranded travelers. I was number 38 on the standby list and thought I had no chance to make it out that night. However, on a whim I decided to wait it out and I started to move up the list because other people had evidently given up as well. To my great surprise and relief I was the last person to get on the plane. There was a brief scare when they asked me to get off because we exceeded the weight limitation but then they changed their mind and let us all fly. (Someone else had kindly volunteered to take my place). I learned two lessons. One is to keep close watch of your flight at all times so you can get on the standby list as soon as possible and two is that even number 38 on a plane that only seats 50 can still make it.
Karen M. Ong, John A. Blackford, Jr., Benjamin L. Kagan, S. Stoney Simons, Jr., and Carson C. Chow. A theoretical framework for gene induction and experimental comparisons PNAS 200911095; published ahead of print March 29, 2010, doi:10.1073/pnas.0911095107
This is an open access article so it can be downloaded directly from the PNAS website here.
This is a paper where group theory appears unexpectedly. The project grew out of a chance conversation between Stoney Simons and myself in 2004. I had arrived recently at the NIH and was invited to give a presentation at the NIDDK retreat. I spoke about how mathematics could be applied to obesity research and I’ll talk about the culmination of that effort in my invited plenary talk at the joint SIAM Life Sciences and Annual meeting this summer. Stoney gave a summary of his research on steroid-mediated gene induction. He showed some amazing data of the dose response curve of the amount of gene product induced for a given amount of steroid. In these experiments, different concentrations of steroid are added to cells in a dish and then after waiting awhile, the total amount of gene product in the form of luciferase is measured. Between steroid and gene product is a lot of messy and complicated biology starting with steroid binding to a steroid receptor, which translocates to the nucleus while interacting with many molecules on the way. The steroid-receptor complex then binds to DNA, which triggers a transcription cascade involving many other factors binding to DNA, which gives rise to mRNA, which is then translated into luciferase and then measured as photons. Amazingly, the dose response curve after all of this was fit almost perfectly ( > 95%) by a Michaelis-Menton or first order Hill function
where P is the amount of gene product, [S] is the concentration of steroid, Amax is the maximum possible amount of product and EC50 is the concentration of steroid giving half the maximum of product. Stoney also showed that Amax and EC50 could be moved in various directions by the addition of various cofactors. I remember thinking to myself during his talk that there must be a nice mathematical explanation for this. After my talk, Stoney came up to me and asked me if I thought I could model his data. We’ve been in sync like this ever since.
The recent tragedy at Fort Hood has people griping about missed signals that could have been used to prevent the attack. However, I will argue that is likely to be impossible to ever have a system that can screen out all terrorists without also flagging a lot of innocent people. The calculation is a simple exercise in probability theory that is often given in first year statistic classes.
Suppose we have a system in place that gives a yes Y or no response of whether or not a person is a terrorist T. Let P(T) be the probablity that a given person is a terrorist, P(T|Y) be the probability that a person is a terrorist given that the test said yes. Thus P(~T|Y)=1-P(T|Y) is the probability that one is not a terrorist even though the test said so. Using Bayes theorem we have that
P(~T|Y)=P(Y|~T) P(~T)/P(Y) (*)
where P(Y)=P(Y|T)P(T) + P(Y|~T)P(~T) is the probability of getting a yes result. Now, the probability of being a terrorist is very low. Out of the 300 million or so people in the US a small number are probably potential terrorists. The US military has over a million people on active service. Hence, the probability of not being a terrorist is very high.
From (*), we see that in order to have a low probability of flagging an innocent person we need to have P(Y|~T)P(~T)<< P(Y|T)P(T), or P(Y|~T)<< P(Y|T) P(T)/P(~T). Since P(T) is very small, P(T)/P(~T)~ P(T), so if the true positive probability P(Y|T) was near one (i.e. a test that catches all terrorists), we need the false positive probability P(Y|~T) to be much smaller than the probability of having a terrorist, which means we need a test that gives false positives at a rate of less than 1 in a million. The problem is that the true positive and false positive probabilities will be correlated. The more sensitive the test the more likely it is to get a false positive. So if you set your threshold to be very low so P(Y|T) is very high (i.e. make sure you never miss a terrorist), you’ll most certainly have P(Y|~T) to also be high. I doubt you’ll ever have a test where P(Y|T) is near one while P(Y|~T) is less than one in a million. So basically, if we want to catch all the terrorists, we’ll also have to flag a lot of innocent people.
I’m currently at the Mathematical Biosciences Institute for a workshop on Computational challenges in integrative biological modeling. The slides for my talk on using Bayesian methods for parameter estimation and model comparison are here.
Abstract: Differential equation models are often used to model biological systems. An important and difficult problem is how to estimate parameters and decide which model among possible models is the best. I will argue that Bayesian inference provides a self-consistent framework to do both tasks. In particular, Bayesian parameter estimation provides a natural measure of parameter sensitivity and Bayesian model comparison automatically evaluates models by rewarding fit to the data while penalizing the number of parameters. I will give examples of employing these approaches on ODE and PDE models.
This past Sunday, economist Paul Krugman was lamenting in a book review of Justin Fox’s book “Myth of the Rational Market” (which he liked very much) that despite this current financial crisis and previous crises, like the failure of the hedge fund Long Term Capital Management, people still believe in efficient markets as strongly as ever. The efficient market hypothesis is the basis of most of modern finance and assumes that the price of a security is always correct and that you can never beat the market. So artificial bubbles should never occur. Krugman wonders what it will take to ever change people’s minds.
I want to show here that there might be no amount of evidence that will ever change their minds and they can still be perfectly rational in the Bayesian sense. The argument can also apply to all other controversial topics. I think it is generally believed in intellectual circles that the reason there is so much disagreement on these issues is that the other side is either stupid, deluded or irrational. I want to point out that believing in something completely wrong even in the face of overwhelming evidence may arise in perfectly rational beings. That is not to say that faulty reasoning does not exist and can be dangerous. It just explains why two perfectly reasonable and intelligent people can disagree so alarmingly.
The seeds of the modern era could arguably be traced to the Enlightenment and the invention of rationality. I say invention because although we may be universal computers and we are certainly capable of applying the rules of logic, it is not what we naturally do. What we actually use, as coined by E.T. Jaynes in his iconic book Probability Theory: The Logic of Science, is plausible reasoning. Jaynes is famous for being a major proponent of Bayesian inference during most of the second half of the last century. However, to call Jaynes’s book a book about Bayesian statistics is to wholly miss Jayne’s point, which is that probability theory is not about measures on sample spaces but a generalization of logical inference. In the Jaynes view, probabilities measure a degree of plausibility.
I think a perfect example of how unnatural the rules of formal logic are is to consider the simple implication
which means – If A is true then B is true. By the rules of formal logic, if A is false then B can be true or false (i.e. a false premise can prove anything). Conversely, if B is true, then A can be true or false. The only valid conclusion you can deduce from is that if B is false then A is false. Implication is equivalent to the logical statement , where means negation and means logical OR.
A paper I’ve been trying to get published for two years will finally appear in the American Journal of Physiology – Regulatory, Integrative, and Comparative Physiology. The goal of this paper was to develop a quantitative model for how insulin suppresses free fatty acid levels in the blood. A little background for those unfamiliar with human metabolism. All of the body’s cells burn fuel and for most cells this can be fat, carbohydrate or protein. The brain, however, can only burn glucose, which is a carbohydrate, and ketone bodies, which are made when the body is short of glucose. Why the brain can’t burn fat is still a mystery. It is not because fat cannot cross the blood brain barrier as is sometimes claimed. Thus, the body has a reason to regulate glucose levels in the blood. It does this through hormones, the most well known of which is insulin. (more…)