Here are my slides for my recent COVID-19 talk at the Centre for Applied Mathematics in BioScience and Medicine (CAMBAM). It’s an updated version of the one I gave to the FDA.
Right now there are hundreds if not thousands of Covid-19 models floating out there. Some are better than others and some have much more influence than others and the ones that have the most influence are not necessarily the best. There must be a better way of doing this. The world’s greatest minds convened in Los Alamos in WWII and built two atomic bombs. Metaphors get thrown around with reckless abandon but if there ever was a time for a Manhattan project, we need one now. Currently, the world’s scientific community has mobilized to come up with models to predict the spread and effectiveness of mitigation efforts, to produce new therapeutics and to develop new vaccines. But this is mostly going on independently.
Would it be better if we were to coordinate all of this activity. Right now at the NIH, there is an intense effort to compile all the research that is being done in the NIH Intramural Program and to develop a system where people can share reagents and materials. There are other coordination efforts going on worldwide as well. This website contains a list of open source computational resources. This link gives a list of data scientists who have banded together. But I think we need a world wide plan if we are ever to return to normal activity. Even if some nation has eliminated the virus completely within its borders there is always a chance of reinfection from outside.
In terms of models, they seem to have very weak predictive ability. This is probably because they are all over fit. We don’t really understand all the mechanisms of SARS-CoV-2 propagation. The case or death curves are pretty simple and as Von Neumann or Ulam or someone once said, “give me 4 parameters and I can fit an elephant, give me 5 and I can make it’s tail wiggle.” Almost any model can fit the curve but to make a projection into the future, you need to get the dynamics correct and this I claim, we have not done. What I am thinking of proposing is to have the equivalent of FDA approval for predictive models. However, instead of a binary decision of approval non-approval, people could submit there models for a predictive score based on some cross validation scheme or prediction on a held out set. You could also submit as many times as you wish to have your score updated. We could then pool all the models and produce a global Bayesian model averaged prediction and see if that does better. Let me know if you wish to participate or ideas on how to do this better.
Here are the slides for my webinar at FDA today .
First draft of our paper can be downloaded from this link. I’ll post more details about it later.
Global prediction of unreported SARS-CoV2 infection from observed COVID-19 cases
Carson C. Chow 1*, Joshua C. Chang 2,3, Richard C. Gerkin 4*, Shashaank Vattikuti 1*
1Mathematical Biology Section, LBM, NIDDK, National Institutes of Health2Epidemiology and Biostatistics Section, Rehabilitation Medicine, Clinical Center, National Institutes of Health 3 mederrata 4 School of Life Sciences, Arizona State University
Summary: Estimation of infectiousness and fatality of the SARS-CoV-2 virus in the COVID-19 global pandemic is complicated by ascertainment bias resulting from incomplete and non-representative samples of infected individuals. We developed a strategy for overcoming this bias to obtain more plausible estimates of the true values of key epidemiological variables. We fit mechanistic Bayesian latent-variable SIR models to confirmed COVID-19 cases, deaths, and recoveries, for all regions (countries and US states) independently. Bayesian averaging over models, we find that the raw infection incidence rate underestimates the true rate by a factor, the case ascertainment ratio CARt that depends upon region and time. At the regional onset of COVID-19, the predicted global median was 13 infections unreported for each case confirmed (CARt = 0.07 C.I. (0.02, 0.4)). As the infection spread, the median CARt rose to 9 unreported cases for every one diagnosed as of April 15, 2020 (CARt = 0.1 C.I. (0.02, 0.5)). We also estimate that the median global initial reproduction number R0 is 3.3 (C.I (1.5, 8.3)) and the total infection fatality rate near the onset is 0.17% (C.I. (0.05%, 0.9%)). However the time-dependent reproduction number Rt and infection fatality rate as of April 15 were 1.2 (C.I. (0.6, 2.5)) and 0.8% (C.I. (0.2%,4%)), respectively. We find that there is great variability between country- and state-level values. Our estimates are consistent with recent serological estimates of cumulative infections for the state of New York, but inconsistent with claims that very large fractions of the population have already been infected in most other regions. For most regions, our estimates imply a great deal of uncertainty about the current state and trajectory of the epidemic.
I have officially thrown my hat into the ring and joined the throngs of would-be Covid-19 modelers to try to estimate (I deliberately do not use predict) the progression of the pandemic. I will pull rank and declare that I kind of do this type of thing for a living. What I (and my colleagues whom I have conscripted) are trying to do is to estimate the fraction of the population that has the SARS-CoV-2 virus but has not been identified as such. This was the goal of the Lourenco et al. paper I wrote about previously that pulled me into this pit of despair. I argued that fitting to deaths alone, which is what they do, is insufficient for constraining the model and thus it has no predictive ability. What I’m doing now is seeing whether it is possible to do the job if you fit not only deaths but also the number of cases reported and cases recovered. You then have a latent variable model where the observed variables are cases, cases that die, and cases that recover, and the latent variables are the infected that are not identified and the susceptible population. Our plan is to test a wide range of models with various degrees of detail and complexity and use Bayesian Model Comparison to see which does a better job. We will apply the analysis to global data. We’re getting numbers now but I’m not sure I trust them yet so I’ll keep computing for a few more days. The full goal is to quantify the uncertainty in a principled way.
2020-04-06: edited for purging typos
Here is my response to the paper from Oxford (Lourenco et al.) arguing that novel coronavirus infection may already be widespread in the UK and Italy. The result is based on fitting a disease spreading model, called an SIR model, to the cumulative number of deaths. SIR models usually consist of ordinary differential equations (ODEs) for the fraction of people in a given population who are susceptible to the infectious agent (S), the number infected (I), and the number recovered (R). There is one other state in the model, which is the fraction who die from the disease (D). The SIR model considers transitions between these states. In the case of ODEs, the states are treated as continuous quantities, which is not a bad approximation for a large population, and each equation in the model describes the rate of change of a state (hence differential equation). There are parameters in the model governing the rate of different interactions in each equation, for example there is a parameter for the rate of increase in S whenever an S interacts with an I, and then there is a rate of loss of an I, which transitions into either R or D. The Oxford group model D somewhat differently. Instead of a transition from I into D they consider that a fraction of (1-S) will die with some delay between time of infection and death.
They estimate the model parameters by fitting the model to the cumulative number of deaths. They did this instead of fitting directly to I because that is unreliable as many people who have Covid-19 have not been tested. They also only fit to the first 15 days from the first recorded death since they want to model what happens before social distancing was implemented. They find that the model is consistent with a scenario where the probability that an infected person gets severe enough to be flagged is low and thus the disease is much more wide spread than expected. I redid the analysis without assuming that the parameters need to have particular values (called priors in Bayesian inference and machine learning) and showed that a wide range of parameters will fit the data. This is because the model is under-constrained by death data alone so even unrealistic parameters can work. To be fair, the authors only proposed that this is a possibility and thus the population should be tested for anti-bodies to the coronavirus (SARS-CoV-2) to see if indeed there may already be herd immunity in place. However, the press has run with the result and that is why I think it is important to examine the result more closely.
In my previous post, I gave an example of how fake news could lead to a scenario of no update of posterior probabilities. However, this situation could occur just from the knowledge of technology. When I was a child, fantasy and science fiction movies always had a campy feel because the special effects were unrealistic looking. When Godzilla came out of Tokyo Harbour it looked like little models in a bathtub. The Creature from the Black Lagoon looked like a man in a rubber suit. I think the first science fiction movie that looked astonishing real was Stanley Kubrick’s 1968 masterpiece 2001: A Space Odyssey, which adhered to physics like no others before and only a handful since. The simulation of weightlessness in space was marvelous and to me the ultimate attention to detail was the scene in the rotating space station where a mild curvature in the floor could be perceived. The next groundbreaking moment was the 1993 film Jurassic Park, which truly brought dinosaurs to life. The first scene of a giant sauropod eating from a tree top was astonishing. The distinction between fantasy and reality was forever gone.
The effect of this essentially perfect rendering of anything into a realistic image is that we now have a plausible reason to reject any evidence. Photographic evidence can be completely discounted because the technology exists to create completely fabricated versions. This is equally true of audio tapes and anything your read on the Internet. In Bayesian terms, we now have an internal model or likelihood function that any data could be false. The more cynical you are the closer this constant is to one. Once the likelihood becomes insensitive to data then we are in the same situation as before. Technology alone, in the absence of fake news, could lead to a world where no one ever changes their mind. The irony could be that this will force people to evaluate truth the way they did before such technology existed, which is that you believe people (or machines) that you trust through building relationships over long periods of time.
Much has been written of the role of fake news in the US presidential election. While we will never know how much it actually contributed to the outcome, as I will show below, it could certainly affect people’s beliefs. Psychology experiments have found that humans often follow Bayesian inference – the probability we assign to an event or action is updated according to Bayes rule. For example, suppose is the probability we assign to whether climate change is real; is our probability that climate change is false. In the Bayesian interpretation of probability, this would represent our level of belief in climate change. Given new data (e.g. news), we will update our beliefs according to
What this means is that our posterior probability or belief that climate change is true given the new data, , is equal to the probability that the new data came from our internal model of a world with climate change (i.e. our likelihood), multiplied by our prior probability that climate change is real, divided by the probability of obtaining such data in all possible worlds, . According to the rules of probability, the latter is given by , which is the sum of the probability the data came from a world with climate change and that from one without.
This update rule can reveal what will happen in the presence of new data including fake news. The first thing to notice is that if is zero, then there is no update. In this binary case, this means that if we believe that climate change is absolutely false or true then no data will change our mind. In the case of multiple outcomes, any outcome with zero prior (has no support) will never change. So if we have very specific priors, fake news is not having an impact because no news is having an impact. If we have nonzero priors for both true and false then if the data is more likely from our true model then our posterior for true will increase and vice versa. Our posteriors will tend towards the direction of the data and thus fake news could have a real impact.
For example, suppose we have an internal model where we expect the mean annual temperature to be 10 degrees Celsius with a standard deviation of 3 degrees if there is no climate change and a mean of 13 degrees with climate change. Thus if the reported data is mostly centered around 13 degrees then our belief of climate change will increase and if it is mostly centered around 10 degrees then it will decrease. However, if we get data that is spread uniformly over a wide range then both models could be equally likely and we would get no update. Mathematically, this is expressed as – if then . From the Bayesian update rule, the posterior will be identical to the prior. In a world of lots of misleading data, there is no update. Thus, obfuscation and sowing confusion is a very good strategy for preventing updates of priors. You don’t need to refute data, just provide fake examples and bury the data in a sea of noise.
It’s a week away from 2015, and I have transitioned completely away from Matlab. Julia is winning the platform attention battle. It is very easy to code in and it is very fast. I just haven’t gotten around to learning python much less pyDSTool (sorry Rob). I kind of find the syntax of Python (with all the periods between words) annoying. Wally Xie and I have also been trying to implement some of our MCMC runs in Stan but we have had trouble making it work. Our model requires integrating ODEs and the ODE solutions from Stan (using our own solver) do not match our Julia code or the gold standard XPP. Maybe we are missing something obvious in the Stan syntax but our code is really very simple. Thus, we are going back to doing our Bayesian posterior estimates in Julia. However, I do plan to revisit Stan if they (or we) can write a debugger for it.
I’ve been asked in a comment to give a sample of pseudo code for an MCMC algorithm to fit a linear model to some data. See here for the original post on MCMC. With a linear model, you can write down the answer in closed form (see here), so it is a good model to test your algorithm and code. Here it is in pseudo-Julia code:
# initial guess for parameters a and b
# construct chi squared, where D is the data vector and x is the vector of the
# independent quantity
chi = norm(D - (a*x +b))^2;
for n = 1 : total;
# Make random guesses for new normally distributed a and b with mean old a and b
# and standard deviation asig and bsig
at = a + asig * randn()
bt = b + bsig * randn() chit = norm(D - (at*x + bt))^2;
# Take ratio of likelihoods, sigma is the data uncertainty
ratio=exp((-chit + chi)/(2*sigma^2));
# Compare the ratio to a uniform random number between 0 and 1,
# keep new parameters if ratio exceeds random number
if rand() < ratio a = at;
b = bt; chi = chit; end
# Keep running until convergence
I predicted that there would be an eventual push back on Big Data and it seems that it has begun. Gary Marcus and Ernest Davis of NYU had an op-ed in the Times yesterday outlining nine issues with Big Data. I think one way to encapsulate many of the critiques is that you will never be able to do true prior free data modeling. The number of combinations in a data set grows as the factorial of the number of elements, which grows faster than an exponential. Hence, Moore’s law can never catch up. At some point, someone will need to exercise some judgement in which case Big Data is not really different from the ordinary data that we deal with all the time.
I think one of the things that tends to lead us astray when we try to understand complex phenomena like evolution, disease, or the economy, is that we have this idea that they must have a single explanation. For example, recently two papers have been published in high profile journals trying to explain mammal monogamy. Although monogamy is quite common in birds it only occurs in 5% of mammals. Here is Carl Zimmer’s summary. The study in Science, which surveyed 2545 mammal species, argued that monogamy arises when females are solitary and sparse. Males must then commit to one since dates are so hard to find. The study in PNAS examined 230 primate species, for which monogamy occurs at the higher rate of 27%, and used Bayesian inference to argue that monogamy arises to prevent male infanticide. It’s better to help out at home rather than go around killing other men’s babies. Although both of these arguments are plausible, there need not be a single universal explanation. Each species could have its own set of circumstances that led to monogamy involving these two explanations and others. However, while we should not be biased towards a single explanation, we shouldn’t also throw up our hands like Hayek and argue that no complex phenomenon can be understood. Some phenomena will have simpler explanations than others but since the Kolmogorov complexity is undecidable there is no algorithm that can tell you which is which. We will just have to struggle with each problem as it comes.
I’m currently in Mt. Snow, Vermont to give a talk at the Gordon Research Conference on Computer Aided Drug Design. Yes, I know nothing about drug design. I am here because the organizer, Anthony Nicholls, asked me to give a pedagogical talk on Bayesian Inference. My slides are here. I only arrived yesterday but the few talks I’ve seen have been quite interesting. One interesting aspect of this conference is that many of the participants are from industry. The evening sessions are meant to be of more general interest. Last night were two talks about how to make science more reproducible. As I’ve posted before, many published results are simply wrong. The very enterprising Elizabeth Iorns has started something called the Reproducibility Initiative. I am not completely clear about how it works but it is part of another entity she started called Science Exchange, which helps to facilitate collaborations with a fee-for-service model. The Reproducibility Initiative piggy backs on Science Exchange by providing a service (for a fee) to validate any particular result. Papers that pass approval get a stamp of approval. It is expected that pharma would be interested in using this service so they can inexpensively check if possible drug targets actually hold up. Many drugs fail at phase three of clinical trials because they’ve been shown to be ineffective and this may be due to the target being wrong to start with.
On a final note, I flew to Albany and drove here. Unlike in the past when I would have printed out a map, I simply assumed that I could use Google Maps on my smart phone to get here. However, Google Maps doesn’t really know where Mt. Snow is. It tried to take me up a dirt road to the back of the ski resort. Also, just after I turned up the road, the phone signal disappeared so I was blind and had no paper backup. I was suspicious that this was the right way to go so I turned back to the main highway in hopes of finding a signal or a gas station to ask for directions. A few miles down Route 9, I finally did get a signal and also found a sign that led me the way. Google Maps still tried to take me the wrong way. I should have followed what I always tell my daughter – Always have a backup plan.
John Ioannidis has a recent paper in Nature Reviews Neuroscience arguing that many results in neuroscience are wrong. The argument follows his previous papers of why most published results are wrong (see here and here) but emphasizes the abundance of studies with small sample sizes in neuroscience. This both reduces the chances of finding true positives and increases the chances of obtaining false positives. Under powered studies are also susceptible to what is called the “winner’s curse” where the effect sizes of true positives are artificially amplified. My take is that any phenomenon with a small effect should be treated with caution even if it is real. If you really wanted to find what causes a given disease then you probably want to find something that is associated with all cases, not just in a small percentage of them.
In a previous post, I summarized the Bayesian approach to model comparison, which requires the calculation of the Bayes factor between two models. Here I will show one computational approach that I use called thermodynamic integration borrowed from molecular dynamics. Recall, that we need to compute the model likelihood function
for each model where is just the parameter dependent likelihood function we used to find the posterior probabilities for the parameters of the model.
The integration over the parameters can be accomplished using the Markov Chain Monte Carlo, which I summarized previously here. We will start by defining the partition function
where is an inverse temperature. The derivative of the log of the partition function gives
which is equal to the ensemble average of . However, if we assume that the MCMC has reached stationarity then we can replace the ensemble average with a time average . Integrating (3) over from 0 to 1 gives
From (1) and (2), we see that , which is what we want to compute and .
Hence, to perform Bayesian model comparison, we simply run the MCMC for each model at different temperatures (i.e. use as the likelihood in the standard MCMC) and then integrate the log likelihoods over at the end. For a Gaussian likelihood function, changing temperature is equivalent to changing the data “error”. The higher the temperature the larger the presumed error. In practice, I usually run at seven to ten different values of and use a simple trapezoidal rule to integrate over . I can even do parameter inference and model comparison in the same MCMC run.
Erratum, 2013-5-2013, I just fixed an error in the final formula
Different internal Bayesian likelihood functions may be why we disagree. The recent shooting tragedies in Colorado and Wisconsin have set off a new round of arguments about gun control. Following the debate has made me realize that the reason the two sides can’t agree (and why different sides of almost every controversial issue can’t agree) is that their likelihood functions are completely different. The optimal way to make inferences about the world is to use Bayesian inference and there is some evidence that we are close to optimal in some circumstances. Nontechnically, Bayesian inference is a way to update the strength of your belief in something (i.e. probability) given new data. What you do is to combine your prior probability with the likelihood of the data given your internal model of the issue (and then normalize to get a posterior probability). For a more technical treatment of Bayesian inference, see here. I posted previously (see here) that I thought that drastic differences in prior probabilities is why people don’t seem to update their beliefs when faced with overwhelming evidence to the contrary. However, I’m starting to realize that the main reason might be that they have completely different models of the world, which in technical terms is their likelihood function.
Consider the issue of gun control. The anti-gun control side argue that “guns don’t kill people, people kill people’ and that restricting access to guns won’t prevent determined malcontents from coming up with some means to kill. The pro-gun control side argues that the amount of gun violence is inversely proportional to the ease of access to guns. After all, you would be hard pressed to kill twenty people in a movie theatre with a sword. The difference in these two viewpoints can be summarized by their models of the world. The anti-gun control people believe that the distribution of the will of people who would commit violence looks like this
where the horizontal line represents a level of gun restriction. In this world view, no amount of gun restriction would prevent these people from undertaking their nefarious designs. On the other hand, the pro-gun control side believes that the same distribution looks like this
in which case, the higher you set the barrier the fewer the number of crimes committed. Given these two views of the world, it is clear why new episodes of gun violence like the recent ones in Colorado and Wisconsin do not change people’s minds. What you would need to do is to teach a new likelihood function to the other side and that may take decades if at all.
I listened today to the podcast of science fiction writer Bruce Sterling’s Long Now Foundation talk from 2004 on “The Singularity: Your Future as a Black Hole”. The talk is available here. Sterling describes some of the ideas of mathematician and science fiction writer Vernor Vinge’s conception of the singularity as a scary moment in time where super human intelligence ends the human era and we have no way to predict what will happen. I won’t address the issue of whether or not such a moment in time will or not happen in the near future or ever. I’ve posted about it in the past (e.g. see here, here and here). What I do want to discuss is whether or not there can exist events or phenomena that are so incomprehensible that it will reduce us to a quivering pile of mush. I think an excellent starting point is former US Secretary of Defense, Donald Rumsfeld’s infamous speech from 2002 regarding the link between Iraq and weapons of mass destruction prior to the Iraq war, where he said:
[T]here are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – there are things we do not know we don’t know.
While Rumsfeld was mocked by the popular media for this seemingly inane statement, I actually think (geopolitical consequences aside) that it was the deepest thing I ever heard him say. Rumsfeld is a Bayesian! There is a very important distinction between known unknowns and unknown unknowns. In the first case, we can assign a probability to the event. In the second we cannot. Stock prices are known unknowns, while black swans are unknown unknowns. (Rumsfeld’s statement predates Nassim Taleb’s book.) My question is not whether we can predict black swans (by definition we cannot) but whether something can ever occur that we wouldn’t even be able to describe it much less understand it.
In Bayesian language, a known unknown would be any event for which a prior probability exists. Unknown unknowns are events for which there is no prior. It’s not just that the prior is zero, but also that it is not included in our collection (i.e. sigma algebra) of possibilities. Now, the space of all possible things is uncountably infinite so it seems likely that there will be things that you cannot imagine. However, I claim that by simply acknowledging that there exist things that I cannot possible ever imagine, is sufficient to remove the surprise. We’ve witnessed enough in the post-modern world, to assign a nonzero prior to the possibility that anything can and will happen. That is not to say that we won’t be very upset or disturbed by some event. We may read about some horrific act of cruelty tomorrow that will greatly perturb us but it won’t be inconceivably shocking. The entire world could vanish tomorrow and be replaced by an oversized immersion blender and while I wouldn’t be very happy about it and would be extremely puzzled by how it happened, I would not say that it was impossible. Perhaps I won’t be able to predict what will happen after the singularity arrives but I won’t be surprised by it.
Biophysical Journal, Volume 102, Issue 3, 399-406, 8 February 2012
William J. Heuett, Bernard V. Miller, Susan B. Racette, John O. Holloszy, Carson C. Chow, and Vipul Periwal
Abstract: Inference of the insulin secretion rate (ISR) from C-peptide measurements as a quantification of pancreatic β-cell function is clinically important in diseases related to reduced insulin sensitivity and insulin action. ISR derived from C-peptide concentration is an example of nonparametric Bayesian model selection where a proposed ISR time-course is considered to be a “model”. An inferred value of inaccessible continuous variables from discrete observable data is often problematic in biology and medicine, because it is a priori unclear how robust the inference is to the deletion of data points, and a closely related question, how much smoothness or continuity the data actually support. Predictions weighted by the posterior distribution can be cast as functional integrals as used in statistical field theory. Functional integrals are generally difficult to evaluate, especially for nonanalytic constraints such as positivity of the estimated parameters. We propose a computationally tractable method that uses the exact solution of an associated likelihood function as a prior probability distribution for a Markov-chain Monte Carlo evaluation of the posterior for the full model. As a concrete application of our method, we calculate the ISR from actual clinical C-peptide measurements in human subjects with varying degrees of insulin sensitivity. Our method demonstrates the feasibility of functional integral Bayesian model selection as a practical method for such data-driven inference, allowing the data to determine the smoothing timescale and the width of the prior probability distribution on the space of models. In particular, our model comparison method determines the discrete time-step for interpolation of the unobservable continuous variable that is supported by the data. Attempts to go to finer discrete time-steps lead to less likely models.
I think the recent uproar over the acquittal of Casey Anthony clearly shows how our internal system of inference can be at odds with the American judicial system. For those of you who don’t pay attention to the mainstream media, Casey Anthony was a young mother of a toddler that was found dead. What captivated the American public was that the toddler had been missing for a month before Anthony reported it. She lied to her parents and the authorities about the whereabouts of the child and even appeared celebratory in public during the period of the child’s disappearance. In the court of public opinion, Anthony was clearly guilty. The fact that a mother showed no anxiety whatsoever over the disappearance of a child clearly indicates that she was the culprit.
The American judicial system requires that if there is any reasonable doubt of guilt then a person must be acquitted. The burden of proof is on the prosecution. In this case, there were no witnesses and no physical evidence linking Anthony to the death of the child or that the child was even murdered. Thus there was the remote possibility that Anthony was not responsible for the child’s death but simply took advantage of the situation, as macabre as that may be. Even though the probability of these two unlikely events – a mother wishing to be free of her child and a child going missing – is exceedingly low, it is still non zero and thus the jury was forced to acquit.
Now if a single piece of evidence had linked Anthony to the crime, say a fingerprint or DNA sample, then she most likely would have been found guilty. The interesting aspect of this is that there is also an equally low probability that someone could have planted the evidence to frame her. Thus, reasonable doubt is not a global quantity according to the law. It is not sufficient that the total probability that the accused is guilty be high, it also matters if it is high in each of several categories, i.e. motive, opportunity, and direct physical evidence or witnesses. Circumstantial evidence is insufficient to convict a criminal. However, it appears that our brains do not work this way. We seem to take the global probability of guilt and go with that.