COVID-19 Paper

First draft of our paper can be downloaded from this link.  I’ll post more details about it later.

Global prediction of unreported SARS-CoV2 infection from observed COVID-19 cases

Carson C. Chow 1*, Joshua C. Chang 2,3, Richard C. Gerkin 4*, Shashaank Vattikuti 1*

1Mathematical Biology Section, LBM, NIDDK, National Institutes of Health2Epidemiology and Biostatistics Section, Rehabilitation Medicine, Clinical Center, National Institutes of Health 3 mederrata 4 School of Life Sciences, Arizona State University

*For correspondence contact,,,

Summary: Estimation of infectiousness and fatality of the SARS-CoV-2 virus in the COVID-19 global pandemic is complicated by ascertainment bias resulting from incomplete and non-representative samples of infected individuals.  We developed a strategy for overcoming this bias to obtain more plausible estimates of the true values of key epidemiological variables.  We fit mechanistic Bayesian latent-variable SIR models to confirmed COVID-19 cases, deaths, and recoveries, for all regions (countries and US states) independently. Bayesian averaging over models, we find that the raw infection incidence rate underestimates the true rate by a factor, the case ascertainment ratio CARt that depends upon region and time. At the regional onset of COVID-19, the predicted global median was 13 infections unreported for each case confirmed (CARt = 0.07 C.I. (0.02, 0.4)). As the infection spread, the median CARt rose to 9 unreported cases for every one diagnosed as of April 15, 2020 (CARt = 0.1 C.I. (0.02, 0.5)).  We also estimate that the median global initial reproduction number R0 is 3.3 (C.I (1.5, 8.3)) and the total infection fatality rate near the onset is 0.17% (C.I. (0.05%, 0.9%)). However the time-dependent reproduction number Rt and infection fatality rate as of April 15 were 1.2 (C.I. (0.6, 2.5)) and 0.8% (C.I. (0.2%,4%)), respectively.  We find that there is great variability between country- and state-level values.  Our estimates are consistent with recent serological estimates of cumulative infections for the state of New York, but inconsistent with claims that very large fractions of the population have already been infected in most other regions.  For most regions, our estimates imply a great deal of uncertainty about the current state and trajectory of the epidemic.


Exponential growth

The spread of covid-19 and epidemics in general is all about exponential growth. Although, I deal with exponential growth in my work daily, I don’t have an immediate intuitive grasp for what it means in life until I actually think about it. To me I write down functions with the form e^{rt} and treat them as some abstract quantity. In fact, I often just look at differential equations of the form

\frac{dx}{dt} = r x

for which x = A e^{rt} is a solution, where A is any constant number, because the derivative of an exponential function is an exponential function. One confusing aspect of discussing exponential growth with lay people is that the above equation is called a linear differential equation and often mathematicians or physicists will simply say the dynamics are linear or even the growth is linear, although what we really mean is that the describing equation is a linear differential equation, which has exponential growth or exponential decay if r is negative.

If we let t represent time (say days) then r is a growth rate. What x = e^{rt} actually means is that every day x, e.g. number of people with covid-19, is increasing by a factor of e, which is about 2.718.  Now, while this is the most convenient number for mathematical manipulation, it is definitely not the most convenient number for intuition. A common alternative is to use 2 as the base of the exponent, rather than e. This then means that tomorrow you will have double what you have today. I think 2 doesn’t really convey how fast exponential growth is because you say well I start with 1 today and then tomorrow I have 2 and the day after 4 and like the famous Simpson’s episode when Homer is making sure that Bart is a name safe from mockery, goes through the alphabet and says “Aart, Bart, Cart, Dart, Eart, okay Bart is fine,”  you will have to go to day 10 before you get to a thousand fold increase (1024 to be precise), which still doesn’t seem too bad. In fact 30 days later is 2 to the power of 30 or 2^{30}, which is only 10 million or so. But on day 31 you have 20 million and day 32 you have 40 million. It’s growing really fast.

Now, I think exponential growth really hits home if you use 10 as a base because now you simply add a zero for every day: 1, 10, 100, 1000, 10000, 100000. Imagine if your stock was increasing by a factor of 10 every day. Starting from a dollar you would have a million dollars by day 6 and a billion by day 9, be richer than Bill Gates by day 11, and have more money than the entire world by the third week. Now, the only difference between exponential growth with base 2 versus base 10 is the rate of growth. This is where logarithms are very useful  because they are the inverse of an exponential. So, if x = 2^3 then \log_2 x = 3 (i.e. the log of x in base 2 is 3). Log just gives you the power of an exponential (in that base). So the only difference between using base 2 versus base 10 is the rate of growth. For example 10^x = 2^{(log_2 10) x}, where \log_2 (10) = 3.3.  So an exponential process that is doubling every day is increasing by a factor of ten every 3.3 days.

The thing about exponential growth is that most of the action is in the last few days. This is probably the hardest part to comprehend.  So the reason they were putting hospital beds in the convention center in New York weeks ago is because if the number of covid-19 cases is doubling every 5 days then even if you are 100 fold under capacity today, you will be over capacity in 7 doublings or 35 days and the next week you will have twice as many as you can handle.

Flattening the curve means slowing the growth rate. If you can slow the rate of growth by a half then you have 70 days before you get to 7 doublings and max out capacity. If you can make the rate of growth negative then the number of cases will decrease and the epidemic will flame out. There are two ways you can make the growth rate go negative. The first is to use external measures like social distancing and the second is to reach nonlinear saturation, which I’ll discuss in a later post. This is also why social distancing measures seem so extreme and unnecessary because you need to apply it before there is a problem and if it works then those beds in the convention center never get used. It’s a no win situation, if it works then it will seem like an over reaction and if it doesn’t then hospitals will be overwhelmed. Given that 7 billion lives and livelihoods are at stake, it is not a decision to be taken lightly.


Covid-19 modeling

I have officially thrown my hat into the ring and joined the throngs of would-be Covid-19 modelers to try to estimate (I deliberately do not use predict) the progression of the pandemic. I will pull rank and declare that I kind of do this type of thing for a living. What I (and my colleagues whom I have conscripted) are trying to do is to estimate the fraction of the population that has the SARS-CoV-2 virus but has not been identified as such. This was the goal of the Lourenco et al. paper I wrote about previously that pulled me into this pit of despair. I argued that fitting to deaths alone, which is what they do, is insufficient for constraining the model and thus it has no predictive ability. What I’m doing now is seeing whether it is possible to do the job if you fit not only deaths but also the number of cases reported and cases recovered. You then have a latent variable model where the observed variables are cases, cases that die, and cases that recover, and the latent variables are the infected that are not identified and the susceptible population. Our plan is to test a wide range of models with various degrees of detail and complexity and use Bayesian Model Comparison to see which does a better job. We will apply the analysis to global data. We’re getting numbers now but I’m not sure I trust them yet so I’ll keep computing for a few more days. The full goal is to quantify the uncertainty in a principled way.

2020-04-06: edited for purging typos