Why we need a national response

It seems quite clear now that we do not do a very good job of projecting COVID-19 progression. There are many reasons. One is that it is hard to predict how people and governments will behave. A fraction of the population will practice social distancing and withdraw from usual activity in the absence of any governmental mandates, another fraction will not do anything different no matter any official policy and the rest are in between. I for one get more scared of this thing the more I learn about it. Who knows what the long term consequences will be particularly for autoimmune diseases. The virus is triggering a massive immune response everywhere in the body and it could easily develop a memory response to your own cells in addition to the virus.

The virus also spreads in local clusters that may reach local saturation before infecting new clusters but the cross-cluster transmission events are low probability and hard to detect. The virus reached American shores in early January and maybe even earlier but most of those early events died out. This is because the transmission rate is highly varied. A mean reproduction number of 3 could mean everyone has R=3 or that most people transmit with R less than 1 while a small number (or events) transmit with very high R. (Nassim Nicholas Taleb has written copiously on the hazards of highly variable (fat tailed) distributions. For those with mathematical backgrounds, I highly recommend reading his technical volumes: The Technical Incerto. Even if you don’t believe most of what he says, you can still learn a lot.) Thus it is hard to predict when an event will start a local epidemic, although large gatherings of people (i.e. weddings, conventions, etc.) are a good place to start. Once the epidemic starts, it grows exponentially and then starts to saturate either by running out of people in the locality to infect or people changing their behavior or more likely both. Parts of New York may be above the herd immunity threshold now.

Thus at this point, I think we need to take a page out of Taleb’s book (like literally as my daughter would say), and don’t worry too much about forecasting. We can use it as a guide but we have enough information to know that most people are susceptible, about a third will be asymptomatic if infected (which doesn’t mean they won’t have long term consequences), about a fifth to a tenth will be counted as a case, and a few percent of those will die, which strongly depends on age and pre-existing conditions. We can wait around for a vaccine or herd immunity and in the process let many more people die, ( I don’t know how many but I do know that total number of deaths is a nondecreasing quantity), or we can act now everywhere to shut this down and impose a strict quarantine on anyone entering the country until they have been tested negative 3 times with a high specificity PCR test (and maybe 8 out of 17 times with a low specificity and sensitivity antigen test).

Acting now everywhere means, either 1) shutting everything down for at least two weeks. No Amazon or Grubhub or Doordash deliveries, no going to Costco and Walmart, not even going to the super market. It means paying everyone in the country without an income some substantial fraction of their salary. It means distributing two weeks supply of food to everyone. It means truly essential workers, like people keeping electricity going and hospital workers, live in a quarantine bubble hotel, like the NBA and NHL or 2) Testing everyone everyday who wants to leave their house and paying them to quarantine at home or in a hotel if they test positive. Both plans require national coordination and a lot of effort. The CARES act package has run out and we are heading for economic disaster while the pandemic rages on. As a recent president once said, “What have you got to lose?”

The battle over academic freedom

In the wake of George Floyd’s death, almost all of institutional America put out official statements decrying racism and some universities initiated policies governing allowable speech and research. This was followed by the expected dissent from those who worry that academic freedom is being suppressed (see here, here, and here for some examples). Then there is the (in)famous Harper’s Magazine open letter decrying Cancel Culture, which triggered a flurry of counter responses (e.g. see here and here).

While some faculty members in the humanities and (non-life) sciences are up in arms over the thought of a committee of their peers judging what should be allowable research, I do wish to point out that their colleagues over on the Medical campus have had to get approval for human and animal research for decades. Research on human subjects must first pass through an Institutional Review Board (IRB) while animal experiments must clear the Institutional Animal Care and Use Committee (IACUC). These panels ensure that the proposed work is ethical, sound, and justified. Even research that is completely noninvasive, such as analysis of genetic data, must pass scrutiny to ensure the data is not misused and subject identies are strongly protected. Almost all faculty members would agree that this step is important and necessary. History is rife of questionable research that range from careless to criminal. Is it so unreasonable to extend such a concept to the rest of campus?

How to make a fast but bad COVID-19 test good

Among the myriad of problems we are having with the COVID-19 pandemic, faster testing is one we could actually improve. The standard test for the presence of SARS-CoV-2 virus uses PCR (polymerase chain reaction), which amplifies targeted viral RNA. It is accurate (high specificity) but requires relatively expensive equipment and reagents that are currently in short supply. There are reports of wait times of over a week, which renders a test useless for contact tracing.

An alternative to PCR is an antigen test that tests for the presence of protein fragments associated with COVID-19. These tests can in principle be very cheap and fast, and could even be administered on paper strips. They are generally much more unreliable than PCR and thus have not been widely adopted. However, as I show below by applying the test multiple times, the noise can be suppressed and a poor test can be made arbitrarily good.

The performance of binary tests are usually gauged by two quantities – sensitivity and specificity. Sensitivity is the probability that you test positive (i.e are infected) given that you actually are positive (true positive rate). Specificity is the probability that you test negative if you actually are negative (true negative rate). For a pandemic, sensitivity is more important than specificity because missing someone who is infected means you could put lots of people at risk while a false positive just means the person falsely testing positive is inconvenienced (provided they cooperatively self-isolate). Current PCR tests have very high specificity but relatively low sensitivity (as low as 0.7) and since we don’t have enough capability to retest, a lot of tested infected people could be escaping detection.

The way to make any test have arbitrarily high sensitivity and specificity is to apply it multiple times and take some sort of average. However, you want to do this with the fewest number of applications. Suppose we administer n tests on the same subject, the probability of getting more than k positive tests if the person is positive is Q(k,n,q) = 1 - CDF(k|n,q), where CDF is the cumulative distribution function of the Binomial distribution (i.e. probability that the number of Binomial distributed events is less than or equal to k). If the person is negative then the probability of  k or fewer positives is R(k,n,r) = CDF(k|n,1-r). We thus want to find the minimal n given a desired sensitivity and specificity, q' and r'. This means that we need to solve the constrained optimization problem: find the minimal n under the constraint that k < n, Q(k,n,q) = \ge q' and R(k,n,r)\ge r'. Q decreases and R increases with increasing k and vice versa for n. We can easily solve this problem by sequentially increasing n and scanning through k until the two constraints are met. I’ve included the Julia code to do this below.  For example, starting with a test with sensitivity .7 and specificity 1 (like a PCR test), you can create a new test with greater than .95 sensitivity and specificity, by administering the test 3 times and looking for a single positive test. However, if the specificity drops to .7 then you would need to find more than 8 positives out of 17 applications to be 95% sure you have COVID-19.

 

using Distributions

function Q(k,n,q)
d = Binomial(n,q)
return 1 – cdf(d,k)
end

function R(k,n,r)
d = Binomial(n,1-r)
return cdf(d,k)
end

function optimizetest(q,r,qp=.95,rp=.95)

nout = 0
kout = 0

for n in 1:100
for k in 0:n-1
println(R(k,n,r),” “,Q(k,n,q))
if R(k,n,r) >= rp && Q(k,n,q) >= qp
kout=k
nout=n
break
end
end
if nout > 0
break
end
end

return nout, kout
end

Remember the ventilator

According to our model, the global death rate due to Covid-19 is around 1 percent for all infected (including unreported). However, if it were not for modern medicine and in particular the ventilator, the death rate would be much higher. Additionally, the pandemic first raged in the developed world and is only recently engulfing parts of the world where medical care is not as ubiquitous although this may be mitigated by a younger populace in those places. The delay between the appearance of a Covid-19 case and deaths is also fairly long; our model predicts a mean of over 50 days. So the lower US death rate compared to April could change in a month or two when the effects of the recent surges in the US south and west are finally felt.

An overview of mechanical ventilation in the intensive care unit

How long and how high for Covid-19

Cases of Covid-19 are trending back up globally and in the US. The world has nearly reached 10 million cases with over 2.3 million in the US. There is still a lot we don’t understand about SARS-CoV-2 transmission but I am confident we are no where near herd immunity. Our model is consistently showing that the case ascertainment ratio, that is the ratio of official Covid-19 cases to total SARS-CoV-2 infections, is between 5 and 10. That means that the US has less than 25 million infections while the world is less than 100 million.

Herd immunity means that for any fixed reproduction number, R0, the number of active infections will trend downward if the fraction of the initially susceptible population falls below 1/R0, or the total number infected is higher than 1- 1/R0. Thus, for an R0 of 4, three quarters of the population needs to be infected to reach herd immunity. However, the total number that will eventually be infected, as I will show below, will be

1 -\frac{e^{-R_0}}{1- R_0e^{-R_0}}

which is considerably higher. Thus, mitigation efforts to reduce R0 will reduce the total number infected. (2020-06-27: This expression is not accurate when R0 is near 1. For a formula in that regime, see Addendum.)

Some regions in Western Europe, East Asia, and even the US have managed to suppress R0 below 1 and cases are trending downward. In the absence of reintroduction of SARS-CoV-2 carriers, Covid-19 can be eliminated in these regions. However, as the recent spikes in China, South Korea, and Australia have shown, this requires continual vigilance. As long as any person remains infected in the world, there is always a chance of re-emergence. As long as new cases are increasing or plateauing, R0 remains above 1. As I mentioned before, plateauing is not a natural feature of the epidemic prediction models, which generally either go up or go down. Plateauing requires either continuous adjustment of R0 through feedback or propagation through the population as a wave front, like a lawn mower cutting grass. The latter is what is actually going on from what we are seeing. Regions seem to rise and fall in succession. As one region reaches a peak and goes down either through mitigation or rapid spread of SARS-CoV-2, Covid-19 takes hold in another. We saw China and East Asia rise and fall, then Italy, then the rest of Western Europe, then New York and New Jersey, and so forth in series, not in parallel. Now it is spreading throughout the rest of the USA, South America, and Eastern Europe. Africa has been spared so far but it is probably next as it is beginning to explode in South Africa.

A reduction in R0 also delays the time to reach the peak. As a simple example, consider the standard SIR model

\frac{ds}{dt} = -\beta sl

\frac{dl}{dt} = \beta sl -\sigma l

where s is the fraction of the population susceptible to SARS-CoV-2 infection and l is the fraction of the population actively infectious. Below are simulations of the pandemic progression for R0 = 4 and 2.

GKSTerm

We see that halving R0, basically doubles the time to reach the peak but much more than doubles the number of people that never get infected. We can see why this is true by analyzing the equations. Dividing the two SIR equations gives

\frac{dl}{ds} = \frac{\sigma l -\beta sl}{\beta sl},

which integrates to l =  \frac{\sigma}{\beta} \ln s - s + C. If we suppose that initially s=1 and l = l_0<<1 then we get

l =  \frac{1}{R_0} \ln s + 1 - s + l_0 (*)

where R_0 = \beta/\sigma is the reproduction number. The total number infected will be 1-s for l=0. Rearranging gives

s = e^{-R_0(1+l_0+s)}

If we assume that R_0 s <<1 and ignore l_0 we can expand the exponential and solve for s to get

s \approx \frac{e^{-R_0}}{1- R_0e^{-R_0}}

This is the fraction of the population that never gets infected, which is also the probability that you won’t be infected. It gets smaller as R_0 increases. So reducing R_0 can exponentially reduce your chances of being infected.

To figure out how long it takes to reach the peak, we substitute equation (*) into the SIR equation for s to get

\frac{ds}{dt} = -\beta(\frac{1}{R_0} \ln s + 1 - s + l_0) s

We compute the time to peak, T, by separating variables and integrating both sides. The peak is reached when s = 1/R_0.  We must thus compute

T= \int_0^T dt =\int_{1/R_0}^1 \frac{ds}{ \beta(\frac{1}{R_0} \ln s + 1 - s +l_0) s}

We can’t do this integral but if we set s = 1- z and z<< 1, then we can expand \ln s = -\epsilon and obtain

T= \int_0^T dt =\int_0^{l_p} \frac{dz}{ \beta(-\frac{1}{R_0}z  + z +l_0) (1-z)}

where l_p = 1-1/R_0. This can be re-expressed as

T=\frac{1}{ \beta (l_0+l_p)}\int_0^{l_p} (\frac{1}{1-z} + \frac{l_p}{l_p z + l_0}) dz

which is integrated to

T= \frac{1}{ \beta (l_0+l_p)} (-\ln(1-l_p) + \ln (l_p^2 + l_0)-\ln l_0)

If we assume that l_0<< l_p, then we get an expression

T \approx \sigma \frac{\ln (R_0l_p^2/l_0)}{ R_0 -1}

So, T is proportional to the recovery time \sigma and inversely related to R_0 as expected but if l_0 is very small (say 0.00001) compared to R_0 (say 3) then \ln R_0/l_0 can be big (around 10), which may explain why it takes so long for the pandemic to get started in a region. If the infection rate is very low in a region, the time it takes a for a super-spreader event to make an impact could be much longer than expected (10 times the infection clearance time (which could be two weeks or more)).

 

Addendum 2020-06-26: Fixed typos in equations and added clarifying text to last paragraph

 

Addendum 2020-06-27: The approximation for total infected is not very good when R_0 is near 1, a better one can be obtained by expanding the exponential to quadratic order in which case you get the new formula for the

s = \frac{1}{{R_{0}}^2} ( e^{R_0} - R_0 - \sqrt{(e^{R_0}-R_0)^2 - 2{R_0}^2})

However, for R_0 near 1, a better expansion is to substitute z = 1-s into equation (*) and obtain

l =  \frac{1}{R_0} \ln 1-z + z + l_0

Set l=0, after rearranging and exponentiating,  obtain

1 - z = e^{-R_0(l_0+z)}, which can be expanded to yield

1- z = e^{-R_0 l_0}(1 - R_0z + R_0^2 z^2/2

Solving for z gives the total fraction infected to be

z = (R_0 -e^{R_0l_0} + \sqrt{(R_0-e^{R_0l_0})^2 - 2 R_0^2(1-e^{R_0l_0})})/R_0^2

This took me much longer than it should have.

 

The fatal flaw of the American Covid-19 response

The United States has surpassed 2 million official Covid-19 cases and a 115 thousand deaths. After three months of lockdown, the country has had enough and is reopening. Although it has achieved its initial goal of slowing the growth of the pandemic so that hospitals would not be overwhelmed, the battle has not been won. We’re not at the beginning of the end; we may not even be at the end of the beginning. If everyone in the world could go into complete isolation, the pandemic would be over in two weeks. Instead, it is passed from one person to the next in a tragic relay race. As long as a single person is shedding the SARS-CoV-2 virus and comes in contact with another person, the pandemic will continue. The pandemic in the US is not heading for extinction. We are not near herd immunity and R0 is not below one. By the most optimistic yet plausible scenario, 30 million people have already been infected and 200 million will never get it either by having some innate immunity or by avoiding it through sheltering or luck. However, that still leaves over 100 million who are susceptible of which about a million will die if they all catch it.

However, the lack of effectiveness of the response is not the fatal flaw. No, the fatal flaw is that the US Covid-19 response asks one set of citizens to sacrifice for the benefit of another set. The Covid-19 pandemic is a story of three groups of people. The fortunate third can work from home, and the lockdown is mostly just an inconvenience. They still get paychecks while supplies and food can be delivered to their homes. Sure it has been stressful and many of have forgone essential medical care but they can basically ride this out for as long as it takes. The second group who own or work in shuttered businesses have lost their income. The federal rescue package is keeping some of them afloat but that runs out in August. The choice they have is to reopen and risk getting infected or be hungry and homeless. Finally, the third group is working to allow the first group to remain in their homes. They are working on farms, food processing plants, and grocery stores. They are cutting lawns, fixing leaking pipes, and delivering goods. They are working in hospitals and nursing homes and taking care of the sick and the children of those who must work. They are also the ones who are most likely to get infected and spread it to their families or the people they are trying to take care of. They are dying so others may live.

A lockdown can only work in a society if the essential workers are adequately protected and those without incomes are supported. Each worker should have an N100 mask, be trained how to wear it and be tested weekly. People in nursing homes should be wearing hazmat suits. Everyone who loses income should be fully compensated. In a fair society, everyone should share the risks and the pain equally.

Covid-19 continues to spread

The global plateau turned out to just be a pause and the growth in new cases continues. The rise seems to be mostly driven by increases in Brazil, India, and until very recently Russia with plateauing in the US and European countries as they relax their mitigation policies. The pandemic is not over by a long shot. There will most certainly be further growth in the near future.

How much Covid-19 testing do we need?

There is a simple way to estimate how much SARS-CoV-2 PCR testing we need to start diminishing the COVID-19 pandemic. Suppose we test everyone at a rate f, with a PCR test with 100% sensitivity, which means we do not miss anyone who is positive but we could have false positives. The number of positives we will find is f p, where  p is the prevalence of infectious individuals in a given population. If positive individuals are isolated from the rest of the population until they are no longer infectious with probability q, then the rate of reduction in prevalence is fqp. To reduce the pandemic, this number needs to be higher than the rate of pandemic growth, which is given by \beta s p, where s is the fraction of the population susceptible to SARS-CoV-2 infection and \beta is the rate of transmission from an infected individual to a susceptible upon contact. Thus, to reduce the pandemic, we need to test at a rate higher than \beta s/q.

In the initial stages of the pandemic s is one and \beta = R_0/\sigma, where R_0 is the mean reproduction number, which is probably around 3.7 and \sigma is the mean rate of becoming noninfectious, which is probably around 10 to 20 days. This gives an estimate of  \beta_0 to be somewhere around 0.3 per day. Thus, in the early stages of the pandemic, we would need to test everyone at least two or three times per week, provided positives are isolated. However, if people wear masks and avoid crowds then \beta could be reduced. If we can get it smaller then we can test less frequently. Currently, the global average of R_0 is around one, so that would mean we need to test every two or three weeks. If positives don’t isolate with high probability, we need to test at a higher rate to compensate. This threshold rate will also go down as s goes down.

In fact, you can just test randomly at rate f and monitor the positive rate. If the positive rate trends downward then you are testing enough. If it is going up then test more. In any case, we may need less testing capability than we originally thought, but we do need to test the entire population and not just suspected cases.

The Covid-19 plateau

For the past five weeks, the appearance rate of Covid-19 cases has plateaued at about a hundred thousand new cases per day. Just click on the Daily Cases Tab on the JHU site to see for yourself. This is quite puzzling because while individual nations and regions are rising, falling, and plateauing independently, the global total is flat as a pancake. A simple resolution to this seeming paradox was proposed by economist John Cochrane (see his post here). The solution is rather simple but the implications as I will go into more detail below are far reaching. The short answer is that if the world (either through behavior or policy) reacts to the severity of Covid-19 incrementally then a plateau will arise. When cases go up, people socially distance, and the number goes down, when cases go down, they relax a little and it goes back up again.

This can be made more precise with the now-famous SIR model. For the uninitiated, SIR stands for Susceptible Infected Recovered model. It is a simple dynamical model of disease propagation that has been in use for almost a century. The basic premise of an SIR model is that at any given time, the proportion of the population is either infected with the virus I, susceptible to infection S, or recovered from infection and no longer susceptible R. Each time an S comes across an I, it has a chance of being infected and becoming another I. An I will recover (or die) with some rate and become an R. The simplest way to implement an SIR model is to assume that people interact completely randomly and uniformly across the population and the rate of transmission and recovery is uniform as well. This is of course a gross simplification and ignores the complexity and diversity of social interactions, the mechanisms of actual viral transmission, and the progression of disease within individuals. However, even though it misses all of these nuances, it captures many of the important dynamics of epidemics. In differential equation form, the SIR model is written as

\frac{dS}{dt} = -\frac{\beta}{N} S I

\frac{dI}{dt} = \frac{\beta}{N} S I - \sigma I   (SIR model)

where N is the total number of people in the population of interest. Here, S and I are in units of number of people.  The left hand sides of these equations are derivatives with respect to time, or rates.  They have dimensions or units of people per unit time, say day. From this we can infer that \beta and \sigma must have units of inverse day (per day) since S, I, and N all have units of numbers of people. Thus \beta is the infection rate per day and \sigma is the recovery/death rate per day. The equation assumes that the probability of an S meeting an I is I/N. If there was one infected person in a population of a hundred, then if you were to interact completely randomly with everyone then the chance you would run into an infected person is 1/100. Actually, it would be 1/99 but in a large population, the one becomes insignificant and you can round up. Right away, we can see a problem with this assumption. I interact regularly with perhaps a few hundred people per week or month but the chance of me meeting a person that had just come from Australia in a typical week is extremely low. Thus, it is not at all clear what we should use for N in the model. The local population, the regional population, the national population?

The model assumes that once an S has run into an I, the rate of transmission of the virus is \beta. The total rate of decrease of S is the product of \beta and SI/N. The rate of change of I is given by the increase due to interactions with S and the decrease due to recovery/death \sigma I. These terms all have units of person per day. Once you understand the basic principles of constructing differential equations, you can model anything, which is what I like to do. For example, I modeled the temperature dynamics of my house this winter and used it to optimize my thermostat settings. In a post from a long time ago, I used it to model how best to boil water.

Given the SIR model, you can solve them to get how I and S will change in time. The SIR model is a system of nonlinear differential equations that do not have what is called a closed-form solution, meaning you can’t write down that I(t) is some nice function like e^{t} or \sin(t). However, you can solve them numerically on a computer or infer properties of the dynamics directly without actually solving them. For example, if \beta SI/N is initially greater than \sigma I, then $dI/dt$ is positive and thus $I$ will increase with time. On the other hand, since dS/dt is always negative (rate of change is negative), it will decrease in time. As I increases and S decreases, since S is decreasing at a faster rate than I is increasing because \sigma I is slowing the growth of I, then at some point \beta SI/N will equal \sigma I and dI/dt=0. This is a stationary point of I.  However, it is only a momentary stationary point because  S keeps decreasing and this will make I start to decrease too and thus this stationary point is a maximum point. In the SIR model, the stationary point is given by the condition

\frac{dI}{dt} = 0 = \frac{\beta}{N}SI -\sigma I (Stationary condition)

which you can solve to get either I = 0 or \beta S/N = \sigma. The I=0 point is not a peak but just reflects the fact that there is no epidemic if there are no infections. The other condition gives the peak:

\frac{S}{N} = \frac{\sigma}{\beta} \equiv \frac{1}{R_0}

where R_0 is the now-famous R naught or initial reproduction number. It is the average number of people infected by a single person since \beta is the infection rate and \sigma is the infection disappearance rate, the ratio is a number. The stationary condition gives the herd immunity threshold. When the fraction of S reaches S^*/N = latex 1/R_0 then the pandemic will begin to decline.  This is usually expressed as the fraction of those infected and no longer susceptible, 1-1/R0. The 70% number you have heard is because 1-1/R_0 is approximately 70% for R_0 = 3.3, the presumed value for Covid-19.

A plateau in the number of new cases per day is an indication that we are at a stationary point in I. This is because only a fraction of the total infected are counted as cases and if we assume that the case detection rate is uniform across all I, then the number of new cases per day is proportional to I. Thus, a plateau in cases means we are at a stationary point in I, which we saw above only occurs at a single instance in time. One resolution to this paradox would be if the peak is broad so it looks like a plateau. We can compute how broad the peak is from the second derivative, which gives the rate of change of the rate of change. This is the curvature of the peak. Taking the second derivative of the I equation in the SIR model gives

\frac{d^2I}{dt^2} = \frac{\beta}{N} (\frac{dS}{dt} I + S\frac{dI}{dt}) - \sigma \frac{dI}{dt}

Using dI/dt=0 and the formula for S^* at the peak, the curvature is

\frac{d^2I}{dt^2} = \frac{\beta}{N} \frac{dS}{dt} I =-\left( \frac{\beta}{N}\right)^2 S^* I^2 =- \frac{I^2\beta^2}{NR_0}

It is negative because at a peak the slope is decreasing. (A hill is easier to climb as you round the top.)  There could be an apparent plateau if the curvature is very small, which is true if I^2 \beta^2 is small compared to N R_0. However, this would also mean we are already at the herd immunity threshold, which our paper and recent anti-body surveys predict to be unlikely given what we know about R_0.

If a broad peak at the herd immunity threshold does not explain the plateau in new global daily cases then what does? Cochrane’s theory is that \beta depends on I.  He postulated that \beta = \beta_0 e^{-\alpha I/N},where \beta_0 is the initial infectivity rate, but any decreasing function will do. When I goes up, \beta goes down. Cochrane attributes this to human behavior but it could also be a result of policy and government mandate. If you plug this into the stationary condition you get

\frac{\beta_0}{N} S^* e^{-\alpha I^*/N} -\sigma = 0

or

I^* =-\frac{N}{\alpha} \log(\frac{N\sigma}{S^*\beta_0})

and the effective reproduction number R_0  is one.

However, this is still only a quasi-stationary state because if I is a constant I^*, then S will decrease as dS/dt = -\frac{\beta}{N}SI^*, which has solution

S(t) = N e^{-(\beta I^*/N) t}      (S)

Plugging this into the equation for  I^* gives

I^* =-\frac{N}{\alpha} \log(\frac{\sigma}{\beta_0}e^{(\beta I^*/N) t}) = \frac{N}{\alpha}\log R_0 - \frac{\beta I^*}{\alpha} t

which means that I is not really plateaued but is decreasing slowly as

I^* = \frac{N}{\alpha+\beta t}\log R_0

We can establish perfect conditions for a plateau if we work backwards. Suppose again that I has plateaued at I^*.  Then, S(t) is given by equation (S).  Substituting this into the (Stationary Condition) above then gives 0 = \beta(t)  e^{-(\beta I^*/N) t} -\sigma or

\beta(t) = \sigma e^{(\beta I^*/N) t}

which means that the global plateau is due to us first reducing \beta to near \sigma, which halted the spread locally, and then gradually relaxing pandemic mitigation measures so that \beta(t) is creeping upwards back to it’s original value.

The Covid-19 plateau is both good news and bad news. It is good news because we are not seeing exponential growth of the pandemic. Globally, it is more or less contained. The bad news is that by increasing at a rate of a hundred thousand cases per day, it will take a long time before we reach herd immunity. If we make the naive assumption that we won’t reach herd immunity until 5 billion people are infected then this pandemic could blunder along for 5\times 10^9/10^5 = 5\times  10^4 = 50000 days! In other words, the pandemic will keep circling the world forever since over that time span, babies will be born and grow up. Most likely, it will become less virulent and will just join the panoply of diseases we currently live with like the various varieties of the common cold (which are also corona viruses) and the flu.

New paper in Molecular Psychiatry

Genomic analysis of diet composition finds novel loci and associations with health and lifestyle

S. Fleur W. Meddens, et al.

Abstract

We conducted genome-wide association studies (GWAS) of relative intake from the macronutrients fat, protein, carbohydrates, and sugar in over 235,000 individuals of European ancestries. We identified 21 unique, approximately independent lead SNPs. Fourteen lead SNPs are uniquely associated with one macronutrient at genome-wide significance (P < 5 × 10−8), while five of the 21 lead SNPs reach suggestive significance (P < 1 × 10−5) for at least one other macronutrient. While the phenotypes are genetically correlated, each phenotype carries a partially unique genetic architecture. Relative protein intake exhibits the strongest relationships with poor health, including positive genetic associations with obesity, type 2 diabetes, and heart disease (rg ≈ 0.15–0.5). In contrast, relative carbohydrate and sugar intake have negative genetic correlations with waist circumference, waist-hip ratio, and neighborhood deprivation (|rg| ≈ 0.1–0.3) and positive genetic correlations with physical activity (rg ≈ 0.1 and 0.2). Relative fat intake has no consistent pattern of genetic correlations with poor health but has a negative genetic correlation with educational attainment (rg ≈−0.1). Although our analyses do not allow us to draw causal conclusions, we find no evidence of negative health consequences associated with relative carbohydrate, sugar, or fat intake. However, our results are consistent with the hypothesis that relative protein intake plays a role in the etiology of metabolic dysfunction.

A Covid-19 Manhattan Project

Right now there are hundreds if not thousands of Covid-19 models floating out there. Some are better than others and some have much more influence than others and the ones that have the most influence are not necessarily the best. There must be a better way of doing this. The world’s greatest minds convened in Los Alamos in WWII and built two atomic bombs. Metaphors get thrown around with reckless abandon but if there ever was a time for a Manhattan project, we need one now. Currently, the world’s scientific community has mobilized to come up with models to predict the spread and effectiveness of mitigation efforts, to produce new therapeutics and to develop new vaccines. But this is mostly going on independently.

Would it be better if we were to coordinate all of this activity. Right now at the NIH, there is an intense effort to compile all the research that is being done in the NIH Intramural Program and to develop a system where people can share reagents and materials. There are other coordination efforts going on worldwide as well.  This website contains a list of open source computational resources.  This link gives a list of data scientists who have banded together. But I think we need a world wide plan if we are ever to return to normal activity. Even if some nation has eliminated the virus completely within its borders there is always a chance of reinfection from outside.

In terms of models, they seem to have very weak predictive ability. This is probably because they are all over fit. We don’t really understand all the mechanisms of SARS-CoV-2 propagation. The case or death curves are pretty simple and as Von Neumann or Ulam or someone once said, “give me 4 parameters and I can fit an elephant, give me 5 and I can make it’s tail wiggle.” Almost any model can fit the curve but to make a projection into the future, you need to get the dynamics correct and this I claim, we have not done. What I am thinking of proposing is to have the equivalent of FDA approval for predictive models. However, instead of a binary decision of approval non-approval, people could submit there models for a predictive score based on some cross validation scheme or prediction on a held out set. You could also submit as many times as you wish to have your score updated. We could then pool all the models and produce a global Bayesian model averaged prediction and see if that does better. Let me know if you wish to participate or ideas on how to do this better.

COVID-19 Paper

First draft of our paper can be downloaded from this link.  I’ll post more details about it later.

Global prediction of unreported SARS-CoV2 infection from observed COVID-19 cases

Carson C. Chow 1*, Joshua C. Chang 2,3, Richard C. Gerkin 4*, Shashaank Vattikuti 1*

1Mathematical Biology Section, LBM, NIDDK, National Institutes of Health2Epidemiology and Biostatistics Section, Rehabilitation Medicine, Clinical Center, National Institutes of Health 3 mederrata 4 School of Life Sciences, Arizona State University

*For correspondence contact carsonc@nih.gov, josh@mederrata.com, rgerkin@asu.edu, vattikutis@nih.gov

Summary: Estimation of infectiousness and fatality of the SARS-CoV-2 virus in the COVID-19 global pandemic is complicated by ascertainment bias resulting from incomplete and non-representative samples of infected individuals.  We developed a strategy for overcoming this bias to obtain more plausible estimates of the true values of key epidemiological variables.  We fit mechanistic Bayesian latent-variable SIR models to confirmed COVID-19 cases, deaths, and recoveries, for all regions (countries and US states) independently. Bayesian averaging over models, we find that the raw infection incidence rate underestimates the true rate by a factor, the case ascertainment ratio CARt that depends upon region and time. At the regional onset of COVID-19, the predicted global median was 13 infections unreported for each case confirmed (CARt = 0.07 C.I. (0.02, 0.4)). As the infection spread, the median CARt rose to 9 unreported cases for every one diagnosed as of April 15, 2020 (CARt = 0.1 C.I. (0.02, 0.5)).  We also estimate that the median global initial reproduction number R0 is 3.3 (C.I (1.5, 8.3)) and the total infection fatality rate near the onset is 0.17% (C.I. (0.05%, 0.9%)). However the time-dependent reproduction number Rt and infection fatality rate as of April 15 were 1.2 (C.I. (0.6, 2.5)) and 0.8% (C.I. (0.2%,4%)), respectively.  We find that there is great variability between country- and state-level values.  Our estimates are consistent with recent serological estimates of cumulative infections for the state of New York, but inconsistent with claims that very large fractions of the population have already been infected in most other regions.  For most regions, our estimates imply a great deal of uncertainty about the current state and trajectory of the epidemic.

 

Exponential growth

The spread of covid-19 and epidemics in general is all about exponential growth. Although, I deal with exponential growth in my work daily, I don’t have an immediate intuitive grasp for what it means in life until I actually think about it. To me I write down functions with the form e^{rt} and treat them as some abstract quantity. In fact, I often just look at differential equations of the form

\frac{dx}{dt} = r x

for which x = A e^{rt} is a solution, where A is any constant number, because the derivative of an exponential function is an exponential function. One confusing aspect of discussing exponential growth with lay people is that the above equation is called a linear differential equation and often mathematicians or physicists will simply say the dynamics are linear or even the growth is linear, although what we really mean is that the describing equation is a linear differential equation, which has exponential growth or exponential decay if r is negative.

If we let t represent time (say days) then r is a growth rate. What x = e^{rt} actually means is that every day x, e.g. number of people with covid-19, is increasing by a factor of e, which is about 2.718.  Now, while this is the most convenient number for mathematical manipulation, it is definitely not the most convenient number for intuition. A common alternative is to use 2 as the base of the exponent, rather than e. This then means that tomorrow you will have double what you have today. I think 2 doesn’t really convey how fast exponential growth is because you say well I start with 1 today and then tomorrow I have 2 and the day after 4 and like the famous Simpson’s episode when Homer is making sure that Bart is a name safe from mockery, goes through the alphabet and says “Aart, Bart, Cart, Dart, Eart, okay Bart is fine,”  you will have to go to day 10 before you get to a thousand fold increase (1024 to be precise), which still doesn’t seem too bad. In fact 30 days later is 2 to the power of 30 or 2^{30}, which is only 10 million or so. But on day 31 you have 20 million and day 32 you have 40 million. It’s growing really fast.

Now, I think exponential growth really hits home if you use 10 as a base because now you simply add a zero for every day: 1, 10, 100, 1000, 10000, 100000. Imagine if your stock was increasing by a factor of 10 every day. Starting from a dollar you would have a million dollars by day 6 and a billion by day 9, be richer than Bill Gates by day 11, and have more money than the entire world by the third week. Now, the only difference between exponential growth with base 2 versus base 10 is the rate of growth. This is where logarithms are very useful  because they are the inverse of an exponential. So, if x = 2^3 then \log_2 x = 3 (i.e. the log of x in base 2 is 3). Log just gives you the power of an exponential (in that base). So the only difference between using base 2 versus base 10 is the rate of growth. For example 10^x = 2^{(log_2 10) x}, where \log_2 (10) = 3.3.  So an exponential process that is doubling every day is increasing by a factor of ten every 3.3 days.

The thing about exponential growth is that most of the action is in the last few days. This is probably the hardest part to comprehend.  So the reason they were putting hospital beds in the convention center in New York weeks ago is because if the number of covid-19 cases is doubling every 5 days then even if you are 100 fold under capacity today, you will be over capacity in 7 doublings or 35 days and the next week you will have twice as many as you can handle.

Flattening the curve means slowing the growth rate. If you can slow the rate of growth by a half then you have 70 days before you get to 7 doublings and max out capacity. If you can make the rate of growth negative then the number of cases will decrease and the epidemic will flame out. There are two ways you can make the growth rate go negative. The first is to use external measures like social distancing and the second is to reach nonlinear saturation, which I’ll discuss in a later post. This is also why social distancing measures seem so extreme and unnecessary because you need to apply it before there is a problem and if it works then those beds in the convention center never get used. It’s a no win situation, if it works then it will seem like an over reaction and if it doesn’t then hospitals will be overwhelmed. Given that 7 billion lives and livelihoods are at stake, it is not a decision to be taken lightly.

 

Covid-19 modeling

I have officially thrown my hat into the ring and joined the throngs of would-be Covid-19 modelers to try to estimate (I deliberately do not use predict) the progression of the pandemic. I will pull rank and declare that I kind of do this type of thing for a living. What I (and my colleagues whom I have conscripted) are trying to do is to estimate the fraction of the population that has the SARS-CoV-2 virus but has not been identified as such. This was the goal of the Lourenco et al. paper I wrote about previously that pulled me into this pit of despair. I argued that fitting to deaths alone, which is what they do, is insufficient for constraining the model and thus it has no predictive ability. What I’m doing now is seeing whether it is possible to do the job if you fit not only deaths but also the number of cases reported and cases recovered. You then have a latent variable model where the observed variables are cases, cases that die, and cases that recover, and the latent variables are the infected that are not identified and the susceptible population. Our plan is to test a wide range of models with various degrees of detail and complexity and use Bayesian Model Comparison to see which does a better job. We will apply the analysis to global data. We’re getting numbers now but I’m not sure I trust them yet so I’ll keep computing for a few more days. The full goal is to quantify the uncertainty in a principled way.

2020-04-06: edited for purging typos

Response to Oxford paper on covid-19

Here is my response to the paper from Oxford (Lourenco et al.) arguing that novel coronavirus infection may already be widespread in the UK and Italy.  The result is based on fitting a disease spreading model, called an SIR model, to the cumulative number of deaths. SIR models usually consist of ordinary differential equations (ODEs) for the fraction of people in a given population who are susceptible to the infectious agent (S), the number infected (I),  and the number recovered (R). There is one other state in the model, which is the fraction who die from the disease (D).  The SIR model considers transitions between these states.  In the case of ODEs, the states are treated as continuous quantities, which is not a bad approximation for a large population, and each equation in the model describes the rate of change of a state (hence differential equation).  There are parameters in the model governing the rate of different interactions in each  equation, for example there is a parameter for the rate of increase in S whenever an S interacts with an I, and then there is a rate of loss of an I, which transitions into either R or D.  The Oxford group model D somewhat differently.  Instead of a transition from I into D they consider that a fraction of (1-S) will die with some delay between time of infection and death.

They estimate the model parameters by fitting the model to the cumulative number of deaths.  They did this instead of fitting directly to I because that is unreliable as many people who have Covid-19 have not been tested. They also only fit to the first 15 days from the first recorded death since they want to model what happens before social distancing was implemented.  They find that the model is consistent with a scenario where the probability that an infected person gets severe enough to be flagged is low and thus the disease is much more wide spread than expected. I redid the analysis without assuming that the parameters need to have particular values (called priors in Bayesian inference and machine learning) and showed that a wide range of parameters will fit the data. This is because the model is under-constrained by death data alone so even unrealistic parameters can work.  To be fair, the authors only proposed that this is a possibility and thus the population should be tested for anti-bodies to the coronavirus (SARS-CoV-2) to see if indeed there may already be herd immunity in place. However, the press has run with the result and that is why I think it is important to examine the result more closely.

The relevant Covid-19 fatality rate

Much has been written in the past few days about whether the case fatality rate (CFR) for Covid-19 is actually much lower than the original estimate of about 3 to 4%. Globally, the CFR is highly variable ranging  from half a  percent in Germany to nearly 10% in Italy. The difference could be due to underlying differences in the populations or to the extent of testing. South Korea, which has done very wide scale testing, has a CFR of around 1.5%. However, whether the CFR is high or low is not the important parameter.  The number we must determine is the population fatality rate because even if most of the people who become infected with SARS-CoV-2 have mild or even no symptoms so the CFR is low, if most people are susceptible and the entire world population gets the virus then even a tenth of a percent of 7 billion is still a very large number.

What we don’t know yet is how much of the population is susceptible. Data from the cruise ship Diamond Princess showed that about 20% of the passengers and crew became infected but there were still some social distancing measures in place after the first case was detected so this does not necessarily imply that 80% of the world population is innately immune. A recent paper from Oxford argues that about half of the UK population may already have been infected and is no longer susceptible. However, I redid their analysis and find that widespread infection although possible is not very likely (details to follow — famous last words) but this can and should be verified by testing for anti-bodies in the population. The bottom line is that we need to test, test and test both for the virus and for anti-bodies before we will know how bad this will be.