Interesting article in the New York Times today about how people to this day still do not know how magician David Berglass did his “Any Card At Any Number” trick. In this trick, a magician asks a person or two to name a card (e.g. Queen of Hearts) and a number (e.g. 37) and then in a variety of ways produce a deck where that card appears at that order in the deck. The supposed standard way to do the trick is for the magician to manipulate the cards in some way but Berglass does his trick by never touching the cards. He can even do the trick impromptu when you visit him by leading you to a deck of cards somewhere in the room or from his pocket that has the card in the correct place. Now, I have no idea how he or anyone does this trick but one way to do the trick is to use “full enumeration”, i.e. hide decks where every possibility is accounted for and then the trick is to remember which deck has that choice. So then the question is how many decks would you need? Well the minimal number of decks is 52 because a particular card could be in one of 52 positions. But how many more decks would you need? The answer is zero. 52 is all you need because for any particular arrangement of cards, each card is in one position. Then all you do is rotate all the cards by one, so the card in the first position is now in the second position for deck 2 and 52 moves to 1 and so on. What the magician can do is to then hide 52 decks and remember the order of each deck. In the article he picked the reporter’s card to within 1 but claimed he may only be able to do it to within 2. That means he’s hiding the decks in groups of 3 and 4 say and then points you to that location and lets you choose which deck.

# Mathematics

# The probability of extraterrestrial life

Since, the discovery of exoplanets nearly 3 decades ago most astronomers, at least the public facing ones, seem to agree that it is just a matter of time before they find signs of life such as the presence of volatile gases in the atmosphere associated with life like methane or oxygen. I’m an agnostic on the existence of life outside of earth because we don’t have any clue as to how easy or hard it is for life to form. To me, it is equally possible that the visible universe is teeming with life or that we are alone. We simply do not know.

But what would happen if we find life on another planet. How would that change our expected probability for life in the universe? MIT astronomer Sara Seager once made an offhand remark in a podcast that finding another planet with life would make it very likely there were many more. But is this true? Does the existence of another planet with life mean a dramatic increase in the probability of life in the universe. We can find out by doing the calculation.

Suppose you believe that the probability of life on a planet is (i.e. fraction of planets with life) and this probability is uniform across the universe. Then if you search planets, the probability for the number of planets with life you will find is given by a Binomial distribution. The probability that there are planets is given by the expression , where is a factor (the binomial coefficient) such that the sum of from one to is 1. By Bayes Theorem, the posterior probability for (yes, that would be the probability of a probability) is given by

where . As expected, the posterior depends strongly on the prior. A convenient way to express the prior probability is to use a Beta distribution

(*)

where is again a normalization constant (the Beta function). The mean of a beta distribution is given by and the variance, which is a measure of uncertainty, is given by . The posterior distribution for after observing planets with life out of will be

where is a normalization factor. This is again a Beta distribution. The Beta distribution is called the conjugate prior for the Binomial because it’s form is preserved in the posterior.

Applying Bayes theorem in equation (*), we see that the mean and variance of the posterior become and , respectively. Now let’s consider how our priors have updated. Suppose our prior was , which gives a uniform distribution for on the range 0 to 1. It has a mean of 1/2 and a variance of 1/12. If we find one planet with life after checking 10,000 planets then our expected becomes 2/10002 with variance . The observation of a single planet has greatly reduced our uncertainty and we now expect about 1 in 5000 planets to have life. Now what happens if we find no planets. Then, our expected only drops to 1 in 10000 and the variance is about the same. So, the difference between finding a planet versus not finding a planet only halves our posterior if we had no prior bias. But suppose we are really skeptical and have a prior with and so our expected probability is zero with zero variance. The observation of a single planet increases our posterior to 1 in 10001 with about the same small variance. However, if we find a single planet out of much fewer observations like 100, then our expected probability for life would be even higher but with more uncertainty. In any case, Sara Seager’s intuition is correct – finding a planet would be a game breaker and not finding one shouldn’t really discourage us that much.

# The machine learning president

For the past four years, I have been unable to post with any regularity. I have dozens of unfinished posts sitting in my drafts folder. I would start with a thought but then get stuck, which had previously been somewhat unusual for me. Now on this first hopeful day I have had for the past four trying years, I am hoping I will be able to post more regularly again.

Prior to what I will now call the Dark Years, I viewed all of history through an economic lens. I bought into the standard twentieth century leftist academic notion that wars, conflicts, social movements, and cultural changes all have economic underpinnings. But I now realize that this is incorrect or at least incomplete. Economics surely plays a role in history but what really motivates people are stories and stories are what led us to the Dark Years and perhaps to get us out.

Trump became president because he had a story. The insurrectionists who stormed the Capitol had a story. It was a bat shit crazy lunatic story but it was still a story. However, the tragic thing about the Trump story (or rather my story of the Trump story) is that it is an unintentional algorithmically generated story. Trump is the first (and probably not last) purely machine learning president (although he may not consciously know that). Everything he did was based on the feedback he got from his Twitter Tweets and Fox News. His objective function was attention and he would do anything to get more attention. Of the many lessons we will take from the Dark Years, one should be how machine learning and artificial intelligence can go so very wrong. Trump’s candidacy and presidency was based on a simple stochastic greedy algorithm for attention. He would Tweet randomly and follow up on the Tweets that got the most attention. However, the problem with a greedy algorithm (and yes that is a technical term that just happens to coincidentally be apropos) is that once you follow a path it is hard to make a correction. I actually believe that if some of Trump’s earliest Tweets from say 2009-2014 had gone another way, he could have been a different president. Unfortunately, one of his early Tweet themes that garnered a lot of attention was on the Obama birther conspiracy. This lit up both racist Twitter and a counter reaction from liberal Twitter, which led him further to the right and ultimately to the presidency. His innate prejudices biased him towards a darker path and he did go completely unhinged after he lost the election but he is unprincipled and immature enough to change course if he had enough incentive to do so.

Unlike standard machine learning for categorizing images or translating languages, the Trump machine learning algorithm changes the data. Every Tweet alters the audience and the reinforcing feedback between Trump’s Tweets and its reaction can manufacture discontent out of nothing. A person could just happen to follow Trump because they like The Apprentice reality show Trump starred in and be having a bad day because they missed the bus or didn’t get a promotion. Then they see a Trump Tweet, follow the link in it and suddenly they find a conspiracy theory that “explains” why they feel disenchanted. They retweet and this repeats. Trump sees what goes viral and Tweets more on the same topic. This positive feedback loop just generated something out of random noise. The conspiracy theorizing then starts it’s own reinforcing feedback loop and before you know it we have a crazed mob bashing down the Capitol doors with impunity.

Ironically Trump, who craved and idolized power, failed to understand the power he actually had and if he had a better algorithm (or just any strategy at all), he would have been reelected in a landslide. Even before he was elected, Trump had already won over the far right and he could have started moving in any direction he wished. He could have moderated on many issues. Even maintaining his absolute ignorance of how govening actually works, he could have had his wall by having it be part of actual infrastructure and immigration bills. He could have directly addressed the COVID-19 pandemic. He would not have lost much of his base and would have easily gained an extra 10 million votes. Maybe, just maybe if liberal Twitter simply ignored the early incendiary Tweets and only responded to the more productive ones, they could have moved him a bit too. Positive reinforcement is how they train animals after all.

Now that Trump has shown how machine learning can win a presidency, it is only a matter of time before someone harnesses it again and more effectively. I just hope that person is not another narcissistic sociopath.

# Math solved Zeno’s paradox

On some rare days when the sun is shining and I’m enjoying a well made kouign-amann (my favourite comes from b.patisserie in San Francisco but Patisserie Poupon in Baltimore will do the trick), I find a brief respite from my usual depressed state and take delight, if only for a brief moment, in the fact that mathematics completely resolved Zeno’s paradox. To me, it is the quintessential example of how mathematics can fully solve a philosophical problem and it is a shame that most people still don’t seem to know or understand this monumental fact. Although there are probably thousands of articles on Zeno’s paradox on the internet (I haven’t bothered to actually check), I feel like visiting it again today even without a kouign-amann in hand.

I don’t know what the original statement of the paradox is but they all involve motion from one location to another like walking towards a wall or throwing a javelin at a target. When you walk towards a wall, you must first cross half the distance, then half the remaining distance, and so on forever. The paradox is thus: How then can you ever reach the wall, or a javelin reach its target, if it must traverse an infinite number of intervals? This paradox is completely resolved by the concept of the mathematical limit, which Newton used to invent calculus in the seventeenth century. I think understanding the limit is the greatest leap a mathematics student must take in all of mathematics. It took mathematicians two centuries to fully formalize it although we don’t need most of that machinery to resolve Zeno’s paradox. In fact, you need no more than middle school math to solve one of history’s most famous problems.

The solution to Zeno’s paradox stems from the fact that if you move at constant velocity then it takes half the time to cross half the distance and the sum of an infinite number of intervals that are half as long as the previous interval adds up to a finite number. That’s it! It doesn’t take forever to get anywhere because you are adding an infinite number of things that get infinitesimally smaller. The sum of a bunch of terms is called a series and the sum of an infinite number of terms is called an infinite series. The beautiful thing is that we can compute this particular infinite series exactly, which is not true of all series.

Expressed mathematically, the total time it takes for an object traveling at constant velocity to reach its target is

which can be rewritten as

where is the distance and is the velocity. This infinite series is technically called a geometric series because the ratio of two subsequent terms in the series is always the same. The terms are related geometrically like the volumes of n-dimensional cubes when you have halve the length of the sides (e.g. 1-cube (line and volume is length), 2-cube (square and volume is area), 3-cube (good old cube and volume), 4-cube ( hypercube and hypervolume), etc) .

For simplicity we can take . So to compute the time it takes to travel the distance, we must compute:

To solve this sum, the first thing is to notice that we can factor out and obtain

The quantity inside the bracket is just the original series plus 1, i.e.

and thus we can substitute this back into the original expression for and obtain

Now, we simply solve for and I’ll actually go over all the algebraic steps. First multiply both sides by 2 and get

Now, subtract from both sides and you get the beautiful answer that . We then have the amazing fact that

I never get tired of this. In fact this generalizes to any geometric series

for any that is less than 1. The more compact way to express this is

Now, notice that in this formula if you set , you get , which is infinity. Since for any , this tells you that if you try to add up an infinite number of ones, you’ll get infinity. Now if you set you’ll get a negative number. Does this mean that the sum of an infinite number of positive numbers greater than 1 is a negative number? Well no because the series is only defined for , which is called the domain of convergence. If you go outside of the domain, you can still get an answer but it won’t be the answer to your question. You always need to be careful when you add and subtract infinite quantities. Depending on the circumstance it may or may not give you sensible answers. Getting that right is what math is all about.

# How to make a fast but bad COVID-19 test good

Among the myriad of problems we are having with the COVID-19 pandemic, faster testing is one we could actually improve. The standard test for the presence of SARS-CoV-2 virus uses PCR (polymerase chain reaction), which amplifies targeted viral RNA. It is accurate (high specificity) but requires relatively expensive equipment and reagents that are currently in short supply. There are reports of wait times of over a week, which renders a test useless for contact tracing.

An alternative to PCR is an antigen test that tests for the presence of protein fragments associated with COVID-19. These tests can in principle be very cheap and fast, and could even be administered on paper strips. They are generally much more unreliable than PCR and thus have not been widely adopted. However, as I show below by applying the test multiple times, the noise can be suppressed and a poor test can be made arbitrarily good.

The performance of binary tests are usually gauged by two quantities – sensitivity and specificity. Sensitivity is the probability that you test positive (i.e are infected) given that you actually are positive (true positive rate). Specificity is the probability that you test negative if you actually are negative (true negative rate). For a pandemic, sensitivity is more important than specificity because missing someone who is infected means you could put lots of people at risk while a false positive just means the person falsely testing positive is inconvenienced (provided they cooperatively self-isolate). Current PCR tests have very high specificity but relatively low sensitivity (as low as 0.7) and since we don’t have enough capability to retest, a lot of tested infected people could be escaping detection.

The way to make any test have arbitrarily high sensitivity and specificity is to apply it multiple times and take some sort of average. However, you want to do this with the fewest number of applications. Suppose we administer tests on the same subject, the probability of getting more than positive tests if the person is positive is , where is the cumulative distribution function of the Binomial distribution (i.e. probability that the number of Binomial distributed events is less than or equal to ). If the person is negative then the probability of or fewer positives is . We thus want to find the minimal given a desired sensitivity and specificity, and . This means that we need to solve the constrained optimization problem: find the minimal under the constraint that , and . decreases and increases with increasing and *vice versa* for . We can easily solve this problem by sequentially increasing and scanning through until the two constraints are met. I’ve included the Julia code to do this below. For example, starting with a test with sensitivity .7 and specificity 1 (like a PCR test), you can create a new test with greater than .95 sensitivity and specificity, by administering the test 3 times and looking for a single positive test. However, if the specificity drops to .7 then you would need to find more than 8 positives out of 17 applications to be 95% sure you have COVID-19.

using Distributions

function Q(k,n,q)

d = Binomial(n,q)

return 1 – cdf(d,k)

endfunction R(k,n,r)

d = Binomial(n,1-r)

return cdf(d,k)

endfunction optimizetest(q,r,qp=.95,rp=.95)

nout = 0

kout = 0for n in 1:100

for k in 0:n-1

println(R(k,n,r),” “,Q(k,n,q))

if R(k,n,r) >= rp && Q(k,n,q) >= qp

kout=k

nout=n

break

end

end

if nout > 0

break

end

endreturn nout, kout

end

# Slides for Covid-19 talk

Here are my slides for my recent COVID-19 talk at the Centre for Applied Mathematics in BioScience and Medicine (CAMBAM). It’s an updated version of the one I gave to the FDA.

# A Covid-19 Manhattan Project

Right now there are hundreds if not thousands of Covid-19 models floating out there. Some are better than others and some have much more influence than others and the ones that have the most influence are not necessarily the best. There must be a better way of doing this. The world’s greatest minds convened in Los Alamos in WWII and built two atomic bombs. Metaphors get thrown around with reckless abandon but if there ever was a time for a Manhattan project, we need one now. Currently, the world’s scientific community has mobilized to come up with models to predict the spread and effectiveness of mitigation efforts, to produce new therapeutics and to develop new vaccines. But this is mostly going on independently.

Would it be better if we were to coordinate *all* of this activity. Right now at the NIH, there is an intense effort to compile all the research that is being done in the NIH Intramural Program and to develop a system where people can share reagents and materials. There are other coordination efforts going on worldwide as well. This website contains a list of open source computational resources. This link gives a list of data scientists who have banded together. But I think we need a world wide plan if we are ever to return to normal activity. Even if some nation has eliminated the virus completely within its borders there is always a chance of reinfection from outside.

In terms of models, they seem to have very weak predictive ability. This is probably because they are all over fit. We don’t really understand all the mechanisms of SARS-CoV-2 propagation. The case or death curves are pretty simple and as Von Neumann or Ulam or someone once said, “give me 4 parameters and I can fit an elephant, give me 5 and I can make it’s tail wiggle.” Almost any model can fit the curve but to make a projection into the future, you need to get the dynamics correct and this I claim, we have not done. What I am thinking of proposing is to have the equivalent of FDA approval for predictive models. However, instead of a binary decision of approval non-approval, people could submit there models for a predictive score based on some cross validation scheme or prediction on a held out set. You could also submit as many times as you wish to have your score updated. We could then pool all the models and produce a global Bayesian model averaged prediction and see if that does better. Let me know if you wish to participate or ideas on how to do this better.

# Covid-19 talk

Here are the slides for my webinar at FDA today .

# COVID-19 Paper

First draft of our paper can be downloaded from this link. I’ll post more details about it later.

Global prediction of unreported SARS-CoV2 infection from observed COVID-19 cases

Carson C. Chow 1*, Joshua C. Chang 2,3, Richard C. Gerkin 4*, Shashaank Vattikuti 1*

1Mathematical Biology Section, LBM, NIDDK, National Institutes of Health2Epidemiology and Biostatistics Section, Rehabilitation Medicine, Clinical Center, National Institutes of Health 3 mederrata 4 School of Life Sciences, Arizona State University

*For correspondence contact carsonc@nih.gov, josh@mederrata.com, rgerkin@asu.edu, vattikutis@nih.gov

**Summary:** Estimation of infectiousness and fatality of the SARS-CoV-2 virus in the COVID-19 global pandemic is complicated by ascertainment bias resulting from incomplete and non-representative samples of infected individuals. We developed a strategy for overcoming this bias to obtain more plausible estimates of the true values of key epidemiological variables. We fit mechanistic Bayesian latent-variable SIR models to confirmed COVID-19 cases, deaths, and recoveries, for all regions (countries and US states) independently. Bayesian averaging over models, we find that the raw infection incidence rate underestimates the true rate by a factor, the case ascertainment ratio CARt that depends upon region and time. At the regional onset of COVID-19, the predicted global median was 13 infections unreported for each case confirmed (CARt = 0.07 C.I. (0.02, 0.4)). As the infection spread, the median CARt rose to 9 unreported cases for every one diagnosed as of April 15, 2020 (CARt = 0.1 C.I. (0.02, 0.5)). We also estimate that the median global initial reproduction number R0 is 3.3 (C.I (1.5, 8.3)) and the total infection fatality rate near the onset is 0.17% (C.I. (0.05%, 0.9%)). However the time-dependent reproduction number Rt and infection fatality rate as of April 15 were 1.2 (C.I. (0.6, 2.5)) and 0.8% (C.I. (0.2%,4%)), respectively. We find that there is great variability between country- and state-level values. Our estimates are consistent with recent serological estimates of cumulative infections for the state of New York, but inconsistent with claims that very large fractions of the population have already been infected in most other regions. For most regions, our estimates imply a great deal of uncertainty about the current state and trajectory of the epidemic.

# Exponential growth

The spread of covid-19 and epidemics in general is all about exponential growth. Although, I deal with exponential growth in my work daily, I don’t have an immediate intuitive grasp for what it means in life until I actually think about it. To me I write down functions with the form and treat them as some abstract quantity. In fact, I often just look at differential equations of the form

for which is a solution, where is any constant number, because the derivative of an exponential function is an exponential function. One confusing aspect of discussing exponential growth with lay people is that the above equation is called a linear differential equation and often mathematicians or physicists will simply say the dynamics are linear or even the growth is linear, although what we really mean is that the describing equation is a linear differential equation, which has exponential growth or exponential decay if is negative.

If we let represent time (say days) then is a growth rate. What actually means is that every day , e.g. number of people with covid-19, is increasing by a factor of , which is about 2.718. Now, while this is the most convenient number for mathematical manipulation, it is definitely not the most convenient number for intuition. A common alternative is to use 2 as the base of the exponent, rather than . This then means that tomorrow you will have double what you have today. I think 2 doesn’t really convey how fast exponential growth is because you say well I start with 1 today and then tomorrow I have 2 and the day after 4 and like the famous Simpson’s episode when Homer is making sure that Bart is a name safe from mockery, goes through the alphabet and says “Aart, Bart, Cart, Dart, Eart, okay Bart is fine,” you will have to go to day 10 before you get to a thousand fold increase (1024 to be precise), which still doesn’t seem too bad. In fact 30 days later is 2 to the power of 30 or , which is only 10 million or so. But on day 31 you have 20 million and day 32 you have 40 million. It’s growing really fast.

Now, I think exponential growth really hits home if you use 10 as a base because now you simply add a zero for every day: 1, 10, 100, 1000, 10000, 100000. Imagine if your stock was increasing by a factor of 10 every day. Starting from a dollar you would have a million dollars by day 6 and a billion by day 9, be richer than Bill Gates by day 11, and have more money than the entire world by the third week. Now, the only difference between exponential growth with base 2 versus base 10 is the rate of growth. This is where logarithms are very useful because they are the inverse of an exponential. So, if then (i.e. the log of x in base 2 is 3). Log just gives you the power of an exponential (in that base). So the only difference between using base 2 versus base 10 is the rate of growth. For example , where . So an exponential process that is doubling every day is increasing by a factor of ten every 3.3 days.

The thing about exponential growth is that most of the action is in the last few days. This is probably the hardest part to comprehend. So the reason they were putting hospital beds in the convention center in New York weeks ago is because if the number of covid-19 cases is doubling every 5 days then even if you are 100 fold under capacity today, you will be over capacity in 7 doublings or 35 days and the next week you will have twice as many as you can handle.

Flattening the curve means slowing the growth rate. If you can slow the rate of growth by a half then you have 70 days before you get to 7 doublings and max out capacity. If you can make the rate of growth negative then the number of cases will decrease and the epidemic will flame out. There are two ways you can make the growth rate go negative. The first is to use external measures like social distancing and the second is to reach nonlinear saturation, which I’ll discuss in a later post. This is also why social distancing measures seem so extreme and unnecessary because you need to apply it before there is a problem and if it works then those beds in the convention center never get used. It’s a no win situation, if it works then it will seem like an over reaction and if it doesn’t then hospitals will be overwhelmed. Given that 7 billion lives and livelihoods are at stake, it is not a decision to be taken lightly.

# Covid-19 modeling

I have officially thrown my hat into the ring and joined the throngs of would-be Covid-19 modelers to try to estimate (I deliberately do not use predict) the progression of the pandemic. I will pull rank and declare that I kind of do this type of thing for a living. What I (and my colleagues whom I have conscripted) are trying to do is to estimate the fraction of the population that has the SARS-CoV-2 virus but has not been identified as such. This was the goal of the Lourenco et al. paper I wrote about previously that pulled me into this pit of despair. I argued that fitting to deaths alone, which is what they do, is insufficient for constraining the model and thus it has no predictive ability. What I’m doing now is seeing whether it is possible to do the job if you fit not only deaths but also the number of cases reported and cases recovered. You then have a latent variable model where the observed variables are cases, cases that die, and cases that recover, and the latent variables are the infected that are not identified and the susceptible population. Our plan is to test a wide range of models with various degrees of detail and complexity and use Bayesian Model Comparison to see which does a better job. We will apply the analysis to global data. We’re getting numbers now but I’m not sure I trust them yet so I’ll keep computing for a few more days. The full goal is to quantify the uncertainty in a principled way.

2020-04-06: edited for purging typos

# Response to Oxford paper on covid-19

Here is my response to the paper from Oxford (Lourenco et al.) arguing that novel coronavirus infection may already be widespread in the UK and Italy. The result is based on fitting a disease spreading model, called an SIR model, to the cumulative number of deaths. SIR models usually consist of ordinary differential equations (ODEs) for the fraction of people in a given population who are susceptible to the infectious agent (S), the number infected (I), and the number recovered (R). There is one other state in the model, which is the fraction who die from the disease (D). The SIR model considers transitions between these states. In the case of ODEs, the states are treated as continuous quantities, which is not a bad approximation for a large population, and each equation in the model describes the rate of change of a state (hence differential equation). There are parameters in the model governing the rate of different interactions in each equation, for example there is a parameter for the rate of increase in S whenever an S interacts with an I, and then there is a rate of loss of an I, which transitions into either R or D. The Oxford group model D somewhat differently. Instead of a transition from I into D they consider that a fraction of (1-S) will die with some delay between time of infection and death.

They estimate the model parameters by fitting the model to the cumulative number of deaths. They did this instead of fitting directly to I because that is unreliable as many people who have Covid-19 have not been tested. They also only fit to the first 15 days from the first recorded death since they want to model what happens before social distancing was implemented. They find that the model is consistent with a scenario where the probability that an infected person gets severe enough to be flagged is low and thus the disease is much more wide spread than expected. I redid the analysis without assuming that the parameters need to have particular values (called priors in Bayesian inference and machine learning) and showed that a wide range of parameters will fit the data. This is because the model is under-constrained by death data alone so even unrealistic parameters can work. To be fair, the authors only proposed that this is a possibility and thus the population should be tested for anti-bodies to the coronavirus (SARS-CoV-2) to see if indeed there may already be herd immunity in place. However, the press has run with the result and that is why I think it is important to examine the result more closely.

# The tragedy of low probability events

We live in an age of fear and yet life (in the US at least) is the safest it has ever been. Megan McArdle blames coddling parents and the media in a Washington Post column. She argues that cars and swimming pools are much more dangerous than school shootings and kidnappings yet we mostly ignore the former and obsess about the latter. However, to me dying from an improbable event is just so much more tragic than dying from an expected one. I would be much less despondent meeting St. Peter at the Pearly Gates if I happened to expire from cancer or heart disease than if I were to be hit by an asteroid while weeding my garden. We are so scared now because we have never been safer. We would fear terrorist attacks less if they were more frequent. This is the reason that I would never want a major increase in lifespan. I most certainly would like to last long enough to see my children become independent but anything beyond that is bonus time. Nothing could be worse to me than immortality. The pain of any tragedy would be unbearable. Life would consist of an endless accumulation of sad memories. The way out is to forget but that to me is no different from death. What would be the point of living forever if you were to erase much of it. What would a life be if you forgot the people and things that you loved? To me that is no life at all.

# Optimizing luck

Each week on the NPR podcast How I Built This, host Guy Raz interviews a founder of a successful enterprise like James Dyson or Ben and Jerry. At the end of most segments, he’ll ask the founder how much of their success do they attribute to luck and how much to talent. In most cases, the founder will modestly say that luck played a major role but some will add that they did take advantage of the luck when it came. One common thread for these successful people is that they are extremely resilient and aren’t afraid to try something new when things don’t work at first.

There are two ways to look at this. On the one hand there is certainly some selection bias. For each one of these success stories there are probably hundreds of others who were equally persistent and worked equally hard but did not achieve the same success. It is like the infamous con where you send 1024 people a two outcome prediction about a stock. The prediction will be correct in 512 of them so the next week you send those people another prediction and so on. After 10 weeks, one person will have received the correct prediction 10 times in a row and will think you are infallible. You then charge them a King’s ransom for the next one.

Yet, it may be possible to optimize luck and you can see this with Jensen’s inequality. Suppose represents some combination of your strategy and effort level and is your outcome function. Jensen’s inequality states that the average or expectation value of a convex function (e.g. a function that bends upwards) is greater than (or equal to) the function of the expectation value. Thus, . In other words, if your outcome function is convex then your average outcome will be larger just by acting in a random fashion. During “convex” times, the people who just keep trying different things will invariably be more successful than those who do nothing. They were lucky (or they recognized) that their outcome was convex but their persistence and willingness to try anything was instrumental in their success. The flip side is that if they were in a nonconvex era, their random actions would have led to a much worse outcome. So, do you feel lucky?

# Catch-22 of our era

The screen on my wife’s iPhone was shattered this week and she had not backed up the photos. The phone seems to still be functioning otherwise so we plugged it into the computer to try to back it up but it requires us to unlock the phone and we can’t enter in the password. My wife refused to pay the 99 cents or whatever Apple charges to increase the disk space for iCloud to automatically back up the phone, so I suggested we just pay the ransom money and then the phone will back up automatically. I currently pay both Apple and Dropbox extortion money. However, since she hadn’t logged onto iCloud in maybe ever, it sent a code to her phone under the two-factor authentication scheme to type in to the website, but of course we can’t see it on her broken screen so that idea is done. We called Apple and they said you could try to change the number on her iCloud account to my phone but that was two days ago and they haven’t complied. So my wife gave up and tried to order a new phone. Under the new system of her university, which provides her phone, she can get a phone if she logs onto this site to request it. The site requires VPN and in order to get VPN she needs to, you guessed it, type in a code sent to her phone. So you need a functioning phone to order a new phone. Basically, tech products are not very good. Software still kind of sucks and is not really improving. My Apple Mac is much worse now than it was 10 years ago. I still have trouble projecting stuff on a screen. I will never get into a self driving car made by any tech company. I’ll wait for Toyota to make one; my (Japanese) car always works (my Audi was terrible).

# Missing the trend

I have been fortunate to have been born at a time when I had the opportunity to witness the birth of several of the major innovations that shape our world today. I have also managed to miss out on capitalizing on every single one of them. You might make a lot of money betting against what I think.

I was a postdoctoral fellow in Boulder, Colorado in 1993 when my very tech savvy advisor John Cary introduced me and his research group to the first web browser Mosaic shortly after it was released. The web was the wild west in those days with just a smattering of primitive personal sites authored by early adopters. The business world had not discovered the internet yet. It was an unexplored world and people were still figuring out how to utilize it. I started to make a list of useful sites but unlike Jerry Yang and David Filo, who immediately thought of doing the same thing and forming a company, it did not remotely occur to me that this activity could be monetized. Even though I struggled to find a job in 1994, was fairly adept at programming, watched the rise of Yahoo! and the rest of the internet startups, and had friends at Stanford and Silicon Valley, it still did not occur to me that perhaps I could join in too.

Just months before impending unemployment, I managed to talk my way into being the first post doc of Jim Collins, who just started as a non-tenure track research assistant professor at Boston University. Midway through my time with Jim, we had a meeting with Charles Cantor, who was a professor at BU then, about creating engineered organisms that could eat oil. Jim subsequently recruited graduate student Tim Gardner, now CEO of Riffyn, to work on this idea. I thought we should create a genetic Hopfield network and I showed Tim how to use XPP to simulate the various models we came up with. However, my idea seemed too complicated to implement biologically so when I went to Switzerland to visit Wulfram Gerstner at the end of 1997, Tim and Jim, freed from my meddling influence, were able create the genetic toggle switch and the field of synthetic biology was born.

I first learned about Bitcoin in 2009 and had even thought about mining some. However, I then heard an interview with one of the early developers, Gavin Andresen, and he failed to understand that because the supply of Bitcoins is finite, prices denominated in it would necessarily deflate over time. I was flabbergasted that he didn’t comprehend the basics of economics and was convinced that Bitcoin would eventually fail. Still, I could have mined thousands of Bitcoins on a laptop back then, which would be worth tens of millions today. I do think blockchains are an important innovation and my former post-bac fellow Wally Xie is even the CEO of the blockchain startup QChain. Although I do not know where cryptocurrencies and blockchains will be in a decade, I do know that I most likely won’t have a role.

I was in Pittsburgh during the late nineties/early 2000’s in one of the few places where neural networks/deep learning, still called connectionism, was king. Geoff Hinton had already left Carnegie Mellon for London by the time I arrived at Pitt but he was still revered in Pittsburgh and I met him in London when I visited UCL. I actually thought the field had great promise and even tried to lobby our math department to hire someone in machine learning for which I was summarily dismissed and mocked. I recruited Michael Buice to work on the path integral formulation for neural networks because I wanted to write down a neural network model that carried both rate and correlation information so I could implement a correlation based learning rule. Michael even proposed that we work on an algorithm to play Go but obviously I demurred. Although, I missed out on this current wave of AI hype, and probably wouldn’t have made an impact anyway, this is the one area where I may get a second chance in the future.

# Jurgen Moser Lecture

The SIAM Jorgen Moser Lecture prize is now open for nominations.

# Technology and inference

In my previous post, I gave an example of how fake news could lead to a scenario of no update of posterior probabilities. However, this situation could occur just from the knowledge of technology. When I was a child, fantasy and science fiction movies always had a campy feel because the special effects were unrealistic looking. When Godzilla came out of Tokyo Harbour it looked like little models in a bathtub. The Creature from the Black Lagoon looked like a man in a rubber suit. I think the first science fiction movie that looked astonishing real was Stanley Kubrick’s 1968 masterpiece 2001: A Space Odyssey, which adhered to physics like no others before and only a handful since. The simulation of weightlessness in space was marvelous and to me the ultimate attention to detail was the scene in the rotating space station where a mild curvature in the floor could be perceived. The next groundbreaking moment was the 1993 film Jurassic Park, which truly brought dinosaurs to life. The first scene of a giant sauropod eating from a tree top was astonishing. The distinction between fantasy and reality was forever gone.

The effect of this essentially perfect rendering of anything into a realistic image is that we now have a plausible reason to reject any evidence. Photographic evidence can be completely discounted because the technology exists to create completely fabricated versions. This is equally true of audio tapes and anything your read on the Internet. In Bayesian terms, we now have an internal model or likelihood function that any data could be false. The more cynical you are the closer this constant is to one. Once the likelihood becomes insensitive to data then we are in the same situation as before. Technology alone, in the absence of fake news, could lead to a world where no one ever changes their mind. The irony could be that this will force people to evaluate truth the way they did before such technology existed, which is that you believe people (or machines) that you trust through building relationships over long periods of time.

# Fake news and beliefs

Much has been written of the role of fake news in the US presidential election. While we will never know how much it actually contributed to the outcome, as I will show below, it could certainly affect people’s beliefs. Psychology experiments have found that humans often follow Bayesian inference – the probability we assign to an event or action is updated according to Bayes rule. For example, suppose is the probability we assign to whether climate change is real; is our probability that climate change is false. In the Bayesian interpretation of probability, this would represent our level of belief in climate change. Given new data (e.g. news), we will update our beliefs according to

What this means is that our posterior probability or belief that climate change is true given the new data, , is equal to the probability that the new data came from our internal model of a world with climate change (i.e. our likelihood), multiplied by our prior probability that climate change is real, divided by the probability of obtaining such data in all possible worlds, . According to the rules of probability, the latter is given by , which is the sum of the probability the data came from a world with climate change and that from one without.

This update rule can reveal what will happen in the presence of new data including fake news. The first thing to notice is that if is zero, then there is no update. In this binary case, this means that if we believe that climate change is absolutely false or true then no data will change our mind. In the case of multiple outcomes, any outcome with zero prior (has no support) will never change. So if we have very specific priors, fake news is not having an impact because no news is having an impact. If we have nonzero priors for both true and false then if the data is more likely from our true model then our posterior for true will increase and vice versa. Our posteriors will tend towards the direction of the data and thus fake news could have a real impact.

For example, suppose we have an internal model where we expect the mean annual temperature to be 10 degrees Celsius with a standard deviation of 3 degrees if there is no climate change and a mean of 13 degrees with climate change. Thus if the reported data is mostly centered around 13 degrees then our belief of climate change will increase and if it is mostly centered around 10 degrees then it will decrease. However, if we get data that is spread uniformly over a wide range then both models could be equally likely and we would get no update. Mathematically, this is expressed as – if then . From the Bayesian update rule, the posterior will be identical to the prior. In a world of lots of misleading data, there is no update. Thus, obfuscation and sowing confusion is a very good strategy for preventing updates of priors. You don’t need to refute data, just provide fake examples and bury the data in a sea of noise.

# Revolution vs incremental change

I think that the dysfunction and animosity we currently see in the US political system and election is partly due to the underlying belief that meaningful change cannot be effected through slow evolution but rather requires an abrupt revolution where the current system is torn down and rebuilt. There is some merit to this idea. Sometimes the structure of a building can be so damaged that it would be easier to demolish and rebuild rather than repair and renovate. Mathematically, this can be expressed as a system being stuck in a local minimum (where getting to the global minimum is desired). In order to get to the true global optimum, you need to get worse before you can get better. When fitting nonlinear models to data, dealing with local minima is a major problem and the reason that a stochastic MCMC algorithm that does occasionally go uphill works so much better than gradient descent, which only goes downhill.

However, the recent success of deep learning may dispel this notion when the dimension is high enough. Deep learning, which is a multi-layer neural network that can have millions of parameters is the quintessence of a high dimensional model. Yet, it seems to be able to work just fine using the back propagation algorithm, which is a form of gradient descent. The reason could be that in high enough dimensions, local minima are rare and the majority of critical points (places where the slope is zero) are high dimensional saddle points, where there is always a way out in some direction. In order to have a local minimum, the matrix of second derivatives in all directions (i.e. Hessian matrix) must be positive definite (i.e. have all positive eigenvalues). As the dimension of the matrix gets larger and larger there are simply more ways for one eigenvalue to be negative and that is all you need to provide an escape hatch. So in a high dimensional system, gradient descent may work just fine and there could be an interesting tradeoff between a parsimonious model with few parameters but difficult to fit versus a high dimensional model that is easy to fit. Now the usual danger of having too many parameters is that you overfit and thus you fit the noise at the expense of the signal and have no ability to generalize. However, deep learning models seem to be able to overcome this limitation.

Hence, if the dimension is high enough evolution can work while if it is too low then you need a revolution. So the question is what is the dimensionality of governance and politics. In my opinion, the historical record suggests that revolutions generally do not lead to good outcomes and even when they do small incremental changes seem to get you to a similar place. For example, the US and France had bloody revolutions while Canada and the England did not and they all have arrived at similar liberal democratic systems. In fact, one could argue that a constitutional monarchy (like Canada and Denmark), where the head of state is a figure head is more stable and benign than a republic, like Venezuela or Russia (e.g. see here). This distinction could have pertinence for the current US election if a group of well-meaning people, who believe that the two major parties do not have any meaningful difference, do not vote or vote for a third party. They should keep in mind that incremental change is possible and small policy differences can and do make a difference in people’s lives.