The machine learning president

For the past four years, I have been unable to post with any regularity. I have dozens of unfinished posts sitting in my drafts folder. I would start with a thought but then get stuck, which had previously been somewhat unusual for me. Now on this first hopeful day I have had for the past four trying years, I am hoping I will be able to post more regularly again.

Prior to what I will now call the Dark Years, I viewed all of history through an economic lens. I bought into the standard twentieth century leftist academic notion that wars, conflicts, social movements, and cultural changes all have economic underpinnings. But I now realize that this is incorrect or at least incomplete. Economics surely plays a role in history but what really motivates people are stories and stories are what led us to the Dark Years and perhaps to get us out.

Trump became president because he had a story. The insurrectionists who stormed the Capitol had a story. It was a bat shit crazy lunatic story but it was still a story. However, the tragic thing about the Trump story (or rather my story of the Trump story) is that it is an unintentional algorithmically generated story. Trump is the first (and probably not last) purely machine learning president (although he may not consciously know that). Everything he did was based on the feedback he got from his Twitter Tweets and Fox News. His objective function was attention and he would do anything to get more attention. Of the many lessons we will take from the Dark Years, one should be how machine learning and artificial intelligence can go so very wrong. Trump’s candidacy and presidency was based on a simple stochastic greedy algorithm for attention. He would Tweet randomly and follow up on the Tweets that got the most attention. However, the problem with a greedy algorithm (and yes that is a technical term that just happens to coincidentally be apropos) is that once you follow a path it is hard to make a correction. I actually believe that if some of Trump’s earliest Tweets from say 2009-2014 had gone another way, he could have been a different president. Unfortunately, one of his early Tweet themes that garnered a lot of attention was on the Obama birther conspiracy. This lit up both racist Twitter and a counter reaction from liberal Twitter, which led him further to the right and ultimately to the presidency. His innate prejudices biased him towards a darker path and he did go completely unhinged after he lost the election but he is unprincipled and immature enough to change course if he had enough incentive to do so.

Unlike standard machine learning for categorizing images or translating languages, the Trump machine learning algorithm changes the data. Every Tweet alters the audience and the reinforcing feedback between Trump’s Tweets and its reaction can manufacture discontent out of nothing. A person could just happen to follow Trump because they like The Apprentice reality show Trump starred in and be having a bad day because they missed the bus or didn’t get a promotion. Then they see a Trump Tweet, follow the link in it and suddenly they find a conspiracy theory that “explains” why they feel disenchanted. They retweet and this repeats. Trump sees what goes viral and Tweets more on the same topic. This positive feedback loop just generated something out of random noise. The conspiracy theorizing then starts it’s own reinforcing feedback loop and before you know it we have a crazed mob bashing down the Capitol doors with impunity.

Ironically Trump, who craved and idolized power, failed to understand the power he actually had and if he had a better algorithm (or just any strategy at all), he would have been reelected in a landslide. Even before he was elected, Trump had already won over the far right and he could have started moving in any direction he wished. He could have moderated on many issues. Even maintaining his absolute ignorance of how govening actually works, he could have had his wall by having it be part of actual infrastructure and immigration bills. He could have directly addressed the COVID-19 pandemic. He would not have lost much of his base and would have easily gained an extra 10 million votes. Maybe, just maybe if liberal Twitter simply ignored the early incendiary Tweets and only responded to the more productive ones, they could have moved him a bit too. Positive reinforcement is how they train animals after all.

Now that Trump has shown how machine learning can win a presidency, it is only a matter of time before someone harnesses it again and more effectively. I just hope that person is not another narcissistic sociopath.

Revolution vs incremental change

I think that the dysfunction and animosity we currently see in the US political system and election is partly due to the underlying belief that meaningful change cannot be effected through slow evolution but rather requires an abrupt revolution where the current system is torn down and rebuilt. There is some merit to this idea. Sometimes the structure of a building can be so damaged that it would be easier to demolish and rebuild rather than repair and renovate. Mathematically, this can be expressed as a system being stuck in a local minimum (where getting to the global minimum is desired). In order to get to the true global optimum, you need to get worse before you can get better. When fitting nonlinear models to data, dealing with local minima is a major problem and the reason that a stochastic MCMC algorithm that does occasionally go uphill works so much better than gradient descent, which only goes downhill.

However, the recent success of deep learning may dispel this notion when the dimension is high enough. Deep learning, which is a multi-layer neural network that can have millions of parameters is the quintessence of a high dimensional model. Yet, it seems to be able to work just fine using the back propagation algorithm, which is a form of gradient descent. The reason could be that in high enough dimensions, local minima are rare and the majority of critical points (places where the slope is zero) are high dimensional saddle points, where there is always a way out in some direction. In order to have a local minimum, the matrix of second derivatives in all directions (i.e. Hessian matrix) must be positive definite (i.e. have all positive eigenvalues). As the dimension of the matrix gets larger and larger there are simply more ways for one eigenvalue to be negative and that is all you need to provide an escape hatch. So in a high dimensional system, gradient descent may work just fine and there could be an interesting tradeoff between a parsimonious model with few parameters but difficult to fit versus a high dimensional model that is easy to fit. Now the usual danger of having too many parameters is that you overfit and thus you fit the noise at the expense of the signal and have no ability to generalize. However, deep learning models seem to be able to overcome this limitation.

Hence, if the dimension is high enough evolution can work while if it is too low then you need a revolution. So the question is what is the dimensionality of governance and politics. In my opinion, the historical record suggests that revolutions generally do not lead to good outcomes and even when they do small incremental changes seem to get you to a similar place. For example, the US and France had bloody revolutions while Canada and the England did not and they all have arrived at similar liberal democratic systems. In fact, one could argue that a constitutional monarchy (like Canada and Denmark), where the head of state is a figure head is more stable and benign than a republic, like Venezuela or Russia (e.g. see here). This distinction could have pertinence for the current US election if a group of well-meaning people, who believe that the two major parties do not have any meaningful difference, do not vote or vote for a third party. They should keep in mind that incremental change is possible and small policy differences can and do make a difference in people’s lives.

Code platform update

It’s a week away from 2015, and I have transitioned completely away from Matlab. Julia is winning the platform attention battle. It is very easy to code in and it is very fast. I just haven’t gotten around to learning python much less pyDSTool (sorry Rob). I kind of find the syntax of Python (with all the periods between words) annoying. Wally Xie and I have also been trying to implement some of our MCMC runs in Stan but we have had trouble making it work.  Our model requires integrating ODEs and the ODE solutions from Stan (using our own solver) do not match our Julia code or the gold standard XPP. Maybe we are missing something obvious in the Stan syntax but our code is really very simple. Thus, we are going back to doing our Bayesian posterior estimates in Julia. However, I do plan to revisit Stan if they (or we) can write a debugger for it.

Big Data backlash

I predicted that there would be an eventual push back on Big Data and it seems that it has begun. Gary Marcus and Ernest Davis of NYU had an op-ed in the Times yesterday outlining nine issues with Big Data. I think one way to encapsulate many of the critiques is that you will never be able to do true prior free data modeling. The number of combinations in a data set grows as the factorial of the number of elements, which grows faster than an exponential. Hence, Moore’s law can never catch up. At some point, someone will need to exercise some judgement in which case Big Data is not really different from the ordinary data that we deal with all the time.

Paper on compressed sensing and genomics

New paper on the arXiv. The next step after the completion of the Human Genome Project, was the search for genes associated with diseases such as autism or diabetes. However, after spending hundreds of millions of dollars, we find that there are very few common variants of genes with large effects. This doesn’t mean that there aren’t genes with large effect. The growth hormone gene definitely has a large effect on height. It just means that variations of genes that are common among people have small effects on the phenotype. Given the results of Fisher, Wright, Haldane and colleagues, this was probably expected as the most likely scenario and recent results measuring narrow-sense heritability directly from genetic markers (e.g. see this) confirms this view.

Current GWAS microarrays consider about a million or two markers and this is increasing rapidly. Narrow-sense heritability refers to the additive or linear genetic variance, which means the phenotype is given by the linear model y= Z\beta + \eta, where y is the phenotype vector, Z is the genotype matrix, \beta are all the genetic effects we want to recover, and \eta are all the nonadditive components including environmental effects. This is a classic linear regression problem. The problem comes when the number of coefficients \beta far exceeds the number of people in your sample, which is the case for genomics. Compressed sensing is a field of high dimensional statistics that addresses this specific problem. People such as David Donoho, Emmanuel Candes and Terence Tao have proven under fairly general conditions that if the number of nonzero coefficients are sparse compared to the number samples, then the effects can be completely recovered using L1 penalized optimization algorithms such as the lasso or approximate message passing. In this paper, we show that these ideas can be applied to genomics.

Here is Steve Hsu’s summary of the paper

Application of compressed sensing to genome wide association studies and genomic selection

(Submitted on 8 Oct 2013)

We show that the signal-processing paradigm known as compressed sensing (CS) is applicable to genome-wide association studies (GWAS) and genomic selection (GS). The aim of GWAS is to isolate trait-associated loci, whereas GS attempts to predict the phenotypic values of new individuals on the basis of training data. CS addresses a problem common to both endeavors, namely that the number of genotyped markers often greatly exceeds the sample size. We show using CS methods and theory that all loci of nonzero effect can be identified (selected) using an efficient algorithm, provided that they are sufficiently few in number (sparse) relative to sample size. For heritability h2 = 1, there is a sharp phase transition to complete selection as the sample size is increased. For heritability values less than one, complete selection can still occur although the transition is smoothed. The transition boundary is only weakly dependent on the total number of genotyped markers. The crossing of a transition boundary provides an objective means to determine when true effects are being recovered. For h2 = 0.5, we find that a sample size that is thirty times the number of nonzero loci is sufficient for good recovery.

Comments: Main paper (27 pages, 4 figures) and Supplement (5 figures) combined
Subjects: Genomics (q-bio.GN); Applications (stat.AP)
Cite as: arXiv:1310.2264 [q-bio.GN]
(or arXiv:1310.2264v1 [q-bio.GN] for this version)

MCMC and fitting models to data

As I have posted before, I never learned any statistics during my education as a theoretical physicist/applied mathematician.  However, it became fairly apparent after I entered biology (although I managed to avoid it for a few years) that fitting models to data and estimating parameters is unavoidable.   As I have opined multiple times previously, Bayesian inference and the Markov Chain Monte Carlo (MCMC) method is the best way to do this.   I thought I would provide the details on how to implement the algorithm since most of the treatments on the subject are often couched in statistical language that may be hard to decipher for someone with a physics or dynamical systems background.  I’ll do this in a series of posts and in a nonstandard order.  Instead of building up the entire Bayesian formalism, I thought I would show how the MCMC method can be used to do the job and then later show how it fits into the grand scheme of things to do even more.

Suppose you have data D(t_i), which consists of some measurements (either a scalar or a vector) at discrete time points t_i and a proposed model, which produces a time series y(t | \theta),  where \theta represents a set of free parameters that changes the function.  This model could be a system of differential equations or  just some proposed function like a polynomial.    The goal is to find the set of parameters that best fits the data and to evaluate how good the model is.  Now, the correct way to do this is to use Bayesian inference and model comparison, which can be computed using the MCMC.   However, the MCMC can also be used just to get the parameters in the sense of finding the best fit according to some criterion.

Continue reading

Slides for two talks

I just gave a seminar in the math department at the University of Iowa today. I gave a talk that was similar to the one I gave at the Mathematical Biosciences Institute on Bayesian Inference for dynamical systems.  My slides are here.  I was lucky I made it to give this talk.  My flight  from Chicago O’Hare to Cedar Rapids, Iowa was cancelled yesterday and I was rebooked on another flight tonight, which wouldn’t have been of much use for my talk this afternoon.  There was one fight left last evening to Cedar Rapids but bad weather cancelled several flights yesterday so there were many stranded travelers.  I was number 38 on the standby list and thought I had no chance to make it out that night.  However, on a whim I decided to wait it out and I started to move up the list because other people had evidently given up as well.  To my great surprise and relief I was the last person to get on the plane.  There was a brief scare when they asked me to get off because we exceeded the weight limitation but then they changed their mind and let us all fly.  (Someone else had kindly volunteered to take my place).  I learned two lessons.  One is to keep close watch of your flight at all times so you can get on the standby list as soon as possible and two is that even number 38 on a plane that only seats 50 can still make it.

I also gave a talk at Northwestern University in March on the gene induction project that I described here.  My slides are here.

Dollar cost averaging

It is often recommended  that when investing money in a mutual fund or security to use a strategy called  dollar cost averaging.  This simply means you invest a fixed amount of money at periodic intervals rather than say buy a fixed number of shares at periodic intervals or invest as a lump sum.  The argument is that when the price of the fund is low you buy more shares and when it is high you buy fewer shares.  In this way you can ride out the fluctuations in the stock price.

Evaluating which strategy is best is a great example of how concepts in calculus and in particular Jensen’s inequality enter every day life and more reasons for why an average person should learn more math.  The question is how do the investment strategies compare over time. Let p_t be the price of the fund at time t.  In dollar cost averaging, you would invest s dollars at periodic intervals (say each week).  So the number of shares you would buy each week is n_t=s/p_t.  Over a time period T you will have purchased \sum_t s/p_t = s T *(1/T)\sum_t 1/p_t  shares, which can be rewritten as sT E[1/p_t], where E is the expectation value or average.  Since you have invested s T dollars, you have paid \tilde p_t=1/E[1/p_t] per share on average, which is also called the harmonic mean.

If on the other hand you decided to buy a fixed number of shares n per week  then it would cost n p_t.   Over the time period T you will have spend \sum_t n p_t and have n T shares.  Thus you have paid \bar{p_t}= E[p_t] on average.    Now the question is how does the mean \bar{p_t} compare to the harmonic mean \tilde p_t.  This is where Jensen’s inequality comes in, which says that for a convex function f(x)E[f(x)]\ge f(E[x]).  A convex function is a function where a straight line joining any two points on the function is greater than or equal to the function.  Since 1/x is a convex function, this means that E[1/p_t] \ge 1/E[p_t] which can be rearrange to show that \bar{p}\ge \tilde{p}.  So dollar cost averaging is always better than or as good as buying the same number of shares each week.  Now of course if you buy on a day when the price is below the harmonic mean of a time horizon you are looking at you will always do better .

Connecting the dots

As I posted previously, I am highly skeptical that any foolproof system can be developed to screen for potential threats.  My previous argument was that in order to have a zero false negative rate for identifying terrorists, it would be impossible to not also have a relatively significant false positive rate.  In other words, the only way to guarantee that a terrorist doesn’t board a plane is to not let anyone board.  A way to visualize this graphically is with what is called a receiver operating characteristic or ROC curve, which is a plot of the true positive rate versus the false positive rate for a binary classification test as some parameter, usually a discrimination threshold, is changed.  Ideally, one would like  a curve that jumps to a true positive rate of 1 for zero false positive rate.  The area under the ROC curve (AROC) is the usual measure for how good a discriminator is.  So a perfect discriminator has AROC = 1.  In  my experience with biological systems,  it is pretty difficult to make a test with an AROC of greater than 90%.    Additionally, ROC curves are usually somewhat smooth so that they only reach true positve rate = 1  at false positive rate = 1.

Practicalities aside, is there any mathematical reason why a perfect or near perfect discriminator couldn’t be designed?  This to me is the more interesting question.  My guess is that deciding if a person is a terrorist is an NP hard question, which is why it is so insidious.   For any NP problem, it is simple to verify the answer but hard to find one.   Connecting all the dots to show that someone is a terrorist is a straightforward matter if you already know that they are a terrorist.  This  is also true of proving the Riemann Hypothesis or solving the 3D Ising model.  The  solution is obvious if you know the answer. If terrorist finding is NP hard, then that means for a large enough population and I think 5 billion qualifies, then no method nor achievable amount of computational power is sufficient to do the job perfectly.

Nonlinear saturation

The classic Malthus argument is that populations will grow exponentially until they are nonlinearly suppressed by famine or disease due to overcrowding.  However, the lesson of the twentieth century is that populations can be checked for other reasons as well.  This is not necessarily a refutation of Malthus per se but rather that the quantity that populations  conserve need not be restricted to food or health.  There seems to be a threshold of economic prosperity when family income or personal freedom becomes the rate limiting step for a bigger family.  Developed nations such as Japan and Germany are approaching zero population growth and trending towards negative growth.  Russia currently has negative growth.

Hence, we can slow down population growth by increasing economic growth.  China is starting to see a very steep decline in population growth in the big cities like Shanghai that is independent of the one child policy.  The emerging middle class is now taking into account the cost of raising a child and how it would affect their lifestyle.  In a very poor country, the cost of raising a child is not really an issue.  In fact, if the probability of a child making it to adulthood is low and help from children is the only way for the elderly to survive then it is logical to have as many children as possible.  In this case, the classic Malthus argument with food and health (and aid) as the rate limiting quantities applies.

I bring this point up now because  it is crucial for the current debate about what to do about climate change.  One  way of mitigating human impact on the environment is to slow down population growth.  However, the most humane and effective method of doing that is to increase economic growth, which will then lead to an increase in emissions.  For example, in 2005, the US produced 23.5 tonnes of CO2 equivalents per person, (which incidentally is not the highest in the world and less than half of world leader Qatar), while  China produces about 5.5 tonnes and Niger 0.1 tonnes.  (This is not accounting for the extra emissions due to changes in land use.)   In absolute terms, China already produces more green house gases than the US and India is not far behind.   On the other hand the population growth rate of Niger is 3.6%, India is 1.5% and dropping, China is 0.6% and the US is 1%.   So, when we increase economic prosperity, we can reduce population growth and presumably suffering, but we will also increase greenhouse gas emissions.

Given that the world economy and agricultural system is entirely based on fossil fuels, it is also true that, at least on the short term, a restriction of carbon emissions will slow down or reduce economic growth.  Thus, even though climate change could have a catastrophic outcome for the future, curbing economic growth could also be bad.  Thus, for a developing nation and even some developed nations, the choice may not be so clear cut.  It is thus no wonder, that the current debate in Copenhagen is so contentious.  I think that unless the developed world can demonstrate that viable economic growth and prosperity can be achieved with reduced carbon emissions, the rest of the world and many people will remain skeptical.  I don’t think the leaders in the climate change community realize that this skepticism about carbon restrictions may not be all irrational.

Darts and Diophantine equations

Darts is a popular spectator sport in the UK. I had access to cable television recently so I was able to watch a few games.  What I find interesting about professional darts is that the players must solve a Diophantine equation to win.  For those who know nothing of the game, it involves throwing a small pointed projectile at an enumerated target board that looks like this:


A dart that lands on a given sector on the board obtains that score.  The center circle of the board called the bulls eye is worth 50 points.  The ring around the bulls eye is worth 25 points.  The wedges are worth the score ascribed by the number on the perimeter.  However, if you land in the inner ring then you get triple the score of the wedge and if you land in the outer ring you get double the score.  Hence, the maximum number of points for one dart is the triple twenty worth 60 points.

Continue reading

The NP economy

I used to believe that one day all human labour would be automated (e.g. see here).  Upon further reflection, I realize that I am wrong.  The question of whether or not machines will someday replace all humans depends crucially on whether or not P is equal to NP.   Jobs that will eventually be automated will be the ones that can be solved easily with an algorithm.  In computer science parlance, these are problems in the computational complexity class P (solvable in polynomial time).   For example, traditional travel agents have disappeared faster than bluefin tuna because their task is pretty simple to automate.  However, not all travel agents will disappear.  The ones that survive will be more like concierges that put together complex travel arrangements or require negotiating with many parties.

Eventually, the jobs that humans will hold (barring a collapse of civilization as we know it) will involve solving problems in the complexity class NP (or harder).  That is not to say that machines won’t be doing some of these jobs, only that the advantage of machines over humans will not be as clear cut.  While it is true that if we could fully reproduce a human and make it faster and bigger then it could do everything that a human could do better but as I blogged about before, I think it will be difficult to exactly reproduce humans.  Additionally, for some very hard problems that don’t even have any good approximation schemes, blind luck will play an important role in coming up with solutions.  Balancing different human centric priorities will also be important and that may be best left for humans to do.   Even if it turns out that P=NP there could still be some jobs that humans can do like working on undecidable problems.

So what are some jobs that will be around in the NP economy?  Well, I think mathematicians will still be employed. Theorems can be verified in polynomial time but there are no known algorithms in P to generate them.   That is not to say that there won’t be robot mathematicians and mathematicians will certainly use automated theorem proving programs to help them (e.g. see here). However, I think the human touch will always have some use.  Artists and comedians will also have jobs in the future.  These are professions that require intimate knowledge of  what it is like to be human .  Again, there will be machine comics and artists but they won’t fully replace humans.  I also think that craftsmen like carpenters, stone masons, basket weavers and so forth could also make a comeback.  They will have to exhibit some exceptional artistry to survive but the demand for them could increase since some people will always long for the human touch in their furniture and houses.

The question then is whether or not there will be enough NP jobs to go around and whether or not everyone is able and willing to hold one.  To some, an NP economy will be almost Utopian – everyone will have interesting jobs.    However, there may be some people who simply don’t want or can’t do an NP job.   What will happen to them?  I think that will be a big (probably undecidable) problem that will face society in the not too distant future, provided we make it that far.

Incentives in health care

The American health care system relies on a “fee for service” model, in which physicians are reimbursed for the procedures they perform.  I think this is a perfect example of how organizational structure and in particular incentives can affect outcomes.  Free market proponents argue that the only system that can optimally distribute goods and services is a free market.  I tangentially posted on efficient markets a short time ago.  However, even with a free market, the rules of the game determine what it means to win.  For example, when physicians are reimbursed for procedures then it makes sense for them to perform as many procedures as possible.  If it is the choice between an inexpensive therapy and an expensive one and there is no clear cut evidence for the benefit of either then why choose the inexpensive option.  A provocative article in the New Yorker by Atul Gawande  shows what can happen when this line of thought is taken to the extreme.  Another unintended consequence of the fee for service model may be that there is no incentive to recruit individuals for clinical studies as detailed in this article by Gina Kolata  in the New York Times.  The interesting thing about both of these examples is that they are independent of whether health insurance is private,  public or single payer.  Gawande’s article was mostly about Medicare, which is government run. An alternative to fee for service is  “fee for outcome”, where physicians are rewarded for having healthier patients.  Gawande favours the Mayo Clinic model where the physicians have a fixed salary and focus on maximizing patient care.  There must be a host of different possible compensation models that are possible, which I’m sure economists have explored. However, perhaps this is also a (critically important) problem where ideas from physics and applied math might be useful.

Waiting in airports

I was visiting the University of Utah this past week.  I gave talks on the Kinetic Theory of Coupled Oscillators and on Deriving Moment Equations for Neural Networks. On my way to the airport I wondered what would be the optimal arrival time so that you spend the least amount of time waiting in the airport balanced by the cost of missing a flight.  If you make some basic assumptions, it’s not too hard to derive a condition for the optimum.  Let’s say the only thing we’re concerned about is minimizing wasted time.  Then what we would want to do is to balance the average time waiting in airports with the average time lost to make up for  a missed flight.

Continue reading