Revolution vs incremental change

I think that the dysfunction and animosity we currently see in the US political system and election is partly due to the underlying belief that meaningful change cannot be effected through slow evolution but rather requires an abrupt revolution where the current system is torn down and rebuilt. There is some merit to this idea. Sometimes the structure of a building can be so damaged that it would be easier to demolish and rebuild rather than repair and renovate. Mathematically, this can be expressed as a system being stuck in a local minimum (where getting to the global minimum is desired). In order to get to the true global optimum, you need to get worse before you can get better. When fitting nonlinear models to data, dealing with local minima is a major problem and the reason that a stochastic MCMC algorithm that does occasionally go uphill works so much better than gradient descent, which only goes downhill.

However, the recent success of deep learning may dispel this notion when the dimension is high enough. Deep learning, which is a multi-layer neural network that can have millions of parameters is the quintessence of a high dimensional model. Yet, it seems to be able to work just fine using the back propagation algorithm, which is a form of gradient descent. The reason could be that in high enough dimensions, local minima are rare and the majority of critical points (places where the slope is zero) are high dimensional saddle points, where there is always a way out in some direction. In order to have a local minimum, the matrix of second derivatives in all directions (i.e. Hessian matrix) must be positive definite (i.e. have all positive eigenvalues). As the dimension of the matrix gets larger and larger there are simply more ways for one eigenvalue to be negative and that is all you need to provide an escape hatch. So in a high dimensional system, gradient descent may work just fine and there could be an interesting tradeoff between a parsimonious model with few parameters but difficult to fit versus a high dimensional model that is easy to fit. Now the usual danger of having too many parameters is that you overfit and thus you fit the noise at the expense of the signal and have no ability to generalize. However, deep learning models seem to be able to overcome this limitation.

Hence, if the dimension is high enough evolution can work while if it is too low then you need a revolution. So the question is what is the dimensionality of governance and politics. In my opinion, the historical record suggests that revolutions generally do not lead to good outcomes and even when they do small incremental changes seem to get you to a similar place. For example, the US and France had bloody revolutions while Canada and the England did not and they all have arrived at similar liberal democratic systems. In fact, one could argue that a constitutional monarchy (like Canada and Denmark), where the head of state is a figure head is more stable and benign than a republic, like Venezuela or Russia (e.g. see here). This distinction could have pertinence for the current US election if a group of well-meaning people, who believe that the two major parties do not have any meaningful difference, do not vote or vote for a third party. They should keep in mind that incremental change is possible and small policy differences can and do make a difference in people’s lives.

Code platform update

It’s a week away from 2015, and I have transitioned completely away from Matlab. Julia is winning the platform attention battle. It is very easy to code in and it is very fast. I just haven’t gotten around to learning python much less pyDSTool (sorry Rob). I kind of find the syntax of Python (with all the periods between words) annoying. Wally Xie and I have also been trying to implement some of our MCMC runs in Stan but we have had trouble making it work.  Our model requires integrating ODEs and the ODE solutions from Stan (using our own solver) do not match our Julia code or the gold standard XPP. Maybe we are missing something obvious in the Stan syntax but our code is really very simple. Thus, we are going back to doing our Bayesian posterior estimates in Julia. However, I do plan to revisit Stan if they (or we) can write a debugger for it.

Big Data backlash

I predicted that there would be an eventual push back on Big Data and it seems that it has begun. Gary Marcus and Ernest Davis of NYU had an op-ed in the Times yesterday outlining nine issues with Big Data. I think one way to encapsulate many of the critiques is that you will never be able to do true prior free data modeling. The number of combinations in a data set grows as the factorial of the number of elements, which grows faster than an exponential. Hence, Moore’s law can never catch up. At some point, someone will need to exercise some judgement in which case Big Data is not really different from the ordinary data that we deal with all the time.

Paper on compressed sensing and genomics

New paper on the arXiv. The next step after the completion of the Human Genome Project, was the search for genes associated with diseases such as autism or diabetes. However, after spending hundreds of millions of dollars, we find that there are very few common variants of genes with large effects. This doesn’t mean that there aren’t genes with large effect. The growth hormone gene definitely has a large effect on height. It just means that variations of genes that are common among people have small effects on the phenotype. Given the results of Fisher, Wright, Haldane and colleagues, this was probably expected as the most likely scenario and recent results measuring narrow-sense heritability directly from genetic markers (e.g. see this) confirms this view.

Current GWAS microarrays consider about a million or two markers and this is increasing rapidly. Narrow-sense heritability refers to the additive or linear genetic variance, which means the phenotype is given by the linear model y= Z\beta + \eta, where y is the phenotype vector, Z is the genotype matrix, \beta are all the genetic effects we want to recover, and \eta are all the nonadditive components including environmental effects. This is a classic linear regression problem. The problem comes when the number of coefficients \beta far exceeds the number of people in your sample, which is the case for genomics. Compressed sensing is a field of high dimensional statistics that addresses this specific problem. People such as David Donoho, Emmanuel Candes and Terence Tao have proven under fairly general conditions that if the number of nonzero coefficients are sparse compared to the number samples, then the effects can be completely recovered using L1 penalized optimization algorithms such as the lasso or approximate message passing. In this paper, we show that these ideas can be applied to genomics.

Here is Steve Hsu’s summary of the paper

Application of compressed sensing to genome wide association studies and genomic selection

(Submitted on 8 Oct 2013)

We show that the signal-processing paradigm known as compressed sensing (CS) is applicable to genome-wide association studies (GWAS) and genomic selection (GS). The aim of GWAS is to isolate trait-associated loci, whereas GS attempts to predict the phenotypic values of new individuals on the basis of training data. CS addresses a problem common to both endeavors, namely that the number of genotyped markers often greatly exceeds the sample size. We show using CS methods and theory that all loci of nonzero effect can be identified (selected) using an efficient algorithm, provided that they are sufficiently few in number (sparse) relative to sample size. For heritability h2 = 1, there is a sharp phase transition to complete selection as the sample size is increased. For heritability values less than one, complete selection can still occur although the transition is smoothed. The transition boundary is only weakly dependent on the total number of genotyped markers. The crossing of a transition boundary provides an objective means to determine when true effects are being recovered. For h2 = 0.5, we find that a sample size that is thirty times the number of nonzero loci is sufficient for good recovery.

Comments: Main paper (27 pages, 4 figures) and Supplement (5 figures) combined
Subjects: Genomics (q-bio.GN); Applications (stat.AP)
Cite as: arXiv:1310.2264 [q-bio.GN]
(or arXiv:1310.2264v1 [q-bio.GN] for this version)

MCMC and fitting models to data

As I have posted before, I never learned any statistics during my education as a theoretical physicist/applied mathematician.  However, it became fairly apparent after I entered biology (although I managed to avoid it for a few years) that fitting models to data and estimating parameters is unavoidable.   As I have opined multiple times previously, Bayesian inference and the Markov Chain Monte Carlo (MCMC) method is the best way to do this.   I thought I would provide the details on how to implement the algorithm since most of the treatments on the subject are often couched in statistical language that may be hard to decipher for someone with a physics or dynamical systems background.  I’ll do this in a series of posts and in a nonstandard order.  Instead of building up the entire Bayesian formalism, I thought I would show how the MCMC method can be used to do the job and then later show how it fits into the grand scheme of things to do even more.

Suppose you have data D(t_i), which consists of some measurements (either a scalar or a vector) at discrete time points t_i and a proposed model, which produces a time series y(t | \theta),  where \theta represents a set of free parameters that changes the function.  This model could be a system of differential equations or  just some proposed function like a polynomial.    The goal is to find the set of parameters that best fits the data and to evaluate how good the model is.  Now, the correct way to do this is to use Bayesian inference and model comparison, which can be computed using the MCMC.   However, the MCMC can also be used just to get the parameters in the sense of finding the best fit according to some criterion.

Continue reading

Slides for two talks

I just gave a seminar in the math department at the University of Iowa today. I gave a talk that was similar to the one I gave at the Mathematical Biosciences Institute on Bayesian Inference for dynamical systems.  My slides are here.  I was lucky I made it to give this talk.  My flight  from Chicago O’Hare to Cedar Rapids, Iowa was cancelled yesterday and I was rebooked on another flight tonight, which wouldn’t have been of much use for my talk this afternoon.  There was one fight left last evening to Cedar Rapids but bad weather cancelled several flights yesterday so there were many stranded travelers.  I was number 38 on the standby list and thought I had no chance to make it out that night.  However, on a whim I decided to wait it out and I started to move up the list because other people had evidently given up as well.  To my great surprise and relief I was the last person to get on the plane.  There was a brief scare when they asked me to get off because we exceeded the weight limitation but then they changed their mind and let us all fly.  (Someone else had kindly volunteered to take my place).  I learned two lessons.  One is to keep close watch of your flight at all times so you can get on the standby list as soon as possible and two is that even number 38 on a plane that only seats 50 can still make it.

I also gave a talk at Northwestern University in March on the gene induction project that I described here.  My slides are here.

Dollar cost averaging

It is often recommended  that when investing money in a mutual fund or security to use a strategy called  dollar cost averaging.  This simply means you invest a fixed amount of money at periodic intervals rather than say buy a fixed number of shares at periodic intervals or invest as a lump sum.  The argument is that when the price of the fund is low you buy more shares and when it is high you buy fewer shares.  In this way you can ride out the fluctuations in the stock price.

Evaluating which strategy is best is a great example of how concepts in calculus and in particular Jensen’s inequality enter every day life and more reasons for why an average person should learn more math.  The question is how do the investment strategies compare over time. Let p_t be the price of the fund at time t.  In dollar cost averaging, you would invest s dollars at periodic intervals (say each week).  So the number of shares you would buy each week is n_t=s/p_t.  Over a time period T you will have purchased \sum_t s/p_t = s T *(1/T)\sum_t 1/p_t  shares, which can be rewritten as sT E[1/p_t], where E is the expectation value or average.  Since you have invested s T dollars, you have paid \tilde p_t=1/E[1/p_t] per share on average, which is also called the harmonic mean.

If on the other hand you decided to buy a fixed number of shares n per week  then it would cost n p_t.   Over the time period T you will have spend \sum_t n p_t and have n T shares.  Thus you have paid \bar{p_t}= E[p_t] on average.    Now the question is how does the mean \bar{p_t} compare to the harmonic mean \tilde p_t.  This is where Jensen’s inequality comes in, which says that for a convex function f(x)E[f(x)]\ge f(E[x]).  A convex function is a function where a straight line joining any two points on the function is greater than or equal to the function.  Since 1/x is a convex function, this means that E[1/p_t] \ge 1/E[p_t] which can be rearrange to show that \bar{p}\ge \tilde{p}.  So dollar cost averaging is always better than or as good as buying the same number of shares each week.  Now of course if you buy on a day when the price is below the harmonic mean of a time horizon you are looking at you will always do better .