Code platform update

It’s a week away from 2015, and I have transitioned completely away from Matlab. Julia is winning the platform attention battle. It is very easy to code in and it is very fast. I just haven’t gotten around to learning python much less pyDSTool (sorry Rob). I kind of find the syntax of Python (with all the periods between words) annoying. Wally Xie and I have also been trying to implement some of our MCMC runs in Stan but we have had trouble making it work.  Our model requires integrating ODEs and the ODE solutions from Stan (using our own solver) do not match our Julia code or the gold standard XPP. Maybe we are missing something obvious in the Stan syntax but our code is really very simple. Thus, we are going back to doing our Bayesian posterior estimates in Julia. However, I do plan to revisit Stan if they (or we) can write a debugger for it.

Big Data backlash

I predicted that there would be an eventual push back on Big Data and it seems that it has begun. Gary Marcus and Ernest Davis of NYU had an op-ed in the Times yesterday outlining nine issues with Big Data. I think one way to encapsulate many of the critiques is that you will never be able to do true prior free data modeling. The number of combinations in a data set grows as the factorial of the number of elements, which grows faster than an exponential. Hence, Moore’s law can never catch up. At some point, someone will need to exercise some judgement in which case Big Data is not really different from the ordinary data that we deal with all the time.

Paper on compressed sensing and genomics

New paper on the arXiv. The next step after the completion of the Human Genome Project, was the search for genes associated with diseases such as autism or diabetes. However, after spending hundreds of millions of dollars, we find that there are very few common variants of genes with large effects. This doesn’t mean that there aren’t genes with large effect. The growth hormone gene definitely has a large effect on height. It just means that variations of genes that are common among people have small effects on the phenotype. Given the results of Fisher, Wright, Haldane and colleagues, this was probably expected as the most likely scenario and recent results measuring narrow-sense heritability directly from genetic markers (e.g. see this) confirms this view.

Current GWAS microarrays consider about a million or two markers and this is increasing rapidly. Narrow-sense heritability refers to the additive or linear genetic variance, which means the phenotype is given by the linear model y= Z\beta + \eta, where y is the phenotype vector, Z is the genotype matrix, \beta are all the genetic effects we want to recover, and \eta are all the nonadditive components including environmental effects. This is a classic linear regression problem. The problem comes when the number of coefficients \beta far exceeds the number of people in your sample, which is the case for genomics. Compressed sensing is a field of high dimensional statistics that addresses this specific problem. People such as David Donoho, Emmanuel Candes and Terence Tao have proven under fairly general conditions that if the number of nonzero coefficients are sparse compared to the number samples, then the effects can be completely recovered using L1 penalized optimization algorithms such as the lasso or approximate message passing. In this paper, we show that these ideas can be applied to genomics.

Here is Steve Hsu’s summary of the paper

Application of compressed sensing to genome wide association studies and genomic selection

(Submitted on 8 Oct 2013)

We show that the signal-processing paradigm known as compressed sensing (CS) is applicable to genome-wide association studies (GWAS) and genomic selection (GS). The aim of GWAS is to isolate trait-associated loci, whereas GS attempts to predict the phenotypic values of new individuals on the basis of training data. CS addresses a problem common to both endeavors, namely that the number of genotyped markers often greatly exceeds the sample size. We show using CS methods and theory that all loci of nonzero effect can be identified (selected) using an efficient algorithm, provided that they are sufficiently few in number (sparse) relative to sample size. For heritability h2 = 1, there is a sharp phase transition to complete selection as the sample size is increased. For heritability values less than one, complete selection can still occur although the transition is smoothed. The transition boundary is only weakly dependent on the total number of genotyped markers. The crossing of a transition boundary provides an objective means to determine when true effects are being recovered. For h2 = 0.5, we find that a sample size that is thirty times the number of nonzero loci is sufficient for good recovery.

Comments: Main paper (27 pages, 4 figures) and Supplement (5 figures) combined
Subjects: Genomics (q-bio.GN); Applications (stat.AP)
Cite as: arXiv:1310.2264 [q-bio.GN]
(or arXiv:1310.2264v1 [q-bio.GN] for this version)

MCMC and fitting models to data

As I have posted before, I never learned any statistics during my education as a theoretical physicist/applied mathematician.  However, it became fairly apparent after I entered biology (although I managed to avoid it for a few years) that fitting models to data and estimating parameters is unavoidable.   As I have opined multiple times previously, Bayesian inference and the Markov Chain Monte Carlo (MCMC) method is the best way to do this.   I thought I would provide the details on how to implement the algorithm since most of the treatments on the subject are often couched in statistical language that may be hard to decipher for someone with a physics or dynamical systems background.  I’ll do this in a series of posts and in a nonstandard order.  Instead of building up the entire Bayesian formalism, I thought I would show how the MCMC method can be used to do the job and then later show how it fits into the grand scheme of things to do even more.

Suppose you have data D(t_i), which consists of some measurements (either a scalar or a vector) at discrete time points t_i and a proposed model, which produces a time series y(t | \theta),  where \theta represents a set of free parameters that changes the function.  This model could be a system of differential equations or  just some proposed function like a polynomial.    The goal is to find the set of parameters that best fits the data and to evaluate how good the model is.  Now, the correct way to do this is to use Bayesian inference and model comparison, which can be computed using the MCMC.   However, the MCMC can also be used just to get the parameters in the sense of finding the best fit according to some criterion.

Continue reading

Slides for two talks

I just gave a seminar in the math department at the University of Iowa today. I gave a talk that was similar to the one I gave at the Mathematical Biosciences Institute on Bayesian Inference for dynamical systems.  My slides are here.  I was lucky I made it to give this talk.  My flight  from Chicago O’Hare to Cedar Rapids, Iowa was cancelled yesterday and I was rebooked on another flight tonight, which wouldn’t have been of much use for my talk this afternoon.  There was one fight left last evening to Cedar Rapids but bad weather cancelled several flights yesterday so there were many stranded travelers.  I was number 38 on the standby list and thought I had no chance to make it out that night.  However, on a whim I decided to wait it out and I started to move up the list because other people had evidently given up as well.  To my great surprise and relief I was the last person to get on the plane.  There was a brief scare when they asked me to get off because we exceeded the weight limitation but then they changed their mind and let us all fly.  (Someone else had kindly volunteered to take my place).  I learned two lessons.  One is to keep close watch of your flight at all times so you can get on the standby list as soon as possible and two is that even number 38 on a plane that only seats 50 can still make it.

I also gave a talk at Northwestern University in March on the gene induction project that I described here.  My slides are here.

Dollar cost averaging

It is often recommended  that when investing money in a mutual fund or security to use a strategy called  dollar cost averaging.  This simply means you invest a fixed amount of money at periodic intervals rather than say buy a fixed number of shares at periodic intervals or invest as a lump sum.  The argument is that when the price of the fund is low you buy more shares and when it is high you buy fewer shares.  In this way you can ride out the fluctuations in the stock price.

Evaluating which strategy is best is a great example of how concepts in calculus and in particular Jensen’s inequality enter every day life and more reasons for why an average person should learn more math.  The question is how do the investment strategies compare over time. Let p_t be the price of the fund at time t.  In dollar cost averaging, you would invest s dollars at periodic intervals (say each week).  So the number of shares you would buy each week is n_t=s/p_t.  Over a time period T you will have purchased \sum_t s/p_t = s T *(1/T)\sum_t 1/p_t  shares, which can be rewritten as sT E[1/p_t], where E is the expectation value or average.  Since you have invested s T dollars, you have paid \tilde p_t=1/E[1/p_t] per share on average, which is also called the harmonic mean.

If on the other hand you decided to buy a fixed number of shares n per week  then it would cost n p_t.   Over the time period T you will have spend \sum_t n p_t and have n T shares.  Thus you have paid \bar{p_t}= E[p_t] on average.    Now the question is how does the mean \bar{p_t} compare to the harmonic mean \tilde p_t.  This is where Jensen’s inequality comes in, which says that for a convex function f(x)E[f(x)]\ge f(E[x]).  A convex function is a function where a straight line joining any two points on the function is greater than or equal to the function.  Since 1/x is a convex function, this means that E[1/p_t] \ge 1/E[p_t] which can be rearrange to show that \bar{p}\ge \tilde{p}.  So dollar cost averaging is always better than or as good as buying the same number of shares each week.  Now of course if you buy on a day when the price is below the harmonic mean of a time horizon you are looking at you will always do better .

Connecting the dots

As I posted previously, I am highly skeptical that any foolproof system can be developed to screen for potential threats.  My previous argument was that in order to have a zero false negative rate for identifying terrorists, it would be impossible to not also have a relatively significant false positive rate.  In other words, the only way to guarantee that a terrorist doesn’t board a plane is to not let anyone board.  A way to visualize this graphically is with what is called a receiver operating characteristic or ROC curve, which is a plot of the true positive rate versus the false positive rate for a binary classification test as some parameter, usually a discrimination threshold, is changed.  Ideally, one would like  a curve that jumps to a true positive rate of 1 for zero false positive rate.  The area under the ROC curve (AROC) is the usual measure for how good a discriminator is.  So a perfect discriminator has AROC = 1.  In  my experience with biological systems,  it is pretty difficult to make a test with an AROC of greater than 90%.    Additionally, ROC curves are usually somewhat smooth so that they only reach true positve rate = 1  at false positive rate = 1.

Practicalities aside, is there any mathematical reason why a perfect or near perfect discriminator couldn’t be designed?  This to me is the more interesting question.  My guess is that deciding if a person is a terrorist is an NP hard question, which is why it is so insidious.   For any NP problem, it is simple to verify the answer but hard to find one.   Connecting all the dots to show that someone is a terrorist is a straightforward matter if you already know that they are a terrorist.  This  is also true of proving the Riemann Hypothesis or solving the 3D Ising model.  The  solution is obvious if you know the answer. If terrorist finding is NP hard, then that means for a large enough population and I think 5 billion qualifies, then no method nor achievable amount of computational power is sufficient to do the job perfectly.