I predicted that there would be an eventual push back on Big Data and it seems that it has begun. Gary Marcus and Ernest Davis of NYU had an op-ed in the Times yesterday outlining nine issues with Big Data. I think one way to encapsulate many of the critiques is that you will never be able to do true prior free data modeling. The number of combinations in a data set grows as the factorial of the number of elements, which grows faster than an exponential. Hence, Moore’s law can never catch up. At some point, someone will need to exercise some judgement in which case Big Data is not really different from the ordinary data that we deal with all the time.
Archive for the ‘Optimization’ Category
New paper on the arXiv. The next step after the completion of the Human Genome Project, was the search for genes associated with diseases such as autism or diabetes. However, after spending hundreds of millions of dollars, we find that there are very few common variants of genes with large effects. This doesn’t mean that there aren’t genes with large effect. The growth hormone gene definitely has a large effect on height. It just means that variations of genes that are common among people have small effects on the phenotype. Given the results of Fisher, Wright, Haldane and colleagues, this was probably expected as the most likely scenario and recent results measuring narrow-sense heritability directly from genetic markers (e.g. see this) confirms this view.
Current GWAS microarrays consider about a million or two markers and this is increasing rapidly. Narrow-sense heritability refers to the additive or linear genetic variance, which means the phenotype is given by the linear model , where is the phenotype vector, is the genotype matrix, are all the genetic effects we want to recover, and are all the nonadditive components including environmental effects. This is a classic linear regression problem. The problem comes when the number of coefficients far exceeds the number of people in your sample, which is the case for genomics. Compressed sensing is a field of high dimensional statistics that addresses this specific problem. People such as David Donoho, Emmanuel Candes and Terence Tao have proven under fairly general conditions that if the number of nonzero coefficients are sparse compared to the number samples, then the effects can be completely recovered using L1 penalized optimization algorithms such as the lasso or approximate message passing. In this paper, we show that these ideas can be applied to genomics.
Here is Steve Hsu’s summary of the paper
Application of compressed sensing to genome wide association studies and genomic selection
We show that the signal-processing paradigm known as compressed sensing (CS) is applicable to genome-wide association studies (GWAS) and genomic selection (GS). The aim of GWAS is to isolate trait-associated loci, whereas GS attempts to predict the phenotypic values of new individuals on the basis of training data. CS addresses a problem common to both endeavors, namely that the number of genotyped markers often greatly exceeds the sample size. We show using CS methods and theory that all loci of nonzero effect can be identified (selected) using an efficient algorithm, provided that they are sufficiently few in number (sparse) relative to sample size. For heritability h2 = 1, there is a sharp phase transition to complete selection as the sample size is increased. For heritability values less than one, complete selection can still occur although the transition is smoothed. The transition boundary is only weakly dependent on the total number of genotyped markers. The crossing of a transition boundary provides an objective means to determine when true effects are being recovered. For h2 = 0.5, we find that a sample size that is thirty times the number of nonzero loci is sufficient for good recovery.
As I have posted before, I never learned any statistics during my education as a theoretical physicist/applied mathematician. However, it became fairly apparent after I entered biology (although I managed to avoid it for a few years) that fitting models to data and estimating parameters is unavoidable. As I have opined multiple times previously, Bayesian inference and the Markov Chain Monte Carlo (MCMC) method is the best way to do this. I thought I would provide the details on how to implement the algorithm since most of the treatments on the subject are often couched in statistical language that may be hard to decipher for someone with a physics or dynamical systems background. I’ll do this in a series of posts and in a nonstandard order. Instead of building up the entire Bayesian formalism, I thought I would show how the MCMC method can be used to do the job and then later show how it fits into the grand scheme of things to do even more.
Suppose you have data , which consists of some measurements (either a scalar or a vector) at discrete time points and a proposed model, which produces a time series , where represents a set of free parameters that changes the function. This model could be a system of differential equations or just some proposed function like a polynomial. The goal is to find the set of parameters that best fits the data and to evaluate how good the model is. Now, the correct way to do this is to use Bayesian inference and model comparison, which can be computed using the MCMC. However, the MCMC can also be used just to get the parameters in the sense of finding the best fit according to some criterion.
I just gave a seminar in the math department at the University of Iowa today. I gave a talk that was similar to the one I gave at the Mathematical Biosciences Institute on Bayesian Inference for dynamical systems. My slides are here. I was lucky I made it to give this talk. My flight from Chicago O’Hare to Cedar Rapids, Iowa was cancelled yesterday and I was rebooked on another flight tonight, which wouldn’t have been of much use for my talk this afternoon. There was one fight left last evening to Cedar Rapids but bad weather cancelled several flights yesterday so there were many stranded travelers. I was number 38 on the standby list and thought I had no chance to make it out that night. However, on a whim I decided to wait it out and I started to move up the list because other people had evidently given up as well. To my great surprise and relief I was the last person to get on the plane. There was a brief scare when they asked me to get off because we exceeded the weight limitation but then they changed their mind and let us all fly. (Someone else had kindly volunteered to take my place). I learned two lessons. One is to keep close watch of your flight at all times so you can get on the standby list as soon as possible and two is that even number 38 on a plane that only seats 50 can still make it.
It is often recommended that when investing money in a mutual fund or security to use a strategy called dollar cost averaging. This simply means you invest a fixed amount of money at periodic intervals rather than say buy a fixed number of shares at periodic intervals or invest as a lump sum. The argument is that when the price of the fund is low you buy more shares and when it is high you buy fewer shares. In this way you can ride out the fluctuations in the stock price.
Evaluating which strategy is best is a great example of how concepts in calculus and in particular Jensen’s inequality enter every day life and more reasons for why an average person should learn more math. The question is how do the investment strategies compare over time. Let be the price of the fund at time . In dollar cost averaging, you would invest dollars at periodic intervals (say each week). So the number of shares you would buy each week is . Over a time period you will have purchased shares, which can be rewritten as , where is the expectation value or average. Since you have invested dollars, you have paid per share on average, which is also called the harmonic mean.
If on the other hand you decided to buy a fixed number of shares per week then it would cost . Over the time period you will have spend and have shares. Thus you have paid on average. Now the question is how does the mean compare to the harmonic mean . This is where Jensen’s inequality comes in, which says that for a convex function , . A convex function is a function where a straight line joining any two points on the function is greater than or equal to the function. Since is a convex function, this means that which can be rearrange to show that . So dollar cost averaging is always better than or as good as buying the same number of shares each week. Now of course if you buy on a day when the price is below the harmonic mean of a time horizon you are looking at you will always do better .
As I posted previously, I am highly skeptical that any foolproof system can be developed to screen for potential threats. My previous argument was that in order to have a zero false negative rate for identifying terrorists, it would be impossible to not also have a relatively significant false positive rate. In other words, the only way to guarantee that a terrorist doesn’t board a plane is to not let anyone board. A way to visualize this graphically is with what is called a receiver operating characteristic or ROC curve, which is a plot of the true positive rate versus the false positive rate for a binary classification test as some parameter, usually a discrimination threshold, is changed. Ideally, one would like a curve that jumps to a true positive rate of 1 for zero false positive rate. The area under the ROC curve (AROC) is the usual measure for how good a discriminator is. So a perfect discriminator has AROC = 1. In my experience with biological systems, it is pretty difficult to make a test with an AROC of greater than 90%. Additionally, ROC curves are usually somewhat smooth so that they only reach true positve rate = 1 at false positive rate = 1.
Practicalities aside, is there any mathematical reason why a perfect or near perfect discriminator couldn’t be designed? This to me is the more interesting question. My guess is that deciding if a person is a terrorist is an NP hard question, which is why it is so insidious. For any NP problem, it is simple to verify the answer but hard to find one. Connecting all the dots to show that someone is a terrorist is a straightforward matter if you already know that they are a terrorist. This is also true of proving the Riemann Hypothesis or solving the 3D Ising model. The solution is obvious if you know the answer. If terrorist finding is NP hard, then that means for a large enough population and I think 5 billion qualifies, then no method nor achievable amount of computational power is sufficient to do the job perfectly.
The classic Malthus argument is that populations will grow exponentially until they are nonlinearly suppressed by famine or disease due to overcrowding. However, the lesson of the twentieth century is that populations can be checked for other reasons as well. This is not necessarily a refutation of Malthus per se but rather that the quantity that populations conserve need not be restricted to food or health. There seems to be a threshold of economic prosperity when family income or personal freedom becomes the rate limiting step for a bigger family. Developed nations such as Japan and Germany are approaching zero population growth and trending towards negative growth. Russia currently has negative growth.
Hence, we can slow down population growth by increasing economic growth. China is starting to see a very steep decline in population growth in the big cities like Shanghai that is independent of the one child policy. The emerging middle class is now taking into account the cost of raising a child and how it would affect their lifestyle. In a very poor country, the cost of raising a child is not really an issue. In fact, if the probability of a child making it to adulthood is low and help from children is the only way for the elderly to survive then it is logical to have as many children as possible. In this case, the classic Malthus argument with food and health (and aid) as the rate limiting quantities applies.
I bring this point up now because it is crucial for the current debate about what to do about climate change. One way of mitigating human impact on the environment is to slow down population growth. However, the most humane and effective method of doing that is to increase economic growth, which will then lead to an increase in emissions. For example, in 2005, the US produced 23.5 tonnes of CO2 equivalents per person, (which incidentally is not the highest in the world and less than half of world leader Qatar), while China produces about 5.5 tonnes and Niger 0.1 tonnes. (This is not accounting for the extra emissions due to changes in land use.) In absolute terms, China already produces more green house gases than the US and India is not far behind. On the other hand the population growth rate of Niger is 3.6%, India is 1.5% and dropping, China is 0.6% and the US is 1%. So, when we increase economic prosperity, we can reduce population growth and presumably suffering, but we will also increase greenhouse gas emissions.
Given that the world economy and agricultural system is entirely based on fossil fuels, it is also true that, at least on the short term, a restriction of carbon emissions will slow down or reduce economic growth. Thus, even though climate change could have a catastrophic outcome for the future, curbing economic growth could also be bad. Thus, for a developing nation and even some developed nations, the choice may not be so clear cut. It is thus no wonder, that the current debate in Copenhagen is so contentious. I think that unless the developed world can demonstrate that viable economic growth and prosperity can be achieved with reduced carbon emissions, the rest of the world and many people will remain skeptical. I don’t think the leaders in the climate change community realize that this skepticism about carbon restrictions may not be all irrational.
Darts is a popular spectator sport in the UK. I had access to cable television recently so I was able to watch a few games. What I find interesting about professional darts is that the players must solve a Diophantine equation to win. For those who know nothing of the game, it involves throwing a small pointed projectile at an enumerated target board that looks like this:
A dart that lands on a given sector on the board obtains that score. The center circle of the board called the bulls eye is worth 50 points. The ring around the bulls eye is worth 25 points. The wedges are worth the score ascribed by the number on the perimeter. However, if you land in the inner ring then you get triple the score of the wedge and if you land in the outer ring you get double the score. Hence, the maximum number of points for one dart is the triple twenty worth 60 points.
I used to believe that one day all human labour would be automated (e.g. see here). Upon further reflection, I realize that I am wrong. The question of whether or not machines will someday replace all humans depends crucially on whether or not P is equal to NP. Jobs that will eventually be automated will be the ones that can be solved easily with an algorithm. In computer science parlance, these are problems in the computational complexity class P (solvable in polynomial time). For example, traditional travel agents have disappeared faster than bluefin tuna because their task is pretty simple to automate. However, not all travel agents will disappear. The ones that survive will be more like concierges that put together complex travel arrangements or require negotiating with many parties.
Eventually, the jobs that humans will hold (barring a collapse of civilization as we know it) will involve solving problems in the complexity class NP (or harder). That is not to say that machines won’t be doing some of these jobs, only that the advantage of machines over humans will not be as clear cut. While it is true that if we could fully reproduce a human and make it faster and bigger then it could do everything that a human could do better but as I blogged about before, I think it will be difficult to exactly reproduce humans. Additionally, for some very hard problems that don’t even have any good approximation schemes, blind luck will play an important role in coming up with solutions. Balancing different human centric priorities will also be important and that may be best left for humans to do. Even if it turns out that P=NP there could still be some jobs that humans can do like working on undecidable problems.
So what are some jobs that will be around in the NP economy? Well, I think mathematicians will still be employed. Theorems can be verified in polynomial time but there are no known algorithms in P to generate them. That is not to say that there won’t be robot mathematicians and mathematicians will certainly use automated theorem proving programs to help them (e.g. see here). However, I think the human touch will always have some use. Artists and comedians will also have jobs in the future. These are professions that require intimate knowledge of what it is like to be human . Again, there will be machine comics and artists but they won’t fully replace humans. I also think that craftsmen like carpenters, stone masons, basket weavers and so forth could also make a comeback. They will have to exhibit some exceptional artistry to survive but the demand for them could increase since some people will always long for the human touch in their furniture and houses.
The question then is whether or not there will be enough NP jobs to go around and whether or not everyone is able and willing to hold one. To some, an NP economy will be almost Utopian – everyone will have interesting jobs. However, there may be some people who simply don’t want or can’t do an NP job. What will happen to them? I think that will be a big (probably undecidable) problem that will face society in the not too distant future, provided we make it that far.
The American health care system relies on a “fee for service” model, in which physicians are reimbursed for the procedures they perform. I think this is a perfect example of how organizational structure and in particular incentives can affect outcomes. Free market proponents argue that the only system that can optimally distribute goods and services is a free market. I tangentially posted on efficient markets a short time ago. However, even with a free market, the rules of the game determine what it means to win. For example, when physicians are reimbursed for procedures then it makes sense for them to perform as many procedures as possible. If it is the choice between an inexpensive therapy and an expensive one and there is no clear cut evidence for the benefit of either then why choose the inexpensive option. A provocative article in the New Yorker by Atul Gawande shows what can happen when this line of thought is taken to the extreme. Another unintended consequence of the fee for service model may be that there is no incentive to recruit individuals for clinical studies as detailed in this article by Gina Kolata in the New York Times. The interesting thing about both of these examples is that they are independent of whether health insurance is private, public or single payer. Gawande’s article was mostly about Medicare, which is government run. An alternative to fee for service is “fee for outcome”, where physicians are rewarded for having healthier patients. Gawande favours the Mayo Clinic model where the physicians have a fixed salary and focus on maximizing patient care. There must be a host of different possible compensation models that are possible, which I’m sure economists have explored. However, perhaps this is also a (critically important) problem where ideas from physics and applied math might be useful.
I was visiting the University of Utah this past week. I gave talks on the Kinetic Theory of Coupled Oscillators and on Deriving Moment Equations for Neural Networks. On my way to the airport I wondered what would be the optimal arrival time so that you spend the least amount of time waiting in the airport balanced by the cost of missing a flight. If you make some basic assumptions, it’s not too hard to derive a condition for the optimum. Let’s say the only thing we’re concerned about is minimizing wasted time. Then what we would want to do is to balance the average time waiting in airports with the average time lost to make up for a missed flight.