John Ioannidis has a recent paper in Nature Reviews Neuroscience arguing that many results in neuroscience are wrong. The argument follows his previous papers of why most published results are wrong (see here and here) but emphasizes the abundance of studies with small sample sizes in neuroscience. This both reduces the chances of finding true positives and increases the chances of obtaining false positives. Under powered studies are also susceptible to what is called the “winner’s curse” where the effect sizes of true positives are artificially amplified. My take is that any phenomenon with a small effect should be treated with caution even if it is real. If you really wanted to find what causes a given disease then you probably want to find something that is associated with all cases, not just in a small percentage of them.
Archive for the ‘Statistics’ Category
In a previous post, I summarized the Bayesian approach to model comparison, which requires the calculation of the Bayes factor between two models. Here I will show one computational approach that I use called thermodynamic integration borrowed from molecular dynamics. Recall, that we need to compute the model likelihood function
for each model where is just the parameter dependent likelihood function we used to find the posterior probabilities for the parameters of the model.
The integration over the parameters can be accomplished using the Markov Chain Monte Carlo, which I summarized previously here. We will start by defining the partition function
where is an inverse temperature. The derivative of the log of the partition function gives
which is equal to the ensemble average of . However, if we assume that the MCMC has reached stationarity then we can replace the ensemble average with a time average . Integrating (3) over from 0 to 1 gives
From (1) and (2), we see that , which is what we want to compute and .
Hence, to perform Bayesian model comparison, we simply run the MCMC for each model at different temperatures (i.e. use as the likelihood in the standard MCMC) and then integrate the log likelihoods over at the end. For a Gaussian likelihood function, changing temperature is equivalent to changing the data “error”. The higher the temperature the larger the presumed error. In practice, I usually run at seven to ten different values of and use a simple trapezoidal rule to integrate over . I can even do parameter inference and model comparison in the same MCMC run.
Erratum, 2013-5-2013, I just fixed an error in the final formula
Nate Silver has been hailed in the media as a vindicated genius for correctly predicting the election. He was savagely attacked before the election for predicting that Obama would win handily. Kudos also go to Sam Wang, Pollester.com, electoral-vote.com, and all others who simply took the data obtained from the polls seriously. Hence, the real credit should go to all of the polling organizations for collectively not being statistically biased. It didn’t matter if single organizations were biased one way or the other as long as they were not correlated in their biases. The true power of prediction in this election was that the errors of the various pollsters were independently distributed. However, even if you didn’t take the data at face value, you could still reasonably predict the election. Obama had an inherent advantage because he had more paths to winning 270 electoral votes. Suppose there were 8 battleground states and Romney needed to win at least 6 of them. Hence, Romney had 28 ways to win while Obama had 228 ways to win. If the win probability was approximately a half in each of these states, which is what a lot of people claimed, then Romney has slightly more than one in ten chance of winning, which is close to the odds given by Sam. The only way Romney’s odds would increase is if the state results were correlated in his favour. However, it would take a lot of correlated bias to predict that Romney was a favourite.
Erratum, Nov 9 2012: Romney actually has 37 ways and Obama 219 in my example. The total must add up to 2^8=256. I forgot to include the fact that Romney could also win 7 of 8 states or all states in his paths to winning.
The US presidential election on Nov. 6 is expected to be particularly close. The polling has been vigorous and there are many statistical prediction web sites. One of them, the Princeton Election Consortium, is run by neuroscientist Sam Wang at Princeton University. For any non-American readers, the president is not elected directly by the citizens but through what is called the electoral college. This is a set of 538 electoral voters that are selected by the individual states. The electoral votes are allotted according to the number of congressional districts per state plus two. Hence, low population states are over-represented. Almost all states agree that the candidate that takes the plurality of the votes in that state wins all the electoral votes of that state. Maine and Nebraska are the two exceptions that allot electoral votes according to who wins the congressional district. Thus in order to predict who will win, one must predict who will get at least 270 electoral votes. Most of the states are not competitive so the focus of the candidates (and media) are on a handful of so-called battleground states like Ohio and Colorado. Currently, Sam Wang predicts that President Obama will win the election with a median of 319 votes. Sam estimates the Bayesian probability for Obama’s re-election to be 99.6%. Nate Silver at another popular website (Five Thirty Eight), predicts that Obama will win 305 electoral votes and has a re-election probability of 83.7%.
These estimates are made by using polling data with a statistical model. Nate Silver uses national and state polls along with some economic indicators, although the precise model is unknown. Sam Wang uses only state polls. I’ll describe his method here. The goal is to estimate the probability distribution for the number of electoral votes a specific candidate will receive. The state space consists of possibilities (50 states plus the District of Columbia). I will assume that Maine and Nebraska do not split their votes along congressional districts although it is a simple task to include that possibility. Sam assumes that the individual states are statistically independent so that the joint probability distribution factorizes completely. He then takes the median of the polls for each state over some time window to represent the probability of that given state. The polling data is comprised of the voting preferences of a sample for a given candidate. The preferences are converted into probabilities using a normal distribution. He then computes the probability for all combinations. Suppose that there are just two states with win probabilities for your candidate of and . The probability of your candidate winning both states is , state 1 but not state 2 is , and so forth. If the states have and electoral votes respectively then if they win both states they will win votes and so forth. To keep the bookkeeping simple, Sam uses the trick of expressing the probability distribution as a polynomial of a dummy variable . So the probability distribution is
Hence, the coefficient of each term is the probability for the number of electoral votes given by the exponent of The expression for 51 “states” is and this can be evaluated quickly on a desktop computer. One can then take the median or mean of the distribution for the predicted number of electoral votes. The sum of the probabilities for electoral votes greater than 269 gives the winning probability, although Sam uses a more sophisticated method for his predicted probabilities. The model does assume that the probabilities are independent. Sam tries to account for this by using what he calls a meta-margin, in which he calculates how much the probabilities (in terms of preference) need to move for the leading candidate to lose. Also, the state polls will likely pick up any correlations as the election gets closer.
Most statistical models predict that Obama will be re-elected with fairly high probability but the national polls are showing that the race is almost tied. This discrepancy is a puzzle. Silver’s hypothesis for why is here and Sam’s is here. One of the sources for error in polls is that they must predict who will vote. The 2008 election had a voter turnout of a little less than 62%. That means that an election can be easily won or lost based on turnout alone, which makes one wonder about democracy.
Nov 4: dead link is updated
In an effort to make published science to be less wrong, psychologist Brian Nosek and collaborators have started what is called the Open Science Framework. The idea is that all results from experiments can be openly documented for everyone to see. This way, negative results that are locked away in the proverbial “file drawer”, will be available. In light of the fact that many high impact results turn out to be wrong (e.g see here and here), we definitely needed to do something and I think this is a good start. You can hear Nosek talk about this on Econtalk here.
Different internal Bayesian likelihood functions may be why we disagree. The recent shooting tragedies in Colorado and Wisconsin have set off a new round of arguments about gun control. Following the debate has made me realize that the reason the two sides can’t agree (and why different sides of almost every controversial issue can’t agree) is that their likelihood functions are completely different. The optimal way to make inferences about the world is to use Bayesian inference and there is some evidence that we are close to optimal in some circumstances. Nontechnically, Bayesian inference is a way to update the strength of your belief in something (i.e. probability) given new data. What you do is to combine your prior probability with the likelihood of the data given your internal model of the issue (and then normalize to get a posterior probability). For a more technical treatment of Bayesian inference, see here. I posted previously (see here) that I thought that drastic differences in prior probabilities is why people don’t seem to update their beliefs when faced with overwhelming evidence to the contrary. However, I’m starting to realize that the main reason might be that they have completely different models of the world, which in technical terms is their likelihood function.
Consider the issue of gun control. The anti-gun control side argue that “guns don’t kill people, people kill people’ and that restricting access to guns won’t prevent determined malcontents from coming up with some means to kill. The pro-gun control side argues that the amount of gun violence is inversely proportional to the ease of access to guns. After all, you would be hard pressed to kill twenty people in a movie theatre with a sword. The difference in these two viewpoints can be summarized by their models of the world. The anti-gun control people believe that the distribution of the will of people who would commit violence looks like this
where the horizontal line represents a level of gun restriction. In this world view, no amount of gun restriction would prevent these people from undertaking their nefarious designs. On the other hand, the pro-gun control side believes that the same distribution looks like this
in which case, the higher you set the barrier the fewer the number of crimes committed. Given these two views of the world, it is clear why new episodes of gun violence like the recent ones in Colorado and Wisconsin do not change people’s minds. What you would need to do is to teach a new likelihood function to the other side and that may take decades if at all.
I was listening to physicist and science writer Leonard Mlodinow on an All in the Mind podcast this morning. He was talking about his new book, Subliminal, which is about recent neuroscience results on neural processes that operate in the absence of conscious awareness. During the podcast, which was quite good, Mlodinow quoted a result that said 95% of all professors think they are above average, then he went on to say that we all know that only %50 can be. It’s unfortunate that Mlodninow, who wrote an excellent book on probability theory, would make such a statement. I am sure that he knows that 50% of all professors are better than the median but any number greater than one could be greater than the average (i.e. mean). He used average in the colloquial sense but knowing the difference between median and mean is crucial for the average or should I say median person to make informed decisions.
It could be that on a professor performance scale, 5% of all professors are phenomenally bad, while the rest are better but clumped together. In this case, it would be absolutely true that 95% of all professors are better than the mean. Also, if the professors actually obeyed such a distribution then comparing to the mean would be more informative than the median because what every student should do is to simply avoid the really bad professors. However, for something like income, which is broad with a fat tail, comparing yourself to the median is probably more informative because it will tell you where you place in society. The mean salary of a lecture hall filled with mathematicians would increase perhaps a hundred fold if James Simons (of Chern-Simons theory as well as the CEO of one of the most successful hedge funds, Renaissance Technologies) were to suddenly walk into the room. However, the median would hardly budge. All the children in Lake Woebegon could really be above average.
Biophysical Journal, Volume 102, Issue 3, 399-406, 8 February 2012
William J. Heuett, Bernard V. Miller, Susan B. Racette, John O. Holloszy, Carson C. Chow, and Vipul Periwal
Abstract: Inference of the insulin secretion rate (ISR) from C-peptide measurements as a quantification of pancreatic β-cell function is clinically important in diseases related to reduced insulin sensitivity and insulin action. ISR derived from C-peptide concentration is an example of nonparametric Bayesian model selection where a proposed ISR time-course is considered to be a “model”. An inferred value of inaccessible continuous variables from discrete observable data is often problematic in biology and medicine, because it is a priori unclear how robust the inference is to the deletion of data points, and a closely related question, how much smoothness or continuity the data actually support. Predictions weighted by the posterior distribution can be cast as functional integrals as used in statistical field theory. Functional integrals are generally difficult to evaluate, especially for nonanalytic constraints such as positivity of the estimated parameters. We propose a computationally tractable method that uses the exact solution of an associated likelihood function as a prior probability distribution for a Markov-chain Monte Carlo evaluation of the posterior for the full model. As a concrete application of our method, we calculate the ISR from actual clinical C-peptide measurements in human subjects with varying degrees of insulin sensitivity. Our method demonstrates the feasibility of functional integral Bayesian model selection as a practical method for such data-driven inference, allowing the data to determine the smoothing timescale and the width of the prior probability distribution on the space of models. In particular, our model comparison method determines the discrete time-step for interpolation of the unobservable continuous variable that is supported by the data. Attempts to go to finer discrete time-steps lead to less likely models.
The New York Times has some interesting articles online right now. There is a series of interesting essays on the Future of Computing in the Science section and the philosophy blog The Stone has a very nice post by Alva Noe on Art and Neuroscience. I think Noe’s piece eloquently phrases several ideas that I have tried to get across recently, which is that while mind may arise exclusively from brain this doesn’t mean that looking at the brain alone will explain everything that the mind does. Neuroscience will not make psychology or art history obsolete. The reason is simply a matter of computational complexity or even more simply combinatorics. It goes back to Phillip Anderson’s famous article More is Different (e.g. see here), where he argued that each field has its own set of fundamental laws and rules and thinking at a lower level isn’t always useful.
For example, suppose that what makes me enjoy or like a piece of art is set by a hundred or so on-off neural switches. Then there are different ways I could appreciate art. Now, I have no idea if a hundred is correct but suffice it to say that anything above 50 or so makes the number of combinations so large that it will take Moore’s law a long time to catch up and anything above 300 makes it virtually impossible to handle computationally in our universe with a classical computer. Thus, if art appreciation is sufficiently complex, meaning that it involves a few hundred or more neural parameters, then Big Data on the brain alone will not be sufficient to obtain insight into what makes a piece of art special. Some sort of reduced description would be necessary and that already exists in the form of art history. That is not to say that data mining how people respond to art may not provide some statistical information on what would constitute a masterpiece. After all, Netflix is pretty successful in predicting what movies you will like based on what you have liked before and what other people like. However, there will always be room for the art critic.
If I had to compress everything that ails us today into one word it would be correlations. Basically, everything bad that has happened recently from the financial crisis to political gridlock is due to undesired correlations. That is not to say that all correlations are bad. Obviously, a system without any correlations is simply noise. You would certainly want the activity on an assembly line in a factory to be correlated. Useful correlations are usually serial in nature like an invention leads to a new company. Bad correlations are mostly parallel like all the members in Congress voting exclusively along party lines, which reduces an assembly with hundreds of people into just two. A recession is caused when everyone in the economy suddenly decides to decrease spending all at once. In a healthy economy, people would be uncorrelated so some would spend more when others spend less and the aggregate demand would be about constant. When people’s spending habits are tightly correlated and everyone decides to save more at the same time then there would be less demand for goods and services in the economy so companies must lay people off resulting in even less demand leading to a vicious cycle.
The financial crisis that triggered the recession was due to the collapse of the housing bubble, another unwanted correlated event. This was exacerbated by collateralized debt obligations (CDOs), which are financial instruments that were doomed by unwanted correlations. In case you haven’t followed the crisis, here’s a simple explanation. Say you have a set of loans where you think the default rate is 50%. Hence, given a hundred mortgages, you know fifty will fail but you don’t know which. The way to make a triple A bond out of these risky mortgages is to lump them together and divide the lump into tranches that have different seniority (i.e. get paid off sequentially). So the most senior tranche will be paid off first and have the highest bond rating. If fifty of the hundred loans go bad, the senior tranche will still get paid. This is great as long as the mortgages are only weakly correlated and you know what that correlation is. However, if the mortgages fail together then all the tranches will be bad. This is what happened when the bubble collapsed. Correlations in how people responded to the collapse made it even worse. When some CDOs started to fail, people panicked collectively and didn’t trust any CDOs even though some of them were still okay. The market for CDOs became frozen so people who had them and wanted to sell them couldn’t even at a discount. This is why the federal government stepped in. The bail out was deemed necessary because of bad correlations. Just between you and me, I would have let all the banks just fail.
We can quantify the effect of correlations in a simple example, which will also show the difference between sample mean and population mean. Let’s say you have some variable that estimates some quantity. The expectation value (population mean) is . The variance of , gives an estimate of the square of the error. If you wanted to decrease the error of the estimate then you can take more measurements. So let’s consider a sample of measurements. The sample mean is . The expectation value of the sample mean is . The variance of the sample mean is
Let be the correlation between two measurements. Hence, . The variance of the sample mean is thus . If the measurements are uncorrelated () then the variance is , i.e. the standard deviation or error is decreased by the square root of the number of samples. However, if there are nonzero correlations then the error can only be reduced to the amount of correlations . Thus, correlations give a lower bound in the error on any estimate. Another way to think about this is that correlations reduce entropy and entropy reduces information. One way to cure our current problems is to destroy parallel correlations.
A new paper on the evolution of overconfidence (arXiv:0909.4043v2) will appear shortly in Nature. (Hat tip to J.L. Delatre). It is well known in psychology that people generally overvalue themselves and it has always been a puzzle as to why. This paper argues that under certain plausible conditions, it may have been evolutionarily advantageous to be overconfident. One of the authors is James Fowler who has garnered recent fame for claiming with Nicholas Christakis that medically noninfectious phenomena such as obesity and divorce are socially contagious. I have always been skeptical of these social network results and it seems like there has been some recent push back. Statistician and blogger Andrew Gelman has a summary of the critiques here. The problem with these papers fall in line with the same problems of many other clinical papers that I have posted on before (e.g. see here and here). The evolution of overconfidence paper does not rely on statistics but on a simple evolutionary model.
The model considers competition between two parties for some scarce resource. Each party possess some heritable attribute and the one with the higher value of that attribute will win a contest and obtain the resource. The model allows for three outcomes in any interaction: 1) winning a competition and obtaining the resource with value W-C (where C is the cost of competing), 2) claiming the resource without a fight with value W, and 3) losing a competition with a value -C. The parties assess their own and their opponents attributes before deciding to compete. If both parties had perfect information, participating in a contest would be unnecessary. Both parties would realize who would win and the stronger of the two would claim the prize. However, because of the error and biases in assessing attributes, resources will be contested. Overconfidence is represented as a positive bias in assessing oneself. The authors chose a model that was simple enough to explicitly evaluate the outcomes of all possible situations and show that when the reward for winning is sufficiently large compared to the cost, then overconfidence is evolutionarily stable.
Here I will present a simpler toy model of why the result is plausible. Let P be the probability that a given party will win a competition on average and let Q be the probability that they will engage in a competition. Hence, Q is a measure of overconfidence. Using these values, we can then compute the expectation value of an interaction:
(i.e. the probability of a competition and winning is , the probability of winning and not having to fight is , the probability of losing a competition is , and it doesn’t cost anything to not compete.) The derivative of E with respect to Q is
Hence, we see that if , i.e. the reward of winning sufficiently exceeds the cost of competing, then the expectation value is guaranteed to increase with increasing confidence. Of course this simple demonstration doesn’t prove that overconfidence is a stable strategy but it does affirm Woody Allen’s observation that “95% of life is just showing up.”
J Clin Endocrinol Metab. 2011 May 18. [Epub ahead of print] Higher Acute Insulin Response to Glucose May Determine Greater Free Fatty Acid Clearance in African-American Women. Chow CC, Periwal V, Csako G, Ricks M, Courville AB, Miller BV 3rd, Vega GL, Sumner AE. Laboratory of Biological Modeling (C.C.C., V.P.), National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892; Departments of Laboratory Medicine (G.C.) and Nutrition (A.B.C.), Clinical Center, National Institutes of Health, and Clinical Endocrinology Branch (M.R., B.V.M., A.E.S.), National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892; and Center for Human Nutrition (G.L.V.), University of Texas Southwestern Medical Center at Dallas, Dallas, Texas 75235. Context: Obesity and diabetes are more common in African-Americans than whites. Because free fatty acids (FFA) participate in the development of these conditions, studying race differences in the regulation of FFA and glucose by insulin is essential. Objective: The objective of the study was to determine whether race differences exist in glucose and FFA response to insulin. Design: This was a cross-sectional study. Setting: The study was conducted at a clinical research center. Participants: Thirty-four premenopausal women (17 African-Americans, 17 whites) matched for age [36 ± 10 yr (mean ± sd)] and body mass index (30.0 ± 6.7 kg/m(2)). Interventions: Insulin-modified frequently sampled iv glucose tolerance tests were performed with data analyzed by separate minimal models for glucose and FFA. Main Outcome Measures: Glucose measures were insulin sensitivity index (S(I)) and acute insulin response to glucose (AIRg). FFA measures were FFA clearance rate (c(f)). Results: Body mass index was similar but fat mass was higher in African-Americans than whites (P < 0.01). Compared with whites, African-Americans had lower S(I) (3.71 ± 1.55 vs. 5.23 ± 2.74 [×10(-4) min(-1)/(microunits per milliliter)] (P = 0.05) and higher AIRg (642 ± 379 vs. 263 ± 206 mU/liter(-1) · min, P < 0.01). Adjusting for fat mass, African-Americans had higher FFA clearance, c(f) (0.13 ± 0.06 vs. 0.08 ± 0.05 min(-1), P < 0.01). After adjusting for AIRg, the race difference in c(f) was no longer present (P = 0.51). For all women, the relationship between c(f) and AIRg was significant (r = 0.64, P < 0.01), but the relationship between c(f) and S(I) was not (r = -0.07, P = 0.71). The same pattern persisted when the two groups were studied separately. Conclusion: African-American women were more insulin resistant than white women, yet they had greater FFA clearance. Acutely higher insulin concentrations in African-American women accounted for higher FFA clearance. PMID: 21593106 [PubMed - as supplied by publisher]
Am J Clin Nutr. 2011 Jul;94(1):66-74. Epub 2011 May 11. Estimating changes in free-living energy intake and its confidence interval. Hall KD, Chow CC. Laboratory of Biological Modeling, National Institute of Diabetes and Digestive and Kidney Diseases, Bethesda, MD. Background: Free-living energy intake in humans is notoriously difficult to measure but is required to properly assess outpatient weight-control interventions. Objective: Our objective was to develop a simple methodology that uses longitudinal body weight measurements to estimate changes in energy intake and its 95% CI in individual subjects. Design: We showed how an energy balance equation with 2 parameters can be derived from any mathematical model of human metabolism. We solved the energy balance equation for changes in free-living energy intake as a function of body weight and its rate of change. We tested the predicted changes in energy intake by using weight-loss data from controlled inpatient feeding studies as well as simulated free-living data from a group of "virtual study subjects" that included realistic fluctuations in body water and day-to-day variations in energy intake. Results: Our method accurately predicted individual energy intake changes with the use of weight-loss data from controlled inpatient feeding experiments. By applying the method to our simulated free-living virtual study subjects, we showed that daily weight measurements over periods >28 d were required to obtain accurate estimates of energy intake change with a 95% CI of <300 kcal/d. These estimates were relatively insensitive to initial body composition or physical activity level. Conclusions: Frequent measurements of body weight over extended time periods are required to precisely estimate changes in energy intake in free-living individuals. Such measurements are feasible, relatively inexpensive, and can be used to estimate diet adherence during clinical weight-management programs. PMCID: PMC3127505 [Available on 2012/7/1] PMID: 21562087 [PubMed - in process]
Noted sociologist Richard Florida has an opinion piece in the Finacial Times and The Atlantic (see here) about how immigration may be responsible for the recent decline in violent crime in cities. Many explanations have been given for why crime has decreased since the nineteen nineties. The bestselling book Freakonomics suggested that the decline was because of legalized abortion, which meant fewer unwanted children who would go on to be criminals. Florida shows that there is a strong negative correlation between the presence of large immigrant communities and the crime rate. Again, like all epidemiological results, this correlation may or may not be significant much less have causal value. That is not to say that it is not correct. Immigrant neighborhoods may have a greater sense of a small town community that discourages crime but if the opposite correlation was found an equally plausible just-so story could also be concocted. I think the crucial point about this result along with all other explanations of complex phenomena is that we are drawn towards single universal explanations. I am all the time even though there may not be a reason why there should even be an explanation in a few hundred bits. The opposite view would be that a single explanation is implausible. After all, something like crime involves millions upon millions of degrees of freedom so why should it be compressible to a few hundred bits. However, if the phenomenon is consistent across many cities and regions then maybe a global explanation may be in order. The variance around the mean is also important. If the variability is low then a universal explanation carries more weight. I think that we tend to either embrace reduced descriptions or reject them outright based on the ideas presented. However, in many cases, a careful examination of the data may at least tell us if a reduced description is warranted or not.
The Snowbird meeting finishes today. I think it has been highly successful with over 800 participants. Last night, I was on the “forward looking panel” moderated by Alan Champneys and one of the questions asked was what defines nonlinear dynamics and this meeting. I gave a rather flip answer about how we are now in the age of machine learning and statistics and this meeting is everything in applied math that is not that. Of course that is not true given that data assimilation was a major part of this meeting and Sara Solla gave a brilliant talk on applying the generalized linear model to neural data to estimate functional connectivity in the underlying cortical circuit.
Given some time to reflect on that question, I think the common theme of Snowbird is the concept of taking a complicated system and reducing it to something simpler that can be analyzed. What we do is to create nontrivial models that can be accessed mathematically. This is distinctly different from other branches of applied math like optimization and numerical methods. However, one difference between previous meetings and now is that before the main tools to analyze these reduced systems were methods of dynamical systems such as geometric singular perturbation theory (e.g. see here) and bifurcation theory. Today, a much wider range of methods are being utilized.
Another question posed was whether there was too much biology at this meeting. I said yes because I thought there were too many parallel sessions. Although, I said it partially with tongue in cheek, I think there are both good and bad things about biology being overly represented. It is good that biology is doing well and attracting lots of people but it would be a bad thing if the meeting becomes so large that it devolves into multiple concurrent meetings where people only go to the talks that are directly related to what they already know. In a meeting with fewer parallel sessions one has more chance to learn something new and see something unexpected. I really have no idea what should be done about this if anything at all.
Finally, a question about how data will be relevant to our field was posed from the audience. My answer was that the big trend right now was in massive data mining but I thought that it had overpromised and would eventually fail to deliver. Eventually, dynamical systems methods will be required to help reduce and analyze the data. However, I do want to add that data will play a bigger and bigger role in dynamical systems research. In the past, we mostly strived to just qualitatively match experiments but now the data has improved to the point that we can try to quantitatively match it. This will require using statistics. Readers of this blog will know that I have been an advocate of using Bayesian methods. I really believe that the marriage of dynamical systems and statistics will have great impact. Statistics is about fitting models to data but the models used are rather simple and generally not motivated by mechanisms. Our field is about producing models based on the underlying mechanisms. It will be a perfect fit.
A recently published paper in the Journal of the National Cancer Institute reports that heavy coffee consumption can reduce the risk of prostate cancer. This story made the rounds in the popular press as would be expected. The paper is based on a longitudinal study of 47,911 health professionals. What they found was that men who consumed more than six cups of coffee per day lowered their risk of developing prostate cancer by 18%. While this sounds impressive, one must weigh this with the fact that the probability of getting prostate cancer was around 10% over the 20 years of the study. So this means that six cups of coffee per day or more lowered the risk from 10% to 8%. A reduction yes, but probably not enough to start drinking massive amounts of coffee for this purpose alone. Stated in another way, the risk was reduced from 529 cancers per 100 000 person-years to 425 cancers. Now, the study also found that the reduction in risk for severe prostate cancer was around fifty percent. However, the risk of getting lethal prostate cancer is also lower so the risk drops from 79 cancers per 100 000 person-years to 34.
Now the problem with these epidemiological studies is that there are so many confounders, and although the authors were extremely careful in trying to account for them, they are still dealing with very uncertain data. Previous studies on the effects of coffee showed no effects on prostate cancer risk. There is also the problem of multiple comparisons. I’m sure the authors looked at risks for all sorts of diseases and this one turned out to be statistically significant. As I posted before (see here and here), many if not most high-profile clinical results turn out to be wrong and for good systematic reasons. Now, it is biologically plausible that coffee could have some effect in reducing cancer. It contains lots of bioactive molecules and antioxidants and we should test these directly. My take is that until there is a solid biophysical explanation for a clinical effect, the jury is always out.
Any doubts that computers can do natural language processing ended dramatically yesterday as IBM’s Watson computer defeated the world’s two best players in the TV quiz show Jeopardy. Although the task was constrained, it clearly shows that it won’t be too long before we’ll have computers that can understand most of what we say. This Nova episode gives a nice summary of the project. A description of the strategy and algorithms used by the program’s designers can be found here.
I think there are two lessons to be learned from Watson. The first is that machine learning will lead the way towards strong AI. (Sorry Robin Hanson, it won’t be brain emulation). Although they incorporated “hard coded” algorithms, the engine behind Watson was supervised learning from examples. The second lesson is that we may already have all the algorithms to get there. The Watson team didn’t have to invent any dramatically new algorithms. What was novel is the way they integrated many existing algorithms. This is analogous to what I called the Hopfield Hypothesis in that we may already know enough biology to understand how the brain works. What we don’t understand yet is how these elements combine.
Addendum: Here is a YouTube link for the show last night.
I can never remember the form of distributions so here is a cheat sheet for the density functions of some commonly used ones.
The most recent dust-up over statistical significance and being wrong regards the publication of a paper on ESP in a top psychology journal. Here’s a link to a paper criticizing the result using Bayesian analysis. Below is an excerpt from the journal Science.
Science: The decision by a top psychology journal to publish a paper on extrasensory perception (ESP) has sparked a lively discussion on blogs and in the mainstream media. The paper’s author, Daryl Bem, a respected social psychologist and professor emeritus at Cornell University, argues that the results of nine experiments he conducted with more than 1000 college students provide statistically significant evidence of an ability to predict future events. Not surprisingly, word that the paper will appear in an upcoming issue of the Journal of Personality and Social Psychology (JPSP) has provoked outrage from pseudoscience debunkers and counteraccusations of closed-mindedness from those willing to consider the possibility of psychic powers.
It has also rekindled a long-running debate about whether the statistical tools commonly used in psychology—and most other areas of science—too often lead researchers astray. “The real lesson to be learned from this is not that ESP exists, it’s that the methods we’re using aren’t protecting us against spurious results,” says David Krantz, a statistician at Columbia University…
…But statisticians have long argued that there are problems lurking in the weeds when it comes to standard statistical methods like the t test. What scientists generally want to know is the probability that a given hypothesis is true given the data they’ve observed. But that’s not what a p-value tells them, says Adrian Raftery, a statistician at the University of Washington, Seattle. In Bem’s erotic photo experiment, the p-value of less than .01 means, by definition, that there’s less than a 1% chance he would have observed these data—or data pointing to an even stronger ESP effect—if ESP does not exist. “Some people would turn that around and say there’s a 99% chance there’s something going on, but that’s wrong,” Raftery says.
Not only does this type of thinking reflect a misunderstanding of what a p-value is, but it also overestimates the probability that an effect is real, Raftery and other statisticians say. Work by Raftery, for example, suggests that p-values in the .001 to .01 range reflect a true effect only 86% to 92% of the time. The problem is more acute for larger samples, which can give rise to a small p-value even when the effect is negligible for practical purposes, Raftery says.
He and others champion a different approach based on so-called Bayesian statistics. Based on a theory developed by Thomas Bayes, an 18th century English minister, these methods are designed to determine the probability that a hypothesis is true given the data a researcher has observed. It’s a more intuitive approach that’s conceptually more in line with the goals of scientists, say its advocates. Also, unlike the standard approach, which assumes that each new experiment takes place in a vacuum, Bayesian statistics takes prior knowledge into consideration.