Science writer Jonah Lehrer has a nice article in the New Yorker on the “Decline Effect”, where the significance of many published results, mostly but not exclusively in clinical and psychological fields, tends to decline with time or disappear entirely. The article cites the work of John Ioannidis, which I summarized here, on why most published results are false. The article posits several explanations that basically boil down to selection bias and multiple comparisons.

I believe this problem stems from the fact that the reward structures in science are biased towards positive results and the more surprising and hence unlikely the result, the greater the impact and attention. Generally, a result is deemed statistically significant if the probability of that result arising by random chance (technically if the null hypothesis were true) is less than 5%. By this fact alone, 5% of published results are basically noise. However, the actual amount is much higher because of selection bias and multiple comparisons.

In order to estimate significance, a sample set must be defined and it is quite simple for conscious and unconscious selection bias to arise when deciding on what data is to be included. For example, in clinical trials, some subjects or data points will be excluded for various reasons and this could bias the results towards a positive result. Multiple comparison may even be harder to avoid. For every published result of an investigator, there are countless “failed” results. For example, suppose you want to show pesticides cause cancer and you test different pesticides until one shows an effect. Most likely you will assess significance only for that particular pesticide, not for all the others that didn’t show an effect. In some sense, to be truly fair, one should include all experiments that one has ever conducted when assessing significance with the odd implication that the criterion for significance will become more stringent as you age.

This is a tricky thing. Suppose you do a study and measure ten things but you blind yourself to the results. You then pick one of the items and test for significance. In this case, have you done ten measurements or one? The criterion for significance will be much more stringent if it is the former. However, there are probably a hundred other things you could have measured but didn’t. Should significance be tested against a null hypothesis that includes all potential events including those you didn’t measure?

I think the only way this problem will be solved is with an overhaul of the way science is practiced. First of all, negative results must be taken as seriously as positive ones. In some sense, the results of all experiments need to be published or at least made public in some database. Second, the concept of statistical significance needs to be abolished. There cannot be some artificial dividing line between significance and nonsignificance. Adopting a Bayesian approach will help. People will just report the probability that the result is true given some prior and likelihood. In fact, P values used for assessing significance could be easily converted to Bayesian probabilities. However, I doubt very much these proposals will be adopted any time soon.