Science writer Jonah Lehrer has a nice article in the New Yorker on the “Decline Effect”, where the significance of many published results, mostly but not exclusively in clinical and psychological fields, tends to decline with time or disappear entirely. The article cites the work of John Ioannidis, which I summarized here, on why most published results are false. The article posits several explanations that basically boil down to selection bias and multiple comparisons.
I believe this problem stems from the fact that the reward structures in science are biased towards positive results and the more surprising and hence unlikely the result, the greater the impact and attention. Generally, a result is deemed statistically significant if the probability of that result arising by random chance (technically if the null hypothesis were true) is less than 5%. By this fact alone, 5% of published results are basically noise. However, the actual amount is much higher because of selection bias and multiple comparisons.
In order to estimate significance, a sample set must be defined and it is quite simple for conscious and unconscious selection bias to arise when deciding on what data is to be included. For example, in clinical trials, some subjects or data points will be excluded for various reasons and this could bias the results towards a positive result. Multiple comparison may even be harder to avoid. For every published result of an investigator, there are countless “failed” results. For example, suppose you want to show pesticides cause cancer and you test different pesticides until one shows an effect. Most likely you will assess significance only for that particular pesticide, not for all the others that didn’t show an effect. In some sense, to be truly fair, one should include all experiments that one has ever conducted when assessing significance with the odd implication that the criterion for significance will become more stringent as you age.
This is a tricky thing. Suppose you do a study and measure ten things but you blind yourself to the results. You then pick one of the items and test for significance. In this case, have you done ten measurements or one? The criterion for significance will be much more stringent if it is the former. However, there are probably a hundred other things you could have measured but didn’t. Should significance be tested against a null hypothesis that includes all potential events including those you didn’t measure?
I think the only way this problem will be solved is with an overhaul of the way science is practiced. First of all, negative results must be taken as seriously as positive ones. In some sense, the results of all experiments need to be published or at least made public in some database. Second, the concept of statistical significance needs to be abolished. There cannot be some artificial dividing line between significance and nonsignificance. Adopting a Bayesian approach will help. People will just report the probability that the result is true given some prior and likelihood. In fact, P values used for assessing significance could be easily converted to Bayesian probabilities. However, I doubt very much these proposals will be adopted any time soon.
[…] of diseases and this one turned out to be statistically significant. As I posted before (see here and here), many if not most high-profile clinical results turn out to be wrong and for good […]
LikeLike
[…] in line with the same problems of many other clinical papers that I have posted on before (e.g. see here and here). The evolution of overconfidence paper does not rely on statistics but on a simple […]
LikeLike
[…] to a statistical anomaly or systematic error. As I have posted previously in the past (e.g. see here), clinical and epidemiological results are as likely to be wrong (if not more) as correct. Unlike […]
LikeLike
[…] he seemed to always really know his science. I have linked to his articles in the past (see here). His publisher is withdrawing his book and giving refunds to anyone returning it. I haven’t […]
LikeLike
[…] In light of the fact that many high impact results turn out to be wrong (e.g see here and here), we definitely needed to do something and I think this is a good start. You can hear Nosek talk […]
LikeLike
[…] The argument follows his previous papers of why most published results are wrong (see here and here) but emphasizes the abundance of studies with small sample sizes in neuroscience. This both reduces […]
LikeLike