# Guest Post: On Proving Too Much in Scientific Data Analysis

I asked Rick Gerkin to write a summary of his recent eLife paper commenting on a much hyped Science paper on how many odours we can discriminate.

On Proving Too Much in Scientific Data Analysis

by Richard C. Gerkin

Last year, Science published a paper by a group at Rockefeller University claiming that humans can discriminate at least a trillion smells. This was remarkable and exciting because, as the authors noted, there are far fewer than a trillion mutually discriminable colors or pure tones, and yet olfaction has been commonly believed to be much duller than vision or audition, at least in humans. Could it in fact be much sharper than the other senses?

After the paper came out in Science, two rebuttals were published in eLife. The first was by Markus Meister, an olfaction and vision researcher and computational neuroscientist at Cal Tech. My colleague Jason Castro and I had a separate rebuttal. The original authors have also posted a re-rebuttal of our two papers (mostly of Meister’s paper), which has not yet been peer reviewed. Here I’ll discuss the source of the original claim, and the logical underpinnings of the counterclaims that Meister, Castro, and I have made.

How did the original authors support their claim in the Science paper? Proving this claim by brute force would have been impractical, so the authors selected a representative set of 128 odorous molecules and then tested a few hundred random 30-component mixtures of those molecules. Since many mixture stimuli can be constructed in this way but only a small fraction can be practically tested, they tried to extrapolate their experimental results to the larger space of possible mixtures. They relied on a statistical transformation of the data, followed by a theorem from the mathematics of error-correcting codes, to estimate — from the data they collected — a lower bound on the actual number of discriminable olfactory stimuli.

The two rebuttals in eLife are mostly distinct from one another but have a common thread: both effectively identify the Science paper’s analysis framework with the logical fallacy of `proving too much‘, which can be thought of as a form of reductio ad absurdum. An argument `proves too much’ when it (or an argument of parallel construction) can prove things that are known to be false. For example, the 11th century theologian St. Anselm’s ontological argument [ed note: see previous post] for the existence of god states (in abbreviated form): “God is the greatest possible being. A being that exists is greater than one that doesn’t. If God does not exist, we can conceive of an even greater being, that is one that does exist. Therefore God exists”. But this proves too much because the same argument can be used to prove the existence of the greatest island, the greatest donut, etc., by making arguments of parallel construction about those hypothetical items, e.g. “The Lost Island is the greatest possible island…” as shown by Anselm’s contemporary Gaunilo of Marmoutiers. One could investigate further to identify more specific errors in logic in Anselm’s argument, but this can be tricky and time-consuming. Philosophers have spent centuries doing just this, with varying levels of success. But simply showing that the argument proves too much is sufficient to at least call the conclusion into question. This makes `proves too much’ a rhetorically powerful approach. In the context of a scientific rebuttal, leading with a demonstration that this fallacy has occurred piques enough reader interest to justify a dissection of more specific technical errors. Both eLife rebuttals use this approach, first showing that the analysis framework proves too much, and then exploring the source(s) of the error in greater detail.

How does one show that a particular detailed mathematical analysis `proves too much’ about experimental data? Let me reduce the analysis in the Science paper to the essentials, and abstract away all the other mathematical details. The most basic claim in that paper is based upon what I will call `the analysis framework’:

$d = g(data)$

$z* = f(d)$

$z > z*$

The authors did three basic things. First, they extracted a critical parameter $d$ from their data set using a statistical procedure I’ll call $g$. $d$ represents an average threshold for discriminability, corresponding to the number of components by which two mixtures must differ to be barely discriminable. Second, they fed this derived value, $d$, into a function, $f$ that produces a number of odorous mixtures $z*$. Finally, they argued that the number $z*$ so obtained necessarily underestimates the `true’ number of discriminable smells, owing to the particular form of $f$. Each step and proposition can be investigated:

1) How does the quantity $d$ behave as the data or form of $g$ varies? That is, is $g$ the `right thing’ to do to the data?
2) What implicit assumptions does $f$ make about the sense of smell — are these assumptions reasonable?
3) Is the stated inequality — which says that any number $z*$ derived using $f$ will always underestimate the true value z — really valid?

What are the rebuttals about? Meister’s paper rejects the equation 2 on the grounds that $f$ is unjustified for the current problem. Castro and I are also critical of $f$, but focus more on equations 1 and 3, criticizing the robustness of $g$ and demonstrating that the inequality $z > z*$ should be reversed (the last of which I will not discuss further here). So together we called everything about the analysis framework into question. However, all parties are enthusiastic about the data itself, as well as its importance, so great care should be taken to distinguish the quality of the data from the validity of the interpretation.

In Meister’s paper, he shows that the analysis framework proves too much by using simulations of simple models, using either synthetic data or the actual data from the original paper. These simulations show that the original analysis framework can generate all sorts of values for $z*$ which are known to be false by construction. For example, he shows that a synthetic organism constructed to have 3 odor percepts necessarily produces data which, when the analysis framework is applied, yield values of $z* >> 3$. Since we know by construction that the correct answer is 3, the analysis framework must be flawed. This kind of demonstration of `proving too much’ is also known by the more familiar term `positive control’: a control where a specific non-null outcome can be expected in advance if everything is working correctly. When instead of the correct outcome the analysis framework produces an incredible outcome reminiscent of the one reported in the Science paper, then that framework proves too much.

Meister then explores the reason the equations are flawed, and identifies the flaw in $f$. Imagine making a map of all odors, wherein similar-smelling odors are near each other on the map, and dissimilar-smelling odors are far apart. Let the distance between odors on the map be highly predictive of their perceptual similarity. How many dimensions must this map have to be accurate? We know the answer for a map of color vision: 3. Using only hue (H), saturation (S), and lightness (L) any perceptible color can be constructed, and any two nearby colors in an HSL map are perceptually similar, while any two distant colors in an such a map are perceptually dissimilar. The hue and saturation subspace of that map is familiar as the `color wheel‘, and has been understood for more than a century. In that map, hue is the angular dimension, saturation is the radial dimension, and lightness (if it were shown) would be perpendicular to the other two.

Meister argues that $f$ must be based upon a corresponding perceptual map. Since no such reliable map exists for olfaction, Meister argues, we cannot even begin to construct an $f$ for the smell problem; in fact, the $f$ actually used in the Science paper assumes a map with 128 dimensions, corresponding to the dimensionality of the stimulus not the (unknown) dimensionality of the perceptual space. By using such a high dimensional version of $f$, a very high large value of $z$ is guaranteed, but unwarranted.

In my paper with Castro, we show that the original paper proves too much in a different way. We show that very similar datasets (differing only in the number of subjects, the number of experiments, or the number of molecules) or very similar analytical choices (differing only in the statistical significance criterion or discriminability thresholds used) produce vastly different estimates for $z*$, differing over tens of orders of magnitude from the reported value. Even trivial differences produce absurd results such as `all possible odors can be discriminated’ or `at most 1 odor can be discriminated’. The differences were trivial in the sense that equally reasonable experimental designs and analyses could and have proceeded according to these differences. But the resulting conclusions are obviously false, and therefore the analysis framework has proved too much. This kind of demonstration of `proving too much’ differs from that in Meister’s paper. Whereas he showed that the analysis framework produces specific values that are known to be incorrect, we showed that it can produce any value at all under equally reasonable assumptions. For many of those assumptions, we don’t know if the values it produces is correct or not; after all, there may be $10^4$ or $10^8$ or $10^{12}$ discriminable odors — we don’t know. But if all values are equally justified, the framework proves too much.

We then showed the technical source of the error, which is a very steep dependence of $d$ on incidental features of the study design, mediated by $g$, which is then amplified exponentially by a steep nonlinearity in $f$. I’ll illustrate with a much more well-known example from gene expression studies. When identifying genes that are thought to be differentially expressed in some disease or phenotype of interest, there is always a statistical significance threshold, e.g. $p<0.01$, $p<0.001$, etc. used for selection. After correcting for multiple comparisons, some number of genes pass the threshold and are identified as candidates for involvement in the phenotype. With a liberal threshold, e.g. $p<0.05$, many candidates will be identified (e.g. 50). With a more moderate threshold, e.g. $p<0.005$, fewer candidates will be identified (e.g. 10). With a more strict threshold, e.g. $p<0.001$, still fewer candidates will be identified (e.g. 2). This sensitivity is well known in gene expression studies. We showed that the function $g$ in the original paper has a similar sensitivity.

Now suppose some researcher went a step further and said, “If there are $N$ candidates genes involved in inflammation, and each has two expression levels, then there are $2^N$ inflammation phenotypes”. Then the estimate for the number of inflammation phenotypes might be:
$2^2 = 4$ at $p<0.001$,

$2^{10} = 1024$ at $p<0.005$, and

$2^{50} = 1.1*10^{15}$ at $p<0.05$.

Any particular claim about the number of inflammation phenotypes from this approach would be arbitrary, incredibly sensitive to the significance threshold, and not worth considering seriously. One could obtain nearly any number of inflammation phenotypes one wanted, just by setting the significance threshold accordingly (and all of those thresholds, in different contexts, are considered reasonable in experimental science).

But this is essentially what the function $f$ does in the original paper. By analogy, $g$ is the thresholding step, $d$ is the number of candidate genes, and $z*$ is the number of inflammation phenotypes. And while all of the possible values for $z*$ in the Science paper are arbitrary, a wide range of them would have been unimpressively small, another wide range would have been comically large, and only the `goldilocks zone’ produced the impressive but just plausible value reported in the paper. This is something that I think can and does happen to all scientists. If your first set of analysis decisions gives you a really exciting result, you may forget to check whether other reasonable sets of decisions would give you similar results, or whether instead they would give you any and every result under the sun. This robustness check can prevent you from proving too much — which really means proving nothing at all.

2015-08-14: typos fixed

## 3 thoughts on “Guest Post: On Proving Too Much in Scientific Data Analysis”

1. its an interesting excercize and concept, but looks something one might see as a sample problem in a texdtbook or in the back of am math monthly or martin gardner’s math games book. strait up combinatorics and then some sort of mapping to biological space. (i didnt get it but did wonder about the 2**50 p value.

i wonder if the god proof, islands, etc. can be extended to numbers (eg transfinites).

my favorite proof of god which i believe is true is ‘you know He exists because She’s never there when you neeed It’.

this also reminds of the ‘kay coloring’ problem in psychologhy related to sapir whorf idea—how many colors can people discern and do all cultures which name the colors actually just use different dialects to say the same thing.

it doesnt appear this paper had any data beyond the 128 figure —did they even ask anyone if the combinations were discernible and also whether there were errors of discrimination (ie inability to conistantly say which is the fine wine versus the cheap wine, or the great artwork, book, or scientific theory from the second rate , bogus, or randomly generated one.

sometimes scientific papers come across as though were written combinatorically by a computer which just combines words and equations so the research can have a long lunch—its done after s/he returns. (i have several publications with einstein, hawking, darwin, freud, george bush, s weinberg i wrote this way using some generator on the computer; one can also write literary papers this way—postmodern generator ). few artworks can compete with warhol’s soup can or a cell phone pic. (simkin of U Cal has studied this).

in soical sciences, philosophy and law it often appears that the way to get a publication, you go to the library and go to your field—eg law–than you walk 5 feet to another section and pull out a random vook or journaol (eg in mathematical economics) and then copy that down and change the words economics to law and the author’s names and you are all done. eg you have a ‘language instinct or organ’ (chomsky), and then you have a number one, a moral one , a gender one, a violence one etc. depending on your field.

Its a big business producing them of course—the signal/noise ratio compared to the benefit/cost one is an open question—maybe big data can find causal signatures. then there are issues of applications—some think its great we have excellent inexpensive hard and software for everything (spying, hacking, marketing, recruiting) while others want various regulations and discernment though these might vary in the us, china, iran, etc.

Like

2. ps i glanced at the papers—i see they actually test on human subjects a bit. they also say similar reasoning says humans can discriminate a million colors—i thought there were 7, except for paints. i dont even get the logic really—from 3 to a million for colors. this does seem to be a very complex form of the bipartite matching problem —eg biran josephson’s (physics) discussion of esp by a russian girl, was presented 7 people with different health problems, told the health problems and asked to say who had which one based on her esp. she got 4 out of 7 –some said it was stat significant while others said they needed 5 correct matches using a different p value. this type of problem arises when matching people to colleges, jobs, relationships, etc.

Like

3. ps2—looks like maybe that p value was a typo.

(i am into patterns—once i found my brother’s wife’s ring on her porch at night—a little shining thing in a crack in rainwater— when i was hanging out there —i showed it to someone and said maybe just throw it in the trash but it turned out it was seen as valuable–i’m sortuh persona non grata there—she had been asking everyone at her job and neighborhod if they had seen it. i also notice patterns in nature— there may be only 3, or 7, or 128 primary ones, but i have found over 10 species of snakes right in dc–the copperheads are poisonous. if you pick some up they emanate a bad smell. most people dont notice them).

on patterns, see for example ‘ulf grenander’ ‘pattern theory’ (wikipedia) . or, jung with wolfgang pauli on synchroniciity — wolf gang (had pep–you should spin your statistics) was the ruler of sin (chronic) city with st paul, a refugee on the road to damascus– grenander taught where i went but i only saw his book. (its one more dialect in the tower of babel called science, in class of petri nets and fuzzy set theory). i also notice i have excellent dxslekic typing skills—good for math (always put a minus sign when it should be +) and programming –one can write a program in 3 hours and then debug it for 3 months). .

i am leading tourism tours to damascus—its a very historical place. we’ll be there for the fireworks season. after that, we’ll fly to china for more fireworks. and maybe a movie—strait outta compton—then second half of the tour will be to cross the pacific , stop by hawaii to see relatives–howlies–and then on to the ‘burning man’ festival in california , and for fitness and obesity we’ll do a hike in hell’s canyon on border of oregon/idaho. the tour will conclude by taking a brief detour into wallawalla mtns (made famous by william o douglas—i actually met him once –he was quite old and i shook his hand—he saved the C and O canal trail and said in a book ‘go east young man’) then go through the bitteroots, see some trout, stop by north dakota, maybe wapetan minnesota, and then take the ‘trail of tears’ to NC and then take appalachian trail back to harper’s ferrry and c&o to dc. this should take about 5 hours if u walk fast.

Like