James Lee and I just published a paper entitled “The causal meaning of Fisher’s average effect” in the journal Genetics Research. The paper can be obtained here. This paper is the brainchild of James and I just helped him out with some of the proofs. James’s take on the paper can be read here. The paper resolves a puzzle about the incommensurability of Ronald Fisher’s two definitions of the average effect noted by population geneticist D.S. Falconer three decades ago.
Fisher was well known for both brilliance and obscurity and people have long puzzled over the meaning of some of his work. The concept of the average effect is extremely important for population genetics but it is not very well understood. The field of population genetics was invented in the early twentieth century by luminaries such as Fisher, Sewall Wright, and JBS Haldane to reconcile Darwin’s theory of evolution with Mendelian genetics. This is a very rich field that has been somewhat forgotten. People in mathematical, systems, computational, and quantitative biology really should be fully acquainted with the field.
For those who are unacquainted with genetics, here is a quick primer to understand the paper. Certain traits, like eye colour or the ability to roll your tongue, are affected by your genes. Prior to the discovery of the structure of DNA, it was not clear what genes were, except that they were the primary discrete unit of genetic inheritance. These days it usually refers to some region on the genome. Mendel’s great insight was that genes come in pairs, which we now know to correspond to the two copies of each of the 23 chromosomes we have. A variant of a particular gene is called an allele. Traits can depend on genes (or more accurately genetic loci) linearly or nonlinearly. Consider a quantitative trait that depends on a single genetic locus that has two alleles, which we will call a and A. This means that a person will have one of three possible genotypes: 1) homozygous in A (i.e. have two A alleles), 2) heterozygous (have one of each), or 3) homozygous in a (i.e. have no A alleles). If the locus is linear then if you plot the measure of the trait (e.g. height) against the number of A alleles, you will get a straight line. For example, suppose allele A contributes a tenth of a centimetre to height. Then people with one A allele will be on average one tenth of a centimetre taller than those with no A alleles and those with two A alleles will be two tenths taller. The familiar notion of dominance is a nonlinear effect. So for example, the ability to roll your tongue is controlled by a single gene. There is a dominant rolling allele and a recessive nonrolling allele. If you have at least one rolling allele, you can roll your tongue.
The average effect of a gene substitution is the average change in a trait if one allele is substituted for another. A crucial part of population genetics is that you always need to consider averages. This is because genes are rarely completely deterministic. They can be influenced by the environment or other genes. Thus, in order to define the effect of the gene, you need to average over these other influences. This then leads to a somewhat ambiguous definition of average effect and Fisher actually came up with two. The first, and as James would argue the primary definition, is a causal one in that we want to measure the average effect of a gene if you experimentally substituted one allele for another prior to development and influence by the environment. A second correlation definition would simply be to plot the trait against the number of alleles as in the example above. The slope would then be the average effect. This second definition looks at the correlation between the gene and the trait but as the old saying goes “correlation does not imply causation”. For example, the genetic loci may not have any effect on the trait but happens to be strongly correlated with a true causal locus (in the population you happen to be examining). Distinguishing between genes that are merely associated with a trait from ones that are actually causal remains an open problem in genome wide association studies.
Our paper goes over some of the history and philosophy of the tension between these two definitions. We wrote the paper because these two definitions do not always agree and we show under what conditions they do agree. The main reason they don’t agree is that averages will depend on the background over which you average. For a biallelic gene, there are 2 alleles but 3 genotypes. The distribution of alleles in a population is governed by two parameters. It’s not enough to specify the frequency of one allele. You also need to know the correlation between alleles. The regression definition matches the causal definition if a particular function representing this correlation is held fixed while the experimental allele substitutions under the causal definition are carried out. We also considered the multi-allele and multi-loci case in as much generality as we could.