New paper in Molecular Psychiatry

Genomic analysis of diet composition finds novel loci and associations with health and lifestyle

S. Fleur W. Meddens, et al.

Abstract

We conducted genome-wide association studies (GWAS) of relative intake from the macronutrients fat, protein, carbohydrates, and sugar in over 235,000 individuals of European ancestries. We identified 21 unique, approximately independent lead SNPs. Fourteen lead SNPs are uniquely associated with one macronutrient at genome-wide significance (P < 5 × 10−8), while five of the 21 lead SNPs reach suggestive significance (P < 1 × 10−5) for at least one other macronutrient. While the phenotypes are genetically correlated, each phenotype carries a partially unique genetic architecture. Relative protein intake exhibits the strongest relationships with poor health, including positive genetic associations with obesity, type 2 diabetes, and heart disease (rg ≈ 0.15–0.5). In contrast, relative carbohydrate and sugar intake have negative genetic correlations with waist circumference, waist-hip ratio, and neighborhood deprivation (|rg| ≈ 0.1–0.3) and positive genetic correlations with physical activity (rg ≈ 0.1 and 0.2). Relative fat intake has no consistent pattern of genetic correlations with poor health but has a negative genetic correlation with educational attainment (rg ≈−0.1). Although our analyses do not allow us to draw causal conclusions, we find no evidence of negative health consequences associated with relative carbohydrate, sugar, or fat intake. However, our results are consistent with the hypothesis that relative protein intake plays a role in the etiology of metabolic dysfunction.

New paper on GWAS

 2018 Dec;42(8):783-795. doi: 10.1002/gepi.22161. Epub 2018 Sep 24.

The accuracy of LD Score regression as an estimator of confounding and genetic correlations in genome-wide association studies.

Author information

1
Department of Psychology, University of Minnesota Twin Cities, Minneapolis, Minnesota.
2
Mathematical Biology Section, Laboratory of Biological Modeling, NIDDK, National Institutes of Health, Bethesda, Maryland.

Abstract

To infer that a single-nucleotide polymorphism (SNP) either affects a phenotype or is linkage disequilibrium with a causal site, we must have some assurance that any SNP-phenotype correlation is not the result of confounding with environmental variables that also affect the trait. In this study, we study the properties of linkage disequilibrium (LD) Score regression, a recently developed method for using summary statistics from genome-wide association studies to ensure that confounding does not inflate the number of false positives. We do not treat the effects of genetic variation as a random variable and thus are able to obtain results about the unbiasedness of this method. We demonstrate that LD Score regression can produce estimates of confounding at null SNPs that are unbiased or conservative under fairly general conditions. This robustness holds in the case of the parent genotype affecting the offspring phenotype through some environmental mechanism, despite the resulting correlation over SNPs between LD Scores and the degree of confounding. Additionally, we demonstrate that LD Score regression can produce reasonably robust estimates of the genetic correlation, even when its estimates of the genetic covariance and the two univariate heritabilities are substantially biased.

KEYWORDS:

causal inference; genetic correlation; heritability; population stratification; quantitative genetics

PMID:

 

30251275

 

DOI:

 

10.1002/gepi.22161

Are humans successful because they are cruel?

According to Wikipedia, the class Mammalia has 29 orders (e.g. Carnivora), 156 families (e.g. Ursidae), 1258 Genera (e.g. Ursus), and nearly 6000 species (e.g. Polar Bear). Some orders like Chiroptera (bats) and Rodentia are very large with many families, genera, and species. Some are really small like Orycteropodidae, which has only one species – the aardvark. Humans are in the order Primates, of which there are quite a few families and genera. Almost all of them live in tropical or subtropical areas and almost all of them have small populations, many of them endangered. The exception of course is humans who is the only species remaining of the genus homo. The other genera in the great ape family hominidae – gorillas, orangutans, chimpanzees, and bonobos – are all in big trouble.

I think most people would attribute the incomparable success of humans to their resilience, intelligence, and ingenuity. However, another important factor could be their bottomless capacity for intentional cruelty. Although there seems to be a decline in violence throughout history as documented in Steven Pinker’s recent book, there are still no shortages of examples. Take a listen to this recent Econtalk podcast with Mike Munger on how the South rationalized slavery. It could very well be that what made modern humans dominate earth and wipe out all the other homo species along the way was not that they were more intelligent but that they were more cruel and rapacious. Neanderthals and Denisovans may have been happy sitting around the campfire after a hunt, while humans needed to raid every nearby tribe and kill them.

New review paper on GWAS

Comput Struct Biotechnol J. 2015 Nov 23;14:28-34
Uncovering the Genetic Architectures of Quantitative Traits.
Lee JJ, Vattikuti S, Chow CC.

Abstract
The aim of a genome-wide association study (GWAS) is to identify loci in the human genome affecting a phenotype of interest. This review summarizes some recent work on conceptual and methodological aspects of GWAS. The average effect of gene substitution at a given causal site in the genome is the key estimand in GWAS, and we argue for its fundamental importance. Implicit in the definition of average effect is a linear model relating genotype to phenotype. The fraction of the phenotypic variance ascribable to polymorphic sites with nonzero average effects in this linear model is called the heritability, and we describe methods for estimating this quantity from GWAS data. Finally, we show that the theory of compressed sensing can be used to provide a sharp estimate of the sample size required to identify essentially all sites contributing to the heritability of a given phenotype.
KEYWORDS:
Average effect of gene substitution; Compressed sensing; GWAS; Heritability; Population genetics; Quantitative genetics; Review; Statistical genetics

Paper on new myopia associated gene

The prevalence of near sightedness or myopia has almost doubled in the past thirty years from about 25% to 44%. No one knows why but it is probably a gene-environment effect, like obesity. This recent paper in PLoS Genetics: APLP2 Regulates Refractive Error and Myopia Development in Mice and Humans, sheds light on the subject. It reports that a variant of the APLP2 gene is associated with myopia in people if they read a lot as children. Below is a figure of the result of a GWAS study showing the increase in myopia (more negative is more myopic) with age for those with the risk variant (GA) and for time spent reading. The effect size is pretty large and a myopic effect of APLP2 is seen in monkeys, mice, and humans. Thus, I think that this result will hold up. The authors also show that the APLP2 gene is involved in retinal signaling, particularly in amacrine cells. It is thus consistent with the theory that myopia is the result of feedback from the retina during development.  Hence, if you are constantly focused on near objects, the eye will develop to accommodate for that. So maybe you should send your 7 year old outside to play instead of sitting inside reading or playing video games.

Brave New World

Read Steve Hsu’s Nautilus article on Super-Intelligence. If so-called IQ-related genetic variants are truly additive then his estimates are probably correct. His postulated being could possibly understand the fine details of any topic in less than a day or shorter. Instead of taking several years to learn enough differential geometry to develop Einstein’s General Relativity (which is what it took for Einstein), a super-intelligence could perhaps do it in an afternoon or during a coffee break. Personally, I believe that nothing is free and that there will always be tradeoffs. I’m not sure what the cost of super-intelligence will be but there will likely be something. Variability in a population is always good for the population although not so great for each individual. An effective way to make a species go extinct is to remove variability. If pests had no genetic variability then it would be a simple matter to eliminate them with some toxin. Perhaps, humans will be able to innovate fast enough to buffer them against environmental changes. Maybe cognitive variability can compensate for genetic variability. I really don’t know.

Heritability in twins

Nature Genetics recently published a meta-analysis of virtually all twin studies over the last half century:

Tinca J C Polderman, Beben Benyamin, Christiaan A de Leeuw, Patrick F Sullivan, Arjen van Bochoven, Peter M Visscher & Danielle Posthuma. Meta-analysis of the heritability of human traits based on fifty years of twin studies. Nature Genetics 47,702–709 (2015) doi:10.1038/ng.3285.

One of the authors, Peter Visscher, is perhaps the most influential and innovative thinker in human genetics at this moment and this paper continues his string of insightful results. The paper examined close to eighteen thousand traits in almost three thousand publications, representing fifteen million twins. The main goal was to use all the available data to recompute the heritability estimates for all of these traits. The first thing they found was that the traits were highly skewed towards psychiatric and cognitive phenotypes. People who study heritability are mostly interested in mental function. They then checked to see if there was any publication bias where people only published results with high heritability. They used multiple methods but they basically checked if the predictions of effect size was correlated with sample size and they found none. Their most interesting result, which I will comment on more below was that the average heritability across all traits was 0.488, which means that on average genes and environment contribute equally. However, heritability does vary widely across domains where eye, ear, nose and throat function are most heritable, and social values were least heritable. The largest influence of shared environmental effects was for bodily functions, infections, and social values. Hence, staying healthy depends on your environment, which is why a child who may be stunted in their impoverished village can thrive if moved to Minnesota. It also shows why attitudes on social issues can and do change. Finally, the paper addressed two important technical issues which I will expand on below – 1) previous studies may be underestimating heritability and 2) heritability is mostly additive.

Heritability is the fraction of the variance of a trait due to genetic variance. Here is a link to a previous post explaining heritability although as my colleague Vipul Periwal points out, it is full of words and has no equations. Briefly, there are two types of heritability – broad sense and narrow sense. Broad sense heritability, H^2 = Var(G)/Var(P), is the total genetic variance divided by the phenotypic variance. Narrow sense heritability h^2 = Var(A)/Var(P) is the linear or additive genetic variance divided by the phenotypic variance. A linear regression of the standardized trait of the children against the average of the standardized trait of the parents is an estimate of the narrow sense heritability. It captures the linear part while the broad sense heritability includes the linear and nonlinear contributions, which include dominance and gene-gene effects (epistasis). To estimate (narrow-sense) heritability from twins, Polderman et al. used what is called Falconer’s formula and took twice the difference in the correlation of a trait between identical (monozygotic) and fraternal (dizygotic) twins (h^2 =2 (r_{MZ}-r_{DZ})). The idea being that the any difference between identical twins must be environmental (nongenetic), while the difference between dyzgotic twins is half genetic and environmental, so the difference between the two is half genetic. They also used another Falconer formula to estimate the shared environmental variance, which is c^2 = 2 r_{DZ} - r_{MZ}, since this “cancels out” the genetic part. Their paper then boiled down to doing a meta-analysis of r_{DZ} and r_{MZ}. Meta-analysis is a nuanced topic but it boils down to weighting results from different studies by some estimate of how large the errors are. They used the DerSimonian-Laird random-effects approach, which is implemented in R. The Falconer formulas estimate the narrow sense heritability but many of the previous studies were interested in nonadditive genetic effects as well. Typically, what they did was to use either an ACE (Additive, common environmental, environmental) or an ADE (Additive, dominance, environmental) model. They decided on which model to use by looking at the sign of c^2. If it is positive then they used ACE and if it is negative they used ADE. Polderman et al. showed that this decision algorithm biases the heritability estimate downward.

If the heritability of a trait is mostly additive then you would expect that r_{MZ}=2 r_{DZ} and they found that this was observed in 69% of the traits. Of the top 20 traits, 8 traits showed nonadditivity and these mostly related to behavioral and cognitive functions. Of these eight, 7 showed that the correlation between monozygotic twins was smaller than twice that of dizygotic twins, which implies that nonlinear genetic effects tend to work against each other. This makes sense to me since it would seem that as you start to accumulate additive variants that increase a phenotype you will start to hit rate limiting effects that will tend to dampen these effects. In other words, it seems plausible that the major nonlinearity in genetics is a saturation effect.

The most striking result was that the average heritability across all of the traits was about 0.5. Is an average value of 0.5 obvious or deep? I honestly do not know. When I told theoretical neuroscientist Fred Hall this result, he thought it was obvious and should be expected from maximum entropy considerations, which would assume that the distribution of h^2 would be uniform or at least symmetric about 0.5. This sounds plausible but as I have asserted many times – biology is the result of an exponential amplification of exponentially unlikely events. Traits that are heritable are by definition those that have variation across the population. Some traits, like the number of limbs, have no variance but are entirely genetic. Other traits, like your favourite sports team, are highly variable but not genetic even though there is a high probability that your favourite team will be the same as your parent’s or sibling’s favourite team. Traits that are highly heritable include height and cognitive function. Personality on the other hand, is not highly heritable. One of the biggest puzzles in population genetics is why there is any variability in a trait to start with. Natural selection prunes out variation exponentially fast so if any gene is selected for, it should be fixed very quickly. Hence, it seems equally plausible that traits with high variability would have low heritability. The studied traits were also biased towards mental function and different domains have different heritabilities. Thus, if the traits were sampled differently, the averaged heritability could easily deviate from 0.5. Thus, I think the null hypothesis should be that the h^2 = .5 value is a coincidence but I’m open to a deeper explanation.

A software tool to investigate these results can be found here. An enterprising student could do some subsampling of the traits to see how likely 0.5 would hold up if our historical interests in phenotypes were different.

Thanks go to Rick Gerkin for suggesting this topic.

Paper on new version of Plink

The paper describing the updated version of the genome analysis software tool Plink has just been published.

Second-generation PLINK: rising to the challenge of larger and richer datasets
Christopher C Chang, Carson C Chow, Laurent CAM Tellier, Shashaank Vattikuti, Shaun M Purcell, and James J Lee

GigaScience 2015, 4:7  doi:10.1186/s13742-015-0047-8

Abstract
Background
PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1’s primary data format.

Findings
To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, View MathML-time/constant-space Hardy-Weinberg equilibrium and Fisher’s exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0).

Conclusions
The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

Keywords: GWAS; Population genetics; Whole-genome sequencing; High-density SNP genotyping; Computational statistics

 

This project started out with us trying to do some genomic analysis that involved computing various distance metrics on sequence space. Programming virtuoso Chris Chang stepped in and decided to write some code to speed up the computations. His program, originally called wdist, was so good and fast that we kept asking him to put in more capabilities. Eventually,  he had basically replicated the suite of functions that Plink performed so he contacted Shaun Purcell, the author of Plink, if he could just call his code Plink too and Shaun agreed. We then ran a series of tests on various machines to check the speed-ups compared to the original Plink and gcta. If you do any GWAS analysis at all, I highly recommend you check out Plink 1.9.

Journal Club

Here is the paper I’ll be covering in the Laboratory of Biological Modeling, NIDDK, Journal Club tomorrow

Morphological and population genomic evidence that human faces have evolved to signal individual identity

Michael J. Sheehan & Michael W. Nachman

Abstract: Facial recognition plays a key role in human interactions, and there has been great interest in understanding the evolution of human abilities for individual recognition and tracking social relationships. Individual recognition requires sufficient cognitive abilities and phenotypic diversity within a population for discrimination to be possible. Despite the importance of facial recognition in humans, the evolution of facial identity has received little attention. Here we demonstrate that faces evolved to signal individual identity under negative frequency-dependent selection. Faces show elevated phenotypic variation and lower between-trait correlations compared with other traits. Regions surrounding face-associated single nucleotide polymorphisms show elevated diversity consistent with frequency-dependent selection. Genetic variation maintained by identity signalling tends to be shared across populations and, for some loci, predates the origin of Homo sapiens. Studies of human social evolution tend to emphasize cognitive adaptations, but we show that social evolution has shaped patterns of human phenotypic and genetic diversity as well.

Incompetence is the norm

People have been justly anguished by the recent gross mishandling of the Ebola patients in Texas and Spain and the risible lapse in security at the White House. The conventional wisdom is that these demonstrations of incompetence are a recent phenomenon signifying a breakdown in governmental competence. However, I think that incompetence has always been the norm; any semblance of competence in the past is due mostly to luck and the fact that people do not exploit incompetent governance because of a general tendency towards docile cooperativity (as well as incompetence of bad actors). In many ways, it is quite amazing at how reliably citizens of the US and other OECD members respect traffic laws, pay their bills and service their debts on time. This is a huge boon to an economy since excessive resources do not need to be spent on enforcing rules. This does not hold in some if not many developing nations where corruption is a major problem (c.f. this op-ed in the Times today). In fact, it is still an evolutionary puzzle as to why agents cooperate for the benefit of the group even though it is an advantage for an individual to defect. Cooperativity is also not likely to be all genetic since immigrants tend to follow the social norm of their adopted country, although there could be a self-selection effect here. However, the social pressure to cooperate could evaporate quickly if there is the perception of the lack of enforcement as evidenced by looting following natural disasters or the abundance of insider trading in the finance industry. Perhaps, as suggested by the work of Karl Sigmund and other evolutionary theorists, cooperativity is a transient phenomenon and will eventually be replaced by the evolutionarily more stable state of noncooperativity. In that sense, perceived incompetence could be rising but not because we are less able but because we are less cooperative.

Linear and nonlinear thinking

A linear system is one where the whole is precisely the sum of its parts. You can know how different parts will act together by simply knowing how they act in isolation. A nonlinear function lacks this nice property. For example, consider a linear function f(x). It satisfies the property that f(a x + b y) = a f(x) + b f(y). The function of the sum is the sum of the functions. One important point to note is that what is considered to be the paragon of linearity, namely a line on a graph, i.e. f(x) = mx + b is not linear since f(x + y) = m (x + y) + b \ne f(x)+ f(y). The y-intercept b destroys the linearity of the line. A line is instead affine, which is to say a linear function shifted by a constant. A linear differential equation has the form

\frac{dx}{dt} = M x

where x can be in any dimension.  Solutions of a linear differential equation can be multiplied by any constant and added together.

Linearity is thus essential for engineering. If you are designing a bridge then you simply add as many struts as you need to support the predicted load. Electronic circuit design is also linear in the sense that you combine as many logic circuits as you need to achieve your end. Imagine if bridge mechanics were completely nonlinear so that you had no way to predict how a bunch of struts would behave when assembled together. You would then have to test each combination to see how they work. Now, real bridges are not entirely linear but the deviations from pure linearity are mild enough that you can make predictions or have rules of thumb of what will work and what will not.

Chemistry is an example of a system that is highly nonlinear. You can’t know how a compound will act just based on the properties of its components. For example, you can’t simply mix glass and steel together to get a strong and hard transparent material. You need to be clever in coming up with something like gorilla glass used in iPhones. This is why engineering new drugs is so hard. Although organic chemistry is quite sophisticated in its ability to synthesize various compounds there is no systematic way to generate molecules of a given shape or potency. We really don’t know how molecules will behave until we create them. Hence, what is usually done in drug discovery is to screen a large number of molecules against specific targets and hope. I was at a computer-aided drug design Gordon conference a few years ago and you could cut the despair and angst with a knife.

That is not to say that engineering is completely hopeless for nonlinear systems. Most nonlinear systems act linearly if you perturb them gently enough. That is why linear regression is so useful and prevalent. Hence, even though the global climate system is a highly nonlinear system, it probably acts close to linear for small changes. Thus I feel confident that we can predict the increase in temperature for a 5% or 10% change in the concentration of greenhouse gases but much less confident in what will happen if we double or treble them. How linear a system will act depends on how close they are to a critical or bifurcation point. If the climate is very far from a bifurcation then it could act linearly over a large range but if we’re near a bifurcation then who knows what will happen if we cross it.

I think biology is an example of a nonlinear system with a wide linear range. Recent research has found that many complex traits and diseases like height and type 2 diabetes depend on a large number of linearly acting genes (see here). Their genetic effects are additive. Any nonlinear interactions they have with other genes (i.e. epistasis) are tiny. That is not to say that there are no nonlinear interactions between genes. It only suggests that common variations are mostly linear. This makes sense from an engineering and evolutionary perspective. It is hard to do either in a highly nonlinear regime. You need some predictability if you make a small change. If changing an allele had completely different effects depending on what other genes were present then natural selection would be hard pressed to act on it.

However, you also can’t have a perfectly linear system because you can’t make complex things. An exclusive OR logic circuit cannot be constructed without a threshold nonlinearity. Hence, biology and engineering must involve “the linear combination of nonlinear gadgets”. A bridge is the linear combination of highly nonlinear steel struts and cables. A computer is the linear combination of nonlinear logic gates. This occurs at all scales as well. In biology, you have nonlinear molecules forming a linear genetic code. Two nonlinear mitochondria may combine mostly linearly in a cell and two liver cells may combine mostly linearly in a liver.  This effective linearity is why organisms can have a wide range of scales. A mouse liver is thousands of times smaller than a human one but their functions are mostly the same. You also don’t need very many nonlinear gadgets to have extreme complexity. The genes between organisms can be mostly conserved while the phenotypes are widely divergent.

New paper on genomics

James Lee and I have a new paper out: Lee and Chow, Conditions for the validity of SNP-based heritability estimation, Human Genetics, 2014. As I summarized earlier (e.g. see here and here), heritability is a measure of the proportion of the variance of some trait (like height or cholesterol levels) due to genetic factors. The classical way to estimate heritability is to regress standardized (mean zero, standard deviation one) phenotypes of close relatives against each other. In 2010, Jian Yang, Peter Visscher and colleagues developed a way to estimate heritability directly from the data obtained in Genome Wide Association Studies (GWAS), sometimes called GREML.  Shashaank Vattikuti and I quickly adopted this method and computed the heritability of metabolic syndrome traits as well as the genetic correlations between the traits (link here). Unfortunately, our methods section has a lot of typos but the corrected Methods with the Matlab code can be found here. However, I was puzzled by the derivation of the method provided by the Yang et al. paper.  This paper is our resolution.  The technical details are below the fold.

 

Continue reading

The Stephanie Event

You should read this article in Esquire about the advent of personalized cancer treatment for a heroic patient named Stephanie Lee.  Here is Steve Hsu’s blog post. The cost of sequencing is almost at the point where everyone can have their normal and tumor cells completely sequenced to look for mutations like Stephanie. The team at Mt.  Sinai Hospital in New York described in the article inserted some of the mutations into a fruit fly and then checked to see what drugs killed it. The Stephanie Event was the oncology board meeting at Sinai where the treatment for Stephanie Lee’s colon cancer, which had spread to the liver, was discussed. They decided on a standard protocol but would use the individualized therapy based on the fly experiments if the standard treatments failed.  The article was beautifully written, combining a compelling human story with science.

Paper on compressed sensing and genomics

New paper on the arXiv. The next step after the completion of the Human Genome Project, was the search for genes associated with diseases such as autism or diabetes. However, after spending hundreds of millions of dollars, we find that there are very few common variants of genes with large effects. This doesn’t mean that there aren’t genes with large effect. The growth hormone gene definitely has a large effect on height. It just means that variations of genes that are common among people have small effects on the phenotype. Given the results of Fisher, Wright, Haldane and colleagues, this was probably expected as the most likely scenario and recent results measuring narrow-sense heritability directly from genetic markers (e.g. see this) confirms this view.

Current GWAS microarrays consider about a million or two markers and this is increasing rapidly. Narrow-sense heritability refers to the additive or linear genetic variance, which means the phenotype is given by the linear model y= Z\beta + \eta, where y is the phenotype vector, Z is the genotype matrix, \beta are all the genetic effects we want to recover, and \eta are all the nonadditive components including environmental effects. This is a classic linear regression problem. The problem comes when the number of coefficients \beta far exceeds the number of people in your sample, which is the case for genomics. Compressed sensing is a field of high dimensional statistics that addresses this specific problem. People such as David Donoho, Emmanuel Candes and Terence Tao have proven under fairly general conditions that if the number of nonzero coefficients are sparse compared to the number samples, then the effects can be completely recovered using L1 penalized optimization algorithms such as the lasso or approximate message passing. In this paper, we show that these ideas can be applied to genomics.

Here is Steve Hsu’s summary of the paper

Application of compressed sensing to genome wide association studies and genomic selection

(Submitted on 8 Oct 2013)

We show that the signal-processing paradigm known as compressed sensing (CS) is applicable to genome-wide association studies (GWAS) and genomic selection (GS). The aim of GWAS is to isolate trait-associated loci, whereas GS attempts to predict the phenotypic values of new individuals on the basis of training data. CS addresses a problem common to both endeavors, namely that the number of genotyped markers often greatly exceeds the sample size. We show using CS methods and theory that all loci of nonzero effect can be identified (selected) using an efficient algorithm, provided that they are sufficiently few in number (sparse) relative to sample size. For heritability h2 = 1, there is a sharp phase transition to complete selection as the sample size is increased. For heritability values less than one, complete selection can still occur although the transition is smoothed. The transition boundary is only weakly dependent on the total number of genotyped markers. The crossing of a transition boundary provides an objective means to determine when true effects are being recovered. For h2 = 0.5, we find that a sample size that is thirty times the number of nonzero loci is sufficient for good recovery.

Comments: Main paper (27 pages, 4 figures) and Supplement (5 figures) combined
Subjects: Genomics (q-bio.GN); Applications (stat.AP)
Cite as: arXiv:1310.2264 [q-bio.GN]
(or arXiv:1310.2264v1 [q-bio.GN] for this version)

Heritability and additive genetic variance

Most people have an intuitive notion of heritability being the genetic component of why close relatives tend to resemble each other more than strangers. More technically, heritability is the fraction of the variance of a trait within a population that is due to genetic factors. This is the pedagogical post on heritability that I promised in a previous post on estimating heritability from genome wide association studies (GWAS).

One of the most important facts about uncertainty and something that everyone should know but often doesn’t is that when you add two imprecise quantities together, while the average of the sum is the sum of the averages of the individual quantities, the total error (i.e. standard deviation) is not the sum of the standard deviations but the square root of the sum of the square of the standard deviations or variances. In other words, when you add two uncorrelated noisy variables, the variance of the sum is the sum of the variances. Hence, the error grows as the square root of the number of quantities you add and not linearly as it had been assumed for centuries. There is a great article in the American Scientist from 2007 called The Most Dangerous Equation giving a history of some calamities that resulted from not knowing about how variances sum. The variance of a trait can thus be expressed as the sum of the genetic variance and environmental variance, where environment just means everything that is not correlated to genetics. The heritability is the ratio of the genetic variance to the trait variance.

Continue reading

New paper on population genetics

James Lee and I just published a paper entitled “The causal meaning of Fisher’s average effect” in the journal Genetics Research. The paper can be obtained here. This paper is the brainchild of James and I just helped him out with some of the proofs.  James’s take on the paper can be read here. The paper resolves a puzzle about the incommensurability of Ronald Fisher’s two definitions of the average effect noted by population geneticist D.S. Falconer three decades ago.

Fisher was well known for both brilliance and obscurity and people have long puzzled over the meaning of some of his work.  The concept of the average effect is extremely important for population genetics but it is not very well understood. The field of population genetics was invented in the early twentieth century by luminaries such as Fisher, Sewall Wright, and JBS Haldane to reconcile Darwin’s theory of evolution with Mendelian genetics. This is a very rich field that has been somewhat forgotten. People in mathematical,  systems, computational, and quantitative biology really should be fully acquainted with the field.

For those who are unacquainted with genetics, here is a quick primer to understand the paper. Certain traits, like eye colour or the ability to roll your tongue, are affected by your genes. Prior to the discovery of the structure of DNA, it was not clear what genes were, except that they were the primary discrete unit of genetic inheritance. These days it usually refers to some region on the genome. Mendel’s great insight was that genes come in pairs, which we now know to correspond to the two copies of each of the 23 chromosomes we have.  A variant of a particular gene is called an allele.  Traits can depend on genes (or more accurately genetic loci) linearly or nonlinearly. Consider a quantitative trait that depends on a single genetic locus that has two alleles, which we will call a and A. This means that a person will have one of three possible genotypes: 1) homozygous in A (i.e. have two A alleles), 2) heterozygous (have one of each), or 3) homozygous in a (i.e. have no A alleles). If the locus is linear then if you plot the measure of the trait (e.g. height) against the number of A alleles, you will get a straight line. For example, suppose allele A contributes a tenth of a centimetre to height. Then people with one A allele will be on average one tenth of a centimetre taller than those with no A alleles and those with two A alleles will be two tenths taller. The familiar notion of dominance is a nonlinear effect. So for example, the ability to roll your tongue is controlled by a single gene. There is a dominant rolling allele and a recessive nonrolling allele. If you have at least one rolling allele, you can roll your tongue.

The average effect of a gene substitution is the average change in a trait if one allele is substituted for another. A crucial part of population genetics is that you always need to consider averages. This is because genes are rarely completely deterministic. They can be influenced by the environment or other genes. Thus, in order to define the effect of the gene, you need to average over these other influences. This then leads to a somewhat ambiguous definition of average effect and Fisher actually came up with two. The first, and as James would argue the primary definition, is a causal one in that we want to measure the average effect of a gene if you experimentally substituted one allele for another prior to development and influence by the environment. A second correlation definition would simply be to plot the trait against the number of alleles as in the example above. The slope would then be the average effect. This second definition looks at the correlation between the gene and the trait but as the old saying goes “correlation does not imply causation”. For example, the genetic loci may not have any effect on the trait but happens to be strongly correlated with a true causal locus (in the population you happen to be examining). Distinguishing between genes that are merely associated with a trait from ones that are actually causal remains an open problem in genome wide association studies.

Our paper goes over some of the history and philosophy of the tension between these two definitions. We wrote the paper because these two definitions do not always agree and we show under what conditions they do agree. The main reason they don’t agree is that averages will depend on the background over which you average. For a biallelic gene, there are 2 alleles but 3 genotypes. The distribution of alleles in a population is governed by two parameters. It’s not enough to specify the frequency of one allele. You also need to know the correlation between alleles. The regression definition matches the causal definition if a particular function representing this correlation is held fixed while the experimental allele substitutions under the causal definition are carried out. We also considered the multi-allele and multi-loci case in as much generality as we could.

New paper in Nature Reviews Genetics

A Coulon, CC Chow, RH Singer, DR Larson Eukaryotic transcriptional dynamics: from single molecules to cell populations. Nat Gen Reviews (2013).

Abstract | Transcriptional regulation is achieved through combinatorial interactions between regulatory elements in the human genome and a vast range of factors that modulate the recruitment and activity of RNA polymerase. Experimental approaches for studying transcription in vivo now extend from single-molecule techniques to genome-wide measurements. Parallel to these developments is the need for testable quantitative and predictive models for understanding gene regulation. These conceptual models must also provide insight into the dynamics of transcription and the variability that is observed at the single-cell level. In this Review, we discuss recent results on transcriptional regulation and also the models those results engender. We show how a non-equilibrium description informs our view of transcription by explicitly considering time- and energy-dependence at the molecular level.

The problem with “just desserts”

The blogosphere is aflutter over Harvard economist and former chairman of the Council of Economic Advisors to Bush 43, Greg Mankiw‘s recent article “Defending the One Percent“. Mankiw’s paper mostly argues against the classic utilitarian reason for redistribution – a dollar is more useful to a poor person than a rich one.  However, near the end of the paper he proposes that an alternative basis for fair income distribution should be the just desserts principle where everyone is compensated according to how much they contribute. Mankiw believes that the recent surge in income inequality is due to changes in technology that favour superstars who create much more value for the economy than the rest. He then argues that the superstars are superstars because of heritable innate qualities like IQ and not because the economy is rigged in their favour.

The problem with this idea is that genetic ability is a shared natural resource that came through a long process of evolution and everyone who has ever lived has contributed to this process. In many ways, we’re like a huge random Monte Carlo simulation where we randomly try out lots of different gene variants to see what works best. Mankiw’s superstars are the Monte Carlo trials that happen to be successful in our current system. However, the world could change and other qualities could become more important just as physical strength was more important in the pre-information age. The ninety-nine percent are reservoirs of genetic variability that we all need to prosper. Some impoverished person alive today may possess the genetic variant to resist some future plague and save humanity. She is providing immense uncompensated economic value. The just desserts world is really nothing more than a random world; a world where you are handed a lottery ticket and you hope you win. This would be fine but one shouldn’t couch it in terms of some deeper rationale. A world with a more equitable distribution is one where we compensate the less successful for their contribution to economic progress. However, that doesn’t mean we should have a world with completely equal income distribution. Unfortunately, the human mind needs incentives to try hard so for maximal economic growth, the lottery winners must always get at least a small bonus.