# New paper on GWAS

2018 Dec;42(8):783-795. doi: 10.1002/gepi.22161. Epub 2018 Sep 24.

# The accuracy of LD Score regression as an estimator of confounding and genetic correlations in genome-wide association studies.

### Author information

1
Department of Psychology, University of Minnesota Twin Cities, Minneapolis, Minnesota.
2
Mathematical Biology Section, Laboratory of Biological Modeling, NIDDK, National Institutes of Health, Bethesda, Maryland.

### Abstract

To infer that a single-nucleotide polymorphism (SNP) either affects a phenotype or is linkage disequilibrium with a causal site, we must have some assurance that any SNP-phenotype correlation is not the result of confounding with environmental variables that also affect the trait. In this study, we study the properties of linkage disequilibrium (LD) Score regression, a recently developed method for using summary statistics from genome-wide association studies to ensure that confounding does not inflate the number of false positives. We do not treat the effects of genetic variation as a random variable and thus are able to obtain results about the unbiasedness of this method. We demonstrate that LD Score regression can produce estimates of confounding at null SNPs that are unbiased or conservative under fairly general conditions. This robustness holds in the case of the parent genotype affecting the offspring phenotype through some environmental mechanism, despite the resulting correlation over SNPs between LD Scores and the degree of confounding. Additionally, we demonstrate that LD Score regression can produce reasonably robust estimates of the genetic correlation, even when its estimates of the genetic covariance and the two univariate heritabilities are substantially biased.

#### KEYWORDS:

causal inference; genetic correlation; heritability; population stratification; quantitative genetics

PMID:

30251275

DOI:

10.1002/gepi.22161

# Are humans successful because they are cruel?

According to Wikipedia, the class Mammalia has 29 orders (e.g. Carnivora), 156 families (e.g. Ursidae), 1258 Genera (e.g. Ursus), and nearly 6000 species (e.g. Polar Bear). Some orders like Chiroptera (bats) and Rodentia are very large with many families, genera, and species. Some are really small like Orycteropodidae, which has only one species – the aardvark. Humans are in the order Primates, of which there are quite a few families and genera. Almost all of them live in tropical or subtropical areas and almost all of them have small populations, many of them endangered. The exception of course is humans who is the only species remaining of the genus homo. The other genera in the great ape family hominidae – gorillas, orangutans, chimpanzees, and bonobos – are all in big trouble.

I think most people would attribute the incomparable success of humans to their resilience, intelligence, and ingenuity. However, another important factor could be their bottomless capacity for intentional cruelty. Although there seems to be a decline in violence throughout history as documented in Steven Pinker’s recent book, there are still no shortages of examples. Take a listen to this recent Econtalk podcast with Mike Munger on how the South rationalized slavery. It could very well be that what made modern humans dominate earth and wipe out all the other homo species along the way was not that they were more intelligent but that they were more cruel and rapacious. Neanderthals and Denisovans may have been happy sitting around the campfire after a hunt, while humans needed to raid every nearby tribe and kill them.

# New review paper on GWAS

Comput Struct Biotechnol J. 2015 Nov 23;14:28-34
Uncovering the Genetic Architectures of Quantitative Traits.
Lee JJ, Vattikuti S, Chow CC.

Abstract
The aim of a genome-wide association study (GWAS) is to identify loci in the human genome affecting a phenotype of interest. This review summarizes some recent work on conceptual and methodological aspects of GWAS. The average effect of gene substitution at a given causal site in the genome is the key estimand in GWAS, and we argue for its fundamental importance. Implicit in the definition of average effect is a linear model relating genotype to phenotype. The fraction of the phenotypic variance ascribable to polymorphic sites with nonzero average effects in this linear model is called the heritability, and we describe methods for estimating this quantity from GWAS data. Finally, we show that the theory of compressed sensing can be used to provide a sharp estimate of the sample size required to identify essentially all sites contributing to the heritability of a given phenotype.
KEYWORDS:
Average effect of gene substitution; Compressed sensing; GWAS; Heritability; Population genetics; Quantitative genetics; Review; Statistical genetics

# Paper on new myopia associated gene

The prevalence of near sightedness or myopia has almost doubled in the past thirty years from about 25% to 44%. No one knows why but it is probably a gene-environment effect, like obesity. This recent paper in PLoS Genetics: APLP2 Regulates Refractive Error and Myopia Development in Mice and Humans, sheds light on the subject. It reports that a variant of the APLP2 gene is associated with myopia in people if they read a lot as children. Below is a figure of the result of a GWAS study showing the increase in myopia (more negative is more myopic) with age for those with the risk variant (GA) and for time spent reading. The effect size is pretty large and a myopic effect of APLP2 is seen in monkeys, mice, and humans. Thus, I think that this result will hold up. The authors also show that the APLP2 gene is involved in retinal signaling, particularly in amacrine cells. It is thus consistent with the theory that myopia is the result of feedback from the retina during development.  Hence, if you are constantly focused on near objects, the eye will develop to accommodate for that. So maybe you should send your 7 year old outside to play instead of sitting inside reading or playing video games.

# Brave New World

Read Steve Hsu’s Nautilus article on Super-Intelligence. If so-called IQ-related genetic variants are truly additive then his estimates are probably correct. His postulated being could possibly understand the fine details of any topic in less than a day or shorter. Instead of taking several years to learn enough differential geometry to develop Einstein’s General Relativity (which is what it took for Einstein), a super-intelligence could perhaps do it in an afternoon or during a coffee break. Personally, I believe that nothing is free and that there will always be tradeoffs. I’m not sure what the cost of super-intelligence will be but there will likely be something. Variability in a population is always good for the population although not so great for each individual. An effective way to make a species go extinct is to remove variability. If pests had no genetic variability then it would be a simple matter to eliminate them with some toxin. Perhaps, humans will be able to innovate fast enough to buffer them against environmental changes. Maybe cognitive variability can compensate for genetic variability. I really don’t know.

# Heritability in twins

Nature Genetics recently published a meta-analysis of virtually all twin studies over the last half century:

Tinca J C Polderman, Beben Benyamin, Christiaan A de Leeuw, Patrick F Sullivan, Arjen van Bochoven, Peter M Visscher & Danielle Posthuma. Meta-analysis of the heritability of human traits based on fifty years of twin studies. Nature Genetics 47,702–709 (2015) doi:10.1038/ng.3285.

One of the authors, Peter Visscher, is perhaps the most influential and innovative thinker in human genetics at this moment and this paper continues his string of insightful results. The paper examined close to eighteen thousand traits in almost three thousand publications, representing fifteen million twins. The main goal was to use all the available data to recompute the heritability estimates for all of these traits. The first thing they found was that the traits were highly skewed towards psychiatric and cognitive phenotypes. People who study heritability are mostly interested in mental function. They then checked to see if there was any publication bias where people only published results with high heritability. They used multiple methods but they basically checked if the predictions of effect size was correlated with sample size and they found none. Their most interesting result, which I will comment on more below was that the average heritability across all traits was 0.488, which means that on average genes and environment contribute equally. However, heritability does vary widely across domains where eye, ear, nose and throat function are most heritable, and social values were least heritable. The largest influence of shared environmental effects was for bodily functions, infections, and social values. Hence, staying healthy depends on your environment, which is why a child who may be stunted in their impoverished village can thrive if moved to Minnesota. It also shows why attitudes on social issues can and do change. Finally, the paper addressed two important technical issues which I will expand on below – 1) previous studies may be underestimating heritability and 2) heritability is mostly additive.

Heritability is the fraction of the variance of a trait due to genetic variance. Here is a link to a previous post explaining heritability although as my colleague Vipul Periwal points out, it is full of words and has no equations. Briefly, there are two types of heritability – broad sense and narrow sense. Broad sense heritability, $H^2 = Var(G)/Var(P)$, is the total genetic variance divided by the phenotypic variance. Narrow sense heritability $h^2 = Var(A)/Var(P)$ is the linear or additive genetic variance divided by the phenotypic variance. A linear regression of the standardized trait of the children against the average of the standardized trait of the parents is an estimate of the narrow sense heritability. It captures the linear part while the broad sense heritability includes the linear and nonlinear contributions, which include dominance and gene-gene effects (epistasis). To estimate (narrow-sense) heritability from twins, Polderman et al. used what is called Falconer’s formula and took twice the difference in the correlation of a trait between identical (monozygotic) and fraternal (dizygotic) twins ($h^2 =2 (r_{MZ}-r_{DZ})$). The idea being that the any difference between identical twins must be environmental (nongenetic), while the difference between dyzgotic twins is half genetic and environmental, so the difference between the two is half genetic. They also used another Falconer formula to estimate the shared environmental variance, which is $c^2 = 2 r_{DZ} - r_{MZ}$, since this “cancels out” the genetic part. Their paper then boiled down to doing a meta-analysis of $r_{DZ}$ and $r_{MZ}$. Meta-analysis is a nuanced topic but it boils down to weighting results from different studies by some estimate of how large the errors are. They used the DerSimonian-Laird random-effects approach, which is implemented in R. The Falconer formulas estimate the narrow sense heritability but many of the previous studies were interested in nonadditive genetic effects as well. Typically, what they did was to use either an ACE (Additive, common environmental, environmental) or an ADE (Additive, dominance, environmental) model. They decided on which model to use by looking at the sign of $c^2$. If it is positive then they used ACE and if it is negative they used ADE. Polderman et al. showed that this decision algorithm biases the heritability estimate downward.

If the heritability of a trait is mostly additive then you would expect that $r_{MZ}=2 r_{DZ}$ and they found that this was observed in 69% of the traits. Of the top 20 traits, 8 traits showed nonadditivity and these mostly related to behavioral and cognitive functions. Of these eight, 7 showed that the correlation between monozygotic twins was smaller than twice that of dizygotic twins, which implies that nonlinear genetic effects tend to work against each other. This makes sense to me since it would seem that as you start to accumulate additive variants that increase a phenotype you will start to hit rate limiting effects that will tend to dampen these effects. In other words, it seems plausible that the major nonlinearity in genetics is a saturation effect.

The most striking result was that the average heritability across all of the traits was about 0.5. Is an average value of 0.5 obvious or deep? I honestly do not know. When I told theoretical neuroscientist Fred Hall this result, he thought it was obvious and should be expected from maximum entropy considerations, which would assume that the distribution of $h^2$ would be uniform or at least symmetric about 0.5. This sounds plausible but as I have asserted many times – biology is the result of an exponential amplification of exponentially unlikely events. Traits that are heritable are by definition those that have variation across the population. Some traits, like the number of limbs, have no variance but are entirely genetic. Other traits, like your favourite sports team, are highly variable but not genetic even though there is a high probability that your favourite team will be the same as your parent’s or sibling’s favourite team. Traits that are highly heritable include height and cognitive function. Personality on the other hand, is not highly heritable. One of the biggest puzzles in population genetics is why there is any variability in a trait to start with. Natural selection prunes out variation exponentially fast so if any gene is selected for, it should be fixed very quickly. Hence, it seems equally plausible that traits with high variability would have low heritability. The studied traits were also biased towards mental function and different domains have different heritabilities. Thus, if the traits were sampled differently, the averaged heritability could easily deviate from 0.5. Thus, I think the null hypothesis should be that the $h^2 = .5$ value is a coincidence but I’m open to a deeper explanation.

A software tool to investigate these results can be found here. An enterprising student could do some subsampling of the traits to see how likely 0.5 would hold up if our historical interests in phenotypes were different.

Thanks go to Rick Gerkin for suggesting this topic.

# Paper on new version of Plink

The paper describing the updated version of the genome analysis software tool Plink has just been published.

Second-generation PLINK: rising to the challenge of larger and richer datasets
Christopher C Chang, Carson C Chow, Laurent CAM Tellier, Shashaank Vattikuti, Shaun M Purcell, and James J Lee

GigaScience 2015, 4:7  doi:10.1186/s13742-015-0047-8

Abstract
Background
PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1’s primary data format.

Findings
To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, View MathML-time/constant-space Hardy-Weinberg equilibrium and Fisher’s exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0).

Conclusions
The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

Keywords: GWAS; Population genetics; Whole-genome sequencing; High-density SNP genotyping; Computational statistics

This project started out with us trying to do some genomic analysis that involved computing various distance metrics on sequence space. Programming virtuoso Chris Chang stepped in and decided to write some code to speed up the computations. His program, originally called wdist, was so good and fast that we kept asking him to put in more capabilities. Eventually,  he had basically replicated the suite of functions that Plink performed so he contacted Shaun Purcell, the author of Plink, if he could just call his code Plink too and Shaun agreed. We then ran a series of tests on various machines to check the speed-ups compared to the original Plink and gcta. If you do any GWAS analysis at all, I highly recommend you check out Plink 1.9.