# New paper on GWAS

2018 Dec;42(8):783-795. doi: 10.1002/gepi.22161. Epub 2018 Sep 24.

# The accuracy of LD Score regression as an estimator of confounding and genetic correlations in genome-wide association studies.

### Author information

1
Department of Psychology, University of Minnesota Twin Cities, Minneapolis, Minnesota.
2
Mathematical Biology Section, Laboratory of Biological Modeling, NIDDK, National Institutes of Health, Bethesda, Maryland.

### Abstract

To infer that a single-nucleotide polymorphism (SNP) either affects a phenotype or is linkage disequilibrium with a causal site, we must have some assurance that any SNP-phenotype correlation is not the result of confounding with environmental variables that also affect the trait. In this study, we study the properties of linkage disequilibrium (LD) Score regression, a recently developed method for using summary statistics from genome-wide association studies to ensure that confounding does not inflate the number of false positives. We do not treat the effects of genetic variation as a random variable and thus are able to obtain results about the unbiasedness of this method. We demonstrate that LD Score regression can produce estimates of confounding at null SNPs that are unbiased or conservative under fairly general conditions. This robustness holds in the case of the parent genotype affecting the offspring phenotype through some environmental mechanism, despite the resulting correlation over SNPs between LD Scores and the degree of confounding. Additionally, we demonstrate that LD Score regression can produce reasonably robust estimates of the genetic correlation, even when its estimates of the genetic covariance and the two univariate heritabilities are substantially biased.

#### KEYWORDS:

causal inference; genetic correlation; heritability; population stratification; quantitative genetics

PMID:

30251275

DOI:

10.1002/gepi.22161

# The low carb war continues

Last month, a paper in the British Journal of Medicine on the effect of low carb diets on energy expenditure, with senior author David Ludwig, made a big splash in the popular press and also instigated a mini-Twitter war. The study, which cost somewhere in the neighborhood of 12 million dollars, addressed the general question of whether a person will burn more energy on a low carbohydrate diet compared to an average or high carb diet. In particular, the study looked at the time period after weight loss where people are susceptible to regaining weight. The argument is that it will be easier to maintain weight loss on a low carb diet since you will be burning more energy. Recent intensive studies by my colleague Kevin Hall and others have found that low carb diets had little effect if any on energy expenditure, so this paper was somewhat of a surprise and gave hope to low carb aficionados. However, Kevin found some possible flaws, which he points out in an official response to BMJ and a BioRxiv paper, which then prompted a none-too-pleased response from Ludwig, which you can follow on Twitter. The bottom line is that the low carb effect size depends on the baseline point you compare too. In the original study plan, the baseline point was chosen to be energy expenditure prior to the weight loss phase of the study. In the publication, the baseline point was changed to after the weight loss but before the weight loss maintenance phase. If the original baseline was chosen, the low carb effect is no longer significant. The authors claim that they were blinded to the data and changed the baseline for technical reasons so this did not represent a case of p-hacking where one tries multiple combinations until something significant turns up. It seems pretty clear to me that low carbs do not have much of a metabolic effect but that is not to say that low carb diets are not effective. The elephant in the room is still appetite. It is possible that you are simply less hungry on a low carb diet and thus you eat less. Also, when you eliminate a whole category of food, there is just less food to eat. That could be the biggest effect of all.

# The tragedy of low probability events

We live in an age of fear and yet life (in the US at least) is the safest it has ever been. Megan McArdle blames coddling parents and the media in a Washington Post column. She argues that cars and swimming pools are much more dangerous than school shootings and kidnappings yet we mostly ignore the former and obsess about the latter. However, to me dying from an improbable event is just so much more tragic than dying from an expected one. I would be much less despondent meeting St. Peter at the Pearly Gates if I happened to expire from cancer or heart disease than if I were to be hit by an asteroid while weeding my garden. We are so scared now because we have never been safer. We would fear terrorist attacks less if they were more frequent. This is the reason that I would never want a major increase in lifespan. I most certainly would like to last long enough to see my children become independent but anything beyond that is bonus time. Nothing could be worse to me than immortality. The pain of any tragedy would be unbearable. Life would consist of an endless accumulation of sad memories. The way out is to forget but that to me is no different from death. What would be the point of living forever if you were to erase much of it. What would a life be if you forgot the people and things that you loved? To me that is no life at all.

# Sampling from a probability distribution

When simulating any system with randomness, sampling from a probability distribution is necessary. Usually, you’ll just need to sample from a normal or uniform distribution and thus can use a built-in random number generator. However, for the time when a built-in function does not exist for your distribution, here’s a simple algorithm.

Let’s say you have some probability density function (PDF) $\rho(x)$ on some domain $[x_{\min},x_{\max}]$ and you want to generate a set of numbers that follows this probability law. This is actually very simple to do although those new to the field may not know how. Generally, any programming language or environment has a built-in random number generator for real numbers (up to machine precision) between 0 and 1. Thus, what you want to do is to find a way to transform these random numbers, which are uniformly distributed on the domain $[0,1]$ to the random numbers you want. The first thing to notice is that the cumulative distribution function (CDF) for your PDF, $P(x) = \int_{x_{\min}}^x \rho(x') dx'$ is a function that ranges over the interval $[0,1]$, since it is a probability. Thus, the algorithm is: 1) generate a random number $r$ between o and 1, and 2) plug it into the inverse of $P(x)$. The numbers you get out satisfy your distribution (i.e. $P(x)=r \Rightarrow x = P^{-1}(r)$).

If you’re lucky enough to have a closed form expression of the inverse of the CDF $P^{-1}(r)$ then you can just plug the random numbers directly into your function, otherwise you have to do a little more work. If you don’t have a closed form expression for the CDF then you can just solve the equation $P(y)-r=0$ for $x$. Newton’s method will generally converge very quickly.  You just loop over the command x = x – (P(x)-r)/rho(x) until you reach the tolerance you want. If you don’t have a closed form expression of the CDF then the simplest way is construct the CDF numerically by computing the N dimensional vector $v[j]=\sum_{i=0}^j \rho(ih+x_{\min}) h$, for some discretization parameter $h$ such that $x_{\max}= Nh+x_{\min}$ .  If the PDF is defined on the entire real line then you’ll have to decide on what you want the minimum $x_{\min}$ and maximum (through $N$) $x_{\max}$ to be. If the distribution has thin tails, then you just need to go out until the probability is smaller than your desired error tolerance. You then find the index $j$ such that $v[j]$ is closest to $r$ (i.e. in Julia you could use findmin(abs(v[j]-r))) and your random number is $x = hj+x_{\min}$.

Here is why the algorithm works. What we are doing is transforming the PDF of $r$, which we can call $u(r)$, to the desired PDF for $x$. The transformation we use is $r = P(x)$. Thus, $u(r) dr = \rho(x) dx$ or $\rho(x) = u(r) dr/dx$. Now, $u(r)$ is just 1 on the interval $[0,1]$ and $dr/dx$ is the PDF of $x$ as expected. Thus, inverting the CDF for uniform random numbers always gives you a new set of random numbers that satisfies the new PDF. Trivial perhaps, but really useful.

Erratum: 2019-04-05.  It was pointed out to me that my algorithm involving direct summation of the CDF had some errors.  I hope they have now been corrected.

# R vs Matlab

It has now been almost half a year since I switched from Matlab to open source software and I’ve been amazed at how easy the transition has been. I had planned to replace Matlab with Python, Julia, and R but I have found that R and Julia have been sufficient for my requirements. Maxima is also more than adequate to replace Mathematica. I have become particularly attached to R especially after I started to use R Studio as the interface. I had only used R before as just a statistics tool but it really is a complete programming platform like Matlab. It has very nice graphics capabilities and I find the language very nice to program in. I really like its use of lists where I can pile sets of any type and any size into one object. I also like how R Studio can save your work into Projects, which keeps the whole environment and history in one place. I can then switch between multiple projects and everything comes back. The only thing I miss from Matlab is the command completion history feature, where I could easily find a previous command by just typing the first few letters. Also, I haven’t quite figured out how to output R data into a text file seamlessly yet. I seem to always get extraneous row or column information. I use Julia when I want to write a code that needs to loop fast but for everything else I’ve been reaching for R.

# New paper on genomics

James Lee and I have a new paper out: Lee and Chow, Conditions for the validity of SNP-based heritability estimation, Human Genetics, 2014. As I summarized earlier (e.g. see here and here), heritability is a measure of the proportion of the variance of some trait (like height or cholesterol levels) due to genetic factors. The classical way to estimate heritability is to regress standardized (mean zero, standard deviation one) phenotypes of close relatives against each other. In 2010, Jian Yang, Peter Visscher and colleagues developed a way to estimate heritability directly from the data obtained in Genome Wide Association Studies (GWAS), sometimes called GREML.  Shashaank Vattikuti and I quickly adopted this method and computed the heritability of metabolic syndrome traits as well as the genetic correlations between the traits (link here). Unfortunately, our methods section has a lot of typos but the corrected Methods with the Matlab code can be found here. However, I was puzzled by the derivation of the method provided by the Yang et al. paper.  This paper is our resolution.  The technical details are below the fold.

# Paper on compressed sensing and genomics

New paper on the arXiv. The next step after the completion of the Human Genome Project, was the search for genes associated with diseases such as autism or diabetes. However, after spending hundreds of millions of dollars, we find that there are very few common variants of genes with large effects. This doesn’t mean that there aren’t genes with large effect. The growth hormone gene definitely has a large effect on height. It just means that variations of genes that are common among people have small effects on the phenotype. Given the results of Fisher, Wright, Haldane and colleagues, this was probably expected as the most likely scenario and recent results measuring narrow-sense heritability directly from genetic markers (e.g. see this) confirms this view.

Current GWAS microarrays consider about a million or two markers and this is increasing rapidly. Narrow-sense heritability refers to the additive or linear genetic variance, which means the phenotype is given by the linear model $y= Z\beta + \eta$, where $y$ is the phenotype vector, $Z$ is the genotype matrix, $\beta$ are all the genetic effects we want to recover, and $\eta$ are all the nonadditive components including environmental effects. This is a classic linear regression problem. The problem comes when the number of coefficients $\beta$ far exceeds the number of people in your sample, which is the case for genomics. Compressed sensing is a field of high dimensional statistics that addresses this specific problem. People such as David Donoho, Emmanuel Candes and Terence Tao have proven under fairly general conditions that if the number of nonzero coefficients are sparse compared to the number samples, then the effects can be completely recovered using L1 penalized optimization algorithms such as the lasso or approximate message passing. In this paper, we show that these ideas can be applied to genomics.

Here is Steve Hsu’s summary of the paper

Application of compressed sensing to genome wide association studies and genomic selection

(Submitted on 8 Oct 2013)

We show that the signal-processing paradigm known as compressed sensing (CS) is applicable to genome-wide association studies (GWAS) and genomic selection (GS). The aim of GWAS is to isolate trait-associated loci, whereas GS attempts to predict the phenotypic values of new individuals on the basis of training data. CS addresses a problem common to both endeavors, namely that the number of genotyped markers often greatly exceeds the sample size. We show using CS methods and theory that all loci of nonzero effect can be identified (selected) using an efficient algorithm, provided that they are sufficiently few in number (sparse) relative to sample size. For heritability h2 = 1, there is a sharp phase transition to complete selection as the sample size is increased. For heritability values less than one, complete selection can still occur although the transition is smoothed. The transition boundary is only weakly dependent on the total number of genotyped markers. The crossing of a transition boundary provides an objective means to determine when true effects are being recovered. For h2 = 0.5, we find that a sample size that is thirty times the number of nonzero loci is sufficient for good recovery.

 Comments: Main paper (27 pages, 4 figures) and Supplement (5 figures) combined Subjects: Genomics (q-bio.GN); Applications (stat.AP) Cite as: arXiv:1310.2264 [q-bio.GN] (or arXiv:1310.2264v1 [q-bio.GN] for this version)