Here is the backstory for the paper in the previous post. Immediately after the human genome project was completed a decade ago, people set out to discover the genes responsible for diseases. Traditionally, genes had been discovered by tracing the incidence in families that exhibit the disease. This was a painstaking process – a classic example being the discovery of the gene for Huntington’s disease. The completion of the human genome project provided a simpler approach. The first thing was to create what is known as a haplotype map or HapMap. This is a catalog of all the common genome differences between people. The genome between humans only differ by about 0.1%. These differences include Single Nucleotide Variations (SNPs) where a given base (A,C,G,T) is changed and Copy Number Variations (CNV) where there are differences in the number of copies of segments of DNA. There are about 10 million common SNPs.
The Genome Wide Association Study (GWAS) usually looks for differences in SNPs between people with and without diseases. The working hypothesis at that time was that common diseases like Alzheimer’s disease or Type II diabetes should be due to differences in common SNPs (i.e. common disease, common variant). People thought that the genes for many of these diseases would be found within a decade. Towards this end, companies like Affymetrics and Illumina began making microarray chips, which consist of tiny wells with snippets of complement DNA (cDNA) of a small segment of the sequence around each SNP. SNP variants are found by seeing what types of DNA fragments in genetic samples (e.g. saliva) get bound to the complement strands in the array. A classic GWAS study then considers the differences in SNP variants observed in disease and control groups. For any finite sample, there will always be fluctuations so a statistical criterion must be established to evaluate whether a variant is significant. This is done by computing the probability or p-value of the occurrence of a variant due to random chance. However, in a set of a million SNPS, which is standard for a chip, the probability of one SNP being randomly associated with a disease is a million fold higher if the SNPs are all independent (i.e. it’s just the sum of the probabilities of each event (see Bonferonni correction)). However, since SNPs are not always independent, this is a very conservative criterion.
The first set of results from GWAS started to be published shortly after I arrived at the NIH in 2004 and they weren’t very promising. A small number of SNPs were found for some common diseases but they only conferred a very small increase in risk even when the disease were thought to be highly heritable. Heritability is a measure of the proportion of phenotypic variation between people explained by genetic variation. (I’ll give a primer on heritability in a following post.) The one notable exception was age-related macular degeneration for which five SNPs were found to be associated with a 2 to 3 fold increase in risk. The difference between the heritability of a disease as measured by classical genetic methods and what was found to be explained by SNPs came to be known as the “Missing heritability” problem. I thought this was an interesting puzzle but didn’t think much about it until I saw the results from this paper (summarized on Steve Hsu’s blog), which showed that different European subpopulations could be separated by just projecting onto two principal components of 300,000 SNPS of 6000 people. (Principle components are the eigenvectors of the 6000 by 6000 correlation matrix of the 300,000 dimensional SNP vectors of each person.) This was a revelation to me because it implied that although single genes did not carry much weight, the collection of all the genes certainly did. I decided right then that I would show that the missing heritability for obesity was contained in the “pattern” of SNPs rather than in single SNPs. I did this without really knowing anything at all about genetics or heritability.
The first thing I did was to see if I could obtain data. I immediately discovered that the results of all NIH funded GWAS studies are posted on the database dbGAP after the study authors have published their initial findings. This is a fantastic resource and an example of how government can be useful. For example, the results from the famous Framingham Heart Study are on there. It took me nearly a year to figure out how to navigate the required protocols to get the data. During that year, I started to read up on the subject. I highly recommend the text book by Lynch and Walsh. I also recruited my fellow Shashaank Vattikuti to work on the project with me. I thought he would be ideal because he always thinks big and is completely unafraid of working on risky and challenging things.
My first thought was that the missing heritability was due to epistasis, which means that genes interact so that two genes may not do much alone but confer a huge effect when combined. This is the hypothesis of a recent paper by Eric Lander. However, as I read more I began to believe that while epistasis is probably crucial for biology it is not for heritability. The reason is quite simple – traits are determined by genotypes but we inherit genes. To explain this, I’ll need to review some basics of genetics. Humans are diploids (as are most mammals), which means we have two sets of chromosomes so that each SNP is paired. The pair could be homozygote meaning the same or heterozygote meaning different. SNP variants are called alleles so consider a bi-allelic SNP with an A allele and a C allele. (Do not confuse the two chromosomes with the two strands of the DNA molecule, in which case A is always paired with T and C with G). Hence, there are two homozygote genotypes AA and CC and one heterozygote genotype AC at one SNP. If only the number of a given allele matters for a phenotype or fitness, then the allele is called additive. So if allele A carries a value of x then your phenotype is 0 for CC, x for AC and 2x for AA. If the combination of alleles is non-additive at a given SNP (or any gene locus), then the SNP has dominance. For example if A is completely dominant then CC and AC both have value zero while AA has value x. Non-additivity across SNPs or gene loci is called epistasis. Additivity is the same as linearity. Non-additive effects are departures from perfect linearity in a linear regression of the phenotype on the genotype.
The combinatorics explodes exponentially if you consider the genotype of combinations of SNPs. Children inherit one set of chromosomes from each parent. If the parents are not inbred and breeding is more or less random, then the genotypes will become less correlated in each subsequent generation. Two alleles are said to be in linkage disequilibrium if they are uncorrelated. They are in equilibrium or not in linkage disequilibrium if they are uncorrelated. (I always find this double negative to be confusing). Hence, over long periods of time allele combinations (within and between loci) will tend to be broken up. Thus in order for a given allele to persist over time it must confer some evolutionary advantage on its own. This is why I believe epistasis probably plays a small role in heritability.
In the summer of 2010, Peter Visscher’s group published a groundbreaking paper (Yang et al. Common SNPs explain a large proportion of the heritability for human height. Nature Genetics, 42:565-569, 2010) that showed that half of the known heritability of height could be found in the GWAS SNPs if taken together as a whole. Up to that point, Shashaank and I were trying to show using principal components that people with similar BMI are more clustered but this paper told me how I should think of the problem and I immediately switched gears. It took me and Shashaank a few months to write the code to run the algorithms to repeat this analysis for BMI. Juen Guo helped us manage and organize the databases for the project. However, just as we were about to submit our paper in 2011, Visscher’s group published another important paper (Yang J et al., Genome partitioning of genetic variation for complex traits using common SNPs. Nat Genet 43: 519–525, 2011), which applied the analysis to BMI. We then redid the analysis for more metabolic syndrome traits and also computed the genetic correlation between traits. We then found an error in our code during the revision process and had to redo the all the computations,which take a lot computer time. I had no appreciation for what memory was good for until I did this project and had to invert manipulate huge matrices. I now maximize my 64 GB of memory in many computations and could easily find use for a TB of memory. I will go through the details of the method and our computation in a following post.
5 thoughts on “Heritability and GWAS”
There are two conceptual issues which are distinct but I’m not sure if you are addressing both or just mainly the first. The first is recognizing that a few single genes will rarely be very important for heritability, so the genome must be considered as a high-dimensional vector. The second is that (possibly) this high-dimensional vector has few useful low-dimensional representations that help explain heritability, so for example the first several principal components will also not really help explain heritability. There is sufficient linkage disequilbrium (i.e. lack of correlation?) across individuals in the population that you need to operate on the high-dimensional representation from beginning to end.
Another question is how you get a preliminary estimate of what the variance explained might be, before doing the most sophisticated analyses. Are there useful methods for calculating upper or lower bounds on this problem that don’t require days of computation and 100s of GBs of memory. For example, what does the canonical correlation coefficient between the matrix of individual genomes and the matrix of individual metabolic traits tell you about heritability, or at least about how well they are related. I assume due to computational constraints and the lacks of correlations, most non-linear approaches are not helpful, but are there non-linearities somewhere in the problem that you must consider to answer the problem correctly? I am thinking by analogy to generalized linear models, where one only deals with linear combinations of terms, and then e.g. exponentiates to get a prediction.
I would say a priori, that heritability does not imply how many causal variants there are. It could be a few with large effects, many with small effects or a combination. The data seems to indicate that it is many of small effect. We also find empirically, that the data cannot be explained by a few principal components. In fact, dominant principal components are usually taken out because they indicate population structure that may not have anything to do with the trait. The classic example is that if Chinese people are more genetically related to each other than to Europeans then one could infer a genetic basis for chopstick use. Using the first few principal components as a fixed effect gets rid of these trivial correlation effects.
As to your second point, I’m not sure how to answer. You could always use family pedigree to estimate genetic relationship to compute heritability. You can also use what is called Haseman-Elston regression where you essentially regress the phenotypic covariance against the genetic covariance and the slope is related to the heritability. This is very fast although you still have to compute the genetic relationship matrix which requires handling large matrices. This is akin to your example of taking the canonical correlation coefficient.
Not sure what you mean by nonlinear approaches. We are estimating narrow sense heritability, which is the linear part of heritability. All the nonlinear genetic interactions are lumped in with the noise. You can use the generalized linear model (i.e. logistic regression) to estimate heritability for case-control situations.
[…] due to genetic factors. This is the pedagogical post on heritability that I promised in a previous post on estimating heritability from genome wide association studies […]
[…] and recent results measuring narrow-sense heritability directly from genetic markers (e.g. see this) confirms this […]
[…] Fortran but when it came down to writing the codes to estimate heritability directly from GWAS (see here), the matrix manipulating capabilities of MATLAB really became useful. I also learned that […]