Here is the backstory for the paper in the previous post. Immediately after the human genome project was completed a decade ago, people set out to discover the genes responsible for diseases. Traditionally, genes had been discovered by tracing the incidence in families that exhibit the disease. This was a painstaking process – a classic example being the discovery of the gene for Huntington’s disease. The completion of the human genome project provided a simpler approach. The first thing was to create what is known as a haplotype map or HapMap. This is a catalog of all the common genome differences between people. The genome between humans only differ by about 0.1%. These differences include Single Nucleotide Variations (SNPs) where a given base (A,C,G,T) is changed and Copy Number Variations (CNV) where there are differences in the number of copies of segments of DNA. There are about 10 million common SNPs.
The Genome Wide Association Study (GWAS) usually looks for differences in SNPs between people with and without diseases. The working hypothesis at that time was that common diseases like Alzheimer’s disease or Type II diabetes should be due to differences in common SNPs (i.e. common disease, common variant). People thought that the genes for many of these diseases would be found within a decade. Towards this end, companies like Affymetrics and Illumina began making microarray chips, which consist of tiny wells with snippets of complement DNA (cDNA) of a small segment of the sequence around each SNP. SNP variants are found by seeing what types of DNA fragments in genetic samples (e.g. saliva) get bound to the complement strands in the array. A classic GWAS study then considers the differences in SNP variants observed in disease and control groups. For any finite sample, there will always be fluctuations so a statistical criterion must be established to evaluate whether a variant is significant. This is done by computing the probability or p-value of the occurrence of a variant due to random chance. However, in a set of a million SNPS, which is standard for a chip, the probability of one SNP being randomly associated with a disease is a million fold higher if the SNPs are all independent (i.e. it’s just the sum of the probabilities of each event (see Bonferonni correction)). However, since SNPs are not always independent, this is a very conservative criterion.
The first set of results from GWAS started to be published shortly after I arrived at the NIH in 2004 and they weren’t very promising. A small number of SNPs were found for some common diseases but they only conferred a very small increase in risk even when the disease were thought to be highly heritable. Heritability is a measure of the proportion of phenotypic variation between people explained by genetic variation. (I’ll give a primer on heritability in a following post.) The one notable exception was age-related macular degeneration for which five SNPs were found to be associated with a 2 to 3 fold increase in risk. The difference between the heritability of a disease as measured by classical genetic methods and what was found to be explained by SNPs came to be known as the “Missing heritability” problem. I thought this was an interesting puzzle but didn’t think much about it until I saw the results from this paper (summarized on Steve Hsu’s blog), which showed that different European subpopulations could be separated by just projecting onto two principal components of 300,000 SNPS of 6000 people. (Principle components are the eigenvectors of the 6000 by 6000 correlation matrix of the 300,000 dimensional SNP vectors of each person.) This was a revelation to me because it implied that although single genes did not carry much weight, the collection of all the genes certainly did. I decided right then that I would show that the missing heritability for obesity was contained in the “pattern” of SNPs rather than in single SNPs. I did this without really knowing anything at all about genetics or heritability.
The first thing I did was to see if I could obtain data. I immediately discovered that the results of all NIH funded GWAS studies are posted on the database dbGAP after the study authors have published their initial findings. This is a fantastic resource and an example of how government can be useful. For example, the results from the famous Framingham Heart Study are on there. It took me nearly a year to figure out how to navigate the required protocols to get the data. During that year, I started to read up on the subject. I highly recommend the text book by Lynch and Walsh. I also recruited my fellow Shashaank Vattikuti to work on the project with me. I thought he would be ideal because he always thinks big and is completely unafraid of working on risky and challenging things.
My first thought was that the missing heritability was due to epistasis, which means that genes interact so that two genes may not do much alone but confer a huge effect when combined. This is the hypothesis of a recent paper by Eric Lander. However, as I read more I began to believe that while epistasis is probably crucial for biology it is not for heritability. The reason is quite simple – traits are determined by genotypes but we inherit genes. To explain this, I’ll need to review some basics of genetics. Humans are diploids (as are most mammals), which means we have two sets of chromosomes so that each SNP is paired. The pair could be homozygote meaning the same or heterozygote meaning different. SNP variants are called alleles so consider a bi-allelic SNP with an A allele and a C allele. (Do not confuse the two chromosomes with the two strands of the DNA molecule, in which case A is always paired with T and C with G). Hence, there are two homozygote genotypes AA and CC and one heterozygote genotype AC at one SNP. If only the number of a given allele matters for a phenotype or fitness, then the allele is called additive. So if allele A carries a value of x then your phenotype is 0 for CC, x for AC and 2x for AA. If the combination of alleles is non-additive at a given SNP (or any gene locus), then the SNP has dominance. For example if A is completely dominant then CC and AC both have value zero while AA has value x. Non-additivity across SNPs or gene loci is called epistasis. Additivity is the same as linearity. Non-additive effects are departures from perfect linearity in a linear regression of the phenotype on the genotype.
The combinatorics explodes exponentially if you consider the genotype of combinations of SNPs. Children inherit one set of chromosomes from each parent. If the parents are not inbred and breeding is more or less random, then the genotypes will become less correlated in each subsequent generation. Two alleles are said to be in linkage disequilibrium if they are uncorrelated. They are in equilibrium or not in linkage disequilibrium if they are uncorrelated. (I always find this double negative to be confusing). Hence, over long periods of time allele combinations (within and between loci) will tend to be broken up. Thus in order for a given allele to persist over time it must confer some evolutionary advantage on its own. This is why I believe epistasis probably plays a small role in heritability.
In the summer of 2010, Peter Visscher’s group published a groundbreaking paper (Yang et al. Common SNPs explain a large proportion of the heritability for human height. Nature Genetics, 42:565-569, 2010) that showed that half of the known heritability of height could be found in the GWAS SNPs if taken together as a whole. Up to that point, Shashaank and I were trying to show using principal components that people with similar BMI are more clustered but this paper told me how I should think of the problem and I immediately switched gears. It took me and Shashaank a few months to write the code to run the algorithms to repeat this analysis for BMI. Juen Guo helped us manage and organize the databases for the project. However, just as we were about to submit our paper in 2011, Visscher’s group published another important paper (Yang J et al., Genome partitioning of genetic variation for complex traits using common SNPs. Nat Genet 43: 519–525, 2011), which applied the analysis to BMI. We then redid the analysis for more metabolic syndrome traits and also computed the genetic correlation between traits. We then found an error in our code during the revision process and had to redo the all the computations,which take a lot computer time. I had no appreciation for what memory was good for until I did this project and had to invert manipulate huge matrices. I now maximize my 64 GB of memory in many computations and could easily find use for a TB of memory. I will go through the details of the method and our computation in a following post.