People have been justly anguished by the recent gross mishandling of the Ebola patients in Texas and Spain and the risible lapse in security at the White House. The conventional wisdom is that these demonstrations of incompetence are a recent phenomenon signifying a breakdown in governmental competence. However, I think that incompetence has always been the norm; any semblance of competence in the past is due mostly to luck and the fact that people do not exploit incompetent governance because of a general tendency towards docile cooperativity (as well as incompetence of bad actors). In many ways, it is quite amazing at how reliably citizens of the US and other OECD members respect traffic laws, pay their bills and service their debts on time. This is a huge boon to an economy since excessive resources do not need to be spent on enforcing rules. This does not hold in some if not many developing nations where corruption is a major problem (c.f. this op-ed in the Times today). In fact, it is still an evolutionary puzzle as to why agents cooperate for the benefit of the group even though it is an advantage for an individual to defect. Cooperativity is also not likely to be all genetic since immigrants tend to follow the social norm of their adopted country, although there could be a self-selection effect here. However, the social pressure to cooperate could evaporate quickly if there is the perception of the lack of enforcement as evidenced by looting following natural disasters or the abundance of insider trading in the finance industry. Perhaps, as suggested by the work of Karl Sigmund and other evolutionary theorists, cooperativity is a transient phenomenon and will eventually be replaced by the evolutionarily more stable state of noncooperativity. In that sense, perceived incompetence could be rising but not because we are less able but because we are less cooperative.
Archive for the ‘Genetics’ Category
A linear system is one where the whole is precisely the sum of its parts. You can know how different parts will act together by simply knowing how they act in isolation. A nonlinear function lacks this nice property. For example, consider a linear function . It satisfies the property that . The function of the sum is the sum of the functions. One important point to note is that what is considered to be the paragon of linearity, namely a line on a graph, i.e. is not linear since . The y-intercept destroys the linearity of the line. A line is instead affine, which is to say a linear function shifted by a constant. A linear differential equation has the form
where can be in any dimension. Solutions of a linear differential equation can be multiplied by any constant and added together.
Linearity is thus essential for engineering. If you are designing a bridge then you simply add as many struts as you need to support the predicted load. Electronic circuit design is also linear in the sense that you combine as many logic circuits as you need to achieve your end. Imagine if bridge mechanics were completely nonlinear so that you had no way to predict how a bunch of struts would behave when assembled together. You would then have to test each combination to see how they work. Now, real bridges are not entirely linear but the deviations from pure linearity are mild enough that you can make predictions or have rules of thumb of what will work and what will not.
Chemistry is an example of a system that is highly nonlinear. You can’t know how a compound will act just based on the properties of its components. For example, you can’t simply mix glass and steel together to get a strong and hard transparent material. You need to be clever in coming up with something like gorilla glass used in iPhones. This is why engineering new drugs is so hard. Although organic chemistry is quite sophisticated in its ability to synthesize various compounds there is no systematic way to generate molecules of a given shape or potency. We really don’t know how molecules will behave until we create them. Hence, what is usually done in drug discovery is to screen a large number of molecules against specific targets and hope. I was at a computer-aided drug design Gordon conference a few years ago and you could cut the despair and angst with a knife.
That is not to say that engineering is completely hopeless for nonlinear systems. Most nonlinear systems act linearly if you perturb them gently enough. That is why linear regression is so useful and prevalent. Hence, even though the global climate system is a highly nonlinear system, it probably acts close to linear for small changes. Thus I feel confident that we can predict the increase in temperature for a 5% or 10% change in the concentration of greenhouse gases but much less confident in what will happen if we double or treble them. How linear a system will act depends on how close they are to a critical or bifurcation point. If the climate is very far from a bifurcation then it could act linearly over a large range but if we’re near a bifurcation then who knows what will happen if we cross it.
I think biology is an example of a nonlinear system with a wide linear range. Recent research has found that many complex traits and diseases like height and type 2 diabetes depend on a large number of linearly acting genes (see here). Their genetic effects are additive. Any nonlinear interactions they have with other genes (i.e. epistasis) are tiny. That is not to say that there are no nonlinear interactions between genes. It only suggests that common variations are mostly linear. This makes sense from an engineering and evolutionary perspective. It is hard to do either in a highly nonlinear regime. You need some predictability if you make a small change. If changing an allele had completely different effects depending on what other genes were present then natural selection would be hard pressed to act on it.
However, you also can’t have a perfectly linear system because you can’t make complex things. An exclusive OR logic circuit cannot be constructed without a threshold nonlinearity. Hence, biology and engineering must involve “the linear combination of nonlinear gadgets”. A bridge is the linear combination of highly nonlinear steel struts and cables. A computer is the linear combination of nonlinear logic gates. This occurs at all scales as well. In biology, you have nonlinear molecules forming a linear genetic code. Two nonlinear mitochondria may combine mostly linearly in a cell and two liver cells may combine mostly linearly in a liver. This effective linearity is why organisms can have a wide range of scales. A mouse liver is thousands of times smaller than a human one but their functions are mostly the same. You also don’t need very many nonlinear gadgets to have extreme complexity. The genes between organisms can be mostly conserved while the phenotypes are widely divergent.
James Lee and I have a new paper out: Lee and Chow, Conditions for the validity of SNP-based heritability estimation, Human Genetics, 2014. As I summarized earlier (e.g. see here and here), heritability is a measure of the proportion of the variance of some trait (like height or cholesterol levels) due to genetic factors. The classical way to estimate heritability is to regress standardized (mean zero, standard deviation one) phenotypes of close relatives against each other. In 2010, Jian Yang, Peter Visscher and colleagues developed a way to estimate heritability directly from the data obtained in Genome Wide Association Studies (GWAS), sometimes called GREML. Shashaank Vattikuti and I quickly adopted this method and computed the heritability of metabolic syndrome traits as well as the genetic correlations between the traits (link here). Unfortunately, our methods section has a lot of typos but the corrected Methods with the Matlab code can be found here. However, I was puzzled by the derivation of the method provided by the Yang et al. paper. This paper is our resolution. The technical details are below the fold.
You should read this article in Esquire about the advent of personalized cancer treatment for a heroic patient named Stephanie Lee. Here is Steve Hsu’s blog post. The cost of sequencing is almost at the point where everyone can have their normal and tumor cells completely sequenced to look for mutations like Stephanie. The team at Mt. Sinai Hospital in New York described in the article inserted some of the mutations into a fruit fly and then checked to see what drugs killed it. The Stephanie Event was the oncology board meeting at Sinai where the treatment for Stephanie Lee’s colon cancer, which had spread to the liver, was discussed. They decided on a standard protocol but would use the individualized therapy based on the fly experiments if the standard treatments failed. The article was beautifully written, combining a compelling human story with science.
New paper on the arXiv. The next step after the completion of the Human Genome Project, was the search for genes associated with diseases such as autism or diabetes. However, after spending hundreds of millions of dollars, we find that there are very few common variants of genes with large effects. This doesn’t mean that there aren’t genes with large effect. The growth hormone gene definitely has a large effect on height. It just means that variations of genes that are common among people have small effects on the phenotype. Given the results of Fisher, Wright, Haldane and colleagues, this was probably expected as the most likely scenario and recent results measuring narrow-sense heritability directly from genetic markers (e.g. see this) confirms this view.
Current GWAS microarrays consider about a million or two markers and this is increasing rapidly. Narrow-sense heritability refers to the additive or linear genetic variance, which means the phenotype is given by the linear model , where is the phenotype vector, is the genotype matrix, are all the genetic effects we want to recover, and are all the nonadditive components including environmental effects. This is a classic linear regression problem. The problem comes when the number of coefficients far exceeds the number of people in your sample, which is the case for genomics. Compressed sensing is a field of high dimensional statistics that addresses this specific problem. People such as David Donoho, Emmanuel Candes and Terence Tao have proven under fairly general conditions that if the number of nonzero coefficients are sparse compared to the number samples, then the effects can be completely recovered using L1 penalized optimization algorithms such as the lasso or approximate message passing. In this paper, we show that these ideas can be applied to genomics.
Here is Steve Hsu’s summary of the paper
Application of compressed sensing to genome wide association studies and genomic selection
We show that the signal-processing paradigm known as compressed sensing (CS) is applicable to genome-wide association studies (GWAS) and genomic selection (GS). The aim of GWAS is to isolate trait-associated loci, whereas GS attempts to predict the phenotypic values of new individuals on the basis of training data. CS addresses a problem common to both endeavors, namely that the number of genotyped markers often greatly exceeds the sample size. We show using CS methods and theory that all loci of nonzero effect can be identified (selected) using an efficient algorithm, provided that they are sufficiently few in number (sparse) relative to sample size. For heritability h2 = 1, there is a sharp phase transition to complete selection as the sample size is increased. For heritability values less than one, complete selection can still occur although the transition is smoothed. The transition boundary is only weakly dependent on the total number of genotyped markers. The crossing of a transition boundary provides an objective means to determine when true effects are being recovered. For h2 = 0.5, we find that a sample size that is thirty times the number of nonzero loci is sufficient for good recovery.
Most people have an intuitive notion of heritability being the genetic component of why close relatives tend to resemble each other more than strangers. More technically, heritability is the fraction of the variance of a trait within a population that is due to genetic factors. This is the pedagogical post on heritability that I promised in a previous post on estimating heritability from genome wide association studies (GWAS).
One of the most important facts about uncertainty and something that everyone should know but often doesn’t is that when you add two imprecise quantities together, while the average of the sum is the sum of the averages of the individual quantities, the total error (i.e. standard deviation) is not the sum of the standard deviations but the square root of the sum of the square of the standard deviations or variances. In other words, when you add two uncorrelated noisy variables, the variance of the sum is the sum of the variances. Hence, the error grows as the square root of the number of quantities you add and not linearly as it had been assumed for centuries. There is a great article in the American Scientist from 2007 called The Most Dangerous Equation giving a history of some calamities that resulted from not knowing about how variances sum. The variance of a trait can thus be expressed as the sum of the genetic variance and environmental variance, where environment just means everything that is not correlated to genetics. The heritability is the ratio of the genetic variance to the trait variance.
James Lee and I just published a paper entitled “The causal meaning of Fisher’s average effect” in the journal Genetics Research. The paper can be obtained here. This paper is the brainchild of James and I just helped him out with some of the proofs. James’s take on the paper can be read here. The paper resolves a puzzle about the incommensurability of Ronald Fisher’s two definitions of the average effect noted by population geneticist D.S. Falconer three decades ago.
Fisher was well known for both brilliance and obscurity and people have long puzzled over the meaning of some of his work. The concept of the average effect is extremely important for population genetics but it is not very well understood. The field of population genetics was invented in the early twentieth century by luminaries such as Fisher, Sewall Wright, and JBS Haldane to reconcile Darwin’s theory of evolution with Mendelian genetics. This is a very rich field that has been somewhat forgotten. People in mathematical, systems, computational, and quantitative biology really should be fully acquainted with the field.
For those who are unacquainted with genetics, here is a quick primer to understand the paper. Certain traits, like eye colour or the ability to roll your tongue, are affected by your genes. Prior to the discovery of the structure of DNA, it was not clear what genes were, except that they were the primary discrete unit of genetic inheritance. These days it usually refers to some region on the genome. Mendel’s great insight was that genes come in pairs, which we now know to correspond to the two copies of each of the 23 chromosomes we have. A variant of a particular gene is called an allele. Traits can depend on genes (or more accurately genetic loci) linearly or nonlinearly. Consider a quantitative trait that depends on a single genetic locus that has two alleles, which we will call a and A. This means that a person will have one of three possible genotypes: 1) homozygous in A (i.e. have two A alleles), 2) heterozygous (have one of each), or 3) homozygous in a (i.e. have no A alleles). If the locus is linear then if you plot the measure of the trait (e.g. height) against the number of A alleles, you will get a straight line. For example, suppose allele A contributes a tenth of a centimetre to height. Then people with one A allele will be on average one tenth of a centimetre taller than those with no A alleles and those with two A alleles will be two tenths taller. The familiar notion of dominance is a nonlinear effect. So for example, the ability to roll your tongue is controlled by a single gene. There is a dominant rolling allele and a recessive nonrolling allele. If you have at least one rolling allele, you can roll your tongue.
The average effect of a gene substitution is the average change in a trait if one allele is substituted for another. A crucial part of population genetics is that you always need to consider averages. This is because genes are rarely completely deterministic. They can be influenced by the environment or other genes. Thus, in order to define the effect of the gene, you need to average over these other influences. This then leads to a somewhat ambiguous definition of average effect and Fisher actually came up with two. The first, and as James would argue the primary definition, is a causal one in that we want to measure the average effect of a gene if you experimentally substituted one allele for another prior to development and influence by the environment. A second correlation definition would simply be to plot the trait against the number of alleles as in the example above. The slope would then be the average effect. This second definition looks at the correlation between the gene and the trait but as the old saying goes “correlation does not imply causation”. For example, the genetic loci may not have any effect on the trait but happens to be strongly correlated with a true causal locus (in the population you happen to be examining). Distinguishing between genes that are merely associated with a trait from ones that are actually causal remains an open problem in genome wide association studies.
Our paper goes over some of the history and philosophy of the tension between these two definitions. We wrote the paper because these two definitions do not always agree and we show under what conditions they do agree. The main reason they don’t agree is that averages will depend on the background over which you average. For a biallelic gene, there are 2 alleles but 3 genotypes. The distribution of alleles in a population is governed by two parameters. It’s not enough to specify the frequency of one allele. You also need to know the correlation between alleles. The regression definition matches the causal definition if a particular function representing this correlation is held fixed while the experimental allele substitutions under the causal definition are carried out. We also considered the multi-allele and multi-loci case in as much generality as we could.
A Coulon, CC Chow, RH Singer, DR Larson Eukaryotic transcriptional dynamics: from single molecules to cell populations. Nat Gen Reviews (2013).
Abstract | Transcriptional regulation is achieved through combinatorial interactions between regulatory elements in the human genome and a vast range of factors that modulate the recruitment and activity of RNA polymerase. Experimental approaches for studying transcription in vivo now extend from single-molecule techniques to genome-wide measurements. Parallel to these developments is the need for testable quantitative and predictive models for understanding gene regulation. These conceptual models must also provide insight into the dynamics of transcription and the variability that is observed at the single-cell level. In this Review, we discuss recent results on transcriptional regulation and also the models those results engender. We show how a non-equilibrium description informs our view of transcription by explicitly considering time- and energy-dependence at the molecular level.
The blogosphere is aflutter over Harvard economist and former chairman of the Council of Economic Advisors to Bush 43, Greg Mankiw‘s recent article “Defending the One Percent“. Mankiw’s paper mostly argues against the classic utilitarian reason for redistribution – a dollar is more useful to a poor person than a rich one. However, near the end of the paper he proposes that an alternative basis for fair income distribution should be the just desserts principle where everyone is compensated according to how much they contribute. Mankiw believes that the recent surge in income inequality is due to changes in technology that favour superstars who create much more value for the economy than the rest. He then argues that the superstars are superstars because of heritable innate qualities like IQ and not because the economy is rigged in their favour.
The problem with this idea is that genetic ability is a shared natural resource that came through a long process of evolution and everyone who has ever lived has contributed to this process. In many ways, we’re like a huge random Monte Carlo simulation where we randomly try out lots of different gene variants to see what works best. Mankiw’s superstars are the Monte Carlo trials that happen to be successful in our current system. However, the world could change and other qualities could become more important just as physical strength was more important in the pre-information age. The ninety-nine percent are reservoirs of genetic variability that we all need to prosper. Some impoverished person alive today may possess the genetic variant to resist some future plague and save humanity. She is providing immense uncompensated economic value. The just desserts world is really nothing more than a random world; a world where you are handed a lottery ticket and you hope you win. This would be fine but one shouldn’t couch it in terms of some deeper rationale. A world with a more equitable distribution is one where we compensate the less successful for their contribution to economic progress. However, that doesn’t mean we should have a world with completely equal income distribution. Unfortunately, the human mind needs incentives to try hard so for maximal economic growth, the lottery winners must always get at least a small bonus.
The US Supreme Court ruled today that human genes cannot be patented. Here is the link to the New York Times article. The specific case regards Myriad Genetics, which held a patent that controlled the rights to all tests for the BRCA1 and BRCA2 genes implicated in breast cancer. The patent essentially blocked most research on the BRCA genes. The immediate effect will be that genetic testing will become cheaper and more widespread. People will argue that not allowing genes to be patented will discourage further innovation. I doubt it. Most discoveries, like genes, come from basic federally funded research. Any company can now develop a test for any newly discovered gene. Patent law has been broken for decades and this is just one small step to correcting it.
I want to stress that there is nothing wrong with the results in the paper. The mistakes are typographical in the sense that the formulas in the methods were transcribed incorrectly from our code. This was just pointed out to me that the errata could be misinterpreted. What happened was that MS Word kept turning our equations into pictures so we couldn’t edit them so we retyped them over and over again. Transcription errors then started to creep in and we were so adapted to the equations that we didn’t notice anymore. Not a good excuse but unfortunately that is what happened.
I just discovered some glaring typographical errors in the Methods Section of our recent paper: Heritability and genetic correlations explained by common SNPS for metabolic syndrome traits. The corrected Methods can be obtained here. I will see if I can include an errata on the PLoS Genetics website as well. The results in the paper are fine.
edited on May 9 to clear up possible misconception.
I gave a short talk at the NIH yesterday describing our recent work on heritability and GWAS. The slides are here.
Here is the backstory for the paper in the previous post. Immediately after the human genome project was completed a decade ago, people set out to discover the genes responsible for diseases. Traditionally, genes had been discovered by tracing the incidence in families that exhibit the disease. This was a painstaking process – a classic example being the discovery of the gene for Huntington’s disease. The completion of the human genome project provided a simpler approach. The first thing was to create what is known as a haplotype map or HapMap. This is a catalog of all the common genome differences between people. The genome between humans only differ by about 0.1%. These differences include Single Nucleotide Variations (SNPs) where a given base (A,C,G,T) is changed and Copy Number Variations (CNV) where there are differences in the number of copies of segments of DNA. There are about 10 million common SNPs.
The Genome Wide Association Study (GWAS) usually looks for differences in SNPs between people with and without diseases. The working hypothesis at that time was that common diseases like Alzheimer’s disease or Type II diabetes should be due to differences in common SNPs (i.e. common disease, common variant). People thought that the genes for many of these diseases would be found within a decade. Towards this end, companies like Affymetrics and Illumina began making microarray chips, which consist of tiny wells with snippets of complement DNA (cDNA) of a small segment of the sequence around each SNP. SNP variants are found by seeing what types of DNA fragments in genetic samples (e.g. saliva) get bound to the complement strands in the array. A classic GWAS study then considers the differences in SNP variants observed in disease and control groups. For any finite sample, there will always be fluctuations so a statistical criterion must be established to evaluate whether a variant is significant. This is done by computing the probability or p-value of the occurrence of a variant due to random chance. However, in a set of a million SNPS, which is standard for a chip, the probability of one SNP being randomly associated with a disease is a million fold higher if the SNPs are all independent (i.e. it’s just the sum of the probabilities of each event (see Bonferonni correction)). However, since SNPs are not always independent, this is a very conservative criterion.
The first set of results from GWAS started to be published shortly after I arrived at the NIH in 2004 and they weren’t very promising. A small number of SNPs were found for some common diseases but they only conferred a very small increase in risk even when the disease were thought to be highly heritable. Heritability is a measure of the proportion of phenotypic variation between people explained by genetic variation. (I’ll give a primer on heritability in a following post.) The one notable exception was age-related macular degeneration for which five SNPs were found to be associated with a 2 to 3 fold increase in risk. The difference between the heritability of a disease as measured by classical genetic methods and what was found to be explained by SNPs came to be known as the “Missing heritability” problem. I thought this was an interesting puzzle but didn’t think much about it until I saw the results from this paper (summarized on Steve Hsu’s blog), which showed that different European subpopulations could be separated by just projecting onto two principal components of 300,000 SNPS of 6000 people. (Principle components are the eigenvectors of the 6000 by 6000 correlation matrix of the 300,000 dimensional SNP vectors of each person.) This was a revelation to me because it implied that although single genes did not carry much weight, the collection of all the genes certainly did. I decided right then that I would show that the missing heritability for obesity was contained in the “pattern” of SNPs rather than in single SNPs. I did this without really knowing anything at all about genetics or heritability.
PLoS Genet 8(3): e1002637. doi:10.1371/journal.pgen.1002637
Shashaank Vattikuti, Juen Guo, and Carson C. Chow
Abstract: We used a bivariate (multivariate) linear mixed-effects model to estimate the narrow-sense heritability (h2) and heritability explained by the common SNPs (hg2) for several metabolic syndrome (MetS) traits and the genetic correlation between pairs of traits for the Atherosclerosis Risk in Communities (ARIC) genome-wide association study (GWAS) population. MetS traits included body-mass index (BMI), waist-to-hip ratio (WHR), systolic blood pressure (SBP), fasting glucose (GLU), fasting insulin (INS), fasting trigylcerides (TG), and fasting high-density lipoprotein (HDL). We found the percentage of h2 accounted for by common SNPs to be 58% of h2 for height, 41% for BMI, 46% for WHR, 30% for GLU, 39% for INS, 34% for TG, 25% for HDL, and 80% for SBP. We confirmed prior reports for height and BMI using the ARIC population and independently in the Framingham Heart Study (FHS) population. We demonstrated that the multivariate model supported large genetic correlations between BMI and WHR and between TG and HDL. We also showed that the genetic correlations between the MetS traits are directly proportional to the phenotypic correlations.
Author Summary: The narrow-sense heritability of a trait such as body-mass index is a measure of the variability of the trait between people that is accounted for by their additive genetic differences. Knowledge of these genetic differences provides insight into biological mechanisms and hence treatments for diseases. Genome-wide association studies (GWAS) survey a large set of genetic markers common to the population. They have identified several single markers that are associated with traits and diseases. However, these markers do not seem to account for all of the known narrow-sense heritability. Here we used a recently developed model to quantify the genetic information contained in GWAS for single traits and shared between traits. We specifically investigated metabolic syndrome traits that are associated with type 2 diabetes and heart disease, and we found that for the majority of these traits much of the previously unaccounted for heritability is contained within common markers surveyed in GWAS. We also computed the genetic correlation between traits, which is a measure of the genetic components shared by traits. We found that the genetic correlation between these traits could be predicted from their phenotypic correlation.
I am very happy that this paper is finally out. It has been a three year long ordeal. I’ll write about the story and background for this paper later.
One of the big news stories last week was the publication in Science on the genomic sequence of a hundred year old Aboriginal Australian. The analysis finds that the Aboriginal Australians are descendants of an early migration to Asia between 62,000 and 75,000 years ago and this migration is different from the one that gave rise to modern Asians 25,000 to 38,000 years ago. I have often been amazed that humans were able to traverse over harsh terrain and open water into the complete unknown. However, I briefly watched a documentary on CNBC last night about Apocalypse 2012 that made me understand this much better. Evidently, there is a fairly large group of people who believe the world will end in 2012. (This is independent of the group that thought the world would end earlier this year.) The prediction is based on the fact that a large cycle in the Mayan calendar will supposedly end in 2012. According to some of the believers, the earth’s rotation will reverse and that will cause massive earthquakes and tsunamis. These believers have thus managed to recruit followers and start building colonies in the mountains to try to survive. People are taking this extremely seriously. I think this ability to change the course of one’s entire life on the flimsiest of evidence is what led our ancestors to leave Africa and head into the unknown. People will get ideas in their head and nothing will stop them from pursuing them. It’s what led us to populate every corner of the world and reshape much of the surface of the earth. It also suggests that the best optimization algorithms that seek a global maximum may be ones that have some ‘momentum’ so that they can leave local maxima and head downhill to find higher peaks elsewhere.
University of Oregon physicist Steve Hsu has just launched an initiative with BGI to study the genetic basis of intelligence. See here for a summary of the project. They are presently recruiting participants to be genotyped so check out the website if you are interested. I have agreed to serve as a consultant on the project. The reason for my involvement is because my recent research has involved estimating the heritability of various obesity and Type 2 diabetic traits from genomic data and these methods could be useful for the project. I am also interested in the genetic and molecular basis of complex cognitive disorders like autism, schizophrenia and addiction. I will make some expository posts on these topics in the near future.
I enter this project with full awareness that the topic of heritability and intelligence is a lightning rod for controversy. I have even argued in the past that this is a topic that might be best left alone because any knowledge gained could be easily misused. However, it is also clear that this project will happen with or without me and I decided that the potential benefits outweighed the costs. When it comes to treating cognitive disorders like autism, schizophrenia and addiction, we still are in the dark ages and a large genome study of cognition could lead to potential new treatments. BGI also has the will and wherewithal to devote large amounts of resources to this project.