New paper in Molecular Psychiatry

Genomic analysis of diet composition finds novel loci and associations with health and lifestyle

S. Fleur W. Meddens, et al.


We conducted genome-wide association studies (GWAS) of relative intake from the macronutrients fat, protein, carbohydrates, and sugar in over 235,000 individuals of European ancestries. We identified 21 unique, approximately independent lead SNPs. Fourteen lead SNPs are uniquely associated with one macronutrient at genome-wide significance (P < 5 × 10−8), while five of the 21 lead SNPs reach suggestive significance (P < 1 × 10−5) for at least one other macronutrient. While the phenotypes are genetically correlated, each phenotype carries a partially unique genetic architecture. Relative protein intake exhibits the strongest relationships with poor health, including positive genetic associations with obesity, type 2 diabetes, and heart disease (rg ≈ 0.15–0.5). In contrast, relative carbohydrate and sugar intake have negative genetic correlations with waist circumference, waist-hip ratio, and neighborhood deprivation (|rg| ≈ 0.1–0.3) and positive genetic correlations with physical activity (rg ≈ 0.1 and 0.2). Relative fat intake has no consistent pattern of genetic correlations with poor health but has a negative genetic correlation with educational attainment (rg ≈−0.1). Although our analyses do not allow us to draw causal conclusions, we find no evidence of negative health consequences associated with relative carbohydrate, sugar, or fat intake. However, our results are consistent with the hypothesis that relative protein intake plays a role in the etiology of metabolic dysfunction.

Did microbes cause the Great Dying?

In one of my very first posts almost a decade ago, I wrote about the end-Permian extinction 250 million years ago, which was the greatest mass extinction thus far. In that post I covered research that had ruled out an asteroid impact and found evidence of global warming, possibly due to volcanos, as a cause. Now, a recent paper in PNAS proposes that a horizontal gene transfer event from bacteria to archaea may have been the main cause for the increase of methane and CO2. This paper is one of the best papers I have read in a long time, combining geological field work, mathematical modeling, biochemistry, metabolism, and evolutionary phylogenetic analysis to make a compelling argument for their hypothesis.

Their case hinges on several pieces of evidence. The first comes from well-dated carbon isotopic records from China.  The data shows a steep plunge in the isotopic ratio (i.e ratio between the less abundant but heavier carbon 13 and the lighter more abundant carbon 12) in the inorganic carbonate reservoir with a moderate increase in the organic reservoir. In the earth’s carbon cycle, the organic reservoir comes from the conversion of atmospheric CO2 into carbohydrates via photosynthesis, which prefers carbon 12 to carbon 13. Organic carbon is returned to inorganic form through oxidation by animals eating photosynthetic organisms or by the burning of stored carbon like trees or coal. A steep drop in the isotopic ratio means that there was an extra surge of carbon 12 into the inorganic reservoir. Using a mathematical model, the authors show that in order to explain the steep drop, the inorganic reservoir must have grown superexponentially (faster than exponential). This requires some runaway positive feedback loop that is difficult to explain by geological processes such as volcanic activity, but is something that life is really good at.

The increased methane would have been oxidized to CO2 by other microbes, which would have lowered the oxygen concentration. This would allow for more efficient fermentation and thus more acetate fuel for the archaea to make more methane. The authors showed in another simple mathematical model how this positive feedback loop could lead to superexponential growth. Methane and CO2 are both greenhouse gases and their increase would have caused significant global warming. Anaerobic methane oxidation could also lead to the release of poisonous hydrogen sulfide.

They then considered what microbe could have been responsible. They realized that during the late Permian, a lot of organic material was being deposited in the sediment. The organic reservoir (i.e. fossil fuels, methane hydrates, soil organic matter, peat, etc) was much larger back then than today, as if someone or something used it up at some point. One of the end products of fermentation of this matter would be acetate and that is something archaea like to eat and convert to methane. There are two types of archaea that can do this and one is much more efficient than the other at high acetate concentrations. This increased efficiency was also shown recently to have arisen by a horizontal gene transfer event from a bacterium. A phylogenetic analysis of all known archaea showed that the progenitor of the efficient methanogenic one likely arose 250 million years ago.

The final piece of evidence is that the archaea need nickel to make methane. The authors then looked at the nickel concentrations in their Chinese geological samples and found a sharp increase in nickel immediately before the steep drop in the isotopic ratio. They postulate that the source of the nickel was the massive Siberian volcano eruptions at that time (and previously proposed as the cause of the increased methane and CO2).

This scenario required the unlikely coincidence of several events –  lots of excess organic fuel, low oxygen (and sulfate), increased nickel, and a horizontal gene transfer event. If any of these were missing, the Great Dying may not have taken place. However, given that there have been only 5 mass extinctions, although we may be currently inducing the 6th, low probability events may be required for such calamitous events. This paper should also give us some pause about introducing genetically modified organisms into the environment. While most will probably be harmless, you never know when one will be the match that lights the fire.



New paper on genomics

James Lee and I have a new paper out: Lee and Chow, Conditions for the validity of SNP-based heritability estimation, Human Genetics, 2014. As I summarized earlier (e.g. see here and here), heritability is a measure of the proportion of the variance of some trait (like height or cholesterol levels) due to genetic factors. The classical way to estimate heritability is to regress standardized (mean zero, standard deviation one) phenotypes of close relatives against each other. In 2010, Jian Yang, Peter Visscher and colleagues developed a way to estimate heritability directly from the data obtained in Genome Wide Association Studies (GWAS), sometimes called GREML.  Shashaank Vattikuti and I quickly adopted this method and computed the heritability of metabolic syndrome traits as well as the genetic correlations between the traits (link here). Unfortunately, our methods section has a lot of typos but the corrected Methods with the Matlab code can be found here. However, I was puzzled by the derivation of the method provided by the Yang et al. paper.  This paper is our resolution.  The technical details are below the fold.


Continue reading

New paper on GPCRs

New paper in PloS One:

Fatakia SN, Costanzi S, Chow CC (2011) Molecular Evolution of the Transmembrane Domains of G Protein-Coupled Receptors. PLoS ONE 6(11): e27813. doi:10.1371/journal.pone.0027813


G protein-coupled receptors (GPCRs) are a superfamily of integral membrane proteins vital for signaling and are important targets for pharmaceutical intervention in humans. Previously, we identified a group of ten amino acid positions (called key positions), within the seven transmembrane domain (7TM) interhelical region, which had high mutual information with each other and many other positions in the 7TM. Here, we estimated the evolutionary selection pressure at those key positions. We found that the key positions of receptors for small molecule natural ligands were under strong negative selection. Receptors naturally activated by lipids had weaker negative selection in general when compared to small molecule-activated receptors. Selection pressure varied widely in peptide-activated receptors. We used this observation to predict that a subgroup of orphan GPCRs not under strong selection may not possess a natural small-molecule ligand. In the subgroup of MRGX1-type GPCRs, we identified a key position, along with two non-key positions, under statistically significant positive selection.

New paper on TREK channels

This paper came out of a collaboration instigated by my former fellow Sarosh Fatakia.  We applied our method using mutual information (see here and here) to a family of potassium channels known as TREK channels.

Structural models of TREK channels and their gating mechanism

A.L. Milac, A. Anishkin, S.N. Fatakia, C.C. Chow, S. Sukharev, and H. R. Guy

Mechanosensitive TReK channels belong to the family of K2p channels, a family of widely distributed, well-modulated channels that uniquely have two similar or identical subunits, each with two TM1-p-TM2 motifs. Our goal is to build viable structural models of TReK channels, as representatives of K2p channels family. The structures available to be used as templates belong to the 2TM channels superfamily. These have low sequence similarity and different structural features: four symmetrically arranged subunits, each having one TM1-p-TM2 motif. Our model building strategy used two subunits of the template (KcsA) to build one subunit of the target (TREK-1).  Our models of the closed channel were adjusted to differ substantially from those of the template, e.g., TM2 of the second repeat is near the axis of the pore whereas TM2 of the first repeat is far from the axis.  Segments linking the two repeats and immediately following the last TM segment were modeled ab initio as α-helices based on helical periodicities of hydrophobic and hydrophilic residues, highly conserved and poorly conserved residues and statistically related positions from multiple sequence alignments. The models were further refined by 2-fold symmetry-constrained MD simulations using a protocol we developed previously. We also built models of the open state and suggest a possible tension-activated gating mechanism characterized by helical motion with 2-fold symmetry. Our models are consistent with deletion/truncation mutagenesis and thermodynamic analysis of gating described in the accompanying paper.



addendum: link fixed

Talk at MBI

I’m currently at the Mathematical Biosciences Institute for a workshop on Computational challenges in integrative biological modeling.  The slides for my talk on using Bayesian methods for parameter estimation and model comparison are here.

Title: Bayesian approaches for parameter estimation and model evaluation of dynamical systems

Abstract: Differential equation models are often used to model biological systems. An important and difficult problem is how to estimate parameters and decide which model among possible models is the best. I will argue that Bayesian inference provides a self-consistent framework to do both tasks. In particular, Bayesian parameter estimation provides a natural measure of parameter sensitivity and Bayesian model comparison automatically evaluates models by rewarding fit to the data while penalizing the number of parameters. I will give examples of employing these approaches on ODE and PDE models.

Information theory

It took me a very long time to develop an intuitive feel for information theory and Shannon entropy.  In fact, it was only when I had to use it for an application that it finally fell together and made visceral sense.  Now, I understand fully why people, especially in the neural coding field, are so focused on it.  I also think that the concepts are actually very natural and highly useful in a wide variety of applications.  I now think in terms of entropy and mutual information all the time.

Our specific application was to develop a scheme to find important amino acid positions in families of proteins called GPCRs, which I wrote about recently.  Essentially, we have a matrix of letters from an alphabet of twenty letters.  Each row of the matrix corresponds to a protein and each column of the matrix corresponds to a specific position on the protein.  Each column is a vector of letters.  Conserved columns are those which have very little variation in the letters and these positions are thought to be functionally important.   They are also relatively easy to find.

What we were interested in was finding pairs of columns that are correlated in some way. For example, suppose whenever column 1 had the letter A (in a given row), column 2 was more likely to have the letter W, whereas when column 1 had the letter C, column 2 was less likely to have the letter P but more likely to have R and D.  You can appreciate how hard it would be to spot these correlations visually.  This is confounded by the fact that if the two columns have high variability then there will be random coincidences from time to time.  So what we really want to know is the amount of correlation that exceeds randomness. As I show below, this is exactly what mutual information quantifies.

Continue reading

Mutual information and GPCRs

A new paper has just been published on PLoS One: “Computing Highly Correlated Positions Using Mutual Information and Graph Theory for G Protein-Coupled Receptors” by Sarosh Fatakia, Stefano Costanzi and myself. G-Protein-Coupled Receptors (GPCRs)  are a very large family of cell surface receptors that are ubiquitous in biological systems.  Examples include olfactory receptors, neuromodulatory receptors like dopamine, and rhodopsin in the eye.  Most drugs in use today target GPCRs.  There are thousands of different types in humans alone.   What we do in this paper is to look for amino acid positions along the GPCR sequence that may be important for structure and function.  Presumably, GPCRs all evolved from a single ancestor protein so important positions may have coevolved, i.e. a mutation at one position would be compensated by mutations at other positions.

The way we looked for these positions was to consider an alignment that was previously computed for three classes of GPCRs.    A GPCR sequence is given by a string of letters corresponding to the 20 amino acids.  An alignment is an arrangement of the strings into a matrix, where the rows of the matrix correspond to strings that are arranged so that the columns can be considered to be equivalent positions. We only considered the transmembrane regions of the receptor so we could assume there were no insertions and deletions.  We then computed the mutual information between each pair of positions (i.e. columns of matrix) j and k.   The mutual information (MI) is given by the expression

MI(j,k) = \sum_{x,y} p_{j,k}(x,y)\log \frac{p_{j,k}(x,y)}{p_{j}(x)p_{k}(y)},

where p_j(x) is the probability of amino acid x appearing at position j, p_{j,k}(x,y) is the probability of amino acids x and y appear at sites j and k, and the sum over x and y is over all the amino acids.  Basically, MI is a measure of the “excess” of probability of the occurrence of amino acid x at position j and amino acid y at position k, over what would have occurred if they were statistically independent.  One of the problems with mutual information is that  you need a lot of data to compute it accurately.  Given that we only had a finite number of sequences in each class, error in the MI estimate was expected.  So what we did was to set a threshold value for significance compared to the null hypothesis of a set of random sequences.

To test our  hypothesis that important positions would co-evolve as a network, we  constructed a graph out of the MI matrix where the vertices were the positions and an edge was drawn between two vertices only if the MI was significant.  We then looked for interconnected subgraphs or cliques.  Finding a clique is an NP complete problem so as a surrogate we looked for high degree (connectivity) positions and ranked the positions according to degree.  We then assessed the degree significance by comparing our MI graph to a random graph. It turned out that the top 10 significant positions formed a clique and also corresponded to the binding cavity for ligands in the three GPCR structures that have been solved thus far.  The method also did not find a binding cavity in one class of GPCRs for which no cavity has been observed experimentally.  The method could be used on any protein family  to search for important positions.

note: updated Mar 7 to correct MI formula

erratum, Dec 13, 2011:  The number of human GPCRs is now thought to number less than a thousand.