Paper on new version of Plink

The paper describing the updated version of the genome analysis software tool Plink has just been published.

Second-generation PLINK: rising to the challenge of larger and richer datasets
Christopher C Chang, Carson C Chow, Laurent CAM Tellier, Shashaank Vattikuti, Shaun M Purcell, and James J Lee

GigaScience 2015, 4:7  doi:10.1186/s13742-015-0047-8

PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1’s primary data format.

To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, View MathML-time/constant-space Hardy-Weinberg equilibrium and Fisher’s exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0).

The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

Keywords: GWAS; Population genetics; Whole-genome sequencing; High-density SNP genotyping; Computational statistics


This project started out with us trying to do some genomic analysis that involved computing various distance metrics on sequence space. Programming virtuoso Chris Chang stepped in and decided to write some code to speed up the computations. His program, originally called wdist, was so good and fast that we kept asking him to put in more capabilities. Eventually,  he had basically replicated the suite of functions that Plink performed so he contacted Shaun Purcell, the author of Plink, if he could just call his code Plink too and Shaun agreed. We then ran a series of tests on various machines to check the speed-ups compared to the original Plink and gcta. If you do any GWAS analysis at all, I highly recommend you check out Plink 1.9.

New paper in eLife

Kinetic competition during the transcription cycle results in stochastic RNA processing

Matthew L FergusonValeria de TurrisMurali PalangatCarson C ChowDaniel R Larson


Synthesis of mRNA in eukaryotes involves the coordinated action of many enzymatic processes, including initiation, elongation, splicing, and cleavage. Kinetic competition between these processes has been proposed to determine RNA fate, yet such coupling has never been observed in vivo on single transcripts. In this study, we use dual-color single-molecule RNA imaging in living human cells to construct a complete kinetic profile of transcription and splicing of the β-globin gene. We find that kinetic competition results in multiple competing pathways for pre-mRNA splicing. Splicing of the terminal intron occurs stochastically both before and after transcript release, indicating there is not a strict quality control checkpoint. The majority of pre-mRNAs are spliced after release, while diffusing away from the site of transcription. A single missense point mutation (S34F) in the essential splicing factor U2AF1 which occurs in human cancers perturbs this kinetic balance and defers splicing to occur entirely post-release.


New Papers

Two new papers are now in print:
The first is on applying compressed sensing to genomics is now published in Gigascience. The summary of the paper is here and the link is here.
The second is on steroid-regulated gene induction and can be obtained here.
Biochemistry. 2014 Mar 25;53(11):1753-67. doi: 10.1021/bi5000178. Epub 2014 Mar 11.

A kinase-independent activity of Cdk9 modulates glucocorticoid receptor-mediated gene induction.


A gene induction competition assay has recently uncovered new inhibitory activities of two transcriptional cofactors, NELF-A and NELF-B, in glucocorticoid-regulated transactivation. NELF-A and -B are also components of the NELF complex, which participates in RNA polymerase II pausing shortly after the initiation of gene transcription. We therefore asked if cofactors (Cdk9 and ELL) best known to affect paused polymerase could reverse the effects of NELF-A and -B. Unexpectedly, Cdk9 and ELL augmented, rather than prevented, the effects of NELF-A and -B. Furthermore, Cdk9 actions are not blocked either by Ckd9 inhibitors (DRB or flavopiridol) or by two Cdk9 mutants defective in kinase activity. The mode and site of action of NELF-A and -B mutants with an altered NELF domain are similarly affected by wild-type and kinase-dead Cdk9. We conclude that Cdk9 is a new modulator of GR action, that Ckd9 and ELL have novel activities in GR-regulated gene expression, that NELF-A and -B can act separately from the NELF complex, and that Cdk9 possesses activities that are independent of Cdk9 kinase activity. Finally, the competition assay has succeeded in ordering the site of action of several cofactors of GR transactivation. Extension of this methodology should be helpful in determining the site and mode of action of numerous additional cofactors and in reducing unwanted side effects.

PMID: 24559102 [PubMed – indexed for MEDLINE]
PMCID: PMC3985961 [Available on 2015/2/21]

New paper on genomics

James Lee and I have a new paper out: Lee and Chow, Conditions for the validity of SNP-based heritability estimation, Human Genetics, 2014. As I summarized earlier (e.g. see here and here), heritability is a measure of the proportion of the variance of some trait (like height or cholesterol levels) due to genetic factors. The classical way to estimate heritability is to regress standardized (mean zero, standard deviation one) phenotypes of close relatives against each other. In 2010, Jian Yang, Peter Visscher and colleagues developed a way to estimate heritability directly from the data obtained in Genome Wide Association Studies (GWAS), sometimes called GREML.  Shashaank Vattikuti and I quickly adopted this method and computed the heritability of metabolic syndrome traits as well as the genetic correlations between the traits (link here). Unfortunately, our methods section has a lot of typos but the corrected Methods with the Matlab code can be found here. However, I was puzzled by the derivation of the method provided by the Yang et al. paper.  This paper is our resolution.  The technical details are below the fold.


Continue reading

New paper on neural networks

Michael Buice and I have a new paper in Frontiers in Computational Neuroscience as well as on the arXiv (the arXiv version has fewer typos at this point). This paper partially completes the series of papers Michael and I have written about developing generalized activity equations that include the effects of correlations for spiking neural networks. It combines two separate formalisms we have pursued over the past several years. The first was a way to compute finite size effects in a network of coupled deterministic oscillators (e.g. see here, herehere and here).  The second was to derive a set of generalized Wilson-Cowan equations that includes correlation dynamics (e.g. see here, here, and here ). Although both formalisms utilize path integrals, they are actually conceptually quite different. The first formalism adapted kinetic theory of plasmas to coupled dynamical systems. The second used ideas from field theory (i.e. a two-particle irreducible effective action) to compute self-consistent moment hierarchies for a stochastic system. This paper merges the two ideas to generate generalized activity equations for a set of deterministic spiking neurons.

Paper on compressed sensing and genomics

New paper on the arXiv. The next step after the completion of the Human Genome Project, was the search for genes associated with diseases such as autism or diabetes. However, after spending hundreds of millions of dollars, we find that there are very few common variants of genes with large effects. This doesn’t mean that there aren’t genes with large effect. The growth hormone gene definitely has a large effect on height. It just means that variations of genes that are common among people have small effects on the phenotype. Given the results of Fisher, Wright, Haldane and colleagues, this was probably expected as the most likely scenario and recent results measuring narrow-sense heritability directly from genetic markers (e.g. see this) confirms this view.

Current GWAS microarrays consider about a million or two markers and this is increasing rapidly. Narrow-sense heritability refers to the additive or linear genetic variance, which means the phenotype is given by the linear model y= Z\beta + \eta, where y is the phenotype vector, Z is the genotype matrix, \beta are all the genetic effects we want to recover, and \eta are all the nonadditive components including environmental effects. This is a classic linear regression problem. The problem comes when the number of coefficients \beta far exceeds the number of people in your sample, which is the case for genomics. Compressed sensing is a field of high dimensional statistics that addresses this specific problem. People such as David Donoho, Emmanuel Candes and Terence Tao have proven under fairly general conditions that if the number of nonzero coefficients are sparse compared to the number samples, then the effects can be completely recovered using L1 penalized optimization algorithms such as the lasso or approximate message passing. In this paper, we show that these ideas can be applied to genomics.

Here is Steve Hsu’s summary of the paper

Application of compressed sensing to genome wide association studies and genomic selection

(Submitted on 8 Oct 2013)

We show that the signal-processing paradigm known as compressed sensing (CS) is applicable to genome-wide association studies (GWAS) and genomic selection (GS). The aim of GWAS is to isolate trait-associated loci, whereas GS attempts to predict the phenotypic values of new individuals on the basis of training data. CS addresses a problem common to both endeavors, namely that the number of genotyped markers often greatly exceeds the sample size. We show using CS methods and theory that all loci of nonzero effect can be identified (selected) using an efficient algorithm, provided that they are sufficiently few in number (sparse) relative to sample size. For heritability h2 = 1, there is a sharp phase transition to complete selection as the sample size is increased. For heritability values less than one, complete selection can still occur although the transition is smoothed. The transition boundary is only weakly dependent on the total number of genotyped markers. The crossing of a transition boundary provides an objective means to determine when true effects are being recovered. For h2 = 0.5, we find that a sample size that is thirty times the number of nonzero loci is sufficient for good recovery.

Comments: Main paper (27 pages, 4 figures) and Supplement (5 figures) combined
Subjects: Genomics (q-bio.GN); Applications (stat.AP)
Cite as: arXiv:1310.2264 [q-bio.GN]
(or arXiv:1310.2264v1 [q-bio.GN] for this version)

New paper on population genetics

James Lee and I just published a paper entitled “The causal meaning of Fisher’s average effect” in the journal Genetics Research. The paper can be obtained here. This paper is the brainchild of James and I just helped him out with some of the proofs.  James’s take on the paper can be read here. The paper resolves a puzzle about the incommensurability of Ronald Fisher’s two definitions of the average effect noted by population geneticist D.S. Falconer three decades ago.

Fisher was well known for both brilliance and obscurity and people have long puzzled over the meaning of some of his work.  The concept of the average effect is extremely important for population genetics but it is not very well understood. The field of population genetics was invented in the early twentieth century by luminaries such as Fisher, Sewall Wright, and JBS Haldane to reconcile Darwin’s theory of evolution with Mendelian genetics. This is a very rich field that has been somewhat forgotten. People in mathematical,  systems, computational, and quantitative biology really should be fully acquainted with the field.

For those who are unacquainted with genetics, here is a quick primer to understand the paper. Certain traits, like eye colour or the ability to roll your tongue, are affected by your genes. Prior to the discovery of the structure of DNA, it was not clear what genes were, except that they were the primary discrete unit of genetic inheritance. These days it usually refers to some region on the genome. Mendel’s great insight was that genes come in pairs, which we now know to correspond to the two copies of each of the 23 chromosomes we have.  A variant of a particular gene is called an allele.  Traits can depend on genes (or more accurately genetic loci) linearly or nonlinearly. Consider a quantitative trait that depends on a single genetic locus that has two alleles, which we will call a and A. This means that a person will have one of three possible genotypes: 1) homozygous in A (i.e. have two A alleles), 2) heterozygous (have one of each), or 3) homozygous in a (i.e. have no A alleles). If the locus is linear then if you plot the measure of the trait (e.g. height) against the number of A alleles, you will get a straight line. For example, suppose allele A contributes a tenth of a centimetre to height. Then people with one A allele will be on average one tenth of a centimetre taller than those with no A alleles and those with two A alleles will be two tenths taller. The familiar notion of dominance is a nonlinear effect. So for example, the ability to roll your tongue is controlled by a single gene. There is a dominant rolling allele and a recessive nonrolling allele. If you have at least one rolling allele, you can roll your tongue.

The average effect of a gene substitution is the average change in a trait if one allele is substituted for another. A crucial part of population genetics is that you always need to consider averages. This is because genes are rarely completely deterministic. They can be influenced by the environment or other genes. Thus, in order to define the effect of the gene, you need to average over these other influences. This then leads to a somewhat ambiguous definition of average effect and Fisher actually came up with two. The first, and as James would argue the primary definition, is a causal one in that we want to measure the average effect of a gene if you experimentally substituted one allele for another prior to development and influence by the environment. A second correlation definition would simply be to plot the trait against the number of alleles as in the example above. The slope would then be the average effect. This second definition looks at the correlation between the gene and the trait but as the old saying goes “correlation does not imply causation”. For example, the genetic loci may not have any effect on the trait but happens to be strongly correlated with a true causal locus (in the population you happen to be examining). Distinguishing between genes that are merely associated with a trait from ones that are actually causal remains an open problem in genome wide association studies.

Our paper goes over some of the history and philosophy of the tension between these two definitions. We wrote the paper because these two definitions do not always agree and we show under what conditions they do agree. The main reason they don’t agree is that averages will depend on the background over which you average. For a biallelic gene, there are 2 alleles but 3 genotypes. The distribution of alleles in a population is governed by two parameters. It’s not enough to specify the frequency of one allele. You also need to know the correlation between alleles. The regression definition matches the causal definition if a particular function representing this correlation is held fixed while the experimental allele substitutions under the causal definition are carried out. We also considered the multi-allele and multi-loci case in as much generality as we could.