New paper on the arXiv. The next step after the completion of the Human Genome Project, was the search for genes associated with diseases such as autism or diabetes. However, after spending hundreds of millions of dollars, we find that there are very few common variants of genes with large effects. This doesn’t mean that there aren’t genes with large effect. The growth hormone gene definitely has a large effect on height. It just means that variations of genes that are common among people have small effects on the phenotype. Given the results of Fisher, Wright, Haldane and colleagues, this was probably expected as the most likely scenario and recent results measuring narrow-sense heritability directly from genetic markers (e.g. see this) confirms this view.

Current GWAS microarrays consider about a million or two markers and this is increasing rapidly. Narrow-sense heritability refers to the additive or linear genetic variance, which means the phenotype is given by the linear model , where is the phenotype vector, is the genotype matrix, are all the genetic effects we want to recover, and are all the nonadditive components including environmental effects. This is a classic linear regression problem. The problem comes when the number of coefficients far exceeds the number of people in your sample, which is the case for genomics. Compressed sensing is a field of high dimensional statistics that addresses this specific problem. People such as David Donoho, Emmanuel Candes and Terence Tao have proven under fairly general conditions that if the number of nonzero coefficients are sparse compared to the number samples, then the effects can be completely recovered using L1 penalized optimization algorithms such as the lasso or approximate message passing. In this paper, we show that these ideas can be applied to genomics.

Here is Steve Hsu’s summary of the paper

Application of compressed sensing to genome wide association studies and genomic selection

We show that the signal-processing paradigm known as compressed sensing (CS) is applicable to genome-wide association studies (GWAS) and genomic selection (GS). The aim of GWAS is to isolate trait-associated loci, whereas GS attempts to predict the phenotypic values of new individuals on the basis of training data. CS addresses a problem common to both endeavors, namely that the number of genotyped markers often greatly exceeds the sample size. We show using CS methods and theory that all loci of nonzero effect can be identified (selected) using an efficient algorithm, provided that they are sufficiently few in number (sparse) relative to sample size. For heritability h2 = 1, there is a sharp phase transition to complete selection as the sample size is increased. For heritability values less than one, complete selection can still occur although the transition is smoothed. The transition boundary is only weakly dependent on the total number of genotyped markers. The crossing of a transition boundary provides an objective means to determine when true effects are being recovered. For h2 = 0.5, we find that a sample size that is thirty times the number of nonzero loci is sufficient for good recovery.

Comments: | Main paper (27 pages, 4 figures) and Supplement (5 figures) combined |

Subjects: | Genomics (q-bio.GN); Applications (stat.AP) |

Cite as: | arXiv:1310.2264 [q-bio.GN] |

(or arXiv:1310.2264v1 [q-bio.GN] for this version) |

October 15, 2013 at 19:33

Nice paper. As compressed sensing of a sparse signal relies on a linear combination of a relatively small number of ‘building blocks’, does this paradigm miss nonlinear interactions between genes?

October 15, 2013 at 19:58

@Tom Yes it does but that is a feature not a bug. We are specifically looking for additive genetic variants, which are linear by definition. The nonlinear effects are lumped into the “noise” in this model.

October 21, 2013 at 22:04

sparse coefficients is a gaussian assumption. ornstein-uhlenbeck taylor series.

bring in the noise/nsense.