A new paper has just been published on PLoS One: “Computing Highly Correlated Positions Using Mutual Information and Graph Theory for G Protein-Coupled Receptors” by Sarosh Fatakia, Stefano Costanzi and myself. G-Protein-Coupled Receptors (GPCRs) are a very large family of cell surface receptors that are ubiquitous in biological systems. Examples include olfactory receptors, neuromodulatory receptors like dopamine, and rhodopsin in the eye. Most drugs in use today target GPCRs. There are thousands of different types in humans alone. What we do in this paper is to look for amino acid positions along the GPCR sequence that may be important for structure and function. Presumably, GPCRs all evolved from a single ancestor protein so important positions may have coevolved, i.e. a mutation at one position would be compensated by mutations at other positions.
The way we looked for these positions was to consider an alignment that was previously computed for three classes of GPCRs. A GPCR sequence is given by a string of letters corresponding to the 20 amino acids. An alignment is an arrangement of the strings into a matrix, where the rows of the matrix correspond to strings that are arranged so that the columns can be considered to be equivalent positions. We only considered the transmembrane regions of the receptor so we could assume there were no insertions and deletions. We then computed the mutual information between each pair of positions (i.e. columns of matrix) j and k. The mutual information (MI) is given by the expression
where is the probability of amino acid x appearing at position j, is the probability of amino acids x and y appear at sites j and k, and the sum over x and y is over all the amino acids. Basically, MI is a measure of the “excess” of probability of the occurrence of amino acid x at position j and amino acid y at position k, over what would have occurred if they were statistically independent. One of the problems with mutual information is that you need a lot of data to compute it accurately. Given that we only had a finite number of sequences in each class, error in the MI estimate was expected. So what we did was to set a threshold value for significance compared to the null hypothesis of a set of random sequences.
To test our hypothesis that important positions would co-evolve as a network, we constructed a graph out of the MI matrix where the vertices were the positions and an edge was drawn between two vertices only if the MI was significant. We then looked for interconnected subgraphs or cliques. Finding a clique is an NP complete problem so as a surrogate we looked for high degree (connectivity) positions and ranked the positions according to degree. We then assessed the degree significance by comparing our MI graph to a random graph. It turned out that the top 10 significant positions formed a clique and also corresponded to the binding cavity for ligands in the three GPCR structures that have been solved thus far. The method also did not find a binding cavity in one class of GPCRs for which no cavity has been observed experimentally. The method could be used on any protein family to search for important positions.
note: updated Mar 7 to correct MI formula
erratum, Dec 13, 2011: The number of human GPCRs is now thought to number less than a thousand.