Accurate prediction of protein–protein interactions from sequence alignments using a Bayesian method
Lukas Burger1 & Erik van Nimwegen1
- Biozentrum, the University of Basel, and Swiss Institute of Bioinformatics, Basel, Switzerland
Correspondence to: Erik van Nimwegen1 Biozentrum, the University of Basel, and Swiss Institute of Bioinformatics, Klingelbergstrasse 50/70, Basel 4056-CH, Switzerland. Tel.: +41 61 267 1576; Fax: +41 61 267 1584; Email: erik.vannimwegen@unibas.ch
Received 28 August 2007; Accepted 30 November 2007; Published online 12 February 2008
Article highlights
- We present a Bayesian network model to predict protein-protein interactions directly from amino acid sequences, without tunable parameters and without the need for any training examples.
- We successfully apply our method to both bacterial two-component systems and bacterial polyketide synthases, thereby demonstrating the high accuracy and generality of our method.
- Analysis of the predicted genome-wide two-component signaling networks shows that cognates (interacting kinase/regulator pairs which lie adjacent on the genome) and orphans (which lie isolated) form two relatively independent components of the signaling network in each genome.
- Whereas most two-component genes are predicted to have only a small number of interaction partners, we find that 10% of orphans form a separate class of 'hub' nodes that distribute and integrate signals to and from up to tens of different interaction partners.
Synopsis
A method that comprehensively and accurately predicts protein–protein interactions using only the amino-acid sequences of proteins would essentially allow the reconstruction of genome-wide protein interaction networks directly from genome sequences. Automated prediction of protein–protein interactions from their amino-acid sequences is therefore one of the great outstanding challenges in computational biology. Numerous approaches have already been proposed which, however, all suffer from serious drawbacks. Some methods only infer functional relationships and not direct physical interactions, while others are difficult to scale up to large data sets and most of them suffer from high false positive rates (see Valencia and Pazos, 2002; Bork et al, 2004; Shoemaker and Panchenko, 2007 for reviews).
The new method presented here operates on sets of protein families for which it is known that members from one family interact with members from one or more of the other families. Multiple alignments of the sequences in each family are constructed and the algorithm searches over all possible ways in which the proteins from different families can be paired up to form interacting pairs (see Figure 1). The best assignment of interacting pairs is roughly speaking the one that maximizes the statistical dependencies that are observed between amino acids of the interacting protein pairs. Although 'correlated mutations' have been used previously to infer protein–protein interactions (Pazos and Valencia, 2002), the current method presents several important methodological advances. A crucial ingredient of the method is that a rigorous Bayesian network framework is used to derive the probability of each possible assignment from first principles, that is, without any tunable parameters. Also, instead of attempting to infer which residues interact, the method sums over all possible ways in which a tree of dependencies can be assigned to pairs of residues both within and between the interacting proteins. Additionally, the model does not require any training sets, but predicts interactions ab initio by searching the space of all possible ways in which proteins can be paired up to form interaction partners. Finally, our model assigns interaction partners for all proteins from multiple genomes in parallel, thereby maximizing the algorithm's ability to detect subtle sequence dependencies, and uses a Markov chain Monte-Carlo sampling technique to automatically obtain a measure of the reliability of each prediction.
Figure 1
Illustration of the model used to assign a probability P(D|a) to the joint multiple sequence alignment D of two protein families given an assignment a of interaction partners between them. Sequences from the same genome have the same color and horizontally aligned sequences are assumed to interact. The probabilities of pairs of alignment columns (ij) depend on the number of times n
ij that amino acids (
) occur in the corresponding columns. A dependence tree T and the corresponding factorization of the probability P(D|a, T) of the entire alignment given the assignment and dependence tree is illustrated at the bottom of the figure.
Application to two sets of interacting bacterial protein families are presented in the paper: two component systems (TCSs) and polyketide synthases (PKSs). TCSs are responsible for most of the signal transduction underlying complex bacterial behaviors (Stock et al, 2000). They typically consist of a membrane-bound sensory protein with an intracellular histidine kinase domain that, upon activation, specifically phosphorylates a receiver domain on a regulator protein, which in turn is typically activated by this phosphorylation. The method was applied to comprehensively reconstruct two-component signaling networks across all sequenced bacteria involving thousands of kinase and regulator proteins. Comparisons of the predictions with known interactions show that the method infers interaction partners genome-wide with high accuracy. The second application, to a data-set of PKSs (Thattai et al, 2007), demonstrates the generality of the method, and shows that it can also accurately predict interaction partners in much smaller data sets.
Analysis of the predicted genome-wide two-component signaling networks reveals several interesting features. Two-component system genes can be divided into two classes. Cognates are interacting kinase/regulator pairs that lie adjacent on the genome and are co-transcribed, and orphans are kinases and regulators that lie isolated on the genome. The analysis shows that cognates and orphans form two relatively independent groups, with cognates interacting predominantly with cognates and orphans predominantly with orphans.
Second, we find a difference in interaction propensity of kinases and regulators of the two groups. Whereas most kinases and regulators interact with only a few partners, about 10% interact with a large number of partners. Most of these 'hub' kinases and regulators are orphans (see Figure 7). The kinases in this class thus distribute a signal to a large number of downstream regulators, and the regulators in this class integrate a large number of input signals.
Figure 7
Reverse cumulative connectivity distributions of kinases (left panel) and receivers (right panel). The fraction of genes with at least a given number of interaction partners (connectivity) is shown as a function of the connectivity. Cognates are shown in red and orphans in blue. The vertical axis is shown on a logarithmic scale.
Full figure and legend (120K)Figures & Tables indexFinally, the Bayesian network model used in this study is a powerful generalization of the widely used weight matrix model used to describe motifs in both DNA and protein sequences. Importantly, recent results in Bayesian network theory (Meilá and Jaakkola, 2006) have shown that calculations with this more sophisticated 'dependence tree' model are computationally tractable, and our application to predicting protein–protein interaction shows that these can be applied to a realistic bioinformatic setting involving large amounts of data. We believe that such 'dependence tree' models may have very general applications in bioinformatics, for example, for characterization of protein domains or to improve multiple alignments of protein domains and families.
References
- Bork P, Jensen L, von Mering C, Ramani A, Lee I, Marcotte E (2004) Protein interaction networks from yeast to human. Curr Opin Struct Biol 14: 292–299 | Article | PubMed | ISI | ChemPort |
- Meilá M, Jaakkola T (2006) Tractable Bayesian learning of tree belief networks. Statistics Computing 16: 77–92 | Article |
- Pazos F, Valencia A (2002) In silico two-hybrid system for the selection of physically interacting protein pairs. Proteins 47: 219–227 | Article | PubMed | ISI | ChemPort |
- Shoemaker BA, Panchenko AR (2007) Deciphering protein–protein interactions. Part II. Computational methods to predict protein and domain interaction partners. PLoS Comput Biol 3: e43 | Article | PubMed | ChemPort |
- Stock A, Robinson V, Goudreau P (2000) Two-component signal transduction. Annu Rev Biochem 69: 183–215 | Article | PubMed | ISI | ChemPort |
- Thattai M, Burak Y, Shraiman BI (2007) The origins of specificity in polyketide synthase protein interactions. PLoS Comp Biol 3: e186 | Article | ChemPort |
- Valencia A, Pazos F (2002) Computational methods for the prediction of protein interactions. Curr Opin Struct Biol 12: 368–373 | Article | PubMed | ISI | ChemPort |


