In silico prediction of physical protein interactions and characterization of interactome orphans

Abstract

Protein-protein interactions (PPIs) are useful for understanding signaling cascades, predicting protein function, associating proteins with disease and fathoming drug mechanism of action. Currently, only 10% of human PPIs may be known, and about one-third of human proteins have no known interactions. We introduce FpClass, a data mining–based method for proteome-wide PPI prediction. At an estimated false discovery rate of 60%, we predicted 250,498 PPIs among 10,531 human proteins; 10,647 PPIs involved 1,089 proteins without known interactions. We experimentally tested 233 high- and medium-confidence predictions and validated 137 interactions, including seven novel putative interactors of the tumor suppressor p53. Compared to previous PPI prediction methods, FpClass achieved better agreement with experimentally detected PPIs. We provide an online database of annotated PPI predictions (http://ophid.utoronto.ca/fpclass/) and the prediction software (http://www.cs.utoronto.ca/~juris/data/fpclass/).

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: FpClass workflow.
Figure 2: Evaluation of FpClass using experimental PPI data sets.
Figure 3: Experimental validation of FpClass predictions.
Figure 4: Properties of d0 (orphan) proteins and genes.

Change history

  • 10 December 2014

    In the version of this article initially published online, an author (G.B.M.) was incorrectly listed twice. The error has been corrected for the print, PDF and HTML versions of this article.

References

  1. 1

    Cusick, M.E. et al. Literature-curated protein interaction datasets. Nat. Methods 6, 39–46 (2009).

  2. 2

    Pastrello, C. et al. Integration, visualization and analysis of human interactome. Biochem. Biophys. Res. Commun. 445, 757–773 (2014).

  3. 3

    Bork, P. et al. Protein interaction networks from yeast to human. Curr. Opin. Struct. Biol. 14, 292–299 (2004).

  4. 4

    Stumpf, M.P. et al. Estimating the size of the human interactome. Proc. Natl. Acad. Sci. USA 105, 6959–6964 (2008).

  5. 5

    Venkatesan, K. et al. An empirical framework for binary interactome mapping. Nat. Methods 6, 83–90 (2009).

  6. 6

    Edwards, A.M. et al. Too many roads not taken. Nature 470, 163–165 (2011).

  7. 7

    Braun, P. et al. An experimentally derived confidence score for binary protein-protein interactions. Nat. Methods 6, 91–97 (2009).

  8. 8

    Brückner, A., Polge, C., Lentze, N., Auerbach, D. & Schlattner, U. Yeast two-hybrid, a powerful tool for systems biology. Int. J. Mol. Sci. 10, 2763–2788 (2009).

  9. 9

    Wodak, S.J., Pu, S., Vlasblom, J. & Séraphin, B. Challenges and rewards of interaction proteomics. Mol. Cell. Proteomics 8, 3–18 (2009).

  10. 10

    Schwartz, A.S., Yu, J., Gardenour, K.R., Finley, R.L. Jr. & Ideker, T. Cost-effective strategies for completing the interactome. Nat. Methods 6, 55–61 (2009).

  11. 11

    Rhodes, D.R. et al. Probabilistic model of the human protein-protein interaction network. Nat. Biotechnol. 23, 951–959 (2005).

  12. 12

    Scott, M.S. & Barton, G.J. Probabilistic prediction and ranking of human protein-protein interactions. BMC Bioinformatics 8, 239 (2007).

  13. 13

    Kim, J.H. & Pearl,, J. in Proc. IJCAI 190–193 (Morgan Kaufmann, 1983).

  14. 14

    Petschnigg, J. et al. The mammalian-membrane two-hybrid assay (MaMTH) for probing membrane-protein interactions in human cells. Nat. Methods 11, 585–592 (2014).

  15. 15

    Elefsinioti, A. et al. Large-scale de novo prediction of physical protein-protein association. Mol. Cell. Proteomics 10, M111.010629 (2011).

  16. 16

    Zhang, Q.C. et al. Structure-based prediction of protein-protein interactions on a genome-wide scale. Nature 490, 556–560 (2012).

  17. 17

    D'haeseleer, P. & Church, G.M. in Proc. IEEE Comput. Syst. Bioinform. Conf. 216–223 (IEEE, 2004).

  18. 18

    Kang, H.S. et al. NABP1, a novel RORg-regulated gene encoding a single-stranded nucleic-acid-binding protein. Biochem. J. 397, 89–99 (2006).

  19. 19

    Krokeide, S.Z. et al. Human NEIL3 is mainly a monofunctional DNA glycosylase removing spiroimindiohydantoin and guanidinohydantoin. DNA Repair (Amst.) 12, 1159–1164 (2013).

  20. 20

    Menendez, D., Inga, A. & Resnick, M.A. The expanding universe of p53 targets. Nat. Rev. Cancer 9, 724–737 (2009).

  21. 21

    Wang, W. et al. Identification of rare DNA variants in mitochondrial disorders with improved array-based sequencing. Nucleic Acids Res. 39, 44–58 (2011).

  22. 22

    Vaseva, A.V. & Moll, U.M. The mitochondrial p53 pathway. Biochim. Biophys. Acta 1787, 414–420 (2009).

  23. 23

    Gordon, S., Akopyan, G., Garban, H. & Bonavida, B. Transcription factor YY1: structure, function, and therapeutic implications in cancer biology. Oncogene 25, 1125–1142 (2006).

  24. 24

    Tanikawa, C. et al. Regulation of protein citrullination through p53/PADI4 network in DNA damage response. Cancer Res. 69, 8761–8769 (2009).

  25. 25

    Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000).

  26. 26

    Hunter, S. et al. InterPro: the integrative protein signature database. Nucleic Acids Res. 37, D211–D215 (2009).

  27. 27

    Imming, P., Sinning, C. & Meyer, A. Drugs, their targets and the nature and number of drug targets. Nat. Rev. Drug Discov. 5, 821–834 (2006).

  28. 28

    Su, A.I. et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc. Natl. Acad. Sci. USA 101, 6062–6067 (2004).

  29. 29

    Roth, R.B. et al. Gene expression analyses reveal molecular relationships among 20 regions of the human CNS. Neurogenetics 7, 67–80 (2006).

  30. 30

    Wang, E.T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008).

  31. 31

    Krupp, M. et al. RNA-Seq Atlas—a reference database for gene expression profiling in normal tissue by next-generation sequencing. Bioinformatics 28, 1184–1185 (2012).

  32. 32

    Uhlen, M. et al. Towards a knowledge-based Human Protein Atlas. Nat. Biotechnol. 28, 1248–1250 (2010).

  33. 33

    The UniProt Consortium. The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res. 38, D142–D148 (2010).

  34. 34

    Brown, K.R. & Jurisica, I. Unequal evolutionary conservation of human protein interactions in interologous networks. Genome Biol. 8, R95 (2007).

  35. 35

    Piccinin, S. et al. A “twist box” code of p53 inactivation: twist box: p53 interaction promotes p53 degradation. Cancer Cell 22, 404–415 (2012).

  36. 36

    Hupp, T.R., Hayward, R.L. & Vojtesek, B. Strategies for p53 reactivation in human sarcoma. Cancer Cell 22, 283–285 (2012).

  37. 37

    Sprinzak, E. & Margalit, H. Correlated sequence-signatures as markers of protein-protein interaction. J. Mol. Biol. 311, 681–692 (2001).

  38. 38

    Zhang, Y. et al. Systematic analysis, comparison, and integration of disease based human genetic association data and mouse genetic phenotypic information. BMC Med. Genomics 3, 1 (2010).

  39. 39

    Osborne, J.D. et al. Annotating the human genome with Disease Ontology. BMC Genomics 10 (suppl. 1), S6 (2009).

  40. 40

    Davis, A.P. et al. The Comparative Toxicogenomics Database: update 2011. Nucleic Acids Res. 39, D1067–D1072 (2011).

  41. 41

    Kotlyar, M., Fortney, K. & Jurisica, I. Network-based characterization of drug-regulated genes, drug targets, and toxicity. Methods 57, 499–507 (2012).

  42. 42

    Maglott, D., Ostell, J., Pruitt, K.D. & Tatusova, T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 35, D26–D31 (2007).

  43. 43

    Hedges, S.B., Dudley, J. & Kumar, S. TimeTree: a public knowledge-base of divergence times among organisms. Bioinformatics 22, 2971–2972 (2006).

  44. 44

    Toll-Riera, M. et al. Origin of primate orphan genes: a comparative genomics approach. Mol. Biol. Evol. 26, 603–612 (2009).

  45. 45

    Barshir, R. et al. The TissueNet database of human tissue protein-protein interactions. Nucleic Acids Res. 41, D841–D844 (2013).

  46. 46

    Birzele, F., Gewehr, J.E. & Zimmer, R. AutoPSI: a database for automatic structural classification of protein sequences and structures. Nucleic Acids Res. 36, D398–D401 (2008).

  47. 47

    Ward, J.J., Sodhi, J.S., McGuffin, L.J., Buxton, B.F. & Jones, D.T. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J. Mol. Biol. 337, 635–645 (2004).

Download references

Acknowledgements

This research was supported by the grants from Genome Canada via the Ontario Genomics Institute, Ontario Research Fund (GL2-01-030, RE-03-020 to I.J.), the Canadian Institutes for Health Research (#99745, #93579 to I.J., A.J.), the Natural Sciences Research Council (#203475 to I.J.), US Army Department of Defense W81XWH-12-1-0501 (to I.J.), the Italian Association for Cancer Research, the Friuli Venezia-Giulia and CRO 5xmille Intramural Grant (to R.M.), the Friuli Venezia-Giulia Exchange Program (to C.P.), the Ontario Genomics Institute (#303547 to I.S.), the Canadian Institutes of Health Research (Catalyst-NHG99091, PPP-125785 to I.S.), the Canadian Cystic Fibrosis Foundation (#300348 to I.S.), the Canadian Cancer Society (2010-700406 to I.S.), Genentech and University Health Network (GL2-01-018 to I.S.), US National Institutes of Health (NIH) PO1/PPG grant 01CA0099031 (to G.B.M., I.J.) and NCI R21 CA126700 (to Z.D., G.B.M.). Computational resources were supported by grants from the Canada Foundation for Innovation (CFI #12301, #203373, #29272, #22540a, #30865) and IBM (to I.J.). I.J. is supported by the Canada Research Chair program. This research was also supported by the University of Toronto McLaughlin Centre and the Ontario Ministry of Health and Long-Term Care (OMOHLTC). The views expressed do not necessarily reflect those of the OMOHLTC. We thank M. Vidal, D. Hill, F. Roth and the Center for Cancer Systems Biology (Dana-Farber Cancer Institute) for prepublication release of protein interaction data, funded by NIH NHGRI grant R01 HG001715.

Author information

M.K. and I.J. conceived of the project. M.K. developed the algorithm and executed computational analyses and validation. Additional validation and assay-related analyses were executed by C.P., C.C., Y.N., F.V. and F.B.-C. R.M., I.S., A.J. and G.B.M. provided guidance for biological validation experiments that were executed by F.P., A.L.S., H.L., C.P., T.N. and Z.D. M.K. and I.J. wrote the initial manuscript, and all authors were involved in results presentation, discussion and preparation of the final manuscript.

Correspondence to Igor Jurisica.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Estimating FDR of PPI predictions.

(a-b) We used the approach of D'Haeseleer and Church1 to estimate FDR. This approach calculates the FDR of a PPI dataset, D, by analyzing intersections among three PPI datasets, D, R, and D′, where R is a reference set of trusted PPIs and D′ is a set of PPIs from a method similar to that of D. It is assumed that the overlap of any two datasets contains largely true positive PPIs. The number of non-overlapping true positives, IV, is calculated from the numbers of shared PPIs: IV = (II × III) / I. Then, the number of false positives, V, and the FDR are calculated. The FDR tends to be low if D has a high overlap with either D′ or R. (c)To calculate the FDR of FpClass we initially set D to our top 35,000 proteome-wide predictions, excluding any PPIs used in training; (we subsequently calculated FDR for larger sets of FpClass predictions (panels d-g)). We defined R as a set of experimentally detected interactions and D′ as the union of high confidence predictions from previous studies by Rhodes et al., 2005, Scott et al., 2007, Elefsinioti et al., 2011, and Zhang et al., 2012. Using a similar approach, we calculated FDRs for high-confidence predictions from these previous studies. For example, to calculate the FDR for Rhodes et al., we defined D as high-confidence predictions from that study, and D′ as the union of top FpClass predictions and high-confidence predictions from the three remaining previous studies. To ensure that estimated FDRs were not due to biases of a particular reference set, we repeated FDR calculations using 6 reference sets. We calculated FDRs using each reference set, except when the intersection of datasets D, D', and R comprised less than 5 PPIs. In such cases the FDR is indicated as NA. (d-g) Using the approach of D'Haeseleer and Church, we estimated FDRs of predicted networks of various sizes from FpClass and four previous prediction methods. The approach of D'Haeseleer and Church requires a trusted reference set of PPIs. We tried four ways of defining this set: (d) using six reference sets (panel c) individually, and then calculating the median of the six resulting FDR estimates, (e) using the union of PPIs from methods that detect direct interactions (Y2H and LUMIER reference sets), (f) using the union of our six reference sets, and (g) using the union of Y2H reference sets.1D’haeseleer, P. & Church, G. M. Estimating and improving protein interaction error rates. Proc IEEE Comput Syst Bioinform Conf 216–223 (2004). 2Rhodes, D. R. et al. Probabilistic model of the human protein-protein interaction network. Nat Biotechnol 23, 951–959 (2005). 3Scott, M. S. & Barton, G. J. Probabilistic prediction and ranking of human protein-protein interactions. BMC Bioinformatics 8, 239 (2007). 4Elefsinioti, A. et al. Large-scale de novo prediction of physical protein-protein association. Mol Cell Proteomics 10, M111.010629 (2011). 5Zhang, Q. C. et al. Structure-based prediction of protein-protein interactions on a genome-wide scale. Nature 490, 556–560 (2012).

Supplementary Figure 2 Experimental validation of PPI predictions.

(a) Predicted interactions tested by Co-IP assays.(b-c) Predicted interactions tested by GST pull-down assays.(d) Predicted interaction partners of p53 include some of its known partners and d0 proteins. The x-axis indicates the number of top predicted partners, ranked from 1 to 2377. The y-axis indicates the number of known partners and d0 proteins, among the top predicted partners.

Supplementary Figure 3 Top Gene Ontology (GO) categories among d0 genes.

(a-c) GO analysis includes genes without GO annotations. (d-f) GO analysis excludes genes without GO annotations. P-values were calculated by hypergeometric tests and adjusted for multiple testing using FDR.

Supplementary Figure 4 Percentages of d0 proteins in drug-target classes and structural properties of d0 proteins.

(a) Main drug target classes and (b) receptor drug target classes, as defined by Imming et al.6. Dashed lines indicate the percentage of d0 proteins in the proteome. P-values were calculated by hypergeometric tests and adjusted for multiple testing using FDR. (c) SCOP structural classes. P-values were calculated by hypergeometric tests and adjusted for multiple testing using FDR. (d) Protein lengths from UniProt8 and (e) protein disorder, predicted with DISOPRED9. P-values for protein length and disorder were calculated by two-sided Mann-Whitney U tests.6Imming, P., Sinning, C. & Meyer, A. Drugs, their targets and the nature and number of drug targets. Nat Rev Drug Discov 5, 821–834 (2006). 7Andreeva, A. et al. Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res 36, D419–25 (2008). 8The UniProt Consortium. The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res 38, D142–8 (2010). 9Ward, J. J., Sodhi, J. S., McGuffin, L. J., Buxton, B. F. & Jones, D. T. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol 337, 635–645 (2004).

Supplementary Figure 5 Median and maximum expression of d0- and dk-encoding genes.

P-values were calculated by two-sided Mann-Whitney U tests. (a-d) Median expression of d0 and dk genes in healthy human tissues. Gene expression data was taken from (a) Su et al., 200410, (b) Roth et al., 200611, (c) Wang et al., 200812, and (d) Krupp et al., 201213. (e-h) Maximum expression of d0 and dk genes in the same datasets.10Su, A. I. et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A 101, 6062–6067 (2004). 11Roth, R. B. et al. Gene expression analyses reveal molecular relationships among 20 regions of the human CNS. Neurogenetics 7, 67–80 (2006). 12Wang, E. T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008). 13Krupp, M. et al. RNA-Seq Atlas--a reference database for gene expression profiling in normal tissue by next-generation sequencing. Bioinformatics 28, 1184–1185 (2012).

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–5, Supplementary Tables 1–10 and Supplementary Note (PDF 2171 kb)

Supplementary Software

FpClass code (ZIP 239668 kb)

Supplementary Data 1

Positive cases in our largest training set. (TXT 279 kb)

Supplementary Data 2

Predicted probabilities for protein pairs from Braun et al., 2009 (TXT 8 kb)

Supplementary Data 3

Cross-validation data: protein pairs and predicted probabilities. (TXT 48197 kb)

Supplementary Data 4

Experimentally tested predictions. (TXT 12 kb)

Supplementary Data 5

Fp60 network: predicted interactions with estimated FDR of 60%. (TXT 8784 kb)

Supplementary Data 6

D0 proteins: human proteins without experimentally detected interactions in I2D 1.95. (TXT 49 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kotlyar, M., Pastrello, C., Pivetta, F. et al. In silico prediction of physical protein interactions and characterization of interactome orphans. Nat Methods 12, 79–84 (2015). https://doi.org/10.1038/nmeth.3178

Download citation

Further reading