We present Interactome INSIDER, a tool to link genomic variant information with structural protein–protein interactomes. Underlying this tool is the application of machine learning to predict protein interaction interfaces for 185,957 protein interactions with previously unresolved interfaces in human and seven model organisms, including the entire experimentally determined human binary interactome. Predicted interfaces exhibit functional properties similar to those of known interfaces, including enrichment for disease mutations and recurrent cancer mutations. Through 2,164 de novo mutagenesis experiments, we show that mutations of predicted and known interface residues disrupt interactions at a similar rate and much more frequently than mutations outside of predicted interfaces. To spur functional genomic studies, Interactome INSIDER (http://interactomeinsider.yulab.org) enables users to identify whether variants or disease mutations are enriched in known and predicted interaction interfaces at various resolutions. Users may explore known population variants, disease mutations, and somatic cancer mutations, or they may upload their own set of mutations for this purpose.
This is a preview of subscription content, access via your institution
Open Access articles citing this article.
Nature Communications Open Access 07 June 2022
Integrating protein copy numbers with interaction networks to quantify stoichiometry in clathrin-mediated endocytosis
Scientific Reports Open Access 30 March 2022
Human Genetics Open Access 25 August 2021
Subscribe to Nature+
Get immediate online access to Nature and 55 other Nature journal
Subscribe to Journal
Get full journal access for 1 year
only $8.25 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
Rolland, T. et al. A proteome-scale map of the human interactome network. Cell 159, 1212–1226 (2014).
Arabidopsis Interactome Mapping Consortium. Evidence for network evolution in an Arabidopsis interactome map. Science 333, 601–607 (2011).
Yu, H. et al. High-quality binary protein interaction map of the yeast interactome network. Science 322, 104–110 (2008).
Vo, T.V. et al. A proteome-wide fission yeast interactome reveals network evolution principles from yeasts to human. Cell 164, 310–323 (2016).
Das, J. & Yu, H. HINT: High-quality protein interactomes and their applications in understanding human disease. BMC Syst. Biol. 6, 92 (2012).
Sahni, N. et al. Widespread macromolecular interaction perturbations in human genetic disorders. Cell 161, 647–660 (2015).
Kim, P.M., Lu, L.J., Xia, Y. & Gerstein, M.B. Relating three-dimensional structures to protein networks provides evolutionary insights. Science 314, 1938–1941 (2006).
Wang, X. et al. Three-dimensional reconstruction of protein networks provides insight into human genetic disease. Nat. Biotechnol. 30, 159–164 (2012).
Kühlbrandt, W. Cryo-EM enters a new era. eLife 3, e03678 (2014).
Halperin, I., Ma, B., Wolfson, H. & Nussinov, R. Principles of docking: an overview of search algorithms and a guide to scoring functions. Proteins 47, 409–443 (2002).
Šali, A. & Blundell, T.L. Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779–815 (1993).
Mosca, R., Céol, A. & Aloy, P. Interactome3D: adding structural details to protein networks. Nat. Methods 10, 47–53 (2013).
Hopf, T.A. et al. Sequence co-evolution gives 3D contacts and structures of protein complexes. eLife 3, 03430 (2014).
Hwang, H., Vreven, T. & Weng, Z. Binding interface prediction by combining protein-protein docking results. Proteins 82, 57–66 (2014).
Zhang, Q.C. et al. Structure-based prediction of protein-protein interactions on a genome-wide scale. Nature 490, 556–560 (2012).
Garzón, J.I. et al. A computational interactome and functional annotation for the human proteome. eLife 5, 18715 (2016).
Morcos, F. et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc. Natl. Acad. Sci. USA 108, E1293–E1301 (2011).
Lockless, S.W. & Ranganathan, R. Evolutionarily conserved pathways of energetic connectivity in protein families. Science 286, 295–299 (1999).
Bergstra, J.S., Bardenet, R., Bengio, Y. & Kégl, B. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems (eds. Shawe-Taylor, T et al.) 2546–2554 (NIPS, 2011).
Kufareva, I., Budagyan, L., Raush, E., Totrov, M. & Abagyan, R. PIER: protein interface recognition for structural proteomics. Proteins 67, 400–417 (2007).
Liang, S., Zhang, C., Liu, S. & Zhou, Y. Protein binding site prediction using an empirical scoring function. Nucleic Acids Res. 34, 3698–3707 (2006).
Porollo, A. & Meller, J. Prediction-based fingerprints of protein-protein interactions. Proteins 66, 630–645 (2007).
de Vries, S.J. & Bonvin, A.M. CPORT: a consensus interface predictor and its performance in prediction-driven docking with HADDOCK. PLoS One 6, e17695 (2011).
Jordan, R.A., El-Manzalawy, Y., Dobbs, D. & Honavar, V. Predicting protein-protein interface residues using local surface structural similarity. BMC Bioinformatics 13, 41 (2012).
Hwang, H., Vreven, T., Janin, J. & Weng, Z. Protein-protein docking benchmark version 4.0. Proteins 78, 3111–3114 (2010).
Maheshwari, S. & Brylinski, M. Predicting protein interface residues using easily accessible on-line resources. Brief. Bioinform. 16, 1025–1034 (2015).
Wei, X. et al. A massively parallel pipeline to clone DNA variants and examine molecular phenotypes of human disease mutations. PLoS Genet. 10, e1004819 (2014).
Stenson, P.D. et al. The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Hum. Genet. 133, 1–9 (2014).
Landrum, M.J. et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 44, D862–D868 (2016).
Forbes, S.A. et al. COSMIC: exploring the world's knowledge of somatic mutations in human cancer. Nucleic Acids Res. 43, D805–D811 (2015).
Fu, W. et al. Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature 493, 216–220 (2013).
1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res. 43, D204–D212 (2015).
Hodis, E. et al. A landscape of driver mutations in melanoma. Cell 150, 251–263 (2012).
Meyer, M.J. et al. mutation3D: cancer gene prediction through atomic clustering of coding variants in the structural proteome. Hum. Mutat. 37, 447–456 (2016).
Adzhubei, I.A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).
Hopf, T.A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35, 128–135 (2017).
David, A., Razali, R., Wass, M.N. & Sternberg, M.J. Protein-protein interaction sites are hot spots for disease-associated nonsynonymous SNPs. Hum. Mutat. 33, 359–363 (2012).
Wang, R.N. et al. Bone Morphogenetic Protein (BMP) signaling in development and human diseases. Genes Dis. 1, 87–105 (2014).
Roth, S. et al. SMAD genes in juvenile polyposis. Genes Chromosom. Cancer 26, 54–61 (1999).
Ngeow, J. et al. Exome sequencing reveals germline SMAD9 mutation that reduces phosphatase and tensin homolog expression and is associated with hamartomatous polyposis and gastrointestinal ganglioneuromas. Gastroenterology 149, 886–889 e5 (2015).
Maron, B.J. Hypertrophic cardiomyopathy: a systematic review. J. Am. Med. Assoc. 287, 1308–1320 (2002).
Donkervoort, S. et al. Cardiomyopathy in patients with ACTA1-myopathy. Neuromuscul. Disord. 25, S287 (2015).
Sparrow, J.C. et al. Muscle disease caused by mutations in the skeletal muscle alpha-actin gene (ACTA1). Neuromuscul. Disord. 13, 519–531 (2003).
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Forbes, S.A. et al. COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res. 39, D945–D950 (2011).
Kandoth, C. et al. Mutational landscape and significance across 12 major cancer types. Nature 502, 333–339 (2013).
Lawrence, M.S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214–218 (2013).
Tas¸ an, M. et al. Selecting causal genes from genome-wide association studies via functionally coherent subnetworks. Nat. Methods 12, 154–159 (2015).
Kamburov, A. et al. Comprehensive assessment of cancer missense mutation clustering in protein structures. Proc. Natl. Acad. Sci. USA 112, E5486–E5495 (2015).
Kucukkal, T.G., Petukh, M., Li, L. & Alexov, E. Structural and physico-chemical effects of disease and non-disease nsSNPs on proteins. Curr. Opin. Struct. Biol. 32, 18–24 (2015).
Li, M., Petukh, M., Alexov, E. & Panchenko, A.R. Predicting the impact of missense mutations on protein-protein binding affinity. J. Chem. Theory Comput. 10, 1770–1780 (2014).
Lounnas, V. et al. Current progress in structure-based rational drug design marks a new mindset in drug discovery. Comput. Struct. Biotechnol. J. 5, e201302011 (2013).
Peng, K., Obradovic, Z. & Vucetic, S. Exploring bias in the Protein Data Bank using contrast classifiers. Pac. Symp. Biocomput. 2004, 435–446 (2004).
Dunker, A.K. et al. The unfoldomics decade: an update on intrinsically disordered proteins. BMC Genomics 9, S1 (2008).
Orchard, S. et al. Protein interaction data curation: the International Molecular Exchange (IMEx) consortium. Nat. Methods 9, 345–350 (2012).
Salwinski, L. et al. The Database of Interacting Proteins: 2004 update. Nucleic Acids Res. 32, D449–D451 (2004).
Kerrien, S. et al. The IntAct molecular interaction database in 2012. Nucleic Acids Res. 40, D841–D846 (2012).
Licata, L. et al. MINT, the molecular interaction database: 2012 update. Nucleic Acids Res. 40, D857–D861 (2012).
Chatr-Aryamontri, A. et al. The BioGRID interaction database: 2015 update. Nucleic Acids Res. 43, D470–D478 (2015).
Turner, B. et al. iRefWeb: interactive analysis of consolidated protein interaction data and their supporting evidence. Database (Oxford) 2010, baq023 (2010).
Keshava Prasad, T.S. et al. Human Protein Reference Database--2009 update. Nucleic Acids Res. 37, D767–D772 (2009).
Mewes, H.W. et al. MIPS: curated databases and comprehensive secondary data resources in 2010. Nucleic Acids Res. 39, D220–D224 (2011).
Alfarano, C. et al. The Biomolecular Interaction Network Database and related tools 2005 update. Nucleic Acids Res. 33, D418–D424 (2005).
Ruepp, A. et al. CORUM: the comprehensive resource of mammalian protein complexes--2009. Nucleic Acids Res. 38, D497–D501 (2010).
Güldener, U. et al. MPact: the MIPS protein interaction resource on yeast. Nucleic Acids Res. 34, D436–D441 (2006).
Brown, K.R. & Jurisica, I. Online predicted human interaction database. Bioinformatics 21, 2076–2082 (2005).
Pagel, P. et al. The MIPS mammalian protein-protein interaction database. Bioinformatics 21, 832–834 (2005).
Hermjakob, H. et al. The HUPO PSI's molecular interaction format--a community standard for the representation of protein interaction data. Nat. Biotechnol. 22, 177–183 (2004).
Berman, H.M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).
Velankar, S. et al. SIFTS: Structure Integration with Function, Taxonomy and Sequences resource. Nucleic Acids Res. 41, D483–D489 (2013).
Lee, B. & Richards, F.M. The interpretation of protein structures: estimation of static accessibility. J. Mol. Biol. 55, 379–400 (1971).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Witten, I.H., Frank, E., Hall, M.A. & Pal, C.J. Data Mining: Practical Machine Learning Tools and Techniques (Elsevier Science, 2016).
Punta, M. et al. The Pfam protein families database. Nucleic Acids Res. 40, D290–D301 (2012).
Sørensen, T. A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. Biol. Skr. 5, 1–34 (1948).
Kumar, P., Henikoff, S. & Ng, P.C. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat. Protoc. 4, 1073–1081 (2009).
Tyner, C. et al. The UCSC Genome Browser database: 2017 update. Nucleic Acids Res. 45 D1, D626–D634 (2017).
The authors would like to thank G. Hooker, D. Bindel, and K. Weinberger for helpful discussions and J. VanEe for technical support. This work was supported by National Institute of General Medical Sciences grants (R01 GM097358, R01 GM104424, R01 GM124559); National Cancer Institute grant (R01 CA167824); Eunice Kennedy Shriver National Institute of Child Health and Human Development grant (R01 HD082568); National Human Genome Research Institute grant (UM1 HG009393); National Science Foundation grant (DBI-1661380); and Simons Foundation Autism Research Initiative grant (367561) to H.Y.
The authors declare no competing financial interests.
Integrated supplementary information
(a) A schematic showing the five feature categories from which feature sets are optimized to train ECLAIR. (b) The portions of high-quality binary interactomes for which each feature type is available. (c) Feature aggregation strategies employed for combining multiple points of evidence into single co-evolution- and structure-based features. For co-evolution, we select the top co-evolved residue, the mean of features for the top 10 co-evolved residues, or the mean over all co-evolved residues in the partner protein. For proteins with multiple structures, we take the mean, minimum, or maximum SASA over all available structures.
Balance between testing/training and prediction sets of sequence- and structure-based feature depths. (a) Sources (PDB or ModBase) and number of structures used to calculate solvent-accessible surface area. (b) Number of homologous sequences used to calculate evolutionary features. (c) Sources of docked models for calculating docking-based features.
A comparison of (1) imputation and (2) an ensemble of fully-trained classifiers for handling missing data. During training, imputation must fill in gaps in feature coverage, whereas an ensemble trains independent classifiers on each feature-availability scenario. Since structural feature coverage is highly correlated with the existence of known interface residues in training, imputation will fail to predict interface residues outside of regions with structural feature coverage (red). An ensemble will predict interface residues based only on the features available and will not be biased by the missing structural feature.
(a) Training the ECLAIR classifier. (b) Four methods for optimizing machine learning algorithm hyperparameters, showing the order of trials and granularity of hyperparameter sampling spaces for optimizing two hyperparameters. (c) Cross-validation strategy using TPE to optimize hyperparameters and window sizes for both feature selection and ensemble classifier training. (d) Cross-validation results using TPE trials to select top performing feature or set of features (in red) in each feature category. (e) Comparison of four hyperparameter optimization methods’ performance (top panel) and hyperparameter and residue window sampling patterns (bottom panels) on one of the eight sub-classifiers of the ECLAIR ensemble.
(a) Number of residues predicted in each prediction confidence category. (b) Cumulative distribution of interactions with ≥ n residues classified as interface for each of the highest interface potential categories.
(a) Receiver operating characteristic (ROC) curves for each sub-classifier. (b) Precision-recall curves for each sub-classifier. (c) Distribution of raw prediction scores for each sub-classifier. For all panels, sub-classifiers plotted in blue used only sequence-based features; sub-classifiers in red used additional structure-based features. (d) Raw prediction scores compared to actual probabilities of residues in each bin to be at the interface.
Supplementary Figure 7 ROC and precision-recall curves comparing ECLAIR with other popular interface residue prediction methods.
Here, only known surface residues were used in benchmarking all methods. All methods have a slightly lower AUROC (since it is more difficult to distinguish interface from non-interface among only surface residues), however ECLAIR still performs as well or better than all tested methods.
Supplementary Figure 8 Genomic properties of predicted interface residues in interactions lacking structural features.
(a) Enrichment of disease mutations in predicted and known interfaces. (b) Enrichment of recurrent cancer mutations in predicted and known interfaces. (c) Enrichment of rare and common population variants in predicted and known interfaces. (d) Predicted deleteriousness of population variants in known and predicted interfaces (using PolyPhen-2). (e) Predicted effects of population variants in known and predicted interfaces (using EVmutation). (In a-b, significance determined by two-sided Z-test. In d-e, significance determined by a two-sided U-test. n.s. denotes not significant)
(a) Enrichment of disease mutations in predicted and known interfaces. (b) Enrichment of recurrent cancer mutations in predicted and known interfaces. (c) Enrichment of rare and common population variants in predicted and known interfaces. (d) Predicted deleteriousness of population variants in known and predicted interfaces (using PolyPhen-2). (e) Predicted effects of population variants in known and predicted interfaces (using EVmutation). (In a-b, significance determined by two-sided Z-test. In d-e, significance determined by a two-sided U-test)
(a-c) Precision recall curves for interfaces predicted with ECLAIR: (a) interface residues in all benchmarked interactions, (b) interface residues in interactions lacking structural features, and (c) interface domains in interactions lacking structural features. (d) Fraction of interface residues localized to domains for known interface residues in co-crystalized co-bound proteins, predicted interface residues in interactions with structural features, and predicted interface residues in interactions without structural features. (e) Enrichment of human disease mutations in domains determined by known interface residues in co-crystalized co-bound proteins, predicted interface residues in interactions with structural features, and predicted interface residues in interactions without structural features. (Significance determined by two-sided Z-test)
Supplementary Figures 1–11 and Supplementary note 1–7 (PDF 3047 kb)
Comparison of ECLAIR using docking benchmark 4.0 (XLSX 12 kb)
PSI-MI binary evidence codes (XLSX 14 kb)
Training and Testing Sets (XLSX 141 kb)
Feature Selection (XLSX 16 kb)
Full sub-classifier training (XLSX 10 kb)
Comparison of ECLAIR performance with and without co-evolution (XLSX 11 kb)
ECLAIR prediction category performance using docking benchmark 4.0 (XLSX 9 kb)
Initially-trained ECLAIR vs. fully-trained ECLAIR performance (XLSX 11 kb)
ÉCLAIR software (ZIP 127 kb)
About this article
Cite this article
Meyer, M., Beltrán, J., Liang, S. et al. Interactome INSIDER: a structural interactome browser for genomic studies. Nat Methods 15, 107–114 (2018). https://doi.org/10.1038/nmeth.4540
This article is cited by
Nature Communications (2022)
Integrating protein copy numbers with interaction networks to quantify stoichiometry in clathrin-mediated endocytosis
Scientific Reports (2022)
Human Genetics (2022)
My personal mutanome: a computational genomic medicine platform for searching network perturbing alleles linking genotype to phenotype
Genome Biology (2021)
Nature Genetics (2021)