EPIC: software toolkit for elution profile-based inference of protein complexes

Article metrics

Abstract

Protein complexes are key macromolecular machines of the cell, but their description remains incomplete. We and others previously reported an experimental strategy for global characterization of native protein assemblies based on chromatographic fractionation of biological extracts coupled to precision mass spectrometry analysis (chromatographic fractionation–mass spectrometry, CF–MS), but the resulting data are challenging to process and interpret. Here, we describe EPIC (elution profile-based inference of complexes), a software toolkit for automated scoring of large-scale CF–MS data to define high-confidence multi-component macromolecules from diverse biological specimens. As a case study, we used EPIC to map the global interactome of Caenorhabditis elegans, defining 612 putative worm protein complexes linked to diverse biological processes. These included novel subunits and assemblies unique to nematodes that we validated using orthogonal methods. The open source EPIC software is freely available as a Jupyter notebook packaged in a Docker container (https://hub.docker.com/r/baderlab/bio-epic/).

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: EPIC workflow.
Fig. 2: EPIC parameter evaluation.
Fig. 3: Prediction, benchmarking and analysis of C. elegans protein complexes.

Data availability

The supporting co-fractionation data are available via ProteomeXchange with the identifier PXD011182. The entire WormMap network (Cytoscape format) is available on GitHub (https://github.com/BaderLab/EPIC/tree/master/WormMap) and has been submitted to the BioGRID database. Source Data for Fig. 2 are available online.

Code availability

EPIC is available via a Docker container (https://hub.docker.com/r/baderlab/bio-epic/). The EPIC software code is publicly available on GitHub (https://github.com/BaderLab/EPIC).

References

  1. 1.

    Rigaut, G. et al. A generic protein purification method for protein complex characterization and proteome exploration. Nat. Biotechnol. 17, 1030–1032 (1999).

  2. 2.

    Krogan, N. J. et al. Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440, 637–643 (2006).

  3. 3.

    Gavin, A. C. et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415, 141–147 (2002).

  4. 4.

    Ho, Y. et al. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415, 180–183 (2002).

  5. 5.

    Gavin, A. C. et al. Proteome survey reveals modularity of the yeast cell machinery. Nature 440, 631–636 (2006).

  6. 6.

    Hu, P. et al. Global functional atlas of Escherichia coli encompassing previously uncharacterized proteins. PLoS Biol. 7, e96 (2009).

  7. 7.

    Huttlin, E. L. et al. The BioPlex network: a systematic exploration of the human interactome. Cell 162, 425–440 (2015).

  8. 8.

    Hein, M. Y. et al. A human interactome in three quantitative dimensions organized by stoichiometries and abundances. Cell 163, 712–723 (2015).

  9. 9.

    Babu, M. et al. Interaction landscape of membrane-protein complexes in Saccharomyces cerevisiae. Nature 489, 585–589 (2012).

  10. 10.

    Havugimana, P. C. et al. A census of human soluble protein complexes. Cell 150, 1068–1081 (2012).

  11. 11.

    Wan, C. et al. Panorama of ancient metazoan macromolecular complexes. Nature 525, 339–344 (2015).

  12. 12.

    Liu, F., Rijkers, D. T., Post, H. & Heck, A. J. Proteome-wide profiling of protein assemblies by cross-linking mass spectrometry. Nat. Meth. 12, 1179–1184 (2015).

  13. 13.

    Ruepp, A. et al. CORUM: the comprehensive resource of mammalian protein complexes—2009. Nucleic Acids Res. 38, D497–D501 (2010).

  14. 14.

    UniProt, C. UniProt: a hub for protein information. Nucleic Acids Res. 43, D204–D212 (2015).

  15. 15.

    Orchard, S. et al. The MIntAct project-IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res. 42, D358–D363 (2014).

  16. 16.

    The Gene Ontology, C. Expansion of the Gene Ontology knowledgebase and resources. Nucleic Acids Res. 45, D331–D338 (2017).

  17. 17.

    Zuberi, K. et al. GeneMANIA prediction server 2013 update. Nucleic Acids Res. 41, W115–W122 (2013).

  18. 18.

    Szklarczyk, D. et al. The STRING database in 2017: quality-controlled protein–protein association networks, made broadly accessible. Nucleic Acids Res. 45, D362–D368 (2017).

  19. 19.

    Sonnhammer, E. L. & Ostlund, G. InParanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic. Nucleic Acids Res. 43, D234–D239 (2015).

  20. 20.

    Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).

  21. 21.

    Stacey, R. G., Skinnider, M. A., Scott, N. E. & Foster, L. J. A rapid and accurate approach for prediction of interactomes from co-elution data (PrInCE). BMC Bioinformatics 18, 457 (2017).

  22. 22.

    Sanchez-Taltavull, D., Ramachandran, P., Lau, N. & Perkins, T. J. Bayesian correlation analysis for sequence count data. PloS ONE 11, e0163595 (2016).

  23. 23.

    Nepusz, T., Yu, H. & Paccanaro, A. Detecting overlapping protein complexes in protein–protein interaction networks. Nat. Meth. 9, 471–472 (2012).

  24. 24.

    Wiwie, C., Baumbach, J. & Rottger, R. Comparing the performance of biomedical clustering methods. Nat. Meth. 12, 1033–1038 (2015).

  25. 25.

    Cho, A. et al. WormNetv3: a network-assisted hypothesis-generating server for Caenorhabditis elegans. Nucleic Acids Res. 42, W76–W82 (2014).

  26. 26.

    Turner, B. et al. iRefWeb: interactive analysis of consolidated protein interaction data and their supporting evidence. Database 2010, baq023 (2010).

  27. 27.

    Chatr-Aryamontri, A. et al. The BioGRID interaction database: 2017 update. Nucleic Acids Res. 45, D369–D379 (2017).

  28. 28.

    Mulder, N. J. et al. The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res. 31, 315–318 (2003).

  29. 29.

    Kagawa, H., Gengyo, K., McLachlan, A. D., Brenner, S. & Karn, J. Paramyosin gene (unc-15) of Caenorhabditis elegans. Molecular cloning, nucleotide sequence and models for thick filament structure. J. Mol. Biol. 207, 311–333 (1989).

  30. 30.

    Harris, T. W. et al. WormBase: a comprehensive resource for nematode research. Nucleic Acids Res. 38, D463–D467 (2010).

  31. 31.

    Monemi, S. et al. Identification of a novel adult-onset primary open-angle glaucoma (POAG) gene on 5q22.1. Hum. Mol. Genet 14, 725–733 (2005).

  32. 32.

    Yunger, E., Safra, M., Levi-Ferber, M., Haviv-Chesner, A. & Henis-Korenblit, S. Innate immunity mediated longevity and longevity induced by germ cell removal converge on the C-type lectin domain protein IRG-7. PLoS Genet. 13, e1006577 (2017).

  33. 33.

    Amberger, J. S., Bocchini, C. A., Schiettecatte, F., Scott, A. F. & Hamosh, A. OMIM.org: Online Mendelian Inheritance in Man (OMIM(R)), an online catalog of human genes and genetic disorders. Nucleic Acids Res. 43, D789–D798 (2015).

  34. 34.

    Stenson, P. D. et al. The human gene mutation database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Hum. Genet. 133, 1–9 (2014).

  35. 35.

    Olinares, P. D., Ponnala, L. & van Wijk, K. J. Megadalton complexes in the chloroplast stroma of Arabidopsis thaliana characterized by size exclusion chromatography, mass spectrometry, and hierarchical clustering. Mol. Cell. Proteomics 9, 1594–1615 (2010).

  36. 36.

    Skinnider, M. A., Stacey, R. G. & Foster, L. J. Genomic data integration systematically biases interactome mapping. PLoS Comput. Biol. 14, e1006474 (2018).

  37. 37.

    Tran, J. C. et al. Mapping intact protein isoforms in discovery mode using top-down proteomics. Nature 480, 254–258 (2011).

  38. 38.

    Werner, T. et al. Ion coalescence of neutron encoded TMT 10-plex reporter ions. Anal. Chem. 86, 3594–3601 (2014).

  39. 39.

    Ideker, T. & Krogan, N. J. Differential network biology. Mol. Syst. Biol. 8, 565 (2012).

  40. 40.

    Stiernagle, T. in WormBook: The Online Review of C. elegans Biology (ed. The C. elegans Research Community) (WormBook).

  41. 41.

    Kwon, T., Choi, H., Vogel, C., Nesvizhskii, A. I. & Marcotte, E. M. MSblender: a probabilistic approach for integrating peptide identifications from multiple database search engines. J. Proteome Res. 10, 2949–2958 (2011).

  42. 42.

    Cox, J. & Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 26, 1367–1372 (2008).

  43. 43.

    Kislinger, T. et al. PRISM, a generic large scale proteomic investigation strategy for mammals. Mol. Cell. Proteomics 2, 96–106 (2003).

  44. 44.

    Campagnola, P. J. et al. Three-dimensional high-resolution second-harmonic generation imaging of endogenous structural proteins in biological tissues. Biophys. J. 82, 493–508 (2002).

  45. 45.

    Dupuy, D. et al. A first version of the Caenorhabditis elegans promoterome. Genome Res. 14, 2169–2175 (2004).

  46. 46.

    Kwan, J. et al. DLG5 connects cell polarity and Hippo signaling protein networks by linking PAR-1 with MST1/2. Genes Dev. 30, 2696–2709 (2016).

  47. 47.

    Wehrens, R., Melssen, W., Buydens, L. & de Gelder, R. Representing structural databases in a self-organizing map. Acta Crystallogr. B 61, 548–557 (2005).

  48. 48.

    Brohee, S. & van Helden, J. Evaluation of clustering algorithms for protein–protein interaction networks. BMC Bioinformatics 7, 488 (2006).

  49. 49.

    Reimand, J. et al. g:Profiler-a web server for functional interpretation of gene lists (2016 update). Nucleic Acids Res. 44, W83–W89 (2016).

Download references

Acknowledgements

This study was supported by a Foundation Grant (FDN no. 148399) from the Canadian Institute of Health Research (CIHR, to A.E.), and US National Institutes of Health grants (nos. P41 GM103504, GM070743 to G.D.B.) L.Z.M.H. was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) Mass Spectrometry-Enabled Science and Engineering (MS-ESE) program. C. elegans strain access was supported by the NIH Office of Research Infrastructure Programs (P40 OD010440).

Author information

A.E. and G.D.B. conceived the project. L.Z.M.H. and F.G. wrote the software, performed computational analysis and wrote the manuscript. L.Z.M.H. and C.W. performed the co-fractionation experiments. J.H.T. performed the protein GFP tagging in C. elegans with assistance and guidance from M.S. and A.G.F. The AP–MS experiments were performed by E.W. with assistance from U.K. S.P. and C.X. provided technical support. G.D.B and A.E. supervised the study and edited the paper.

Correspondence to Gary D. Bader or Andrew Emili.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information: Allison Doerr was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Pre-enrichment improves the dynamic range of CF/MS studies.

a) Schematic workflow of bead-based sample pre-enrichment. b) Venn diagram showing improved proteome coverage by pre-enrichment. c) Bar chart showing improved detection of low abundance proteins. d) Bar chart showing improved detection of small (low molecular mass) proteins. e) Bar chart showing the distribution of identified proteins across top 8 biological processes in GO. f) Bar chart showing the distribution of identified proteins across top 13 cellular localizations in GO. g) Bar chart showing distribution of identified proteins across top 13 molecular functions in GO.

Supplementary Figure 2 Schematic workflow for generating a training set of macromolecules.

Previously reported protein complexes, collected from the CORUM, GO and Intact curation databases, are first mapped to a target species protein complexes based on InParanoid orthology predictions. Redundancy is minimized to generate a final set of reference assemblies.

Supplementary Figure 3 Co-elution profile similarity predicts PPIs.

Plots showing the Pearson correlation coefficients (distribution density curves) obtained for a representative worm protein co-fractionation experiment; positive (CORUM derived; blue) and negative (randomized; orange) co-complex interactions, as well as the positive/negative ratio (green), are shown.

Supplementary Figure 4 Correlation score cut-off setting.

Histogram of maximal correlation scores of positive protein-protein interaction pairs among all seven different correlation metrics across all 16 co-fractionation experiments. The red line indicates the cutoff chosen for EPIC.

Supplementary Figure 5 Composite score comparison for original and optimized features integrated with different sources functional evidence.

Composite score analysis demonstrates that for predicting complexes, based on EPIC analysis of CF/MS data, integration of functional associations from WormNet outperforms STRING and GeneMANIA evidence. The analysis also shows an optimized set of EPIC-derived co-elution scores better predicts protein complex memberships than were reported previously.

Supplementary Figure 6 ROC curve and Precision-recall curve for co-complex PPI prediction from different input data.

The plot demonstrates that the best co-complex interaction predictions were obtained after integrating experimental data with supporting functional evidence data (that is WormNet).

Supplementary Figure 7

Pie chart showing overlap of predicted co-complex interactions with PPIs from BioGRID, iRefIndex and our previously reported conserved metazoan complex map.

Supplementary Figure 8

Detailed overview of the EPIC computational pipeline.

Supplementary Figure 9 Comparison of peptides identified using different search tools.

a) Number of Peptides before and after removing ‘one-hit-wonders’ for each used searching tools identified in one co-fractionation experiment. There are 16 co-fractionation experiments (n = 16). b) Percentage of one-hit-wonders for each search engine. There are 16 co-fractionation experiments (n = 16). In each box plot, the red line is the median, the lower and upper line of the box indicates the first and the third quartile. The upper and lower whiskers extend to the largest value less than the third quartile plus 1.5 times the interquartile range (IQR) and smallest value greater than first quartile minus 1.5 times the IQR, respectively. All data points beyond the whiskers are plotted as individual points.

Supplementary Figure 10 Number of Poisson noise iteration comparison.

Precision-recall (PR) curves (a) and Receiver-operating-characteristic (ROC) curves (b) for different iterations of Poisson noise added in Pearson correlation coefficients feature.

Supplementary Figure 11 Different Bayes correlation priors comparison.

Precision-recall (PR) curves (a) and Receiver-operating-characteristic (ROC) curves (b) for different Bayes correlation priors: uniform (Bayes1), Dirichlet-marginalized (Bayes2) and zero count-motivated (Bayes3).

Supplementary Figure 12 EPIC parameters global optimization by nested cross-validation.

(a). Boxplot showing the complex prediction performance (composite score) from two different machine-learning classifiers (random forest n = 1014 vs. support vector machine n = 945). (b). Boxplot showing the complex prediction performance (composite score) based on the 234 results from each four different protein search/quantification tool. (c). Boxplot showing the relationship between different numbers of correlation scores and complex prediction performance (that is composite score). n = 28, 110, 224, 280, 224, 112, 32 and 4 are the number of composite score results with various correlation scores used (from 1 to 8). Red arrow indicates the set of (five) correlation scores producing the highest composite score. In each box plot, the red line is the median, the lower and upper line of the box indicates the first and the third quartile. The upper and lower whiskers extend to the largest value less than the third quartile plus 1.5 times the interquartile range (IQR) and smallest value greater than first quartile minus 1.5 times the IQR, respectively. All data points beyond the whiskers are plotted as individual points.

Supplementary Figure 13 Exploring the value of additional experiments.

(a). Line plot of the number of experiments and corresponding averaged composite score. (b). Line plot of the number of experiments and the corresponding averaged value of composite score times the number of predicted protein complexes.

Supplementary information

Supplementary Information

Supplementary Figs. 1–13 and Supplementary Tables 1, 8 and 9

Reporting Summary

Supplementary Table 2

Complete list of predicted worm PPIs.

Supplementary Table 3

Complete list of predicted worm protein complexes in WormMap.

Supplementary Table 4

Results of AP–MS validation experiments.

Supplementary Table 5

Functional (GO term) enrichment on assemblies in WormMap.

Supplementary Table 6

Phenotypic enrichment analysis of complexes in WormMap.

Supplementary Table 7

Disease enrichment for human orthologs of worm protein macromolecules in WormMap.

Source data

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark