EPIC: software toolkit for elution profile-based inference of protein complexes

Abstract

Protein complexes are key macromolecular machines of the cell, but their description remains incomplete. We and others previously reported an experimental strategy for global characterization of native protein assemblies based on chromatographic fractionation of biological extracts coupled to precision mass spectrometry analysis (chromatographic fractionation–mass spectrometry, CF–MS), but the resulting data are challenging to process and interpret. Here, we describe EPIC (elution profile-based inference of complexes), a software toolkit for automated scoring of large-scale CF–MS data to define high-confidence multi-component macromolecules from diverse biological specimens. As a case study, we used EPIC to map the global interactome of Caenorhabditis elegans, defining 612 putative worm protein complexes linked to diverse biological processes. These included novel subunits and assemblies unique to nematodes that we validated using orthogonal methods. The open source EPIC software is freely available as a Jupyter notebook packaged in a Docker container (https://hub.docker.com/r/baderlab/bio-epic/).

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: EPIC workflow.
Fig. 2: EPIC parameter evaluation.
Fig. 3: Prediction, benchmarking and analysis of C. elegans protein complexes.

Data availability

The supporting co-fractionation data are available via ProteomeXchange with the identifier PXD011182. The entire WormMap network (Cytoscape format) is available on GitHub (https://github.com/BaderLab/EPIC/tree/master/WormMap) and has been submitted to the BioGRID database. Source Data for Fig. 2 are available online.

Code availability

EPIC is available via a Docker container (https://hub.docker.com/r/baderlab/bio-epic/). The EPIC software code is publicly available on GitHub (https://github.com/BaderLab/EPIC).

References

  1. 1.

    Rigaut, G. et al. A generic protein purification method for protein complex characterization and proteome exploration. Nat. Biotechnol. 17, 1030–1032 (1999).

    CAS  Article  Google Scholar 

  2. 2.

    Krogan, N. J. et al. Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440, 637–643 (2006).

    CAS  Article  Google Scholar 

  3. 3.

    Gavin, A. C. et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415, 141–147 (2002).

    CAS  Article  Google Scholar 

  4. 4.

    Ho, Y. et al. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415, 180–183 (2002).

    CAS  Article  Google Scholar 

  5. 5.

    Gavin, A. C. et al. Proteome survey reveals modularity of the yeast cell machinery. Nature 440, 631–636 (2006).

    CAS  Article  Google Scholar 

  6. 6.

    Hu, P. et al. Global functional atlas of Escherichia coli encompassing previously uncharacterized proteins. PLoS Biol. 7, e96 (2009).

    Article  Google Scholar 

  7. 7.

    Huttlin, E. L. et al. The BioPlex network: a systematic exploration of the human interactome. Cell 162, 425–440 (2015).

    CAS  Article  Google Scholar 

  8. 8.

    Hein, M. Y. et al. A human interactome in three quantitative dimensions organized by stoichiometries and abundances. Cell 163, 712–723 (2015).

    CAS  Article  Google Scholar 

  9. 9.

    Babu, M. et al. Interaction landscape of membrane-protein complexes in Saccharomyces cerevisiae. Nature 489, 585–589 (2012).

    CAS  Article  Google Scholar 

  10. 10.

    Havugimana, P. C. et al. A census of human soluble protein complexes. Cell 150, 1068–1081 (2012).

    CAS  Article  Google Scholar 

  11. 11.

    Wan, C. et al. Panorama of ancient metazoan macromolecular complexes. Nature 525, 339–344 (2015).

    CAS  Article  Google Scholar 

  12. 12.

    Liu, F., Rijkers, D. T., Post, H. & Heck, A. J. Proteome-wide profiling of protein assemblies by cross-linking mass spectrometry. Nat. Meth. 12, 1179–1184 (2015).

    CAS  Article  Google Scholar 

  13. 13.

    Ruepp, A. et al. CORUM: the comprehensive resource of mammalian protein complexes—2009. Nucleic Acids Res. 38, D497–D501 (2010).

    CAS  Article  Google Scholar 

  14. 14.

    UniProt, C. UniProt: a hub for protein information. Nucleic Acids Res. 43, D204–D212 (2015).

    Article  Google Scholar 

  15. 15.

    Orchard, S. et al. The MIntAct project-IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res. 42, D358–D363 (2014).

    CAS  Article  Google Scholar 

  16. 16.

    The Gene Ontology, C. Expansion of the Gene Ontology knowledgebase and resources. Nucleic Acids Res. 45, D331–D338 (2017).

    Article  Google Scholar 

  17. 17.

    Zuberi, K. et al. GeneMANIA prediction server 2013 update. Nucleic Acids Res. 41, W115–W122 (2013).

    Article  Google Scholar 

  18. 18.

    Szklarczyk, D. et al. The STRING database in 2017: quality-controlled protein–protein association networks, made broadly accessible. Nucleic Acids Res. 45, D362–D368 (2017).

    CAS  Article  Google Scholar 

  19. 19.

    Sonnhammer, E. L. & Ostlund, G. InParanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic. Nucleic Acids Res. 43, D234–D239 (2015).

    CAS  Article  Google Scholar 

  20. 20.

    Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).

    CAS  Article  Google Scholar 

  21. 21.

    Stacey, R. G., Skinnider, M. A., Scott, N. E. & Foster, L. J. A rapid and accurate approach for prediction of interactomes from co-elution data (PrInCE). BMC Bioinformatics 18, 457 (2017).

    Article  Google Scholar 

  22. 22.

    Sanchez-Taltavull, D., Ramachandran, P., Lau, N. & Perkins, T. J. Bayesian correlation analysis for sequence count data. PloS ONE 11, e0163595 (2016).

    Article  Google Scholar 

  23. 23.

    Nepusz, T., Yu, H. & Paccanaro, A. Detecting overlapping protein complexes in protein–protein interaction networks. Nat. Meth. 9, 471–472 (2012).

    CAS  Article  Google Scholar 

  24. 24.

    Wiwie, C., Baumbach, J. & Rottger, R. Comparing the performance of biomedical clustering methods. Nat. Meth. 12, 1033–1038 (2015).

    CAS  Article  Google Scholar 

  25. 25.

    Cho, A. et al. WormNetv3: a network-assisted hypothesis-generating server for Caenorhabditis elegans. Nucleic Acids Res. 42, W76–W82 (2014).

    CAS  Article  Google Scholar 

  26. 26.

    Turner, B. et al. iRefWeb: interactive analysis of consolidated protein interaction data and their supporting evidence. Database 2010, baq023 (2010).

    Article  Google Scholar 

  27. 27.

    Chatr-Aryamontri, A. et al. The BioGRID interaction database: 2017 update. Nucleic Acids Res. 45, D369–D379 (2017).

    CAS  Article  Google Scholar 

  28. 28.

    Mulder, N. J. et al. The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res. 31, 315–318 (2003).

    CAS  Article  Google Scholar 

  29. 29.

    Kagawa, H., Gengyo, K., McLachlan, A. D., Brenner, S. & Karn, J. Paramyosin gene (unc-15) of Caenorhabditis elegans. Molecular cloning, nucleotide sequence and models for thick filament structure. J. Mol. Biol. 207, 311–333 (1989).

    CAS  Article  Google Scholar 

  30. 30.

    Harris, T. W. et al. WormBase: a comprehensive resource for nematode research. Nucleic Acids Res. 38, D463–D467 (2010).

    CAS  Article  Google Scholar 

  31. 31.

    Monemi, S. et al. Identification of a novel adult-onset primary open-angle glaucoma (POAG) gene on 5q22.1. Hum. Mol. Genet 14, 725–733 (2005).

    CAS  Article  Google Scholar 

  32. 32.

    Yunger, E., Safra, M., Levi-Ferber, M., Haviv-Chesner, A. & Henis-Korenblit, S. Innate immunity mediated longevity and longevity induced by germ cell removal converge on the C-type lectin domain protein IRG-7. PLoS Genet. 13, e1006577 (2017).

    Article  Google Scholar 

  33. 33.

    Amberger, J. S., Bocchini, C. A., Schiettecatte, F., Scott, A. F. & Hamosh, A. OMIM.org: Online Mendelian Inheritance in Man (OMIM(R)), an online catalog of human genes and genetic disorders. Nucleic Acids Res. 43, D789–D798 (2015).

    Article  Google Scholar 

  34. 34.

    Stenson, P. D. et al. The human gene mutation database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Hum. Genet. 133, 1–9 (2014).

    CAS  Article  Google Scholar 

  35. 35.

    Olinares, P. D., Ponnala, L. & van Wijk, K. J. Megadalton complexes in the chloroplast stroma of Arabidopsis thaliana characterized by size exclusion chromatography, mass spectrometry, and hierarchical clustering. Mol. Cell. Proteomics 9, 1594–1615 (2010).

    CAS  Article  Google Scholar 

  36. 36.

    Skinnider, M. A., Stacey, R. G. & Foster, L. J. Genomic data integration systematically biases interactome mapping. PLoS Comput. Biol. 14, e1006474 (2018).

    Article  Google Scholar 

  37. 37.

    Tran, J. C. et al. Mapping intact protein isoforms in discovery mode using top-down proteomics. Nature 480, 254–258 (2011).

    CAS  Article  Google Scholar 

  38. 38.

    Werner, T. et al. Ion coalescence of neutron encoded TMT 10-plex reporter ions. Anal. Chem. 86, 3594–3601 (2014).

    CAS  Article  Google Scholar 

  39. 39.

    Ideker, T. & Krogan, N. J. Differential network biology. Mol. Syst. Biol. 8, 565 (2012).

    Article  Google Scholar 

  40. 40.

    Stiernagle, T. in WormBook: The Online Review of C. elegans Biology (ed. The C. elegans Research Community) (WormBook).

  41. 41.

    Kwon, T., Choi, H., Vogel, C., Nesvizhskii, A. I. & Marcotte, E. M. MSblender: a probabilistic approach for integrating peptide identifications from multiple database search engines. J. Proteome Res. 10, 2949–2958 (2011).

    CAS  Article  Google Scholar 

  42. 42.

    Cox, J. & Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 26, 1367–1372 (2008).

    CAS  Article  Google Scholar 

  43. 43.

    Kislinger, T. et al. PRISM, a generic large scale proteomic investigation strategy for mammals. Mol. Cell. Proteomics 2, 96–106 (2003).

    CAS  Article  Google Scholar 

  44. 44.

    Campagnola, P. J. et al. Three-dimensional high-resolution second-harmonic generation imaging of endogenous structural proteins in biological tissues. Biophys. J. 82, 493–508 (2002).

    CAS  Article  Google Scholar 

  45. 45.

    Dupuy, D. et al. A first version of the Caenorhabditis elegans promoterome. Genome Res. 14, 2169–2175 (2004).

    CAS  Article  Google Scholar 

  46. 46.

    Kwan, J. et al. DLG5 connects cell polarity and Hippo signaling protein networks by linking PAR-1 with MST1/2. Genes Dev. 30, 2696–2709 (2016).

    CAS  Article  Google Scholar 

  47. 47.

    Wehrens, R., Melssen, W., Buydens, L. & de Gelder, R. Representing structural databases in a self-organizing map. Acta Crystallogr. B 61, 548–557 (2005).

    Article  Google Scholar 

  48. 48.

    Brohee, S. & van Helden, J. Evaluation of clustering algorithms for protein–protein interaction networks. BMC Bioinformatics 7, 488 (2006).

    Article  Google Scholar 

  49. 49.

    Reimand, J. et al. g:Profiler-a web server for functional interpretation of gene lists (2016 update). Nucleic Acids Res. 44, W83–W89 (2016).

    CAS  Article  Google Scholar 

Download references

Acknowledgements

This study was supported by a Foundation Grant (FDN no. 148399) from the Canadian Institute of Health Research (CIHR, to A.E.), and US National Institutes of Health grants (nos. P41 GM103504, GM070743 to G.D.B.) L.Z.M.H. was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) Mass Spectrometry-Enabled Science and Engineering (MS-ESE) program. C. elegans strain access was supported by the NIH Office of Research Infrastructure Programs (P40 OD010440).

Author information

Affiliations

Authors

Contributions

A.E. and G.D.B. conceived the project. L.Z.M.H. and F.G. wrote the software, performed computational analysis and wrote the manuscript. L.Z.M.H. and C.W. performed the co-fractionation experiments. J.H.T. performed the protein GFP tagging in C. elegans with assistance and guidance from M.S. and A.G.F. The AP–MS experiments were performed by E.W. with assistance from U.K. S.P. and C.X. provided technical support. G.D.B and A.E. supervised the study and edited the paper.

Corresponding authors

Correspondence to Gary D. Bader or Andrew Emili.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information: Allison Doerr was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Pre-enrichment improves the dynamic range of CF/MS studies.

a) Schematic workflow of bead-based sample pre-enrichment. b) Venn diagram showing improved proteome coverage by pre-enrichment. c) Bar chart showing improved detection of low abundance proteins. d) Bar chart showing improved detection of small (low molecular mass) proteins. e) Bar chart showing the distribution of identified proteins across top 8 biological processes in GO. f) Bar chart showing the distribution of identified proteins across top 13 cellular localizations in GO. g) Bar chart showing distribution of identified proteins across top 13 molecular functions in GO.

Supplementary Figure 2 Schematic workflow for generating a training set of macromolecules.

Previously reported protein complexes, collected from the CORUM, GO and Intact curation databases, are first mapped to a target species protein complexes based on InParanoid orthology predictions. Redundancy is minimized to generate a final set of reference assemblies.

Supplementary Figure 3 Co-elution profile similarity predicts PPIs.

Plots showing the Pearson correlation coefficients (distribution density curves) obtained for a representative worm protein co-fractionation experiment; positive (CORUM derived; blue) and negative (randomized; orange) co-complex interactions, as well as the positive/negative ratio (green), are shown.

Supplementary Figure 4 Correlation score cut-off setting.

Histogram of maximal correlation scores of positive protein-protein interaction pairs among all seven different correlation metrics across all 16 co-fractionation experiments. The red line indicates the cutoff chosen for EPIC.

Supplementary Figure 5 Composite score comparison for original and optimized features integrated with different sources functional evidence.

Composite score analysis demonstrates that for predicting complexes, based on EPIC analysis of CF/MS data, integration of functional associations from WormNet outperforms STRING and GeneMANIA evidence. The analysis also shows an optimized set of EPIC-derived co-elution scores better predicts protein complex memberships than were reported previously.

Supplementary Figure 6 ROC curve and Precision-recall curve for co-complex PPI prediction from different input data.

The plot demonstrates that the best co-complex interaction predictions were obtained after integrating experimental data with supporting functional evidence data (that is WormNet).

Supplementary Figure 7

Pie chart showing overlap of predicted co-complex interactions with PPIs from BioGRID, iRefIndex and our previously reported conserved metazoan complex map.

Supplementary Figure 8

Detailed overview of the EPIC computational pipeline.

Supplementary Figure 9 Comparison of peptides identified using different search tools.

a) Number of Peptides before and after removing ‘one-hit-wonders’ for each used searching tools identified in one co-fractionation experiment. There are 16 co-fractionation experiments (n = 16). b) Percentage of one-hit-wonders for each search engine. There are 16 co-fractionation experiments (n = 16). In each box plot, the red line is the median, the lower and upper line of the box indicates the first and the third quartile. The upper and lower whiskers extend to the largest value less than the third quartile plus 1.5 times the interquartile range (IQR) and smallest value greater than first quartile minus 1.5 times the IQR, respectively. All data points beyond the whiskers are plotted as individual points.

Supplementary Figure 10 Number of Poisson noise iteration comparison.

Precision-recall (PR) curves (a) and Receiver-operating-characteristic (ROC) curves (b) for different iterations of Poisson noise added in Pearson correlation coefficients feature.

Supplementary Figure 11 Different Bayes correlation priors comparison.

Precision-recall (PR) curves (a) and Receiver-operating-characteristic (ROC) curves (b) for different Bayes correlation priors: uniform (Bayes1), Dirichlet-marginalized (Bayes2) and zero count-motivated (Bayes3).

Supplementary Figure 12 EPIC parameters global optimization by nested cross-validation.

(a). Boxplot showing the complex prediction performance (composite score) from two different machine-learning classifiers (random forest n = 1014 vs. support vector machine n = 945). (b). Boxplot showing the complex prediction performance (composite score) based on the 234 results from each four different protein search/quantification tool. (c). Boxplot showing the relationship between different numbers of correlation scores and complex prediction performance (that is composite score). n = 28, 110, 224, 280, 224, 112, 32 and 4 are the number of composite score results with various correlation scores used (from 1 to 8). Red arrow indicates the set of (five) correlation scores producing the highest composite score. In each box plot, the red line is the median, the lower and upper line of the box indicates the first and the third quartile. The upper and lower whiskers extend to the largest value less than the third quartile plus 1.5 times the interquartile range (IQR) and smallest value greater than first quartile minus 1.5 times the IQR, respectively. All data points beyond the whiskers are plotted as individual points.

Supplementary Figure 13 Exploring the value of additional experiments.

(a). Line plot of the number of experiments and corresponding averaged composite score. (b). Line plot of the number of experiments and the corresponding averaged value of composite score times the number of predicted protein complexes.

Supplementary information

Supplementary Information

Supplementary Figs. 1–13 and Supplementary Tables 1, 8 and 9

Reporting Summary

Supplementary Table 2

Complete list of predicted worm PPIs.

Supplementary Table 3

Complete list of predicted worm protein complexes in WormMap.

Supplementary Table 4

Results of AP–MS validation experiments.

Supplementary Table 5

Functional (GO term) enrichment on assemblies in WormMap.

Supplementary Table 6

Phenotypic enrichment analysis of complexes in WormMap.

Supplementary Table 7

Disease enrichment for human orthologs of worm protein macromolecules in WormMap.

Source data

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Hu, L.Z., Goebels, F., Tan, J.H. et al. EPIC: software toolkit for elution profile-based inference of protein complexes. Nat Methods 16, 737–742 (2019). https://doi.org/10.1038/s41592-019-0461-4

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing