Co-regulation map of the human proteome enables identification of protein functions

Kustatscher, Georg; Grabowski, Piotr; Schrader, Tina A.; Passmore, Josiah B.; Schrader, Michael; Rappsilber, Juri

doi:10.1038/s41587-019-0298-5

Resource
Published: 04 November 2019

Co-regulation map of the human proteome enables identification of protein functions

Nature Biotechnology volume 37, pages 1361–1371 (2019)Cite this article

14k Accesses
88 Citations
158 Altmetric
Metrics details

Subjects

Abstract

Assigning functions to the vast array of proteins present in eukaryotic cells remains challenging. To identify relationships between proteins, and thereby enable functional annotation of proteins, we determined changes in abundance of 10,323 human proteins in response to 294 biological perturbations using isotope-labeling mass spectrometry. We applied the machine learning algorithm treeClust to reveal functional associations between co-regulated human proteins from ProteomeHD, a compilation of our own data and datasets from the Proteomics Identifications database. This produced a co-regulation map of the human proteome. Co-regulation was able to capture relationships between proteins that do not physically interact or colocalize. For example, co-regulation of the peroxisomal membrane protein PEX11β with mitochondrial respiration factors led us to discover an organelle interface between peroxisomes and mitochondria in mammalian cells. We also predicted the functions of microproteins that are difficult to study with traditional methods. The co-regulation map can be explored at www.proteomeHD.net.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Co-regulation map shows associations between human proteins.**

**Fig. 2: treeClust improves co-regulation analysis through robust selectivity for close linear relationships.**

**Fig. 3: Protein co-regulation predicts functions of unknown proteins.**

**Fig. 4: PEX11β and peroxisomal membrane interactions with mitochondria.**

**Fig. 5: Protein co-regulation enables higher precision but lower coverage than mRNA coexpression.**

Systematic detection of functional proteoform groups from bottom-up proteomic datasets

Article Open access 21 June 2021

Mapping protein states and interactions across the tree of life with co-fractionation mass spectrometry

Article Open access 15 December 2023

Meta-analysis defines principles for the design and analysis of co-fractionation mass spectrometry experiments

Article 01 July 2021

Data availability

All mass spectrometry raw files generated in-house have been deposited in the ProteomeXchange Consortium (http://proteomecentral.proteomexchange.org) via the PRIDE partner repository³⁶ with the dataset identifier PXD008888. The co-regulation map is hosted on our website at www.proteomeHD.net, and pair-wise co-regulation scores are available through STRING (https://string-db.org). A network of the top 0.5% co-regulated protein pairs can be explored interactively on NDEx (https://doi.org/10.18119/N9N30Q).

Code availability

Data analysis was performed in R 3.5.1. R scripts and input files required to reproduce the results of this manuscript are available in the following GitHub repository: https://github.com/Rappsilber-Laboratory/ProteomeHD. R scripts related specifically to the benchmarking of the treeClust algorithm using synthetic data are available in the following GitHub repository: https://github.com/Rappsilber-Laboratory/treeClust-benchmarking. The R package data.table was used for fast data processing. Figures were prepared using ggplot2, gridExtra, cowplot and viridis.

References

Havugimana, P. C. et al. A census of human soluble protein complexes. Cell 150, 1068–1081 (2012).
Article CAS PubMed PubMed Central Google Scholar
Huttlin, E. L. et al. Architecture of the human interactome defines protein communities and disease networks. Nature 545, 505–509 (2017).
Article CAS PubMed PubMed Central Google Scholar
Rolland, T. et al. A proteome-scale map of the human interactome network. Cell 159, 1212–1226 (2014).
Article CAS PubMed PubMed Central Google Scholar
Foster, L. J. et al. A mammalian organelle map by protein correlation profiling. Cell 125, 187–199 (2006).
Article CAS PubMed Google Scholar
Christoforou, A. et al. A draft map of the mouse pluripotent stem cell spatial proteome. Nat. Commun. 7, 8992 (2016).
Article CAS PubMed Google Scholar
Thul, P. J. et al. A subcellular map of the human proteome. Science 356, 6340 (2017).
Article CAS Google Scholar
Costanzo, M. et al. A global genetic interaction network maps a wiring diagram of cellular function. Science 353, pii: aaf1420 (2016).
Article CAS Google Scholar
Mülleder, M. et al. Functional metabolomics describes the yeast biosynthetic regulome. Cell 167, 553–565.e12 (2016).
Article CAS PubMed PubMed Central Google Scholar
Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467–470 (1995).
Article CAS PubMed Google Scholar
DeRisi, J. L., Iyer, V. R. & Brown, P. O. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, 680–686 (1997).
Article CAS PubMed Google Scholar
Hughes, T. R. et al. Functional discovery via a compendium of expression profiles. Cell 102, 109–126 (2000).
Article CAS PubMed Google Scholar
Stuart, J. M., Segal, E., Koller, D. & Kim, S. K. A gene-coexpression network for global discovery of conserved genetic modules. Science 302, 249–255 (2003).
Article CAS PubMed Google Scholar
Wang, J. et al. Proteome profiling outperforms transcriptome profiling for coexpression based gene function prediction. Mol. Cell. Proteomics 16, 121–134 (2017).
Article CAS PubMed Google Scholar
Lapek, J. D. Jr et al. Detection of dysregulated protein-association networks by high-throughput proteomics predicts cancer vulnerabilities. Nat. Biotechnol. 35, 983–989 (2017).
Article CAS PubMed PubMed Central Google Scholar
Liu, Y., Beyer, A. & Aebersold, R. On the dependency of cellular protein levels on mRNA abundance. Cell 165, 535–550 (2016).
Article CAS PubMed Google Scholar
Fortelny, N., Overall, C. M., Pavlidis, P. & Freue, G. V. C. Can we predict protein from mRNA levels? Nature 547, E19–E20 (2017).
Article CAS PubMed Google Scholar
Batada, N. N., Urrutia, A. O. & Hurst, L. D. Chromatin remodelling is a major source of coexpression of linked genes in yeast. Trends Genet. 23, 480–484 (2007).
Article CAS PubMed Google Scholar
Kustatscher, G., Grabowski, P. & Rappsilber, J. Pervasive coexpression of spatially proximal genes is buffered at the protein level. Mol. Syst. Biol. 13, 937 (2017).
Article CAS PubMed PubMed Central Google Scholar
Raj, A., Peskin, C. S., Tranchina, D., Vargas, D. Y. & Tyagi, S. Stochastic mRNA synthesis in mammalian cells. PLoS Biol. 4, e309 (2006).
Article CAS PubMed PubMed Central Google Scholar
Ebisuya, M., Yamamoto, T., Nakajima, M. & Nishida, E. Ripples from neighbouring transcription. Nat. Cell Biol. 10, 1106–1113 (2008).
Article CAS PubMed Google Scholar
Battle, A. et al. Genomic variation. Impact of regulatory variation from RNA to protein. Science 347, 664–667 (2015).
Article CAS PubMed Google Scholar
Geiger, T., Cox, J. & Mann, M. Proteomic changes resulting from gene copy number variations in cancer cells. PLoS Genet. 6, e1001090 (2010).
Article CAS PubMed PubMed Central Google Scholar
Stingele, S. et al. Global analysis of genome, transcriptome and proteome reveals the response to aneuploidy in human cells. Mol. Syst. Biol. 8, 608 (2012).
Article CAS PubMed PubMed Central Google Scholar
Wu, L. et al. Variation and genetic control of protein abundance in humans. Nature 499, 79–82 (2013).
Article CAS PubMed PubMed Central Google Scholar
Kustatscher, G. et al. Proteomics of a fuzzy organelle: interphase chromatin. EMBO J. 33, 648–664 (2014).
Article CAS PubMed PubMed Central Google Scholar
Wu, Y. et al. Multilayered genetic and omics dissection of mitochondrial activity in a mouse reference population. Cell 158, 1415–1430 (2014).
Article CAS PubMed PubMed Central Google Scholar
Kustatscher, G., Grabowski, P. & Rappsilber, J. Multiclassifier combinatorial proteomics of organelle shadows at the example of mitochondria in chromatin data. Proteomics 16, 393–401 (2016).
Article CAS PubMed PubMed Central Google Scholar
Williams, E. G. et al. Systems proteomics of liver mitochondria function. Science 352, aad0189 (2016).
Article CAS PubMed Google Scholar
Gupta, S., Turan, D., Tavernier, J. & Martens, L. The online Tabloid Proteome: an annotated database of protein associations. Nucleic Acids Res. 46, D581–D585 (2017).
Article CAS PubMed Central Google Scholar
Singh, S. A. et al. Co-regulation proteomics reveals substrates and mechanisms of APC/C-dependent degradation. EMBO J. 33, 385–399 (2014).
Article CAS PubMed PubMed Central Google Scholar
Andersen, J. S. et al. Proteomic characterization of the human centrosome by protein correlation profiling. Nature 426, 570–574 (2003).
Article CAS PubMed Google Scholar
Kirchner, M. et al. Computational protein profile similarity screening for quantitative mass spectrometry experiments. Bioinformatics 26, 77–83 (2010).
Article CAS PubMed Google Scholar
Wilhelm, M. et al. Mass-spectrometry-based draft of the human proteome. Nature 509, 582–587 (2014).
Article CAS PubMed Google Scholar
Uhlén, M. et al. Proteomics. Tissue-based map of the human proteome. Science 347, 1260419 (2015).
Article CAS PubMed Google Scholar
Ong, S.-E. et al. Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics. Mol. Cell. Proteomics 1, 376–386 (2002).
Article CAS PubMed Google Scholar
Vizcaíno, J. A. et al. 2016 update of the PRIDE database and its related tools. Nucleic Acids Res. 44, D447–D456 (2016).
Article CAS PubMed Google Scholar
Fabregat, A. et al. The reactome pathway knowledgebase. Nucleic Acids Res. 44, D481–D487 (2016).
Article CAS PubMed Google Scholar
Buttrey, S. E. & Whitaker, L.R. treeClust: an R package for tree-based clustering dissimilarities. R J. 7, 227–236 (2015).
Article Google Scholar
Buttrey, S. E. & Whitaker, L. R. A scale-independent, noise-resistant dissimilarity for tree-based clustering of mixed data. NPS Technical Report Archive https://calhoun.nps.edu/handle/10945/48615 (2016).
Ravasz, E., Somera, A. L., Mongru, D. A., Oltvai, Z. N. & Barabási, A. L. Hierarchical organization of modularity in metabolic networks. Science 297, 1551–1555 (2002).
Article CAS PubMed Google Scholar
Yip, A. M. & Horvath, S. Gene network interconnectedness and the generalized topological overlap measure. BMC Bioinformatics 8, 22 (2007).
Article CAS PubMed PubMed Central Google Scholar
The Gene Ontology Consortium. Expansion of the gene ontology knowledgebase and resources. Nucleic Acids Res. 45, D331–D338 (2017).
Article CAS Google Scholar
Van Der Maaten, L. & Hinton, G. Visualizing high-dimensional data using t-SNE. J. Mach. Learn. Res. 9, 26 (2008).
Google Scholar
García-Aguilar, A. & Cuezva, J. M. A review of the inhibition of the mitochondrial ATP synthase by IF1 in vivo: reprogramming energy metabolism and inducing mitohormesis. Front. Physiol. 9, 1322 (2018).
Article PubMed PubMed Central Google Scholar
The UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45, D158–D169 (2017).
Article CAS Google Scholar
Forbes, S. A. et al. COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Res. 45, D777–D783 (2017).
Article CAS PubMed Google Scholar
Piñero, J. et al. DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Res. 45, D833–D839 (2017).
Article CAS PubMed Google Scholar
Andrews, S. J. & Rothnagel, J. A. Emerging evidence for functional peptides encoded by short open reading frames. Nat. Rev. Genet. 15, 193–204 (2014).
Article CAS PubMed Google Scholar
D’Lima, N. G. et al. A human microprotein that interacts with the mRNA decapping complex. Nat. Chem. Biol. 13, 174–180 (2017).
Article CAS PubMed Google Scholar
Chu, Q. et al. Identification of microprotein–protein interactions via APEX tagging. Biochemistry 56, 3299–3306 (2017).
Article CAS PubMed Google Scholar
Slavoff, S. A. et al. Peptidomic discovery of short open reading frame-encoded peptides in human cells. Nat. Chem. Biol. 9, 59–64 (2013).
Article CAS PubMed Google Scholar
Meyer, B., Wittig, I., Trifilieff, E., Karas, M. & Schägger, H. Identification of two proteins associated with mammalian ATP synthase. Mol. Cell. Proteomics 6, 1690–1699 (2007).
Article CAS PubMed Google Scholar
Chen, R., Runswick, M. J., Carroll, J., Fearnley, I. M. & Walker, J. E. Association of two proteolipids of unknown function with ATP synthase from bovine heart mitochondria. FEBS Lett. 581, 3145–3148 (2007).
Article CAS PubMed Google Scholar
Fujikawa, M., Ohsakaya, S., Sugawara, K. & Yoshida, M. Population of ATP synthase molecules in mitochondria is limited by available 6.8-kDa proteolipid protein (MLQ). Genes Cells 19, 153–160 (2014).
Article CAS PubMed Google Scholar
Borner, G. H. H. et al. Multivariate proteomic profiling identifies novel accessory proteins of coated vesicles. J. Cell Biol. 197, 141–160 (2012).
Article CAS PubMed PubMed Central Google Scholar
Signorile, A., Sgaramella, G., Bellomo, F. & De Rasmo, D. Prohibitins: a critical role in mitochondrial functions and implication in diseases. Cells 8, pii: E71 (2019).
Article CAS Google Scholar
Brennan, R. et al. Investigating nucleo-cytoplasmic shuttling of the human DEAD-box helicase DDX3. Eur. J. Cell Biol. 97, 501–511 (2018).
Article CAS PubMed Google Scholar
Szklarczyk, D. et al. STRINGv11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 47, D607–D613 (2019).
Article CAS PubMed Google Scholar
Schrader, M., Costello, J. L., Godinho, L. F., Azadi, A. S. & Islinger, M. Proliferation and fission of peroxisomes—an update. Biochim. Biophys. Acta 1863, 971–983 (2016).
Article CAS PubMed Google Scholar
Schrader, M., Costello, J., Godinho, L. F. & Islinger, M. Peroxisome-mitochondria interplay and disease. J. Inherit. Metab. Dis. 38, 681–702 (2015).
Article CAS PubMed Google Scholar
Devine, M. J., Birsa, N. & Kittler, J. T. Miro sculpts mitochondrial dynamics in neuronal health and disease. Neurobiol. Dis. 90, 27–34 (2016).
Article CAS PubMed Google Scholar
Costello, J. L. et al. Predicting the targeting of tail-anchored proteins to subcellular compartments in mammalian cells. J. Cell Sci. 130, 1675–1687 (2017).
Article CAS PubMed PubMed Central Google Scholar
Okumoto, K. et al. New splicing variants of mitochondrial Rho GTPase-1 (Miro1) transport peroxisomes. J. Cell Biol. 217, 619–633 (2018).
Article CAS PubMed PubMed Central Google Scholar
Castro, I. G. et al. A role for mitochondrial Rho GTPase 1 (MIRO1) in motility and membrane dynamics of peroxisomes. Traffic 19, 229–242 (2018).
Article CAS PubMed PubMed Central Google Scholar
Rodríguez-Serrano, M., Romero-Puertas, M. C., Sanz-Fernández, M., Hu, J. & Sandalio, L. M. Peroxisomes extend peroxules in a fast response to stress via a reactive oxygen species-mediated induction of the peroxin PEX11a. Plant Physiol. 171, 1665–1674 (2016).
Article PubMed PubMed Central Google Scholar
Mattiazzi Ušaj, M. et al. Genome-wide localization study of yeast Pex11 identifies peroxisome–mitochondria interactions through the ERMES complex. J. Mol. Biol. 427, 2072–2087 (2015).
Article CAS PubMed PubMed Central Google Scholar
Shai, N. et al. Systematic mapping of contact sites reveals tethers and a function for the peroxisome–mitochondria contact. Nat. Commun. 9, 1761 (2018).
Article CAS PubMed PubMed Central Google Scholar
Pickrell, J. K. et al. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464, 768–772 (2010).
Article CAS PubMed PubMed Central Google Scholar
Barrett, T. et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 41, D991–D995 (2013).
Article CAS PubMed Google Scholar
Szklarczyk, D. et al. The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible. Nucleic Acids Res. 45, D362–D368 (2017).
Article CAS PubMed Google Scholar
Jovanovic, M. et al. Immunogenetics. Dynamic profiling of the protein life cycle in response to pathogens. Science 347, 1259038 (2015).
Article CAS PubMed PubMed Central Google Scholar
Kustatscher, G., Wills, K. L. H., Furlan, C. & Rappsilber, J. Chromatin enrichment for proteomics. Nat. Protoc. 9, 2090–2099 (2014).
Article CAS PubMed PubMed Central Google Scholar
Alabert, C. et al. Nascent chromatin capture proteomics determines chromatin dynamics during DNA replication and identifies unknown fork components. Nat. Cell Biol. 16, 281–293 (2014).
Article CAS PubMed PubMed Central Google Scholar
Cox, J. & Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 26, 1367–1372 (2008).
Article CAS PubMed Google Scholar
Zhang, B. & Horvath, S. A general framework for weighted gene co-expression network analysis. Stat. Appl. Genet. Mol. Biol. 4, 17 (2005).
Article Google Scholar
Langfelder, P. & Horvath, S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 9, 559 (2008).
Article CAS PubMed PubMed Central Google Scholar
Pratt, D. et al. NDEx, the network data exchange. Cell Syst. 1, 302–305 (2015).
Article CAS PubMed PubMed Central Google Scholar
Langfelder, P. & Horvath, S. Fast R. Functions for robust correlations and hierarchical clustering. J. Stat. Softw. 46, (2012).
Sing, T., Sander, O., Beerenwinkel, N. & Lengauer, T. ROCR: visualizing classifier performance in R. Bioinformatics 21, 3940–3941 (2005).
Article CAS PubMed Google Scholar
Krijthe, J. H. Rtsne: t-distributed stochastic neighbor embedding using Barnes–Hut implementation. https://github.com/jkrijthe/Rtsne (2015).
Csardi, G. & Nepusz, T. The igraph software package for complex network research. InterJournal Complex Systems, 1695 (2006).
Schloerke, B. et al. GGally: extension to ‘ggplot2’. https://github.com/ggobi/ggally (2018).
Butts, C. T. sna: tools for social network analysis. https://cran.r-project.org/web/packages/sna/index.html (2016).
Alexa, A. & Rahnenfuhrer, J. topGO: Enrichment analysis for gene ontology. R version 2.30.0, https://bioconductor.org/packages/release/bioc/html/topGO.html (2016).
Binns, D. et al. QuickGO: a web-based tool for gene ontology searching. Bioinformatics 25, 3045–3046 (2009).
Article CAS PubMed PubMed Central Google Scholar
Stark, C. et al. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 34, D535–D539 (2006).
Article CAS PubMed Google Scholar
Orchard, S. et al. The MIntAct project—IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res. 42, D358–D363 (2014).
Article CAS PubMed Google Scholar
Szklarczyk, D. et al. STRINGv10: protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 43, D447–D452 (2015).
Article CAS PubMed Google Scholar
Jaccard, P. Distribution de la flore alpine dans le bassin des dranses et dans quelques régions voisines. Bull. Soc. Vaud. Sci. Nat. 37, 241–272 (1901).
Google Scholar
Costello, J. L. et al. ACBD5 and VAPB mediate membrane associations between peroxisomes and the ER. J. Cell Biol. 216, 331–342 (2017).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We are grateful to D. Szklarczyk for providing the mRNA Pearson correlation data used by STRING and the STRING team for testing our coregulation data and adding them as a novel evidence type to STRING 11. We also thank K. Wills, K. Nakamura, C. Alabert and A. Groth for contributing chromatin enrichment experiments, and A.S. Azadi for support with live-cell imaging. This work was supported by the Wellcome Trust through a Senior Research Fellowship to J.R. (grant no. 103139) and by the Biotechnology and Biological Sciences Research Council (grant nos. BB/N01541X/1 and BB/R016844/1 to M.S.) and grant no. H2020-MSCA-ITN-2018 812968 PERICO (to M.S.). The Wellcome Centre for Cell Biology is supported by core funding from the Wellcome Trust (grant no. 203149).

Author information

Piotr Grabowski
Present address: Data Sciences and Artificial Intelligence, Clinical Pharmacology & Safety Sciences, R&D, AstraZeneca, Cambridge, UK
These authors contributed equally: Georg Kustatscher, Piotr Grabowski.

Authors and Affiliations

Wellcome Centre for Cell Biology, University of Edinburgh, Edinburgh, UK
Georg Kustatscher & Juri Rappsilber
Division of Bioanalytics, Institute of Biotechnology, Technische Universität Berlin, Berlin, Germany
Piotr Grabowski & Juri Rappsilber
Biosciences, University of Exeter, Exeter, UK
Tina A. Schrader, Josiah B. Passmore & Michael Schrader

Authors

Georg Kustatscher
View author publications
You can also search for this author in PubMed Google Scholar
Piotr Grabowski
View author publications
You can also search for this author in PubMed Google Scholar
Tina A. Schrader
View author publications
You can also search for this author in PubMed Google Scholar
Josiah B. Passmore
View author publications
You can also search for this author in PubMed Google Scholar
Michael Schrader
View author publications
You can also search for this author in PubMed Google Scholar
Juri Rappsilber
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

G.K. and J.R. conceived the project. G.K. and P.G. conducted the data analysis. P.G. created the web application. T.A.S., J.B.P. and M.S. conducted the Pex11β analysis. All authors contributed to writing the manuscript.

Corresponding author

Correspondence to Juri Rappsilber.

Ethics declarations

Competing interests

The authors declare no competing interests.

Integrated supplementary information

Supplementary Figure 1 Peptide and protein quantitation statistics in ProteomeHD.

(a) Histogram showing the number of peptides identified per protein in ProteomeHD (10,323 proteins, light blue) and in the subset of ProteomeHD used to make the co-regulation map (5,013 proteins, dark blue). Dashed lines show the average number of peptides per protein. (b) Number of peptides per protein broken down by experiment. The average peptide number for the proteins detected in each experiment is shown. (c) Average number of SILAC ratio counts (independent observations) per protein, broken down into the 294 input experiments. (d) Sequence coverage of proteins in ProteomeHD. Dashed lines indicate the average. (e) Average sequence coverage of proteins in each input experiment. (f) The number of proteins that were quantified in the 294 experiments of ProteomeHD ranges from 817 to 6,080. The average is 3,928 proteins per SILAC ratio. (g) Number of experiments, i.e. SILAC ratios, in which proteins were quantified. Only proteins that were quantified in at least 95 experiments were used for the co-regulation analysis. On average, proteins in ProteomeHD were quantified in 112 input experiments. The average rises to 190 if only proteins used for the co-regulation analysis are considered. (h) Bar chart showing which fraction of proteins have been detected in which fraction of experiments. For example, 100% of proteins in the co-regulation map have been quantified in at least 30% of the 294 experiments. About 15% of the proteins have been quantified in at least 90% of the experiments.

Supplementary Figure 2 Impact of co-occurrence on treeClust learning.

Performance comparison of a standard treeClust application on ProteomeHD with two types of co-occurrence measures. Jaccard scores are an established co-occurrence measure (protein pairs observed in the same set of experiments would get a Jaccard score of 1, while protein pairs without any overlapping experiments would get a score of 0). We also applied treeClust to a “binary” version of ProteomeHD, where all SILAC ratios were set to 1 and all missing values were set to 0. The precision recall curve uses Reactome as a gold standard. It shows that Jaccard and “binary treeClust” work equally well but both are outperformed by the standard co-regulation analysis. Therefore, while co-occurrence of proteins across ProteomeHD does provide some information about functional associations, quantitative up- and down regulation is a far better indicator of shared protein function, at least for ProteomeHD. Notably, this also shows that treeClust can detect co-occurrence, in principle, if the data are transformed into a binary format.

Supplementary Figure 3 treeClust detects specifically positive linear associations.

We tested which types of relationships treeClust detects by using a synthetic dataset consisting of 100 variables and 200 proteins, where 0.5% of all possible protein - protein combination have a defined relationship. (a) Precision - recall (PR) analyses show that treeClust separates linear from random relationships perfectly, resulting in an area under the PR curve (AUPRC) of 1. The same result is observed for the three tested correlation-based metrics: PCC, Spearman’s rho and biweight midcorrelation (bicor). The four PR curves overlap fully. (b) TreeClust completely fails to detect exponential or logistic relationships (AUPRC = 0). In contrast, although these pairs receive lower correlation coefficients than linear pairs, they still score high enough with PCC, rho and bicor to be completely separated from the pool of random associations. No metric detects quadratic relationships. (c) Anti-correlations are not identified well by treeClust.

Supplementary Figure 4 Impact of data size and missing values on treeClust performance.

We used synthetic data to assess the impact of various data characteristics on treeClust performance. This figure complements Fig. 2. (a) Synthetic datasets of 50 samples and 500 proteins were created with increasing percentage of defined linear relationships. This has no impact on the three correlation metrics (PCC, rho and bicor), so their curves overlap fully at AUPRC 1. Treeclust performance needs > 0.3% linear relationships in the data in order to detect them successfully. Synthetic datasets were created in triplicate. Points show the average area under the precision recall curve (AUPRC) obtained for each setting. Error bars show the standard error of the mean. (b) Combinatorial impact of the number of samples and the percentage of defined linear relationships (N proteins = 500). Note that for larger datasets lower percentages of “coexpressed” proteins can be detected. (c) TreeClust, but not the three correlation metrics, is also affected by the number of available observations (proteins). N samples = 20, 0.3% linear associations. (d, e) Adding missing values to a small (n = 50 samples, n = 500 proteins) and medium (n = 100 samples, n = 1,000 proteins) dataset, respectively, has a different impact on treeClust performance. (f) Combinatorial impact of missing values and the number of proteins, showing that for large datasets with many proteins a larger percentage of missing values can be tolerated (N samples = 150).

Supplementary Figure 5 Illustration of changing the goodness-of-fit and outlier occurrence in synthetic data.

This figure illustrates the different conditions tested in Fig. 2d, e. (a) Scatterplots illustrating the effect of increasing the difference between variables, which decreases treeClust performance but not that of correlation metrics. (b) Scatterplots illustrating the effect of adding outlier data points, which decreases treeClust performance less than that of the correlation metrics.

Supplementary Figure 6 Outliers in ProteomeHD and their impact on coexpression metrics.

(a) Co-regulated protein pairs in ProteomeHD were divided into those detected by treeClust but not by PCC and vice versa. Separate comparisons were made for pairs detected by treeClust but not rho, and treeClust but not bicor. The pairs in the resulting groups were annotated using Reactome into known, biologically relevant interactions (true positives) and pairs that were unlikely to have any biological associations (false positives). Note that treeClust-specific pairs tend to be true positives, whereas correlation-specific pairs tend to be false positives. (b) This panel complements Fig. 2f. Outliers were detected in ProteomeHD via their Mahalanobis distance, i.e. these outliers are located far from the bulk of the data, but can be close to the regression line. The boxplots show that Mahalanobis outliers are more frequent in protein pairs detected specifically by rho or bicor as opposed to pairs detected specifically by treeClust. The number of protein pairs shown corresponds to n for each group as indicated in (a). (c) Removing these Mahalanobis outliers has little impact on the PCC of treeClust-, rho- or bicor-specific protein pairs, in contrast to what was observed for Pearson’s correlation (see Fig. 2g). For number of proteins shown, see panel (a). (d) A second type of outlier - regression outliers - were detected in ProteomeHD via studentized residuals. These outliers are located far away from the regression line and will decrease correlation coefficients. An example of a true association is shown, where regression outliers affect the resulting correlation. Fold-changes have been scaled to lie between 0 and 1. (e) The percentage of regression outliers is very similar in all six groups. See panel (a) for number of proteins shown. (f) Removing regression outliers increases the correlation coefficient (PCC) of protein pairs that were previously detected only by treeClust, suggesting PCC missed some of these pairs because of regression outliers. This is not the case for pairs missed by rho or bicor. See panel (a) for number of proteins shown. For boxplots, lower and upper hinges correspond to the first and third quartiles, and lower and upper whiskers extend to the smallest or largest value no further than 1.5 * IQR (inter-quartile range) from the hinge, respectively. Notches give roughly a 95% confidence interval for comparing medians.

Supplementary Figure 7 Goodness-of-fit partially explains different performance of PCC and treeClust.

This figure complements Fig. 2i. Systematic comparison of mean absolute errors (MAEs) from protein pairs that scored high with either treeClust or with PCC (see Supplementary Fig. 6a; n = 8,786 treeClust-specific protein pairs, n = 9,593 PCC-specific protein pairs). Protein pairs exclusively detected by PCC tend to have somewhat higher MAEs, possibly explaining why they are predominantly false-positive hits, in addition to the impact of outliers.

Supplementary Figure 8 Lack of genuine non-linear relationships in ProteomeHD.

(a) Exponential and logistic (sigmoid) models were fitted to all protein pairs that scored high with treeClust or the three correlation metrics. Model fit was compared through their residual sum of squares (RSS). Exponential models only fitted better than linear ones in rare cases, but logistic models often did. Around half of the protein pairs detected specifically by PCC are better explained by a logistic than a linear model. However, this is mainly driven by Mahalanobis-type outliers. Removing those strongly reduces the number of logistic models outfitting the linear ones. (b) Two example regressions where an exponential (left) or logistic (right) model fits better than a linear one. Note that this clearly reflects overfitting due to outliers rather than genuine non-linear relationships.

Supplementary Figure 9 The protein co-regulation network satisfies scale-free topology but is difficult to visualize as an interaction network.

(a) The “scale free plot” produced by the WGCNA R package using the treeClust-derived adjacency matrix. The log of the connectivity k is plotted against the log of the frequency of this connectivity. There is a linear relationship between these two variables, as indicated by the square of the Pearson correlation, R2, being 0.91. This shows that the protein co-regulation network derived from ProteomeHD using treeClust is at least approximately scale free. (b) Visualization of a weighted, undirected network with 5,013 nodes (proteins detected in at least 95 experiments) and 62,812 edges (top scoring 0.5% of links), based on the co-regulation score. Four common algorithms were used to create different network layouts, but with so many edges it is difficult to avoid the “hairball” problem.

Supplementary Figure 10 Microproteins in ProteomeHD and their connectivity.

(a) Histogram showing the number of peptides identified per microprotein (proteins < 15 kDa) in ProteomeHD and the subset of ProteomeHD used to make the co-regulation map. Dashed lines show the average number of peptides per microprotein. (b) Average number of SILAC ratio counts (independent observations) per microprotein, broken down into the 294 input experiments. (c) Histogram showing the cumulative SILAC ratio counts per microprotein across all experiments in ProteomeHD. (d) Sequence coverage of microproteins in ProteomeHD. Dashed lines indicate the average. (e) The actual peptides for one example microprotein, MP68. The numbers in brackets indicate in how many different experiments each peptide was observed. (f) Microproteins tend to have more co-regulation partners in ProteomeHD than larger proteins (median 27 vs 10 associations; n = 206 microproteins, n = 2505 other proteins). Microproteins also have more functional protein - protein associations according to STRING (median 23 vs 14; n = 521 microproteins, n = 9,261 other proteins). However, larger proteins have considerably more physical interaction partners than microproteins, according to BioGRID (median 10 vs 17; n = 815 microproteins, n = 14,918 other proteins). (g) The number of interaction partners of microproteins identified by STRING and BioGRID, broken down by the evidence type available in each resource (n = 362 microproteins for mRNA coexpression, 481 for curated databases, 505 for experimental, 908 for text mining, 636 for affinity capture, 251 for co-fractionation, 367 for in vitro and 533 for two-hybrid. We considered STRING interactions with a minimum score of 400 in the individual evidence channels (e.g. mRNA coexpression). Two STRING evidence channels (gene neighborhood and evolutionary co-occurrence) were omitted because they contribute very little. For panel (f) we considered only the most reliable STRING interactions, i.e. those with a combined interaction score above 900. For boxplots, lower and upper hinges correspond to the first and third quartiles, and lower and upper whiskers extend to the smallest or largest value no further than 1.5 * IQR (inter-quartile range) from the hinge, respectively. Notches give roughly a 95% confidence interval for comparing medians.

Supplementary Figure 11 Layout of www.proteomeHD.net.

Screenshot of the core page of www.proteomeHD.net, an interactive web-based app to explore co-regulation data. The basic elements are highlighted and explained. Note that the page also contains help and download sections.

Supplementary Figure 12 Integration of co-regulation scores with STRING (https://string-db.org).

A typical protein - protein association network in STRING, containing the Arp2/3 complex and tropomyosin 3, both of which are involved in actin cytoskeleton regulation. Network edges are colour-coded by the type of evidence available for the association. Protein co-regulation information is embedded in the gene coexpression channel. The channel view shows the channel-specific STRING score, a re-calibrated version of our co-regulation score. It also contains a pre-computed link to www.proteomeHD.net, which uses the first protein as ProteomeHD query and highlights the second protein in the results. If more than one protein isoform is available in ProteomeHD, STRING will link to the alphabetically first isoform, which is generally the main one. The link also contains a cut-off setting to match the ProteomeHD cut-off to the equivalent one selected by the user in STRING. In cases where both mRNA coexpression and protein co-regulation evidence is available for an association, their relative contribution to the STRING coexpression score is indicated (shown here as point 3).

Supplementary Figure 13 MIRO1-induced peroxisomal membrane protrusions depend on PEX11β.

(a-h) PEX5-deficient human skin fibroblasts were mock-treated (control), or transfected with Myc-Miro-PO, a peroxisome-targeted Miro1 variant, in the presence of control- or PEX11β-specific siRNA. Cells were processed for immunofluorescence using anti-Myc and anti-PEX14 antibodies (peroxisomal marker). Results are representative of three independent experiments. (b) Quantification of cells with peroxisomal protrusions. The average result of 3 independent experiments is shown, error bars indicate the mean +/- standard deviation. (a, b) Control cells occasionally contain peroxisomes with membrane protrusions (< 5 per cell; up to 5 µm in length). (c-e, b) Myc-Miro-PO induces the formation of peroxisomal membrane protrusions (> 5 per cell; > 5 µm in length). Results are representative of three independent experiments. (f-h, b) Silencing of PEX11β by siRNA significantly reduces the number of cells with peroxisomal membrane protrusions in controls and Myc-Miro-PO expressing cells. Results are representative of three independent experiments. Globular peroxisomes (arrows) with membrane protrusions (arrowheads) in (a) are highlighted. *** P < 0.001; ** P < 0.01 from a two-tailed unpaired t test; ns, not significant (p = 0.9695). Scale bars, 10 µm.

Supplementary Figure 14 Validation of treeClust and t-SNE on an independent proteomics dataset.

(a) treeClust was applied to the TMT-based cancer proteomics dataset from Lapek et al (Nature Biotechnology, 2017). It outperforms Pearson, Spearman and Bicor correlation, as shown by a Precision-Recall analysis using Reactome annotations as the gold standard. Note that treeClust builds only one decision tree per condition, i.e. 41 trees on this dataset, too few for a standard analysis. Therefore, treeClust was performed iteratively, obtaining the mean co-regulation score of 100 treeClust forests, each generated from 10 random experiments. (b) Co-regulation map for the Lapek et al dataset, made by t-SNE from treeClust scores. As in the correlation network of the original report (Fig. 2 in Lapek et al), CORUM protein complexes are colored. In contrast to a network, there is not a limited number of arbitrarily arranged, pairwise links, but the position of each protein reflects its similarity or dissimilarity to all other proteins in the map. This makes it possible to place all proteins in a functional context, not just those that are directly linked to members of the core network. It also allows for a hierarchical analysis of protein associations, with increasing distances indicating weaker co-regulation. For example, the subunits of the protein complexes in the enlarged map area (inset) are clustered together, and the distances between the complexes are larger. However, all complexes have roles in vesicular trafficking. n = 6,151 proteins shown in plot.

Supplementary Figure 15 Information content of ProteomeHD has not reached saturation yet.

We randomly removed 5%, 10% and 15% of the data points across the ProteomeHD matrix, in triplicate, and repeated treeClust learning to predict protein associations. The Precision-Recall analysis shows that removing data points decreases performance proportionally to the amount of removed data, suggesting that adding additional data would likely enhance performance further.

Supplementary information

Supplementary Information

Supplementary Figs. 1–15.

Reporting Summary

Supplementary Video 1

Interaction of peroxisomal membrane protrusions with mitochondria in COS-7 cells. See Fig. 4a–f. COS-7 cells were transfected with PEX11β-EGFP; mitochondria were stained with Mitotracker (red), and analysed by live-cell imaging using an IX81 microscope (Olympus) equipped with a CSUX1 spinning disk head (Yokogawa). A peroxisome interacts with a mitochondrion via its membrane protrusion, and follows it, occasionally detaching and re-establishing contact. Two hundred stacks of nine planes (0.5 µm thickness, 100 ms exposure) were taken in a continuous stream; 118 frames, 14× speed. Scale bar, 5 µm.

Supplementary Video 2

Note a peroxisome at the bottom, which interacts with a mitochondrion via its membrane protrusion and then wraps around it, possibly to increase the membrane contact area. Two hundred stacks of nine planes (0.5 µm thickness, 100 ms exposure) were taken in a continuous stream; 200 frames, 14× speed. Scale bar, 5 µm

Supplementary Video 3

A mitochondrion, which moves to the left, is dragging a peroxisome with a membrane protrusion with it, indicating that the organelles are tightly tethered to each other. Two hundred stacks of nine planes (0.5 µm thickness, 100 ms exposure) were taken in a continuous stream; 100 frames, 14× speed. Scale bar, 5 µm.

Supplementary Table 1

The table provides the 294 SILAC ratios for 10,323 proteins that make up ProteomeHD.

Supplementary Table 2

Contains PRIDE identifiers and brief description of each project

Supplementary Table 3

Contains the top-scoring 0.5% of protein pairs in the co-regulation analysis. A complete set of all 12.5 million pair-wise associations is available in our PRIDE repository (PXD008888).

Supplementary Table 4

Coordinates and annotations relating to Fig. 1h,i.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kustatscher, G., Grabowski, P., Schrader, T.A. et al. Co-regulation map of the human proteome enables identification of protein functions. Nat Biotechnol 37, 1361–1371 (2019). https://doi.org/10.1038/s41587-019-0298-5

Download citation

Received: 22 December 2017
Accepted: 27 September 2019
Published: 04 November 2019
Issue Date: November 2019
DOI: https://doi.org/10.1038/s41587-019-0298-5

This article is cited by

CCDC58 is a potential biomarker for diagnosis, prognosis, immunity, and genomic heterogeneity in pan-cancer
- Kai Yang
- Yan Ma
- Ruohan Zhang
Scientific Reports (2024)
Proteome-Wide Identification of RNA-dependent proteins and an emerging role for RNAs in Plasmodium falciparum protein complexes
- Thomas Hollin
- Steven Abel
- Karine G. Le Roch
Nature Communications (2024)
The peroxisome: an update on mysteries 3.0
- Rechal Kumar
- Markus Islinger
- Michael Schrader
Histochemistry and Cell Biology (2024)
Prediction of anti-microtubular target proteins of tubulins and their interacting proteins using Gene Ontology tools
- Polani B. Ramesh Babu
Journal of Genetic Engineering and Biotechnology (2023)
Mapping protein states and interactions across the tree of life with co-fractionation mass spectrometry
- Michael A. Skinnider
- Mopelola O. Akinlaja
- Leonard J. Foster
Nature Communications (2023)