Co-regulation map of the human proteome enables identification of protein functions

Article metrics

Abstract

Assigning functions to the vast array of proteins present in eukaryotic cells remains challenging. To identify relationships between proteins, and thereby enable functional annotation of proteins, we determined changes in abundance of 10,323 human proteins in response to 294 biological perturbations using isotope-labeling mass spectrometry. We applied the machine learning algorithm treeClust to reveal functional associations between co-regulated human proteins from ProteomeHD, a compilation of our own data and datasets from the Proteomics Identifications database. This produced a co-regulation map of the human proteome. Co-regulation was able to capture relationships between proteins that do not physically interact or colocalize. For example, co-regulation of the peroxisomal membrane protein PEX11β with mitochondrial respiration factors led us to discover an organelle interface between peroxisomes and mitochondria in mammalian cells. We also predicted the functions of microproteins that are difficult to study with traditional methods. The co-regulation map can be explored at www.proteomeHD.net.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Co-regulation map shows associations between human proteins.
Fig. 2: treeClust improves co-regulation analysis through robust selectivity for close linear relationships.
Fig. 3: Protein co-regulation predicts functions of unknown proteins.
Fig. 4: PEX11β and peroxisomal membrane interactions with mitochondria.
Fig. 5: Protein co-regulation enables higher precision but lower coverage than mRNA coexpression.

Data availability

All mass spectrometry raw files generated in-house have been deposited in the ProteomeXchange Consortium (http://proteomecentral.proteomexchange.org) via the PRIDE partner repository36 with the dataset identifier PXD008888. The co-regulation map is hosted on our website at www.proteomeHD.net, and pair-wise co-regulation scores are available through STRING (https://string-db.org). A network of the top 0.5% co-regulated protein pairs can be explored interactively on NDEx (https://doi.org/10.18119/N9N30Q).

Code availability

Data analysis was performed in R 3.5.1. R scripts and input files required to reproduce the results of this manuscript are available in the following GitHub repository: https://github.com/Rappsilber-Laboratory/ProteomeHD. R scripts related specifically to the benchmarking of the treeClust algorithm using synthetic data are available in the following GitHub repository: https://github.com/Rappsilber-Laboratory/treeClust-benchmarking. The R package data.table was used for fast data processing. Figures were prepared using ggplot2, gridExtra, cowplot and viridis.

References

  1. 1.

    Havugimana, P. C. et al. A census of human soluble protein complexes. Cell 150, 1068–1081 (2012).

  2. 2.

    Huttlin, E. L. et al. Architecture of the human interactome defines protein communities and disease networks. Nature 545, 505–509 (2017).

  3. 3.

    Rolland, T. et al. A proteome-scale map of the human interactome network. Cell 159, 1212–1226 (2014).

  4. 4.

    Foster, L. J. et al. A mammalian organelle map by protein correlation profiling. Cell 125, 187–199 (2006).

  5. 5.

    Christoforou, A. et al. A draft map of the mouse pluripotent stem cell spatial proteome. Nat. Commun. 7, 8992 (2016).

  6. 6.

    Thul, P. J. et al. A subcellular map of the human proteome. Science 356, 6340 (2017).

  7. 7.

    Costanzo, M. et al. A global genetic interaction network maps a wiring diagram of cellular function. Science 353, pii: aaf1420 (2016).

  8. 8.

    Mülleder, M. et al. Functional metabolomics describes the yeast biosynthetic regulome. Cell 167, 553–565.e12 (2016).

  9. 9.

    Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467–470 (1995).

  10. 10.

    DeRisi, J. L., Iyer, V. R. & Brown, P. O. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, 680–686 (1997).

  11. 11.

    Hughes, T. R. et al. Functional discovery via a compendium of expression profiles. Cell 102, 109–126 (2000).

  12. 12.

    Stuart, J. M., Segal, E., Koller, D. & Kim, S. K. A gene-coexpression network for global discovery of conserved genetic modules. Science 302, 249–255 (2003).

  13. 13.

    Wang, J. et al. Proteome profiling outperforms transcriptome profiling for coexpression based gene function prediction. Mol. Cell. Proteomics 16, 121–134 (2017).

  14. 14.

    Lapek, J. D. Jr et al. Detection of dysregulated protein-association networks by high-throughput proteomics predicts cancer vulnerabilities. Nat. Biotechnol. 35, 983–989 (2017).

  15. 15.

    Liu, Y., Beyer, A. & Aebersold, R. On the dependency of cellular protein levels on mRNA abundance. Cell 165, 535–550 (2016).

  16. 16.

    Fortelny, N., Overall, C. M., Pavlidis, P. & Freue, G. V. C. Can we predict protein from mRNA levels? Nature 547, E19–E20 (2017).

  17. 17.

    Batada, N. N., Urrutia, A. O. & Hurst, L. D. Chromatin remodelling is a major source of coexpression of linked genes in yeast. Trends Genet. 23, 480–484 (2007).

  18. 18.

    Kustatscher, G., Grabowski, P. & Rappsilber, J. Pervasive coexpression of spatially proximal genes is buffered at the protein level. Mol. Syst. Biol. 13, 937 (2017).

  19. 19.

    Raj, A., Peskin, C. S., Tranchina, D., Vargas, D. Y. & Tyagi, S. Stochastic mRNA synthesis in mammalian cells. PLoS Biol. 4, e309 (2006).

  20. 20.

    Ebisuya, M., Yamamoto, T., Nakajima, M. & Nishida, E. Ripples from neighbouring transcription. Nat. Cell Biol. 10, 1106–1113 (2008).

  21. 21.

    Battle, A. et al. Genomic variation. Impact of regulatory variation from RNA to protein. Science 347, 664–667 (2015).

  22. 22.

    Geiger, T., Cox, J. & Mann, M. Proteomic changes resulting from gene copy number variations in cancer cells. PLoS Genet. 6, e1001090 (2010).

  23. 23.

    Stingele, S. et al. Global analysis of genome, transcriptome and proteome reveals the response to aneuploidy in human cells. Mol. Syst. Biol. 8, 608 (2012).

  24. 24.

    Wu, L. et al. Variation and genetic control of protein abundance in humans. Nature 499, 79–82 (2013).

  25. 25.

    Kustatscher, G. et al. Proteomics of a fuzzy organelle: interphase chromatin. EMBO J. 33, 648–664 (2014).

  26. 26.

    Wu, Y. et al. Multilayered genetic and omics dissection of mitochondrial activity in a mouse reference population. Cell 158, 1415–1430 (2014).

  27. 27.

    Kustatscher, G., Grabowski, P. & Rappsilber, J. Multiclassifier combinatorial proteomics of organelle shadows at the example of mitochondria in chromatin data. Proteomics 16, 393–401 (2016).

  28. 28.

    Williams, E. G. et al. Systems proteomics of liver mitochondria function. Science 352, aad0189 (2016).

  29. 29.

    Gupta, S., Turan, D., Tavernier, J. & Martens, L. The online Tabloid Proteome: an annotated database of protein associations. Nucleic Acids Res. 46, D581–D585 (2017).

  30. 30.

    Singh, S. A. et al. Co-regulation proteomics reveals substrates and mechanisms of APC/C-dependent degradation. EMBO J. 33, 385–399 (2014).

  31. 31.

    Andersen, J. S. et al. Proteomic characterization of the human centrosome by protein correlation profiling. Nature 426, 570–574 (2003).

  32. 32.

    Kirchner, M. et al. Computational protein profile similarity screening for quantitative mass spectrometry experiments. Bioinformatics 26, 77–83 (2010).

  33. 33.

    Wilhelm, M. et al. Mass-spectrometry-based draft of the human proteome. Nature 509, 582–587 (2014).

  34. 34.

    Uhlén, M. et al. Proteomics. Tissue-based map of the human proteome. Science 347, 1260419 (2015).

  35. 35.

    Ong, S.-E. et al. Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics. Mol. Cell. Proteomics 1, 376–386 (2002).

  36. 36.

    Vizcaíno, J. A. et al. 2016 update of the PRIDE database and its related tools. Nucleic Acids Res. 44, D447–D456 (2016).

  37. 37.

    Fabregat, A. et al. The reactome pathway knowledgebase. Nucleic Acids Res. 44, D481–D487 (2016).

  38. 38.

    Buttrey, S. E. & Whitaker, L.R. treeClust: an R package for tree-based clustering dissimilarities. R J. 7, 227–236 (2015).

  39. 39.

    Buttrey, S. E. & Whitaker, L. R. A scale-independent, noise-resistant dissimilarity for tree-based clustering of mixed data. NPS Technical Report Archive https://calhoun.nps.edu/handle/10945/48615 (2016).

  40. 40.

    Ravasz, E., Somera, A. L., Mongru, D. A., Oltvai, Z. N. & Barabási, A. L. Hierarchical organization of modularity in metabolic networks. Science 297, 1551–1555 (2002).

  41. 41.

    Yip, A. M. & Horvath, S. Gene network interconnectedness and the generalized topological overlap measure. BMC Bioinformatics 8, 22 (2007).

  42. 42.

    The Gene Ontology Consortium. Expansion of the gene ontology knowledgebase and resources. Nucleic Acids Res. 45, D331–D338 (2017).

  43. 43.

    Van Der Maaten, L. & Hinton, G. Visualizing high-dimensional data using t-SNE. J. Mach. Learn. Res. 9, 26 (2008).

  44. 44.

    García-Aguilar, A. & Cuezva, J. M. A review of the inhibition of the mitochondrial ATP synthase by IF1 in vivo: reprogramming energy metabolism and inducing mitohormesis. Front. Physiol. 9, 1322 (2018).

  45. 45.

    The UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45, D158–D169 (2017).

  46. 46.

    Forbes, S. A. et al. COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Res. 45, D777–D783 (2017).

  47. 47.

    Piñero, J. et al. DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Res. 45, D833–D839 (2017).

  48. 48.

    Andrews, S. J. & Rothnagel, J. A. Emerging evidence for functional peptides encoded by short open reading frames. Nat. Rev. Genet. 15, 193–204 (2014).

  49. 49.

    D’Lima, N. G. et al. A human microprotein that interacts with the mRNA decapping complex. Nat. Chem. Biol. 13, 174–180 (2017).

  50. 50.

    Chu, Q. et al. Identification of microprotein–protein interactions via APEX tagging. Biochemistry 56, 3299–3306 (2017).

  51. 51.

    Slavoff, S. A. et al. Peptidomic discovery of short open reading frame-encoded peptides in human cells. Nat. Chem. Biol. 9, 59–64 (2013).

  52. 52.

    Meyer, B., Wittig, I., Trifilieff, E., Karas, M. & Schägger, H. Identification of two proteins associated with mammalian ATP synthase. Mol. Cell. Proteomics 6, 1690–1699 (2007).

  53. 53.

    Chen, R., Runswick, M. J., Carroll, J., Fearnley, I. M. & Walker, J. E. Association of two proteolipids of unknown function with ATP synthase from bovine heart mitochondria. FEBS Lett. 581, 3145–3148 (2007).

  54. 54.

    Fujikawa, M., Ohsakaya, S., Sugawara, K. & Yoshida, M. Population of ATP synthase molecules in mitochondria is limited by available 6.8-kDa proteolipid protein (MLQ). Genes Cells 19, 153–160 (2014).

  55. 55.

    Borner, G. H. H. et al. Multivariate proteomic profiling identifies novel accessory proteins of coated vesicles. J. Cell Biol. 197, 141–160 (2012).

  56. 56.

    Signorile, A., Sgaramella, G., Bellomo, F. & De Rasmo, D. Prohibitins: a critical role in mitochondrial functions and implication in diseases. Cells 8, pii: E71 (2019).

  57. 57.

    Brennan, R. et al. Investigating nucleo-cytoplasmic shuttling of the human DEAD-box helicase DDX3. Eur. J. Cell Biol. 97, 501–511 (2018).

  58. 58.

    Szklarczyk, D. et al. STRINGv11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 47, D607–D613 (2019).

  59. 59.

    Schrader, M., Costello, J. L., Godinho, L. F., Azadi, A. S. & Islinger, M. Proliferation and fission of peroxisomes—an update. Biochim. Biophys. Acta 1863, 971–983 (2016).

  60. 60.

    Schrader, M., Costello, J., Godinho, L. F. & Islinger, M. Peroxisome-mitochondria interplay and disease. J. Inherit. Metab. Dis. 38, 681–702 (2015).

  61. 61.

    Devine, M. J., Birsa, N. & Kittler, J. T. Miro sculpts mitochondrial dynamics in neuronal health and disease. Neurobiol. Dis. 90, 27–34 (2016).

  62. 62.

    Costello, J. L. et al. Predicting the targeting of tail-anchored proteins to subcellular compartments in mammalian cells. J. Cell Sci. 130, 1675–1687 (2017).

  63. 63.

    Okumoto, K. et al. New splicing variants of mitochondrial Rho GTPase-1 (Miro1) transport peroxisomes. J. Cell Biol. 217, 619–633 (2018).

  64. 64.

    Castro, I. G. et al. A role for mitochondrial Rho GTPase 1 (MIRO1) in motility and membrane dynamics of peroxisomes. Traffic 19, 229–242 (2018).

  65. 65.

    Rodríguez-Serrano, M., Romero-Puertas, M. C., Sanz-Fernández, M., Hu, J. & Sandalio, L. M. Peroxisomes extend peroxules in a fast response to stress via a reactive oxygen species-mediated induction of the peroxin PEX11a. Plant Physiol. 171, 1665–1674 (2016).

  66. 66.

    Mattiazzi Ušaj, M. et al. Genome-wide localization study of yeast Pex11 identifies peroxisome–mitochondria interactions through the ERMES complex. J. Mol. Biol. 427, 2072–2087 (2015).

  67. 67.

    Shai, N. et al. Systematic mapping of contact sites reveals tethers and a function for the peroxisome–mitochondria contact. Nat. Commun. 9, 1761 (2018).

  68. 68.

    Pickrell, J. K. et al. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464, 768–772 (2010).

  69. 69.

    Barrett, T. et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 41, D991–D995 (2013).

  70. 70.

    Szklarczyk, D. et al. The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible. Nucleic Acids Res. 45, D362–D368 (2017).

  71. 71.

    Jovanovic, M. et al. Immunogenetics. Dynamic profiling of the protein life cycle in response to pathogens. Science 347, 1259038 (2015).

  72. 72.

    Kustatscher, G., Wills, K. L. H., Furlan, C. & Rappsilber, J. Chromatin enrichment for proteomics. Nat. Protoc. 9, 2090–2099 (2014).

  73. 73.

    Alabert, C. et al. Nascent chromatin capture proteomics determines chromatin dynamics during DNA replication and identifies unknown fork components. Nat. Cell Biol. 16, 281–293 (2014).

  74. 74.

    Cox, J. & Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 26, 1367–1372 (2008).

  75. 75.

    Zhang, B. & Horvath, S. A general framework for weighted gene co-expression network analysis. Stat. Appl. Genet. Mol. Biol. 4, 17 (2005).

  76. 76.

    Langfelder, P. & Horvath, S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 9, 559 (2008).

  77. 77.

    Pratt, D. et al. NDEx, the network data exchange. Cell Syst. 1, 302–305 (2015).

  78. 78.

    Langfelder, P. & Horvath, S. Fast R. Functions for robust correlations and hierarchical clustering. J. Stat. Softw. 46, (2012).

  79. 79.

    Sing, T., Sander, O., Beerenwinkel, N. & Lengauer, T. ROCR: visualizing classifier performance in R. Bioinformatics 21, 3940–3941 (2005).

  80. 80.

    Krijthe, J. H. Rtsne: t-distributed stochastic neighbor embedding using Barnes–Hut implementation. https://github.com/jkrijthe/Rtsne (2015).

  81. 81.

    Csardi, G. & Nepusz, T. The igraph software package for complex network research. InterJournal Complex Systems, 1695 (2006).

  82. 82.

    Schloerke, B. et al. GGally: extension to ‘ggplot2’. https://github.com/ggobi/ggally (2018).

  83. 83.

    Butts, C. T. sna: tools for social network analysis. https://cran.r-project.org/web/packages/sna/index.html (2016).

  84. 84.

    Alexa, A. & Rahnenfuhrer, J. topGO: Enrichment analysis for gene ontology. R version 2.30.0, https://bioconductor.org/packages/release/bioc/html/topGO.html (2016).

  85. 85.

    Binns, D. et al. QuickGO: a web-based tool for gene ontology searching. Bioinformatics 25, 3045–3046 (2009).

  86. 86.

    Stark, C. et al. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 34, D535–D539 (2006).

  87. 87.

    Orchard, S. et al. The MIntAct project—IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res. 42, D358–D363 (2014).

  88. 88.

    Szklarczyk, D. et al. STRINGv10: protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 43, D447–D452 (2015).

  89. 89.

    Jaccard, P. Distribution de la flore alpine dans le bassin des dranses et dans quelques régions voisines. Bull. Soc. Vaud. Sci. Nat. 37, 241–272 (1901).

  90. 90.

    Costello, J. L. et al. ACBD5 and VAPB mediate membrane associations between peroxisomes and the ER. J. Cell Biol. 216, 331–342 (2017).

Download references

Acknowledgements

We are grateful to D. Szklarczyk for providing the mRNA Pearson correlation data used by STRING and the STRING team for testing our coregulation data and adding them as a novel evidence type to STRING 11. We also thank K. Wills, K. Nakamura, C. Alabert and A. Groth for contributing chromatin enrichment experiments, and A.S. Azadi for support with live-cell imaging. This work was supported by the Wellcome Trust through a Senior Research Fellowship to J.R. (grant no. 103139) and by the Biotechnology and Biological Sciences Research Council (grant nos. BB/N01541X/1 and BB/R016844/1 to M.S.) and grant no. H2020-MSCA-ITN-2018 812968 PERICO (to M.S.). The Wellcome Centre for Cell Biology is supported by core funding from the Wellcome Trust (grant no. 203149).

Author information

G.K. and J.R. conceived the project. G.K. and P.G. conducted the data analysis. P.G. created the web application. T.A.S., J.B.P. and M.S. conducted the Pex11β analysis. All authors contributed to writing the manuscript.

Correspondence to Juri Rappsilber.

Ethics declarations

Competing interests

The authors declare no competing interests.

Integrated supplementary information

Supplementary Figure 1 Peptide and protein quantitation statistics in ProteomeHD.

(a) Histogram showing the number of peptides identified per protein in ProteomeHD (10,323 proteins, light blue) and in the subset of ProteomeHD used to make the co-regulation map (5,013 proteins, dark blue). Dashed lines show the average number of peptides per protein. (b) Number of peptides per protein broken down by experiment. The average peptide number for the proteins detected in each experiment is shown. (c) Average number of SILAC ratio counts (independent observations) per protein, broken down into the 294 input experiments. (d) Sequence coverage of proteins in ProteomeHD. Dashed lines indicate the average. (e) Average sequence coverage of proteins in each input experiment. (f) The number of proteins that were quantified in the 294 experiments of ProteomeHD ranges from 817 to 6,080. The average is 3,928 proteins per SILAC ratio. (g) Number of experiments, i.e. SILAC ratios, in which proteins were quantified. Only proteins that were quantified in at least 95 experiments were used for the co-regulation analysis. On average, proteins in ProteomeHD were quantified in 112 input experiments. The average rises to 190 if only proteins used for the co-regulation analysis are considered. (h) Bar chart showing which fraction of proteins have been detected in which fraction of experiments. For example, 100% of proteins in the co-regulation map have been quantified in at least 30% of the 294 experiments. About 15% of the proteins have been quantified in at least 90% of the experiments.

Supplementary Figure 2 Impact of co-occurrence on treeClust learning.

Performance comparison of a standard treeClust application on ProteomeHD with two types of co-occurrence measures. Jaccard scores are an established co-occurrence measure (protein pairs observed in the same set of experiments would get a Jaccard score of 1, while protein pairs without any overlapping experiments would get a score of 0). We also applied treeClust to a “binary” version of ProteomeHD, where all SILAC ratios were set to 1 and all missing values were set to 0. The precision recall curve uses Reactome as a gold standard. It shows that Jaccard and “binary treeClust” work equally well but both are outperformed by the standard co-regulation analysis. Therefore, while co-occurrence of proteins across ProteomeHD does provide some information about functional associations, quantitative up- and down regulation is a far better indicator of shared protein function, at least for ProteomeHD. Notably, this also shows that treeClust can detect co-occurrence, in principle, if the data are transformed into a binary format.

Supplementary Figure 3 treeClust detects specifically positive linear associations.

We tested which types of relationships treeClust detects by using a synthetic dataset consisting of 100 variables and 200 proteins, where 0.5% of all possible protein - protein combination have a defined relationship. (a) Precision - recall (PR) analyses show that treeClust separates linear from random relationships perfectly, resulting in an area under the PR curve (AUPRC) of 1. The same result is observed for the three tested correlation-based metrics: PCC, Spearman’s rho and biweight midcorrelation (bicor). The four PR curves overlap fully. (b) TreeClust completely fails to detect exponential or logistic relationships (AUPRC = 0). In contrast, although these pairs receive lower correlation coefficients than linear pairs, they still score high enough with PCC, rho and bicor to be completely separated from the pool of random associations. No metric detects quadratic relationships. (c) Anti-correlations are not identified well by treeClust.

Supplementary Figure 4 Impact of data size and missing values on treeClust performance.

We used synthetic data to assess the impact of various data characteristics on treeClust performance. This figure complements Fig. 2. (a) Synthetic datasets of 50 samples and 500 proteins were created with increasing percentage of defined linear relationships. This has no impact on the three correlation metrics (PCC, rho and bicor), so their curves overlap fully at AUPRC 1. Treeclust performance needs > 0.3% linear relationships in the data in order to detect them successfully. Synthetic datasets were created in triplicate. Points show the average area under the precision recall curve (AUPRC) obtained for each setting. Error bars show the standard error of the mean. (b) Combinatorial impact of the number of samples and the percentage of defined linear relationships (N proteins = 500). Note that for larger datasets lower percentages of “coexpressed” proteins can be detected. (c) TreeClust, but not the three correlation metrics, is also affected by the number of available observations (proteins). N samples = 20, 0.3% linear associations. (d, e) Adding missing values to a small (n = 50 samples, n = 500 proteins) and medium (n = 100 samples, n = 1,000 proteins) dataset, respectively, has a different impact on treeClust performance. (f) Combinatorial impact of missing values and the number of proteins, showing that for large datasets with many proteins a larger percentage of missing values can be tolerated (N samples = 150).

Supplementary Figure 5 Illustration of changing the goodness-of-fit and outlier occurrence in synthetic data.

This figure illustrates the different conditions tested in Fig. 2d, e. (a) Scatterplots illustrating the effect of increasing the difference between variables, which decreases treeClust performance but not that of correlation metrics. (b) Scatterplots illustrating the effect of adding outlier data points, which decreases treeClust performance less than that of the correlation metrics.

Supplementary Figure 6 Outliers in ProteomeHD and their impact on coexpression metrics.

(a) Co-regulated protein pairs in ProteomeHD were divided into those detected by treeClust but not by PCC and vice versa. Separate comparisons were made for pairs detected by treeClust but not rho, and treeClust but not bicor. The pairs in the resulting groups were annotated using Reactome into known, biologically relevant interactions (true positives) and pairs that were unlikely to have any biological associations (false positives). Note that treeClust-specific pairs tend to be true positives, whereas correlation-specific pairs tend to be false positives. (b) This panel complements Fig. 2f. Outliers were detected in ProteomeHD via their Mahalanobis distance, i.e. these outliers are located far from the bulk of the data, but can be close to the regression line. The boxplots show that Mahalanobis outliers are more frequent in protein pairs detected specifically by rho or bicor as opposed to pairs detected specifically by treeClust. The number of protein pairs shown corresponds to n for each group as indicated in (a). (c) Removing these Mahalanobis outliers has little impact on the PCC of treeClust-, rho- or bicor-specific protein pairs, in contrast to what was observed for Pearson’s correlation (see Fig. 2g). For number of proteins shown, see panel (a). (d) A second type of outlier - regression outliers - were detected in ProteomeHD via studentized residuals. These outliers are located far away from the regression line and will decrease correlation coefficients. An example of a true association is shown, where regression outliers affect the resulting correlation. Fold-changes have been scaled to lie between 0 and 1. (e) The percentage of regression outliers is very similar in all six groups. See panel (a) for number of proteins shown. (f) Removing regression outliers increases the correlation coefficient (PCC) of protein pairs that were previously detected only by treeClust, suggesting PCC missed some of these pairs because of regression outliers. This is not the case for pairs missed by rho or bicor. See panel (a) for number of proteins shown. For boxplots, lower and upper hinges correspond to the first and third quartiles, and lower and upper whiskers extend to the smallest or largest value no further than 1.5 * IQR (inter-quartile range) from the hinge, respectively. Notches give roughly a 95% confidence interval for comparing medians.

Supplementary Figure 7 Goodness-of-fit partially explains different performance of PCC and treeClust.

This figure complements Fig. 2i. Systematic comparison of mean absolute errors (MAEs) from protein pairs that scored high with either treeClust or with PCC (see Supplementary Fig. 6a; n = 8,786 treeClust-specific protein pairs, n = 9,593 PCC-specific protein pairs). Protein pairs exclusively detected by PCC tend to have somewhat higher MAEs, possibly explaining why they are predominantly false-positive hits, in addition to the impact of outliers.

Supplementary Figure 8 Lack of genuine non-linear relationships in ProteomeHD.

(a) Exponential and logistic (sigmoid) models were fitted to all protein pairs that scored high with treeClust or the three correlation metrics. Model fit was compared through their residual sum of squares (RSS). Exponential models only fitted better than linear ones in rare cases, but logistic models often did. Around half of the protein pairs detected specifically by PCC are better explained by a logistic than a linear model. However, this is mainly driven by Mahalanobis-type outliers. Removing those strongly reduces the number of logistic models outfitting the linear ones. (b) Two example regressions where an exponential (left) or logistic (right) model fits better than a linear one. Note that this clearly reflects overfitting due to outliers rather than genuine non-linear relationships.

Supplementary Figure 9 The protein co-regulation network satisfies scale-free topology but is difficult to visualize as an interaction network.

(a) The “scale free plot” produced by the WGCNA R package using the treeClust-derived adjacency matrix. The log of the connectivity k is plotted against the log of the frequency of this connectivity. There is a linear relationship between these two variables, as indicated by the square of the Pearson correlation, R2, being 0.91. This shows that the protein co-regulation network derived from ProteomeHD using treeClust is at least approximately scale free. (b) Visualization of a weighted, undirected network with 5,013 nodes (proteins detected in at least 95 experiments) and 62,812 edges (top scoring 0.5% of links), based on the co-regulation score. Four common algorithms were used to create different network layouts, but with so many edges it is difficult to avoid the “hairball” problem.

Supplementary Figure 10 Microproteins in ProteomeHD and their connectivity.

(a) Histogram showing the number of peptides identified per microprotein (proteins < 15 kDa) in ProteomeHD and the subset of ProteomeHD used to make the co-regulation map. Dashed lines show the average number of peptides per microprotein. (b) Average number of SILAC ratio counts (independent observations) per microprotein, broken down into the 294 input experiments. (c) Histogram showing the cumulative SILAC ratio counts per microprotein across all experiments in ProteomeHD. (d) Sequence coverage of microproteins in ProteomeHD. Dashed lines indicate the average. (e) The actual peptides for one example microprotein, MP68. The numbers in brackets indicate in how many different experiments each peptide was observed. (f) Microproteins tend to have more co-regulation partners in ProteomeHD than larger proteins (median 27 vs 10 associations; n = 206 microproteins, n = 2505 other proteins). Microproteins also have more functional protein - protein associations according to STRING (median 23 vs 14; n = 521 microproteins, n = 9,261 other proteins). However, larger proteins have considerably more physical interaction partners than microproteins, according to BioGRID (median 10 vs 17; n = 815 microproteins, n = 14,918 other proteins). (g) The number of interaction partners of microproteins identified by STRING and BioGRID, broken down by the evidence type available in each resource (n = 362 microproteins for mRNA coexpression, 481 for curated databases, 505 for experimental, 908 for text mining, 636 for affinity capture, 251 for co-fractionation, 367 for in vitro and 533 for two-hybrid. We considered STRING interactions with a minimum score of 400 in the individual evidence channels (e.g. mRNA coexpression). Two STRING evidence channels (gene neighborhood and evolutionary co-occurrence) were omitted because they contribute very little. For panel (f) we considered only the most reliable STRING interactions, i.e. those with a combined interaction score above 900. For boxplots, lower and upper hinges correspond to the first and third quartiles, and lower and upper whiskers extend to the smallest or largest value no further than 1.5 * IQR (inter-quartile range) from the hinge, respectively. Notches give roughly a 95% confidence interval for comparing medians.

Supplementary Figure 11 Layout of www.proteomeHD.net.

Screenshot of the core page of www.proteomeHD.net, an interactive web-based app to explore co-regulation data. The basic elements are highlighted and explained. Note that the page also contains help and download sections.

Supplementary Figure 12 Integration of co-regulation scores with STRING (https://string-db.org).

A typical protein - protein association network in STRING, containing the Arp2/3 complex and tropomyosin 3, both of which are involved in actin cytoskeleton regulation. Network edges are colour-coded by the type of evidence available for the association. Protein co-regulation information is embedded in the gene coexpression channel. The channel view shows the channel-specific STRING score, a re-calibrated version of our co-regulation score. It also contains a pre-computed link to www.proteomeHD.net, which uses the first protein as ProteomeHD query and highlights the second protein in the results. If more than one protein isoform is available in ProteomeHD, STRING will link to the alphabetically first isoform, which is generally the main one. The link also contains a cut-off setting to match the ProteomeHD cut-off to the equivalent one selected by the user in STRING. In cases where both mRNA coexpression and protein co-regulation evidence is available for an association, their relative contribution to the STRING coexpression score is indicated (shown here as point 3).

Supplementary Figure 13 MIRO1-induced peroxisomal membrane protrusions depend on PEX11β.

(a-h) PEX5-deficient human skin fibroblasts were mock-treated (control), or transfected with Myc-Miro-PO, a peroxisome-targeted Miro1 variant, in the presence of control- or PEX11β-specific siRNA. Cells were processed for immunofluorescence using anti-Myc and anti-PEX14 antibodies (peroxisomal marker). Results are representative of three independent experiments. (b) Quantification of cells with peroxisomal protrusions. The average result of 3 independent experiments is shown, error bars indicate the mean +/- standard deviation. (a, b) Control cells occasionally contain peroxisomes with membrane protrusions (< 5 per cell; up to 5 µm in length). (c-e, b) Myc-Miro-PO induces the formation of peroxisomal membrane protrusions (> 5 per cell; > 5 µm in length). Results are representative of three independent experiments. (f-h, b) Silencing of PEX11β by siRNA significantly reduces the number of cells with peroxisomal membrane protrusions in controls and Myc-Miro-PO expressing cells. Results are representative of three independent experiments. Globular peroxisomes (arrows) with membrane protrusions (arrowheads) in (a) are highlighted. *** P < 0.001; ** P < 0.01 from a two-tailed unpaired t test; ns, not significant (p = 0.9695). Scale bars, 10 µm.

Supplementary Figure 14 Validation of treeClust and t-SNE on an independent proteomics dataset.

(a) treeClust was applied to the TMT-based cancer proteomics dataset from Lapek et al (Nature Biotechnology, 2017). It outperforms Pearson, Spearman and Bicor correlation, as shown by a Precision-Recall analysis using Reactome annotations as the gold standard. Note that treeClust builds only one decision tree per condition, i.e. 41 trees on this dataset, too few for a standard analysis. Therefore, treeClust was performed iteratively, obtaining the mean co-regulation score of 100 treeClust forests, each generated from 10 random experiments. (b) Co-regulation map for the Lapek et al dataset, made by t-SNE from treeClust scores. As in the correlation network of the original report (Fig. 2 in Lapek et al), CORUM protein complexes are colored. In contrast to a network, there is not a limited number of arbitrarily arranged, pairwise links, but the position of each protein reflects its similarity or dissimilarity to all other proteins in the map. This makes it possible to place all proteins in a functional context, not just those that are directly linked to members of the core network. It also allows for a hierarchical analysis of protein associations, with increasing distances indicating weaker co-regulation. For example, the subunits of the protein complexes in the enlarged map area (inset) are clustered together, and the distances between the complexes are larger. However, all complexes have roles in vesicular trafficking. n = 6,151 proteins shown in plot.

Supplementary Figure 15 Information content of ProteomeHD has not reached saturation yet.

We randomly removed 5%, 10% and 15% of the data points across the ProteomeHD matrix, in triplicate, and repeated treeClust learning to predict protein associations. The Precision-Recall analysis shows that removing data points decreases performance proportionally to the amount of removed data, suggesting that adding additional data would likely enhance performance further.

Supplementary information

Supplementary Information

Supplementary Figs. 1–15.

Reporting Summary

Supplementary Video 1

Interaction of peroxisomal membrane protrusions with mitochondria in COS-7 cells. See Fig. 4a–f. COS-7 cells were transfected with PEX11β-EGFP; mitochondria were stained with Mitotracker (red), and analysed by live-cell imaging using an IX81 microscope (Olympus) equipped with a CSUX1 spinning disk head (Yokogawa). A peroxisome interacts with a mitochondrion via its membrane protrusion, and follows it, occasionally detaching and re-establishing contact. Two hundred stacks of nine planes (0.5 µm thickness, 100 ms exposure) were taken in a continuous stream; 118 frames, 14× speed. Scale bar, 5 µm.

Supplementary Video 2

Note a peroxisome at the bottom, which interacts with a mitochondrion via its membrane protrusion and then wraps around it, possibly to increase the membrane contact area. Two hundred stacks of nine planes (0.5 µm thickness, 100 ms exposure) were taken in a continuous stream; 200 frames, 14× speed. Scale bar, 5 µm

Supplementary Video 3

A mitochondrion, which moves to the left, is dragging a peroxisome with a membrane protrusion with it, indicating that the organelles are tightly tethered to each other. Two hundred stacks of nine planes (0.5 µm thickness, 100 ms exposure) were taken in a continuous stream; 100 frames, 14× speed. Scale bar, 5 µm.

Supplementary Table 1

The table provides the 294 SILAC ratios for 10,323 proteins that make up ProteomeHD.

Supplementary Table 2

Contains PRIDE identifiers and brief description of each project

Supplementary Table 3

Contains the top-scoring 0.5% of protein pairs in the co-regulation analysis. A complete set of all 12.5 million pair-wise associations is available in our PRIDE repository (PXD008888).

Supplementary Table 4

Coordinates and annotations relating to Fig. 1h,i.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kustatscher, G., Grabowski, P., Schrader, T.A. et al. Co-regulation map of the human proteome enables identification of protein functions. Nat Biotechnol 37, 1361–1371 (2019) doi:10.1038/s41587-019-0298-5

Download citation