The proteome landscape of the kingdoms of life

Müller, Johannes B.; Geyer, Philipp E.; Colaço, Ana R.; Treit, Peter V.; Strauss, Maximilian T.; Oroshi, Mario; Doll, Sophia; Virreira Winter, Sebastian; Bader, Jakob M.; Köhler, Niklas; Theis, Fabian; Santos, Alberto; Mann, Matthias

doi:10.1038/s41586-020-2402-x

Article
Published: 17 June 2020

The proteome landscape of the kingdoms of life

Nature volume 582, pages 592–596 (2020)Cite this article

36k Accesses
105 Citations
507 Altmetric
Metrics details

Subjects

Abstract

Proteins carry out the vast majority of functions in all biological domains, but for technological reasons their large-scale investigation has lagged behind the study of genomes. Since the first essentially complete eukaryotic proteome was reported¹, advances in mass-spectrometry-based proteomics² have enabled increasingly comprehensive identification and quantification of the human proteome^3,4,5,6. However, there have been few comparisons across species^7,8, in stark contrast with genomics initiatives⁹. Here we use an advanced proteomics workflow—in which the peptide separation step is performed by a microstructured and extremely reproducible chromatographic system—for the in-depth study of 100 taxonomically diverse organisms. With two million peptide and 340,000 stringent protein identifications obtained in a standardized manner, we double the number of proteins with solid experimental evidence known to the scientific community. The data also provide a large-scale case study for sequence-based machine learning, as we demonstrate by experimentally confirming the predicted properties of peptides from Bacteroides uniformis. Our results offer a comparative view of the functional organization of organisms across the entire evolutionary range. A remarkably high fraction of the total proteome mass in all kingdoms is dedicated to protein homeostasis and folding, highlighting the biological challenge of maintaining protein structure in all branches of life. Likewise, a universally high fraction is involved in supplying energy resources, although these pathways range from photosynthesis through iron sulfur metabolism to carbohydrate metabolism. Generally, however, proteins and proteomes are remarkably diverse between organisms, and they can readily be explored and functionally compared at www.proteomesoflife.org.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Collection of organism samples across the tree of life, and integration of the proteomic workflow.**

**Fig. 2: Application of a deep learning model to predict peptide retention times for liquid chromatography with tandem mass spectrometry (LC-MS/MS) measurements.**

**Fig. 3: Organism-resolved integration of proteome data into a global analysis.**

**Fig. 4: Global view of the expression levels of functional groups across the 100 organisms.**

Improving microbial phylogeny with citizen science within a mass-market video game

Article Open access 15 April 2024

Roman Sarrazin-Gendron, Parham Ghasemloo Gheidari, … Jérôme Waldispühl

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

John Jumper, Richard Evans, … Demis Hassabis

An open source knowledge graph ecosystem for the life sciences

Article Open access 11 April 2024

Tiffany J. Callahan, Ignacio J. Tripodi, … Lawrence E. Hunter

Data availability

The MS-based proteomics data have been deposited in the ProteomeXchange Consortium via the PRIDE partner repository and are available via ProteomeXchange with identifier PXD014877 and PXD019483.

Code availability

Custom computer code is available at https://github.com/MannLabs/proteomesoflife.

References

de Godoy, L. M. F. et al. Comprehensive mass-spectrometry-based proteome quantification of haploid versus diploid yeast. Nature 455, 1251–1254 (2008).
Article ADS Google Scholar
Aebersold, R. & Mann, M. Mass-spectrometric exploration of proteome structure and function. Nature 537, 347–355 (2016).
Article ADS CAS Google Scholar
Nagaraj, N. et al. System-wide perturbation analysis with nearly complete coverage of the yeast proteome by single-shot ultra HPLC runs on a bench top Orbitrap. Mol. Cell. Proteomics 11, M111.013722 (2012).
Article Google Scholar
Kim, M.-S. et al. A draft map of the human proteome. Nature 509, 575–581 (2014).
Article ADS CAS Google Scholar
Wilhelm, M. et al. Mass-spectrometry-based draft of the human proteome. Nature 509, 582–587 (2014).
Article ADS CAS Google Scholar
Bekker-Jensen, D. B. et al. An optimized shotgun strategy for the rapid generation of comprehensive human proteomes. Cell Syst. 4, 587–599 (2017).
Article CAS Google Scholar
Weiss, M., Schrimpf, S., Hengartner, M. O., Lercher, M. J. & von Mering, C. Shotgun proteomics data from multiple organisms reveals remarkable quantitative conservation of the eukaryotic core proteome. Proteomics 10, 1297–1306 (2010).
Article CAS Google Scholar
Marx, H. et al. A proteomic atlas of the legume Medicago truncatula and its nitrogen-fixing endosymbiont Sinorhizobium meliloti. Nat. Biotechnol. 34, 1198–1205 (2016).
Article CAS Google Scholar
Shendure, J. et al. DNA sequencing at 40: past, present and future. Nature 550, 345–353 (2017); correction Nature 568, E11 (2019).
Article ADS CAS Google Scholar
Kulak, N. A., Pichler, G., Paron, I., Nagaraj, N. & Mann, M. Minimal, encapsulated proteomic-sample processing applied to copy-number estimation in eukaryotic cells. Nat. Methods 11, 319–324 (2014).
Article CAS Google Scholar
Geyer, P. E. et al. Plasma proteome profiling to assess human health and disease. Cell Syst. 2, 185–195 (2016).
Article CAS Google Scholar
De Beeck, J. O. et al. Digging deeper into the human proteome: a novel nanoflow LCMS setup using micro pillar array columns (μPAC^TM). Preprint at bioRxiv https://doi.org/10.1101/472134 (2018).
Kulak, N. A., Geyer, P. E. & Mann, M. Loss-less nano-fractionator for high sensitivity, high coverage proteomics. Mol. Cell. Proteomics 16, 694–705 (2017).
Article CAS Google Scholar
Zhou, X.-X. et al. pDeep: predicting MS/MS spectra of peptides with deep learning. Anal. Chem. 89, 12690–12697 (2017).
Article CAS Google Scholar
Tiwary, S. et al. High-quality MS/MS spectrum prediction for data-dependent and data-independent acquisition data analysis. Nat. Methods 16, 519–525 (2019).
Article CAS Google Scholar
Gessulat, S. et al. Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning. Nat. Methods 16, 509–518 (2019).
Article CAS Google Scholar
UniProt Consortium. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 47 (D1), D506–D515 (2019).
Article Google Scholar
Muñoz, J. & Heck, A. J. R. From the human genome to the human proteome. Angew. Chem. Int. Edn 53, 10864–10866 (2014).
Article Google Scholar
Cox, J. et al. Accurate proteome-wide label-free quantification by delayed normalization and maximal peptide ratio extraction, termed MaxLFQ. Mol. Cell. Proteomics 13, 2513–2526 (2014).
Article CAS Google Scholar
Altenhoff, A. M. et al. Standardized benchmarking in the quest for orthologs. Nat. Methods 13, 425–430 (2016).
Article CAS Google Scholar
Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 47 (D1), D309–D314 (2019).
Article CAS Google Scholar
The Gene Ontology Consortium. The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Res. 47 (D1), D330–D338 (2019).
Article Google Scholar
Geer, L. Y. et al. The NCBI BioSystems database. Nucleic Acids Res. 38, D492–D496 (2010).
Article CAS Google Scholar
El-Gebali, S. et al. The Pfam protein families database in 2019. Nucleic Acids Res. 47 (D1), D427–D432 (2019).
Article CAS Google Scholar
Santos, A. et al. Clinical knowledge graph integrates proteomics data into clinical decision-making. Preprint at bioRxiv https://doi.org/10.1101/2020.05.09.084897 (2020).
Cox, J. & Mann, M. 1D and 2D annotation enrichment: a statistical method integrating quantitative proteomics with complementary high-throughput data. BMC Bioinformatics 13 (Suppl 16), S12 (2012).
Article CAS Google Scholar
Chi, H. et al. Comprehensive identification of peptides in tandem mass spectra using an efficient open search engine. Nat. Biotechnol. 36, 1059–1061 (2018).
Article CAS Google Scholar
Zielinska, D. F., Gnad, F., Schropp, K., Wiśniewski, J. R. & Mann, M. Mapping N-glycosylation sites across seven evolutionarily distant species reveals a divergent substrate proteome despite a common core machinery. Mol. Cell 46, 542–548 (2012).
Article CAS Google Scholar
Wiśniewski, J. R., Wegler, C. & Artursson, P. Multiple-enzyme-digestion strategy improves accuracy and sensitivity of label- and standard-free absolute quantification to a level that is achievable by analysis with stable isotope-labeled standard spiking. J. Proteome Res. 18, 217–224 (2019).
PubMed Google Scholar
Kelstrup, C. D. et al. Performance evaluation of the Q Exactive HF-X for shotgun proteomics. J. Proteome Res. 17, 727–738 (2018).
Article CAS Google Scholar
Scheltema, R. A. & Mann, M. SprayQc: a real-time LC-MS/MS quality monitoring system to maximize uptime using off the shelf components. J. Proteome Res. 11, 3458–3466 (2012).
Article CAS Google Scholar
Cox, J. & Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 26, 1367–1372 (2008).
Article CAS Google Scholar
Cox, J. et al. Andromeda: a peptide search engine integrated into the MaxQuant environment. J. Proteome Res. 10, 1794–1805 (2011).
Article ADS CAS Google Scholar
Tyanova, S. et al. The Perseus computational platform for comprehensive analysis of (prote)omics data. Nat. Methods 13, 731–740 (2016).
Article CAS Google Scholar
Wichmann, C. et al. MaxQuant.Live enables global targeting of more than 25,000 peptides. Mol. Cell. Proteomics 18, 982–994 (2019).
Article CAS Google Scholar
Perez-Riverol, Y. et al. The PRIDE database and related tools and resources in 2019: improving support for quantification data. Nucleic Acids Res. 47 (D1), D442–D450 (2019).
Article CAS Google Scholar
Perkel, J. M. Why Jupyter is data scientists’ computational notebook of choice. Nature 563, 145–146 (2018).
Article ADS CAS Google Scholar

Download references

Acknowledgements

We thank all members of the Proteomics and Signal Transduction Group and the Clinical Proteomics Group at the Max Planck Institute of Biochemistry, Martinsried, for help and discussions, and in particular I. Paron, C. Deiml, A. Strasser and B. Splettstoesser for technical assistance. We further thank the P. Bork group for supplying bacteria, the A. Pichlmair group for virus samples, F. Hosp for A. thaliana, I. Sinning for Neurospora crassa and the K.-P. Janssen group for cell line samples. Our work was partially supported by the Max Planck Society for the Advancement of Science, by the European Union’s Horizon 2020 research and innovation program with the Microb-Predict project (grant 825694), by grants from the Novo Nordisk Foundation (NNF15CC0001 and NNF15OC0016692), and by the Deutsche Forschungsgemeinschaft (DFG) project ‘Chemical proteomics inside us’ (grant 412136960).

Author information

These authors contributed equally: Johannes B. Müller, Philipp E. Geyer

Authors and Affiliations

Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
Johannes B. Müller, Philipp E. Geyer, Peter V. Treit, Maximilian T. Strauss, Mario Oroshi, Sophia Doll, Sebastian Virreira Winter, Jakob M. Bader & Matthias Mann
OmicEra Diagnostics GmbH, Planegg, Germany
Philipp E. Geyer, Maximilian T. Strauss, Sophia Doll & Sebastian Virreira Winter
NNF Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark
Ana R. Colaço, Alberto Santos & Matthias Mann
Helmholtz Zentrum München–German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Munich, Germany
Niklas Köhler & Fabian Theis
Technical University of Munich, Department of Mathematics, Garching, Germany
Fabian Theis
Li-Ka Shing Big Data Institute, University of Oxford, Oxford, UK
Alberto Santos

Authors

Johannes B. Müller
View author publications
You can also search for this author in PubMed Google Scholar
Philipp E. Geyer
View author publications
You can also search for this author in PubMed Google Scholar
Ana R. Colaço
View author publications
You can also search for this author in PubMed Google Scholar
Peter V. Treit
View author publications
You can also search for this author in PubMed Google Scholar
Maximilian T. Strauss
View author publications
You can also search for this author in PubMed Google Scholar
Mario Oroshi
View author publications
You can also search for this author in PubMed Google Scholar
Sophia Doll
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Virreira Winter
View author publications
You can also search for this author in PubMed Google Scholar
Jakob M. Bader
View author publications
You can also search for this author in PubMed Google Scholar
Niklas Köhler
View author publications
You can also search for this author in PubMed Google Scholar
Fabian Theis
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Santos
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Mann
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.B.M. and P.E.G. designed the experiments, performed and interpreted the MS-based proteomic analyses, carried out bioinformatics analyses and generated text and figures for the manuscript. P.V.T., S.D., S.V.W. and J.M.B. designed experiments and performed MS-based proteomics analyses. A.R.C. and A.S. integrated annotation data with proteomics data and implemented the Python code as well as graph-based structures. A.S. and M.O. implemented the web-accessible analyses. N.K., F.T. and M.T.S. carried out the machine learning analysis. M.M. supervised and guided the project, designed the experiments, interpreted MS-based proteomics data and wrote the manuscript.

Corresponding author

Correspondence to Matthias Mann.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature thanks Joshua Coon, Vera van Noort and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Comparison of the peptide retention times obtained by a μPAC and a fused silica capillary column.

a, The histograms illustrate the distribution of coefficients of variation (CVs) calculated from peptide retention times obtained by a μPAC and a fused silica capillary column. The CVs were calculated for peptides from 12 measurements of a HeLa cell digest on each column. b, All components, including lines, connectors, the column and the emitter, are displayed together with grounding and spray voltage connections. The pico tip emitter is from New Objective (catalogue number FS360-20-10-N-5-105CT).

Extended Data Fig. 2 Interlaboratory reproducibility and prediction of peptide retention time on the μPAC column.

a, The ability to produce chip-based columns in a reproducible manner, coupled with the statically fixed micrometre-sized pillars, results in highly reproducible performance and interlaboratory transferability of the μPAC-based approach. Shown are the corrected retention times of an excerpt of 5,000 peptides from the 43,000 overlapping peptides measured in two different HeLa cell digests by our Munich and Copenhagen laboratories, resulting in a Pearson correlation coefficient of peptide retention times of 0.995. b, To validate our model for predicting peptide retention times, we plot an excerpt of 1,000 peptides from the complete test-set of 54,490 peptides, with experimentally determined values on the x axis and predicted values on the y axis. The Pearson’s R² correlation value for the complete predicted peptide set is 0.99.

Extended Data Fig. 3 Total numbers of identified peptides from 100 organisms across the tree of life.

The peptides uniquely identified for a certain organism are colour-coded from peptides identified in multiple species. Orange, archaea; blue, eukaryotes; green, bacteria.

Extended Data Fig. 4 Comparison and characterization of the LSTM model for predicting peptide retention times.

a, Box plots comparing R² scores obtained from different models of peptide retention time, calculated from the linear regressions of correlations between the predicted test set to the measured peptide retention times. Sample sizes are shown in b. b, Table comparing the different models of peptide retention time. The training set was reduced in size (number of peptides included) in order to account for the exponentially growing calculation time of certain models. Statistics represent the linear regression of correlation from the predicted test set retention times to the measured retention times. c, Characterization of the LSTM model applied here for different sizes of training peptide set.

Extended Data Fig. 5 Overview of our data set of 100 organisms across the tree of life.

a, Illustration of all direct taxonomic levels below the superkingdom level that are covered by our data set. DPANN, Diapherotrites, Parvarchaeota, Aenigmarchaeota, Nanoarchaeota and Nanohaloarchaeota; FCB, Fibrobacteres, Chlorobi and Bacteroidetes; PCV, Planctomycetes, Chlamydiae and Verrucomicrobia; TACK, Thaumarchaeota, Crenarchaeota and Korarchaeota. b, Number of protein identification codes (IDs) in this study and their relation to TrEMBL IDs found in the PRIDE archive. c, Comparison of the Swiss-Prot database to the data set in this study with regards to organism and protein numbers. d, Numbers of identified protein groups and UniProt protein entries for all 100 organisms in our data set. The UniProt protein-entry identifications are colour-coded into Swiss-Prot (reviewed) and TrEMBL (predicted) entries.

Extended Data Fig. 6 Dynamic range curves for all organisms analysed here.

Protein intensities are log₁₀-scaled and plotted against the abundance rank of each protein.

Extended Data Fig. 7 Cumulative protein intensities for all organisms analysed here.

On the x axis, proteins are ranked according to their abundance; the y axis shows the cumulative protein intensity. Proteins missing biological-process annotation are highlighted by grey lines in the background.

Extended Data Fig. 8 Quantitative analysis of different enzyme classes and functional protein domains across the tree of life.

a, We classified the contribution of peptides to the top 90% of protein mass within all 100 organisms according to the enzyme commission (EC) number, using the Unipept web-tool (https://unipept.ugent.be/). The alluvial plot illustrates the proportions of each enzyme class across all organisms in our study. b, Comparison of the three domains of life with respect to their normalized contribution of peptides to each enzyme class. c, Proteins that contribute to the top 90% of the protein mass within all 100 organisms studied herein were annotated according to their known functional protein domains, and the intensities for different functional domains of an organism were summed to display the most abundant functional protein domains across the tree of life. The intensity is displayed on a log₁₀ scale.

Extended Data Fig. 9 Quantitative analysis of specific biological processes across the tree of life.

a, Linear display showing a global view of the expression levels of functional groups across the 100 organisms from Fig. 4. Summed intensities for functional terms are shown as grey lines, with the top ten most abundant terms in all organisms colour-coded according to the top key. b, Quantitative analysis of specific biological processes from the superkingdom of eukaryotes. Proteins were annotated with biological processes, and the intensities for each annotation term within an organisms were summed. Those biological processes that display differential expression across the superkingdom as well as photosynthetic processes are highlighted according to the bottom key.

Extended Data Fig. 10 Modified peptides.

Sum of modified peptides per organism, identified with pFind (http://pfind.ict.ac.cn/software/pFind3/index.html) and colour-coded for archaea (red), eukaryotes (blue) and bacteria (green).

Supplementary information

Reporting Summary

Supplementary Table

Supplementary Table 1: Organisms analyzed in the study. All organisms analyzed in the Study are listed with source and taxonomy.

Supplementary Table

Supplementary Table 2: Identified and quantified protein groups. All identified protein groups for the 100 organisms are listed and quantitative information is added for quantified protein groups.

Supplementary Table

Supplementary Table 3: Reported modified peptides. Peptides with biologically relevant modifications as found by the pFind tool are listed.

Supplementary Table

Supplementary Table 4: Identified and quantified protein groups for 14 human cell lines. The deep human proteome derived from 14 human cell lines is listed with all identified and quantified protein groups.

Supplementary Table

Supplementary Table 5: Detailed summary information for technical and biological proteomics data. Technical relevant information on the 100 organism proteomes mass spectrometry data is listed.

Supplementary Table

Supplementary Table 6: Annotation data for the 100 most abundant proteins of the 100 organisms. The 100 most abundant protein groups per organism are listed with annotation data.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Müller, J.B., Geyer, P.E., Colaço, A.R. et al. The proteome landscape of the kingdoms of life. Nature 582, 592–596 (2020). https://doi.org/10.1038/s41586-020-2402-x

Download citation

Received: 02 August 2019
Accepted: 27 April 2020
Published: 17 June 2020
Issue Date: 25 June 2020
DOI: https://doi.org/10.1038/s41586-020-2402-x

This article is cited by

Tenascin-C modulates alveolarization in bronchopulmonary dysplasia
- Wei Liu
- Yu Mao
- Mingshun Zhang
Inflammation and Regeneration (2024)
Omics-based molecular classifications empowering in precision oncology
- Zhaokai Zhou
- Ting Lin
- Zaoqu Liu
Cellular Oncology (2024)
High-throughput deep learning variant effect prediction with Sequence UNET
- Alistair S. Dunham
- Pedro Beltrao
- Mohammed AlQuraishi
Genome Biology (2023)
Making single-cell proteomics biologically relevant
- Florian A. Rosenberger
- Marvin Thielert
- Matthias Mann
Nature Methods (2023)
Accurate de novo peptide sequencing using fully convolutional neural networks
- Kaiyuan Liu
- Yuzhen Ye
- Haixu Tang
Nature Communications (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Extended data figures and tables

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links