Mass-spectrometry-based draft of the human proteome

Journal name:
Nature
Volume:
509,
Pages:
582–587
Date published:
DOI:
doi:10.1038/nature13319
Received
Accepted
Published online

Abstract

Proteomes are characterized by large protein-abundance differences, cell-type- and time-dependent expression patterns and post-translational modifications, all of which carry biological information that is not accessible by genomics or transcriptomics. Here we present a mass-spectrometry-based draft of the human proteome and a public, high-performance, in-memory database for real-time analysis of terabytes of big data, called ProteomicsDB. The information assembled from human tissues, cell lines and body fluids enabled estimation of the size of the protein-coding genome, and identified organ-specific proteins and a large number of translated lincRNAs (long intergenic non-coding RNAs). Analysis of messenger RNA and protein-expression profiles of human tissues revealed conserved control of protein abundance, and integration of drug-sensitivity data enabled the identification of proteins predicting resistance or sensitivity. The proteome profiles also hold considerable promise for analysing the composition and stoichiometry of protein complexes. ProteomicsDB thus enables navigation of proteomes, provides biological insight and fosters the development of proteomic technology.

At a glance

Figures

  1. Strategy for the assembly of the human proteome.
    Figure 1: Strategy for the assembly of the human proteome.

    a, Experimental workflow for the identification and quantification of proteins. b, Structure and features of ProteomicsDB. ProteomicsDB consists of a repository part for raw-data storage and an in-memory database designed for the storage, analysis and visualization of proteomic data sets. Fast computation on large data sets is backed by 160 CPUs and 2 TB of RAM.

  2. Characterization of the human proteome.
    Figure 2: Characterization of the human proteome.

    a, Chromosomal coverage of the 18,097 proteins identified in this study exceeds 90% in all but three cases. Blue bars indicate the density of proteins in a particular chromosomal region. b, Gene ontology analysis of the ‘missing’ proteome’ identifies GPCRs, secreted and keratin-associated proteins as the major protein classes underrepresented in proteomic experiments. CDS, coding sequence; Mt, mitochondrial DNA.

  3. Global protein expression analysis.
    Figure 3: Global protein expression analysis.

    a, Protein expression in different tissues and cell lines, showing that levels of housekeeping (GAPDH), signalling (EGFR) and tumour-associated (CTNNB1) proteins can vary substantially between tissues (grouped by colour). b, PCA showing that cell lines (circles) retain protein-expression characteristics of their respective primary tissue (triangles) and that proteomes of different organs are more diverse. c, Hierarchical clustering of the 100 most highly expressed proteins from each of 47 tissues and body fluids. Despite the presence of a large group of common proteins, clusters of organ and fluid-selective proteins with respective biological functions can readily be identified.

  4. Functional protein expression analysis.
    Figure 4: Functional protein expression analysis.

    Quantitative expression analysis of 906 kinases and transcription factors across 24 tissues (top panel) identifies organ-selective signatures indicative of the underlying biology. The highlighted cluster in spleen contains the kinases LCK, ZAP70 and JAK and the transcription factors SIGIRR, NFKBIE and NLRC3 with strong links to the immune system (bottom panel). Yellow oval represents a cell; blue oval represents the nucleus.

  5. Integration and utility of large proteomic data collections.
    Figure 5: Integration and utility of large proteomic data collections.

    a, Analysis of mRNA and protein levels across 12 organs shows that the protein/mRNA ratio is largely conserved (top panel). The median translation rates of all transcripts across all tissues correlate well with protein abundance (bottom-left panel), leading to the ability to predict individual protein levels from the respective mRNA levels (bottom-right panel). b, Elastic net analysis for the identification of drug sensitivity (positive-effect-size) or resistance (negative-effect-size) markers against the EGFR kinase inhibitors erlotinib and lapatinib in cancer cell lines. c, Analysis of the composition and stoichiometry of the proteasome. Top-left panel, schematic structure of the ‘constitutive’ proteasome and the ‘immunoproteasome’ (marked by the suffix ‘i’). Middle-left and bottom-left panels, stoichiometry derived by iBAQ of the constitutive proteasome (grey) and the immunoproteasome (red) in the salivary gland and the lymph node. Top-right panel, expression analysis of the β1 subunit across more than 100 tissue and cell-line proteomes reveals that many cells express both forms of the proteasome. Bottom-right panel, expression correlation analysis of all β subunits across the said tissues and cell lines showing strong co-expression of the β1i, β2i and β5i subunits as well as all other β-subunits but no correlation with the expression of the corresponding β1, β2 and β5 subunits. d, ProteomicsDB enables the computation of molecular interferences in selected reaction-monitoring experiments (SRM) from experimental data. The transition of the target peptide LHYGLPVVVK (y8 fragment ion, β-catenin) is marked with an arrow. All other circles in the plot are interfering SRM transitions of other peptides found in ProteomicsDB that fall within the same mass tolerance of the experiment (here, 0.7 Da). The size of each circle indicates the severity of the interference. The inset shows that interference can be substantially reduced by the use of high-resolution fragment-ion data (here, 0.04 Da) and confining the analysis to the tissue from which a sample is derived (here, a colon sample).

  6. Peptide and protein identifications.
    Extended Data Fig. 1: Peptide and protein identifications.

    a, Spectrum viewer enabling access to more than 70-million annotated tandem mass spectra of endogenous peptides and synthetic reference standards in real time. b, Peptide length and score distribution for targets and decoys for the search engine Mascot. It is of note that the peptide- and protein-identification criteria followed a two-step process. First, for each LC-MS/MS run, we applied a global 1% target-decoy false discovery rate (FDR) cut on the level of peptide spectrum matches (PSMs, not shown); second, we applied a peptide-length-dependent local FDR cut of 5% for all PSMs and the results are depicted here. c, Same as in a but for the search engine Andromeda. d, e, Heat maps showing FDRs as a function of search engine score and peptide length. Solid lines indicate the 5% local FDR.

  7. Protein-identification quality in very large data sets.
    Extended Data Fig. 2: Protein-identification quality in very large data sets.

    a, First filtering step. The first step filters every LC-MS/MS run at 1% PSM FDR. Top panel, score distribution for target and decoy PSMs following 1% PSM FDR filtering for Maxquant identifications. Bottom panel, the binned peptide-length distribution for target PSMs. b, Same as a but for Mascot identifications. c, Second filtering step. Same as a, but this time applying an additional 5% local length- and score-dependent FDR on the total aggregated data for Maxquant identifications in ProteomicsDB. It is apparent that the second filtering step improves the FDR about threefold and removes most PSMs shorter than 9 amino acids. d, Same as c but for Mascot identifications in ProteomicsDB. e, Comparative analysis of protein FDR characteristics of two different approaches based on Mascot analysis. In the classical target-decoy approach, aggregation of large quantities of data leads to accumulation of large numbers of decoy proteins and a concomitant loss of true target proteins when filtering the data at 1% protein FDR. The alternative ‘picked’ target-decoy method does not suffer from this scaling problem and maintains a constant decoy rate (and therefore lower protein FDR) but at the expense of lower sensitivity of target protein detection compared to the classical target-decoy approach. Please refer to the Supplementary Information for details and a discussion on the topic. Note that the two protein FDR methods were not used in this manuscript. Instead, we used the criteria shown in a and b.

  8. Further characterization of the proteome.
    Extended Data Fig. 3: Further characterization of the proteome.

    a, Some proteins are refractory to identification using tryptic digestion because they do not generate sufficient—or any—peptides that are within the productive mass range of a mass spectrometer typically used for bottom-up proteomics. This can be improved by the use of alternative proteases; for example, chymotrypsin as shown here for one of the many keratin-associated proteins localized on chromosome 21 (detected chymotryptic peptides in red). b, c, Translation of lincRNAs is rare but does exist and can be identified (b) across all chromosomes as well as (c) in many tissues and in HeLa cells. d, Peptide-intensity distribution of protein-coding genes and non-coding transcripts. Interestingly, the abundance of translated lincRNAs is broadly similar to that of classical proteins.

  9. Further characterization of the proteome.
    Extended Data Fig. 4: Further characterization of the proteome.

    a, Proteome coverage rapidly saturates with the addition of shotgun proteomic data. Tissue proteomes saturate at ~approximately 16,000 proteins, but both body fluids and cell lines add small but noticeable numbers of proteins not covered in the tissues (see also b and c for a different ordering of samples). This indicates that proteome coverage is likely not to increase much more by merely adding high-throughput data (although it may increase confidence in protein identifications and will probably also increase sequence coverage). b, Same plot as a but different ordering of samples. c, Saturation plots showing that PTMs and affinity purifications each contribute distinctly to the coverage of the proteome. d, Comparison of five large-scale projects suggesting that a ‘core proteome’ of 10,000–12,000 ubiquitously expressed proteins exists. Ellipses represent the corresponding publications. e, Abundance distribution of the ‘core proteome’ based on the normalized iBAQ method. The most highly expressed 10% of proteins are dominated by proteins relating to energy production and protein synthesis. The least abundant 10% of proteins are enriched in proteins with regulatory functions. f, Tree-view summary of Gene Ontology (GO) term analysis for the proteins constituting the ‘core proteome’, showing that the core proteome is mainly concerned with biological processes relating to the homeostasis and life cycle of cells. The colours represent the broader categories of the treemap.

  10. Comparative analysis of five intensity-based label-free absolute-quantification approaches.
    Extended Data Fig. 5: Comparative analysis of five intensity-based label-free absolute-quantification approaches.

    a, Linearity of intensity (U2-OS cell line data from ref. 22) and copies per cell for absolute protein quantification (AQUA)-quantified proteins (red dots, red regression line; same cell line30) and derived copy-number estimates (grey dots, blue regression line; from the same study). b, Total sum normalization re-scales intensity distributions of Colo-205 cell digests measured on two different mass spectrometers (Orbitrap Elite data in red, LTQ Orbitrap XL data in blue24). c, Quantile-quantile (Q-Q) plots of the normalized data presented in b illustrating good alignment of data across 4.5 orders of magnitude. d, Empirical cumulative density function (ECDF) of error distributions derived from a showing that all five methods have merit. e, Comparison of the fold error of iBAQ and top3 as a function of the number of quantified peptides. f, Same as e but for protein length. When peptide numbers are low, iBAQ shows errors that are slightly smaller in magnitude compared to the top3 method. g, Comparison of iBAQ and total sum normalized iBAQ for heavy SILAC-labelled MCF-7 cell digests (red bars32 and label-free quantified MCF-7cell digests (same as MCF-7 deep proteome in a; blue bars) before (left panel) and after normalization (right panel) showing no influence of the presence of the SILAC label on quantification results. h, Comparison of iBAQ and total sum normalized iBAQ for iTRAQ reporter-ion-intensity-based quantification (red bars; MCF-7 cell digest46) and label-free quantified MCF-7 cell digests (blue bars; same as a and c) before (left panel) and after normalization (right panel). The intensity-distribution characteristics of iTRAQ and label-free measurements are too different to allow for comparative analyses of MS1- and MS2-based quantification data. i, Normalized iBAQ distributions of 347 cell-line and tissue proteomes (all MS1 quantified) available in ProteomicsDB showing the general applicability of MS1-based quantification across many sources of biological material.

  11. Functional protein-expression analysis.
    Extended Data Fig. 6: Functional protein-expression analysis.

    Gene ontology analysis of proteins with expression levels 10-fold above average in a particular organ or body fluid invariably highlights protein signatures with direct organ-related functional significance.

  12. Protein- versus mRNA-expression analysis.
    Extended Data Fig. 7: Protein- versus mRNA-expression analysis.

    a, Comparison of mRNA and protein expression of 12 human tissues showing the general rather poor correlation of protein and mRNA levels, implying the widespread application of transcriptional, translational and post-translational control mechanisms of protein-abundance regulation. Spearman correlation coefficients vary from 0.41 (thyroid gland) to 0.55 (kidney). ‘Corner proteins’ (0.5 logs to either side of zero) are marked in colours. b, Clustering of mRNA expression (left triangle) and protein expression (right triangle) across the 12 tissues does not reveal tissues with common profiles suggesting that the transcriptomes and proteomes of human tissues are quite different from each other. c, The ratio of protein and mRNA level for a protein is approximately constant across many tissues. The heat map shows proteins and tissues clustered according to their protein/mRNA ratio. d, Protein abundance can be predicted from mRNA levels. Using the median ratio of protein/mRNA across 12 tissues, it is possible to predict protein levels from mRNA levels for every tissue with a good correlation coefficient, underscoring the importance of the translation rate (and mRNA levels) on protein expression.

  13. Protein markers for drug sensitivity and resistance.
    Extended Data Fig. 8: Protein markers for drug sensitivity and resistance.

    a, Elastic net analysis of protein expression and drug sensitivity for the EGFR kinase inhibitor erlotinib. Positive-effect-size values indicate that high protein expression is associated with drug sensitivity. Negative-effect-size values indicate that high protein expression is associated with drug resistance. b, Same as in a but for the EGFR kinase inhibitor lapatinib. c, Correlation analysis of the elastic net effect sizes for erlotinib and lapatinib (proteins with elastic net frequencies of less than 600 are not shown for clarity). Proteins in the top-right quadrant are common markers for drug sensitivity (including EGFR as the primary target of both drugs). Proteins in the bottom-left quadrant are common markers for drug resistance (including S100A4, a known resistance marker for lapatinib). Proteins that are strong markers for sensitivity or resistance are annotated in each plot and most proteins can be easily placed into EGFR signalling and regulation pathways.

  14. Protein complex composition and stoichiometry from shotgun proteomic data.
    Extended Data Fig. 9: Protein complex composition and stoichiometry from shotgun proteomic data.

    a, Stoichiometry of the nuclear pore complex (NPC) reconstructed from shotgun proteomics data. To illustrate that normalized iBAQ values from shotgun experiments actually reflect protein copy numbers, we reconstructed the stoichiometry of the NPC (blue bars, data from nuclear extracts of HeLa cells39; error bars indicate standard deviation from triplicate experiments) and compared it to the stoichiometry determined in the same study using AQUA peptides and SRM experiments (red bars). Note that most of the time, the stoichiometries are in very good agreement between the methods and the stoichiometries reported in the literature. b, Stoichiometry of the α- and β-subunits of the proteasome reconstructed from shotgun proteomics data (examples). β-subunits of the constitutive proteasome are indicated in grey, immunoproteasome subunits (β1i, β2i, β5i) are indicated in red. Note that PC-3 cells are devoid of the immunoproteasome, whereas cells in the lymph node almost exclusively express this version of the molecular machine. c, Systematic assessment of the fraction of βi subunits (red bars) and β-subunits (grey bars) across 29 tissue samples and 80 cell-line samples (tissue data from human body map (this study), cell-line data from22, 24). Note that many cell lines and tissues contain both versions of the proteasome and the data also suggest that further forms of the proteasome with different subunit compositions may exist.

  15. Examples for the analytical utility of large mass-spectrometry-based data collected in ProteomicsDB.
    Extended Data Fig. 10: Examples for the analytical utility of large mass-spectrometry-based data collected in ProteomicsDB.

    a, Enumeration of post-translational modifications and protein termini. b, Computation of proteotypic peptides. Generally the same one to five peptides are identified every time a protein is identified (top panel) making proteotypic peptides useful for assessing protein identification and as reagents for targeted mass-spectrometry measurements. We note that the proteotypicity of a peptide strongly depends on the presence or absence of a chemical modification (bottom panel, here tandem mass tags (TMT) or isobaric tags for relative and absolute quantification (iTRAQ)). c, Analysis of the selectivity of SRM transitions. The top panel shows the y8 transition of the peptide LHYGLPVVVK (β-catenin, marked with an arrow) in a slice of the precursor and fragment-ion window of 0.7 Da and 0.7 Da, respectively, typically employed on triple-quadrupole mass spectrometers. The size of the circle represents the relative intensity of the y8 fragment in a full tandem mass spectrum of this peptide. All other circles are interfering peptides (extracted from the entire ProteomicsDB) that have precursor and fragment ions in the same m/z window and with varying intensities (circle size). Interference can be reduced by using high-resolution mass spectrometry (middle panel) and confining the analysis to the tissue in question (here, a colon sample, bottom panel). Such interference plots in conjunction with the proteotypicity of peptides can be valuable for the design of targeted proteomic experiments.

References

  1. UniProt. C. Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Res. 41, D43D47 (2013)
  2. Paik, Y. K. et al. The Chromosome-Centric Human Proteome Project for cataloging proteins encoded in the genome. Nature Biotechnol. 30, 221223 (2012)
  3. Uhlen, M. et al. Towards a knowledge-based Human Protein Atlas. Nature Biotechnol. 28, 12481250 (2010)
  4. Vizcaíno, J. A. et al. ProteomeXchange provides globally coordinated proteomics data submission and dissemination. Nature Biotechnol. 32, 223226 (2014)
  5. Farrah, T. et al. State of the human proteome in 2013 as viewed through PeptideAtlas: comparing the kidney, urine, and plasma proteomes for the biology- and disease-driven Human Proteome Project. J. Proteome Res. 13, 6075 (2014)
  6. Wang, M. et al. PaxDb, a database of protein abundance averages across all three domains of life. Mol. Cell. Proteomics 11, 492500 (2012)
  7. Cox, J. & Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nature Biotechnol. 26, 13671372 (2008)
  8. Perkins, D. N., Pappin, D. J., Creasy, D. M. & Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 35513567 (1999)
  9. Gupta, N., Bandeira, N., Keich, U. & Pevzner, P. A. Target-decoy approach and false discovery rate: when things may go wrong. J. Am. Soc. Mass Spectrom. 22, 11111120 (2011)
  10. Higdon, R. et al. IPM: An integrated protein model for false discovery rate estimation and identification in high-throughput proteomics. J. Proteomics 75, 116121 (2011)
  11. Beausoleil, S. A., Villen, J., Gerber, S. A., Rush, J. & Gygi, S. P. A probability-based approach for high-throughput protein phosphorylation analysis and site localization. Nature Biotechnol. 24, 12851292 (2006)
  12. Reiter, L. et al. Protein identification false discovery rates for very large proteomics data sets generated by tandem mass spectrometry. Mol. Cell. Proteomics 8,. 24052417 (2009)
  13. Nagaraj, N. et al. Deep proteome and transcriptome mapping of a human cancer cell line. Mol. Syst. Biol. 7, 548 (2011)
  14. Tran, J. C. et al. Mapping intact protein isoforms in discovery mode using top-down proteomics. Nature 480, 254258 (2011)
  15. Lane, L. et al. Metrics for the Human Proteome Project 2013–2014 and strategies for finding missing proteins. J. Proteome Res. 13, 1520 (2014)
  16. Cabili, M. N. et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 25, 19151927 (2011)
  17. Djebali, S. et al. Landscape of transcription in human cells. Nature 489, 101108 (2012)
  18. Bánfai, B. et al. Long noncoding RNAs are rarely translated in two human cell lines. Genome Res. 22, 16461657 (2012)
  19. Guttman, M., Russell, P., Ingolia, N. T., Weissman, J. S. & Lander, E. S. Ribosome profiling provides evidence that large noncoding RNAs do not encode proteins. Cell 154, 240251 (2013)
  20. Ingolia, N. T., Lareau, L. F. & Weissman, J. S. Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell 147, 789802 (2011)
  21. Flintoft, L. Non-coding RNA: Ribosomes, but no translation, for lincRNAs. Nature Rev. Genet. 14, 520 (2013)
  22. Geiger, T., Wehner, A., Schaab, C., Cox, J. & Mann, M. Comparative proteomic analysis of eleven common cell lines reveals ubiquitous but varying expression of most proteins. Mol. Cell. Proteomics 11, M111.014050 (2012)
  23. Mertins, P. et al. Integrated proteomic analysis of post-translational modifications by serial enrichment. Nature Methods 10, 634637 (2013)
  24. Moghaddas Gholami, A. et al. Global proteome analysis of the NCI-60 cell line panel. Cell Rep. 4, 609620 (2013)
  25. Shiromizu, T. et al. Identification of missing proteins in the neXtProt database and unregistered phosphopeptides in the PhosphoSitePlus database as part of the Chromosome-centric Human Proteome Project. J. Proteome Res. 12, 24142421 (2013)
  26. Schirle, M., Heurtier, M. A. & Kuster, B. Profiling core proteomes of human cell lines by one-dimensional PAGE and liquid chromatography-tandem mass spectrometry. Mol. Cell. Proteomics 2, 12971305 (2003)
  27. Fagerberg, L. et al. Analysis of the human tissue-specific expression by genome-wide integration of transcriptomics and antibody-based proteomics. Mol. Cell. Proteomics 13, 397406 (2014)
  28. Hughes, G. M., Teeling, E. C. & Higgins, D. G. Loss of olfactory receptor function in hominin evolution. PLoS ONE 9, e84714 (2014)
  29. Ahrné, E., Molzahn, L., Glatter, T. & Schmidt, A. Critical assessment of proteome-wide label-free absolute abundance estimation strategies. Proteomics 13, 25672578 (2013)
  30. Beck, M. et al. The quantitative proteome of a human cell line. Mol. Syst. Biol. 7, 549 (2011)
  31. Schwanhäusser, B. et al. Global quantification of mammalian gene expression control. Nature 473, 337342 (2011)
  32. Geiger, T. et al. Initial quantitative proteomic map of 28 mouse tissues using the SILAC mouse. Mol. Cell. Proteomics 12, 17091722 (2013)
  33. Low, T. Y. et al. Quantitative and qualitative proteome characteristics extracted from in-depth integrated genomics and proteomics analysis. Cell Rep. 5, 14691478 (2013)
  34. Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603607 (2012)
  35. Koumangoye, R. B. et al. Reduced annexin A6 expression promotes the degradation of activated epidermal growth factor receptor and sensitizes invasive breast cancer cells to EGFR-targeted tyrosine kinase inhibitors. Mol. Cancer 12, 167 (2013)
  36. Klingelhöfer, J. et al. Epidermal growth factor receptor ligands as new extracellular targets for the metastasis-promoting S100A4 protein. FEBS J. 276, 59365948 (2009)
  37. Argenzio, E. et al. Proteomic snapshot of the EGF-induced ubiquitin network. Mol. Syst. Biol. 7, 462 (2011)
  38. Havugimana, P. C. et al. A census of human soluble protein complexes. Cell 150, 10681081 (2012)
  39. Ori, A. et al. Cell type-specific nuclear pores: a case in point for context-dependent stoichiometry of molecular machines. Mol. Syst. Biol. 9, 648 (2013)
  40. Hisamatsu, H. et al. Newly identified pair of proteasomal subunits regulated reciprocally by interferon gamma. J. Exp. Med. 183, 18071816 (1996)
  41. Nandi, D., Jiang, H. & Monaco, J. J. Identification of MECL-1 (LMP-10) as the third IFN-gamma-inducible proteasome subunit. J. Immunol. 156, 23612364 (1996)
  42. Mallick, P. et al. Computational prediction of proteotypic peptides for quantitative proteomics. Nature Biotechnol. 25, 125131 (2007)
  43. Domon, B. Considerations on selected reaction monitoring experiments: implications for the selectivity and accuracy of measurements. Proteomics Clin. Appl. 6, 609614 (2012)
  44. Gallien, S. et al. Targeted proteomic quantification on quadrupole-orbitrap mass spectrometer. Mol. Cell. Proteomics 11, 17091723 doi:10.1074/mcp.O112.019802. (2012)
  45. Marx, H. et al. A large synthetic peptide and phosphopeptide reference library for mass spectrometry-based proteomics. Nature Biotechnol. 31, 557564 (2013)
  46. Johannsson, H. J. et al. Retinoic acid receptor alpha is associated with tamoxifen resistance in breast cancer. Nature Commun. 4, 2175 (2013)

Download references

Author information

  1. These authors contributed equally to this work.

    • Mathias Wilhelm,
    • Judith Schlegl,
    • Hannes Hahne &
    • Amin Moghaddas Gholami

Affiliations

  1. Chair of Proteomics and Bioanalytics, Technische Universität München, Emil-Erlenmeyer Forum 5, 85354 Freising, Germany

    • Mathias Wilhelm,
    • Hannes Hahne,
    • Amin Moghaddas Gholami,
    • Harald Marx,
    • Simone Lemeer &
    • Bernhard Kuster
  2. SAP AG, Dietmar-Hopp-Allee 16, 69190 Walldorf, Germany

    • Mathias Wilhelm,
    • Judith Schlegl,
    • Marcus Lieberenz,
    • Emanuel Ziegler,
    • Lars Butzmann,
    • Siegfried Gessulat,
    • Joos-Hendrik Boese,
    • Anja Gerstmair &
    • Franz Faerber
  3. Cellzome GmbH, Meyerhofstraße 1, 69117 Heidelberg, Germany

    • Mikhail M. Savitski,
    • Toby Mathieson &
    • Marcus Bantscheff
  4. JPT Peptide Technologies GmbH, Volmerstraße 5, 12489 Berlin, Germany

    • Karsten Schnatbaum,
    • Ulf Reimer &
    • Holger Wenschuh
  5. Institute of Pathology, Technische Universität München, Trogerstraße 18, 81675 München, Germany

    • Martin Mollenhauer &
    • Julia Slotta-Huspenina
  6. Center for Integrated Protein Science Munich, Germany

    • Bernhard Kuster

Contributions

M.W., J.S., M.L., E.Z., L.B., J.-H.B., S.G., A.G., H.H., A.M.G. and B.K. designed ProteomicsDB. H.H., K.S., U.R., M.M. and J.S.-H. performed experiments. M.W., H.H., A.M.G., M.M.S., H.M., T.M., S.L. and B.K. performed data analysis. H.W., M.B., F.F. and B.K. conceptualized the study. M.W., H.H., A.M.G. and B.K. wrote manuscript.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Mass-spectrometry data are available from ProteomicsDB (https://www.proteomicsdb.org) and ProteomeXchange (http://proteomecentral.proteomexchange.org; dataset identifier PXD000865).

Author details

Extended data figures and tables

Extended Data Figures

  1. Extended Data Figure 1: Peptide and protein identifications. (478 KB)

    a, Spectrum viewer enabling access to more than 70-million annotated tandem mass spectra of endogenous peptides and synthetic reference standards in real time. b, Peptide length and score distribution for targets and decoys for the search engine Mascot. It is of note that the peptide- and protein-identification criteria followed a two-step process. First, for each LC-MS/MS run, we applied a global 1% target-decoy false discovery rate (FDR) cut on the level of peptide spectrum matches (PSMs, not shown); second, we applied a peptide-length-dependent local FDR cut of 5% for all PSMs and the results are depicted here. c, Same as in a but for the search engine Andromeda. d, e, Heat maps showing FDRs as a function of search engine score and peptide length. Solid lines indicate the 5% local FDR.

  2. Extended Data Figure 2: Protein-identification quality in very large data sets. (386 KB)

    a, First filtering step. The first step filters every LC-MS/MS run at 1% PSM FDR. Top panel, score distribution for target and decoy PSMs following 1% PSM FDR filtering for Maxquant identifications. Bottom panel, the binned peptide-length distribution for target PSMs. b, Same as a but for Mascot identifications. c, Second filtering step. Same as a, but this time applying an additional 5% local length- and score-dependent FDR on the total aggregated data for Maxquant identifications in ProteomicsDB. It is apparent that the second filtering step improves the FDR about threefold and removes most PSMs shorter than 9 amino acids. d, Same as c but for Mascot identifications in ProteomicsDB. e, Comparative analysis of protein FDR characteristics of two different approaches based on Mascot analysis. In the classical target-decoy approach, aggregation of large quantities of data leads to accumulation of large numbers of decoy proteins and a concomitant loss of true target proteins when filtering the data at 1% protein FDR. The alternative ‘picked’ target-decoy method does not suffer from this scaling problem and maintains a constant decoy rate (and therefore lower protein FDR) but at the expense of lower sensitivity of target protein detection compared to the classical target-decoy approach. Please refer to the Supplementary Information for details and a discussion on the topic. Note that the two protein FDR methods were not used in this manuscript. Instead, we used the criteria shown in a and b.

  3. Extended Data Figure 3: Further characterization of the proteome. (204 KB)

    a, Some proteins are refractory to identification using tryptic digestion because they do not generate sufficient—or any—peptides that are within the productive mass range of a mass spectrometer typically used for bottom-up proteomics. This can be improved by the use of alternative proteases; for example, chymotrypsin as shown here for one of the many keratin-associated proteins localized on chromosome 21 (detected chymotryptic peptides in red). b, c, Translation of lincRNAs is rare but does exist and can be identified (b) across all chromosomes as well as (c) in many tissues and in HeLa cells. d, Peptide-intensity distribution of protein-coding genes and non-coding transcripts. Interestingly, the abundance of translated lincRNAs is broadly similar to that of classical proteins.

  4. Extended Data Figure 4: Further characterization of the proteome. (510 KB)

    a, Proteome coverage rapidly saturates with the addition of shotgun proteomic data. Tissue proteomes saturate at ~approximately 16,000 proteins, but both body fluids and cell lines add small but noticeable numbers of proteins not covered in the tissues (see also b and c for a different ordering of samples). This indicates that proteome coverage is likely not to increase much more by merely adding high-throughput data (although it may increase confidence in protein identifications and will probably also increase sequence coverage). b, Same plot as a but different ordering of samples. c, Saturation plots showing that PTMs and affinity purifications each contribute distinctly to the coverage of the proteome. d, Comparison of five large-scale projects suggesting that a ‘core proteome’ of 10,000–12,000 ubiquitously expressed proteins exists. Ellipses represent the corresponding publications. e, Abundance distribution of the ‘core proteome’ based on the normalized iBAQ method. The most highly expressed 10% of proteins are dominated by proteins relating to energy production and protein synthesis. The least abundant 10% of proteins are enriched in proteins with regulatory functions. f, Tree-view summary of Gene Ontology (GO) term analysis for the proteins constituting the ‘core proteome’, showing that the core proteome is mainly concerned with biological processes relating to the homeostasis and life cycle of cells. The colours represent the broader categories of the treemap.

  5. Extended Data Figure 5: Comparative analysis of five intensity-based label-free absolute-quantification approaches. (479 KB)

    a, Linearity of intensity (U2-OS cell line data from ref. 22) and copies per cell for absolute protein quantification (AQUA)-quantified proteins (red dots, red regression line; same cell line30) and derived copy-number estimates (grey dots, blue regression line; from the same study). b, Total sum normalization re-scales intensity distributions of Colo-205 cell digests measured on two different mass spectrometers (Orbitrap Elite data in red, LTQ Orbitrap XL data in blue24). c, Quantile-quantile (Q-Q) plots of the normalized data presented in b illustrating good alignment of data across 4.5 orders of magnitude. d, Empirical cumulative density function (ECDF) of error distributions derived from a showing that all five methods have merit. e, Comparison of the fold error of iBAQ and top3 as a function of the number of quantified peptides. f, Same as e but for protein length. When peptide numbers are low, iBAQ shows errors that are slightly smaller in magnitude compared to the top3 method. g, Comparison of iBAQ and total sum normalized iBAQ for heavy SILAC-labelled MCF-7 cell digests (red bars32 and label-free quantified MCF-7cell digests (same as MCF-7 deep proteome in a; blue bars) before (left panel) and after normalization (right panel) showing no influence of the presence of the SILAC label on quantification results. h, Comparison of iBAQ and total sum normalized iBAQ for iTRAQ reporter-ion-intensity-based quantification (red bars; MCF-7 cell digest46) and label-free quantified MCF-7 cell digests (blue bars; same as a and c) before (left panel) and after normalization (right panel). The intensity-distribution characteristics of iTRAQ and label-free measurements are too different to allow for comparative analyses of MS1- and MS2-based quantification data. i, Normalized iBAQ distributions of 347 cell-line and tissue proteomes (all MS1 quantified) available in ProteomicsDB showing the general applicability of MS1-based quantification across many sources of biological material.

  6. Extended Data Figure 6: Functional protein-expression analysis. (449 KB)

    Gene ontology analysis of proteins with expression levels 10-fold above average in a particular organ or body fluid invariably highlights protein signatures with direct organ-related functional significance.

  7. Extended Data Figure 7: Protein- versus mRNA-expression analysis. (685 KB)

    a, Comparison of mRNA and protein expression of 12 human tissues showing the general rather poor correlation of protein and mRNA levels, implying the widespread application of transcriptional, translational and post-translational control mechanisms of protein-abundance regulation. Spearman correlation coefficients vary from 0.41 (thyroid gland) to 0.55 (kidney). ‘Corner proteins’ (0.5 logs to either side of zero) are marked in colours. b, Clustering of mRNA expression (left triangle) and protein expression (right triangle) across the 12 tissues does not reveal tissues with common profiles suggesting that the transcriptomes and proteomes of human tissues are quite different from each other. c, The ratio of protein and mRNA level for a protein is approximately constant across many tissues. The heat map shows proteins and tissues clustered according to their protein/mRNA ratio. d, Protein abundance can be predicted from mRNA levels. Using the median ratio of protein/mRNA across 12 tissues, it is possible to predict protein levels from mRNA levels for every tissue with a good correlation coefficient, underscoring the importance of the translation rate (and mRNA levels) on protein expression.

  8. Extended Data Figure 8: Protein markers for drug sensitivity and resistance. (266 KB)

    a, Elastic net analysis of protein expression and drug sensitivity for the EGFR kinase inhibitor erlotinib. Positive-effect-size values indicate that high protein expression is associated with drug sensitivity. Negative-effect-size values indicate that high protein expression is associated with drug resistance. b, Same as in a but for the EGFR kinase inhibitor lapatinib. c, Correlation analysis of the elastic net effect sizes for erlotinib and lapatinib (proteins with elastic net frequencies of less than 600 are not shown for clarity). Proteins in the top-right quadrant are common markers for drug sensitivity (including EGFR as the primary target of both drugs). Proteins in the bottom-left quadrant are common markers for drug resistance (including S100A4, a known resistance marker for lapatinib). Proteins that are strong markers for sensitivity or resistance are annotated in each plot and most proteins can be easily placed into EGFR signalling and regulation pathways.

  9. Extended Data Figure 9: Protein complex composition and stoichiometry from shotgun proteomic data. (367 KB)

    a, Stoichiometry of the nuclear pore complex (NPC) reconstructed from shotgun proteomics data. To illustrate that normalized iBAQ values from shotgun experiments actually reflect protein copy numbers, we reconstructed the stoichiometry of the NPC (blue bars, data from nuclear extracts of HeLa cells39; error bars indicate standard deviation from triplicate experiments) and compared it to the stoichiometry determined in the same study using AQUA peptides and SRM experiments (red bars). Note that most of the time, the stoichiometries are in very good agreement between the methods and the stoichiometries reported in the literature. b, Stoichiometry of the α- and β-subunits of the proteasome reconstructed from shotgun proteomics data (examples). β-subunits of the constitutive proteasome are indicated in grey, immunoproteasome subunits (β1i, β2i, β5i) are indicated in red. Note that PC-3 cells are devoid of the immunoproteasome, whereas cells in the lymph node almost exclusively express this version of the molecular machine. c, Systematic assessment of the fraction of βi subunits (red bars) and β-subunits (grey bars) across 29 tissue samples and 80 cell-line samples (tissue data from human body map (this study), cell-line data from22, 24). Note that many cell lines and tissues contain both versions of the proteasome and the data also suggest that further forms of the proteasome with different subunit compositions may exist.

  10. Extended Data Figure 10: Examples for the analytical utility of large mass-spectrometry-based data collected in ProteomicsDB. (488 KB)

    a, Enumeration of post-translational modifications and protein termini. b, Computation of proteotypic peptides. Generally the same one to five peptides are identified every time a protein is identified (top panel) making proteotypic peptides useful for assessing protein identification and as reagents for targeted mass-spectrometry measurements. We note that the proteotypicity of a peptide strongly depends on the presence or absence of a chemical modification (bottom panel, here tandem mass tags (TMT) or isobaric tags for relative and absolute quantification (iTRAQ)). c, Analysis of the selectivity of SRM transitions. The top panel shows the y8 transition of the peptide LHYGLPVVVK (β-catenin, marked with an arrow) in a slice of the precursor and fragment-ion window of 0.7 Da and 0.7 Da, respectively, typically employed on triple-quadrupole mass spectrometers. The size of the circle represents the relative intensity of the y8 fragment in a full tandem mass spectrum of this peptide. All other circles are interfering peptides (extracted from the entire ProteomicsDB) that have precursor and fragment ions in the same m/z window and with varying intensities (circle size). Interference can be reduced by using high-resolution mass spectrometry (middle panel) and confining the analysis to the tissue in question (here, a colon sample, bottom panel). Such interference plots in conjunction with the proteotypicity of peptides can be valuable for the design of targeted proteomic experiments.

Supplementary information

PDF files

  1. Supplementary Information (940 KB)

    This file contains Supplementary Methods, Supplementary Notes and Supplementary References. A list of abbreviations used is also included.

Excel files

  1. Supplementary Table 1 (1 MB)

    Data available in ProteomicsDB and search engine and project-wise comparison of the number of identified proteins/genes between the original publication and ProteomicsDB.

  2. Supplementary Table 2 (484 KB)

    List of synthetic reference peptides available in ProteomicsDB.

  3. Supplementary Table 3 (530 KB)

    Summary of lincRNA matches on PSM, peptide and transcript.

  4. Supplementary Table 4 (1.6 MB)

    Protein expression analysis.

  5. Supplementary Table 5 (5.8 MB)

    Gene ontology analysis of proteins in organs.

  6. Supplementary Table 6 (1.6 MB)

    mRNA / protein expression comparison.

  7. Supplementary Table 7 (5.8 MB)

    Drug sensitivity/resistance analysis.

  8. Supplementary Table 8 (544 KB)

    Protein complex analysis.

  9. Supplementary Table 9 (41.4 MB)

    List of proteotypic peptides.

Additional data