Proteomes are characterized by large protein-abundance differences, cell-type- and time-dependent expression patterns and post-translational modifications, all of which carry biological information that is not accessible by genomics or transcriptomics. Here we present a mass-spectrometry-based draft of the human proteome and a public, high-performance, in-memory database for real-time analysis of terabytes of big data, called ProteomicsDB. The information assembled from human tissues, cell lines and body fluids enabled estimation of the size of the protein-coding genome, and identified organ-specific proteins and a large number of translated lincRNAs (long intergenic non-coding RNAs). Analysis of messenger RNA and protein-expression profiles of human tissues revealed conserved control of protein abundance, and integration of drug-sensitivity data enabled the identification of proteins predicting resistance or sensitivity. The proteome profiles also hold considerable promise for analysing the composition and stoichiometry of protein complexes. ProteomicsDB thus enables navigation of proteomes, provides biological insight and fosters the development of proteomic technology.
At a glance
- UniProt. C. Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Res. 41, D43–D47 (2013)
- The Chromosome-Centric Human Proteome Project for cataloging proteins encoded in the genome. Nature Biotechnol. 30, 221–223 (2012) et al.
- Towards a knowledge-based Human Protein Atlas. Nature Biotechnol. 28, 1248–1250 (2010) et al.
- ProteomeXchange provides globally coordinated proteomics data submission and dissemination. Nature Biotechnol. 32, 223–226 (2014) et al.
- State of the human proteome in 2013 as viewed through PeptideAtlas: comparing the kidney, urine, and plasma proteomes for the biology- and disease-driven Human Proteome Project. J. Proteome Res. 13, 60–75 (2014) et al.
- PaxDb, a database of protein abundance averages across all three domains of life. Mol. Cell. Proteomics 11, 492–500 (2012) et al.
- MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nature Biotechnol. 26, 1367–1372 (2008) &
- Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567 (1999) , , &
- Target-decoy approach and false discovery rate: when things may go wrong. J. Am. Soc. Mass Spectrom. 22, 1111–1120 (2011) , , &
- IPM: An integrated protein model for false discovery rate estimation and identification in high-throughput proteomics. J. Proteomics 75, 116–121 (2011) et al.
- A probability-based approach for high-throughput protein phosphorylation analysis and site localization. Nature Biotechnol. 24, 1285–1292 (2006) , , , &
- Protein identification false discovery rates for very large proteomics data sets generated by tandem mass spectrometry. Mol. Cell. Proteomics 8,. 2405–2417 (2009) et al.
- Deep proteome and transcriptome mapping of a human cancer cell line. Mol. Syst. Biol. 7, 548 (2011) et al.
- Mapping intact protein isoforms in discovery mode using top-down proteomics. Nature 480, 254–258 (2011) et al.
- Metrics for the Human Proteome Project 2013–2014 and strategies for finding missing proteins. J. Proteome Res. 13, 15–20 (2014) et al.
- Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 25, 1915–1927 (2011) et al.
- Landscape of transcription in human cells. Nature 489, 101–108 (2012) et al.
- Long noncoding RNAs are rarely translated in two human cell lines. Genome Res. 22, 1646–1657 (2012) et al.
- Ribosome profiling provides evidence that large noncoding RNAs do not encode proteins. Cell 154, 240–251 (2013) , , , &
- Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell 147, 789–802 (2011) , &
- Non-coding RNA: Ribosomes, but no translation, for lincRNAs. Nature Rev. Genet. 14, 520 (2013)
- Comparative proteomic analysis of eleven common cell lines reveals ubiquitous but varying expression of most proteins. Mol. Cell. Proteomics 11, M111.014050 (2012) , , , &
- Integrated proteomic analysis of post-translational modifications by serial enrichment. Nature Methods 10, 634–637 (2013) et al.
- Global proteome analysis of the NCI-60 cell line panel. Cell Rep. 4, 609–620 (2013) et al.
- Identification of missing proteins in the neXtProt database and unregistered phosphopeptides in the PhosphoSitePlus database as part of the Chromosome-centric Human Proteome Project. J. Proteome Res. 12, 2414–2421 (2013) et al.
- Profiling core proteomes of human cell lines by one-dimensional PAGE and liquid chromatography-tandem mass spectrometry. Mol. Cell. Proteomics 2, 1297–1305 (2003) , &
- Analysis of the human tissue-specific expression by genome-wide integration of transcriptomics and antibody-based proteomics. Mol. Cell. Proteomics 13, 397–406 (2014) et al.
- Loss of olfactory receptor function in hominin evolution. PLoS ONE 9, e84714 (2014) , &
- Critical assessment of proteome-wide label-free absolute abundance estimation strategies. Proteomics 13, 2567–2578 (2013) , , &
- The quantitative proteome of a human cell line. Mol. Syst. Biol. 7, 549 (2011) et al.
- Global quantification of mammalian gene expression control. Nature 473, 337–342 (2011) et al.
- Initial quantitative proteomic map of 28 mouse tissues using the SILAC mouse. Mol. Cell. Proteomics 12, 1709–1722 (2013) et al.
- Quantitative and qualitative proteome characteristics extracted from in-depth integrated genomics and proteomics analysis. Cell Rep. 5, 1469–1478 (2013) et al.
- The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603–607 (2012) et al.
- Reduced annexin A6 expression promotes the degradation of activated epidermal growth factor receptor and sensitizes invasive breast cancer cells to EGFR-targeted tyrosine kinase inhibitors. Mol. Cancer 12, 167 (2013) et al.
- Epidermal growth factor receptor ligands as new extracellular targets for the metastasis-promoting S100A4 protein. FEBS J. 276, 5936–5948 (2009) et al.
- Proteomic snapshot of the EGF-induced ubiquitin network. Mol. Syst. Biol. 7, 462 (2011) et al.
- A census of human soluble protein complexes. Cell 150, 1068–1081 (2012) et al.
- Cell type-specific nuclear pores: a case in point for context-dependent stoichiometry of molecular machines. Mol. Syst. Biol. 9, 648 (2013) et al.
- Newly identified pair of proteasomal subunits regulated reciprocally by interferon gamma. J. Exp. Med. 183, 1807–1816 (1996) et al.
- Identification of MECL-1 (LMP-10) as the third IFN-gamma-inducible proteasome subunit. J. Immunol. 156, 2361–2364 (1996) , &
- Computational prediction of proteotypic peptides for quantitative proteomics. Nature Biotechnol. 25, 125–131 (2007) et al.
- Considerations on selected reaction monitoring experiments: implications for the selectivity and accuracy of measurements. Proteomics Clin. Appl. 6, 609–614 (2012)
- Targeted proteomic quantification on quadrupole-orbitrap mass spectrometer. Mol. Cell. Proteomics 11, 1709–1723 doi:10.1074/mcp.O112.019802. (2012) et al.
- A large synthetic peptide and phosphopeptide reference library for mass spectrometry-based proteomics. Nature Biotechnol. 31, 557–564 (2013) et al.
- Retinoic acid receptor alpha is associated with tamoxifen resistance in breast cancer. Nature Commun. 4, 2175 (2013) et al.
Extended data figures and tables
Extended Data Figures
- Extended Data Figure 1: Peptide and protein identifications. (478 KB)
a, Spectrum viewer enabling access to more than 70-million annotated tandem mass spectra of endogenous peptides and synthetic reference standards in real time. b, Peptide length and score distribution for targets and decoys for the search engine Mascot. It is of note that the peptide- and protein-identification criteria followed a two-step process. First, for each LC-MS/MS run, we applied a global 1% target-decoy false discovery rate (FDR) cut on the level of peptide spectrum matches (PSMs, not shown); second, we applied a peptide-length-dependent local FDR cut of 5% for all PSMs and the results are depicted here. c, Same as in a but for the search engine Andromeda. d, e, Heat maps showing FDRs as a function of search engine score and peptide length. Solid lines indicate the 5% local FDR.
- Extended Data Figure 2: Protein-identification quality in very large data sets. (386 KB)
a, First filtering step. The first step filters every LC-MS/MS run at 1% PSM FDR. Top panel, score distribution for target and decoy PSMs following 1% PSM FDR filtering for Maxquant identifications. Bottom panel, the binned peptide-length distribution for target PSMs. b, Same as a but for Mascot identifications. c, Second filtering step. Same as a, but this time applying an additional 5% local length- and score-dependent FDR on the total aggregated data for Maxquant identifications in ProteomicsDB. It is apparent that the second filtering step improves the FDR about threefold and removes most PSMs shorter than 9 amino acids. d, Same as c but for Mascot identifications in ProteomicsDB. e, Comparative analysis of protein FDR characteristics of two different approaches based on Mascot analysis. In the classical target-decoy approach, aggregation of large quantities of data leads to accumulation of large numbers of decoy proteins and a concomitant loss of true target proteins when filtering the data at 1% protein FDR. The alternative ‘picked’ target-decoy method does not suffer from this scaling problem and maintains a constant decoy rate (and therefore lower protein FDR) but at the expense of lower sensitivity of target protein detection compared to the classical target-decoy approach. Please refer to the Supplementary Information for details and a discussion on the topic. Note that the two protein FDR methods were not used in this manuscript. Instead, we used the criteria shown in a and b.
- Extended Data Figure 3: Further characterization of the proteome. (204 KB)
a, Some proteins are refractory to identification using tryptic digestion because they do not generate sufficient—or any—peptides that are within the productive mass range of a mass spectrometer typically used for bottom-up proteomics. This can be improved by the use of alternative proteases; for example, chymotrypsin as shown here for one of the many keratin-associated proteins localized on chromosome 21 (detected chymotryptic peptides in red). b, c, Translation of lincRNAs is rare but does exist and can be identified (b) across all chromosomes as well as (c) in many tissues and in HeLa cells. d, Peptide-intensity distribution of protein-coding genes and non-coding transcripts. Interestingly, the abundance of translated lincRNAs is broadly similar to that of classical proteins.
- Extended Data Figure 4: Further characterization of the proteome. (510 KB)
a, Proteome coverage rapidly saturates with the addition of shotgun proteomic data. Tissue proteomes saturate at ~approximately 16,000 proteins, but both body fluids and cell lines add small but noticeable numbers of proteins not covered in the tissues (see also b and c for a different ordering of samples). This indicates that proteome coverage is likely not to increase much more by merely adding high-throughput data (although it may increase confidence in protein identifications and will probably also increase sequence coverage). b, Same plot as a but different ordering of samples. c, Saturation plots showing that PTMs and affinity purifications each contribute distinctly to the coverage of the proteome. d, Comparison of five large-scale projects suggesting that a ‘core proteome’ of 10,000–12,000 ubiquitously expressed proteins exists. Ellipses represent the corresponding publications. e, Abundance distribution of the ‘core proteome’ based on the normalized iBAQ method. The most highly expressed 10% of proteins are dominated by proteins relating to energy production and protein synthesis. The least abundant 10% of proteins are enriched in proteins with regulatory functions. f, Tree-view summary of Gene Ontology (GO) term analysis for the proteins constituting the ‘core proteome’, showing that the core proteome is mainly concerned with biological processes relating to the homeostasis and life cycle of cells. The colours represent the broader categories of the treemap.
- Extended Data Figure 5: Comparative analysis of five intensity-based label-free absolute-quantification approaches. (479 KB)
a, Linearity of intensity (U2-OS cell line data from ref. 22) and copies per cell for absolute protein quantification (AQUA)-quantified proteins (red dots, red regression line; same cell line30) and derived copy-number estimates (grey dots, blue regression line; from the same study). b, Total sum normalization re-scales intensity distributions of Colo-205 cell digests measured on two different mass spectrometers (Orbitrap Elite data in red, LTQ Orbitrap XL data in blue24). c, Quantile-quantile (Q-Q) plots of the normalized data presented in b illustrating good alignment of data across 4.5 orders of magnitude. d, Empirical cumulative density function (ECDF) of error distributions derived from a showing that all five methods have merit. e, Comparison of the fold error of iBAQ and top3 as a function of the number of quantified peptides. f, Same as e but for protein length. When peptide numbers are low, iBAQ shows errors that are slightly smaller in magnitude compared to the top3 method. g, Comparison of iBAQ and total sum normalized iBAQ for heavy SILAC-labelled MCF-7 cell digests (red bars32 and label-free quantified MCF-7cell digests (same as MCF-7 deep proteome in a; blue bars) before (left panel) and after normalization (right panel) showing no influence of the presence of the SILAC label on quantification results. h, Comparison of iBAQ and total sum normalized iBAQ for iTRAQ reporter-ion-intensity-based quantification (red bars; MCF-7 cell digest46) and label-free quantified MCF-7 cell digests (blue bars; same as a and c) before (left panel) and after normalization (right panel). The intensity-distribution characteristics of iTRAQ and label-free measurements are too different to allow for comparative analyses of MS1- and MS2-based quantification data. i, Normalized iBAQ distributions of 347 cell-line and tissue proteomes (all MS1 quantified) available in ProteomicsDB showing the general applicability of MS1-based quantification across many sources of biological material.
- Extended Data Figure 6: Functional protein-expression analysis. (449 KB)
Gene ontology analysis of proteins with expression levels 10-fold above average in a particular organ or body fluid invariably highlights protein signatures with direct organ-related functional significance.
- Extended Data Figure 7: Protein- versus mRNA-expression analysis. (685 KB)
a, Comparison of mRNA and protein expression of 12 human tissues showing the general rather poor correlation of protein and mRNA levels, implying the widespread application of transcriptional, translational and post-translational control mechanisms of protein-abundance regulation. Spearman correlation coefficients vary from 0.41 (thyroid gland) to 0.55 (kidney). ‘Corner proteins’ (0.5 logs to either side of zero) are marked in colours. b, Clustering of mRNA expression (left triangle) and protein expression (right triangle) across the 12 tissues does not reveal tissues with common profiles suggesting that the transcriptomes and proteomes of human tissues are quite different from each other. c, The ratio of protein and mRNA level for a protein is approximately constant across many tissues. The heat map shows proteins and tissues clustered according to their protein/mRNA ratio. d, Protein abundance can be predicted from mRNA levels. Using the median ratio of protein/mRNA across 12 tissues, it is possible to predict protein levels from mRNA levels for every tissue with a good correlation coefficient, underscoring the importance of the translation rate (and mRNA levels) on protein expression.
- Extended Data Figure 8: Protein markers for drug sensitivity and resistance. (266 KB)
a, Elastic net analysis of protein expression and drug sensitivity for the EGFR kinase inhibitor erlotinib. Positive-effect-size values indicate that high protein expression is associated with drug sensitivity. Negative-effect-size values indicate that high protein expression is associated with drug resistance. b, Same as in a but for the EGFR kinase inhibitor lapatinib. c, Correlation analysis of the elastic net effect sizes for erlotinib and lapatinib (proteins with elastic net frequencies of less than 600 are not shown for clarity). Proteins in the top-right quadrant are common markers for drug sensitivity (including EGFR as the primary target of both drugs). Proteins in the bottom-left quadrant are common markers for drug resistance (including S100A4, a known resistance marker for lapatinib). Proteins that are strong markers for sensitivity or resistance are annotated in each plot and most proteins can be easily placed into EGFR signalling and regulation pathways.
- Extended Data Figure 9: Protein complex composition and stoichiometry from shotgun proteomic data. (367 KB)
a, Stoichiometry of the nuclear pore complex (NPC) reconstructed from shotgun proteomics data. To illustrate that normalized iBAQ values from shotgun experiments actually reflect protein copy numbers, we reconstructed the stoichiometry of the NPC (blue bars, data from nuclear extracts of HeLa cells39; error bars indicate standard deviation from triplicate experiments) and compared it to the stoichiometry determined in the same study using AQUA peptides and SRM experiments (red bars). Note that most of the time, the stoichiometries are in very good agreement between the methods and the stoichiometries reported in the literature. b, Stoichiometry of the α- and β-subunits of the proteasome reconstructed from shotgun proteomics data (examples). β-subunits of the constitutive proteasome are indicated in grey, immunoproteasome subunits (β1i, β2i, β5i) are indicated in red. Note that PC-3 cells are devoid of the immunoproteasome, whereas cells in the lymph node almost exclusively express this version of the molecular machine. c, Systematic assessment of the fraction of βi subunits (red bars) and β-subunits (grey bars) across 29 tissue samples and 80 cell-line samples (tissue data from human body map (this study), cell-line data from22, 24). Note that many cell lines and tissues contain both versions of the proteasome and the data also suggest that further forms of the proteasome with different subunit compositions may exist.
- Extended Data Figure 10: Examples for the analytical utility of large mass-spectrometry-based data collected in ProteomicsDB. (488 KB)
a, Enumeration of post-translational modifications and protein termini. b, Computation of proteotypic peptides. Generally the same one to five peptides are identified every time a protein is identified (top panel) making proteotypic peptides useful for assessing protein identification and as reagents for targeted mass-spectrometry measurements. We note that the proteotypicity of a peptide strongly depends on the presence or absence of a chemical modification (bottom panel, here tandem mass tags (TMT) or isobaric tags for relative and absolute quantification (iTRAQ)). c, Analysis of the selectivity of SRM transitions. The top panel shows the y8 transition of the peptide LHYGLPVVVK (β-catenin, marked with an arrow) in a slice of the precursor and fragment-ion window of 0.7 Da and 0.7 Da, respectively, typically employed on triple-quadrupole mass spectrometers. The size of the circle represents the relative intensity of the y8 fragment in a full tandem mass spectrum of this peptide. All other circles are interfering peptides (extracted from the entire ProteomicsDB) that have precursor and fragment ions in the same m/z window and with varying intensities (circle size). Interference can be reduced by using high-resolution mass spectrometry (middle panel) and confining the analysis to the tissue in question (here, a colon sample, bottom panel). Such interference plots in conjunction with the proteotypicity of peptides can be valuable for the design of targeted proteomic experiments.
- Supplementary Information (940 KB)
This file contains Supplementary Methods, Supplementary Notes and Supplementary References. A list of abbreviations used is also included.
- Supplementary Table 1 (1 MB)
Data available in ProteomicsDB and search engine and project-wise comparison of the number of identified proteins/genes between the original publication and ProteomicsDB.
- Supplementary Table 2 (484 KB)
List of synthetic reference peptides available in ProteomicsDB.
- Supplementary Table 3 (530 KB)
Summary of lincRNA matches on PSM, peptide and transcript.
- Supplementary Table 4 (1.6 MB)
Protein expression analysis.
- Supplementary Table 5 (5.8 MB)
Gene ontology analysis of proteins in organs.
- Supplementary Table 6 (1.6 MB)
mRNA / protein expression comparison.
- Supplementary Table 7 (5.8 MB)
Drug sensitivity/resistance analysis.
- Supplementary Table 8 (544 KB)
Protein complex analysis.
- Supplementary Table 9 (41.4 MB)
List of proteotypic peptides.