The availability of human genome sequence has transformed biomedical research over the past decade. However, an equivalent map for the human proteome with direct measurements of proteins and peptides does not exist yet. Here we present a draft map of the human proteome using high-resolution Fourier-transform mass spectrometry. In-depth proteomic profiling of 30 histologically normal human samples, including 17 adult tissues, 7 fetal tissues and 6 purified primary haematopoietic cells, resulted in identification of proteins encoded by 17,294 genes accounting for approximately 84% of the total annotated protein-coding genes in humans. A unique and comprehensive strategy for proteogenomic analysis enabled us to discover a number of novel protein-coding regions, which includes translated pseudogenes, non-coding RNAs and upstream open reading frames. This large human proteome catalogue (available as an interactive web-based resource at http://www.humanproteomemap.org) will complement available human genome and transcriptome data to accelerate biomedical research in health and disease.
At a glance
- The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012)
- Mass spectrometry-based proteomics. Nature 422, 198–207 (2003) &
- Mass spectrometry-based proteomics and network biology. Annu. Rev. Biochem. 81, 379–405 (2012) , &
- The biological impact of mass-spectrometry-based proteomics. Nature 450, 991–1000 (2007) , &
- System-wide perturbation analysis with nearly complete coverage of the yeast proteome by single-shot ultra HPLC runs on a bench top Orbitrap. Mol. Cell. Proteomics 11, M111.013722 (2012) et al.
- A complete mass-spectrometric map of the yeast proteome applied to quantitative trait analysis. Nature 494, 266–270 (2013) et al.
- Proteogenomic analysis of Mycobacterium tuberculosis by high resolution mass spectrometry. Mol. Cell. Proteomics 10, M111.011627 (2011) et al.
- A tissue-specific atlas of mouse protein phosphorylation and expression. Cell 143, 1174–1189 (2010) et al.
- Global proteome analysis of the NCI-60 cell line panel. Cell Rep. 4, 609–620 (2013) et al.
- HiRIEF LC-MS enables deep proteome coverage and unbiased proteogenomics. Nature Methods 11, 59–62 (2014) et al.
- The state of the human proteome in 2012 as viewed through PeptideAtlas. J. Proteome Res. 12, 162–171 (2013) et al.
- Open source system for analyzing, validating, and storing protein identification data. J. Proteome Res. 3, 1234–1242 (2004) , &
- neXtProt: organizing protein knowledge in the context of human proteome projects. J. Proteome Res. 12, 293–298 (2013) et al.
- Towards a knowledge-based Human Protein Atlas. Nature Biotechnol. 28, 1248–1250 (2010) et al.
- RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 42, D756–D763 (2014) et al.
- Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567 (1999) , , &
- An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976–989 (1994) , &
- Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nature Methods 4, 923–925 (2007) , , , &
- Metrics for the human proteome project 2013–2014 and strategies for finding missing proteins. J. Proteome Res. 13, 15–20 (2014) et al.
- Highly reproducible label free quantitative proteomic analysis of RNA polymerase complexes. Mol. Cell. Proteomics 10, M110.000687 (2011) et al.
- Proteomic analysis of the fetal brain. Proteomics 2, 1547–1576 (2002) , , &
- A dataset of human fetal liver proteome identified by subcellular fractionation and multiple protein separation and identification technology. Mol. Cell. Proteomics 5, 1703–1707 (2006) et al.
- Relating whole-genome expression data with protein-protein interactions. Genome Res. 12, 37–46 (2002) , &
- Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nature Genet. 29, 482–486 (2001) , , &
- CORUM: the comprehensive resource of mammalian protein complexes–2009. Nucleic Acids Res. 38, D497–D501 (2010) et al.
- Immunoproteasomes: structure, function, and antigen presentation. Prog. Mol. Biol. Transl. Sci. 109, 75–112 (2012) &
- The abc’s (and xyz’s) of peptide sequencing. Nature Rev. Mol. Cell Biol. 5, 699–711 (2004) &
- A novel human endogenous retroviral protein inhibits cell-cell fusion. Sci. Rep. 3, 1462 (2013) , , , &
- Ribosome profiling provides evidence that large noncoding RNAs do not encode proteins. Cell 154, 240–251 (2013) , , , &
- Expressed pseudogenes in the transcriptional landscape of human cancers. Cell 149, 1622–1634 (2012) et al.
- The GENCODE pseudogene resource. Genome Biol. 13, R51 (2012) et al.
- An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012) et al.
- A reassessment of the translation initiation codon in vertebrates. Trends Genet. 17, 685–687 (2001) &
- The human proteome project: current state and future direction. Mol. Cell. Proteomics 10, M111.009993 (2011) et al.
- The Chromosome-Centric Human Proteome Project for cataloging proteins encoded in the genome. Nature Biotechnol. 30, 221–223 (2012) et al.
- A first step toward completion of a genome-wide characterization of the human proteome. J. Proteome Res. 12, 1–5 (2013) , , &
- In-gel digestion for mass spectrometric characterization of proteins and proteomes. Nature Protocols 1, 2856–2860 (2007) , , , &
- Reversed-phase chromatography with multiple fraction concatenation strategy for proteome profiling of human MCF10A cells. Proteomics 11, 2019–2026 (2011) et al.
- Parts per million mass accuracy on an Orbitrap mass spectrometer via lock mass injection into a C-trap. Mol. Cell. Proteomics 4, 2010–2021 (2005) et al.
- The PRoteomics IDEntifications (PRIDE) database and associated tools: status in 2013. Nucleic Acids Res. 41, D1063–D1069 (2013) et al.
- TANDEM: matching proteins with tandem mass spectra. Bioinformatics 20, 1466–1467 (2004) &
- The UCSC Genome Browser database: extensions and updates 2013. Nucleic Acids Res. 41, D64–D69 (2013) et al.
- iRefIndex: a consolidated protein interaction database with provenance. BMC Bioinformatics 9, 405 (2008) , &
- GeneMANIA prediction server 2013 update. Nucleic Acids Res. 41, W115–W122 (2013) et al.
Extended data figures and tables
Extended Data Figures
- Extended Data Figure 1: Summary of proteome analysis. (161 KB)
a, Mass error in parts per million for precursor ions of all identified peptides. b, Number of peptides detected per gene binned as shown. c, Distribution of sequence coverage of identified proteins. d–f, %FDR with a q value of <0.01 plotted against peptide length in number of amino acids, charge state of peptide ion and number of cleavage sites missed by enzyme. P values computed from two-tailed t-test are shown. Error bars indicate s.d. calculated from FDRs of multiple fetal samples. g, h, A comparison of peptides identified in this study with PeptideAtlas and GPMDB. i, Mass error in parts per million for precursor ions identified from proteogenomics analysis.
- Extended Data Figure 2: Tissue-wise gene expression and housekeeping proteins. (341 KB)
a, A heat map shows a partial list of not well-characterized, hypothetical genes. b, The bulk of protein mass is contributed by only a small number of genes. Only 2,350 ‘housekeeping genes’ account for ~75% of proteome mass. c, The number of cell/tissue types where a gene was observed was counted. Some genes were found to be specifically restricted in a few samples while others were observed in the majority of samples analysed. For example, 1,537 genes were detected only in one sample, and 2,350 genes were found in all samples. These latter genes can be defined as highly abundant ‘housekeeping proteins’. d, Distribution of genes in the RefSeq database based on the number of protein isoforms resulting from their annotated transcripts (left). Distribution of the transcripts with two or more protein isoforms annotated based on the number of isoform-specific or shared peptides (right). e, A representative example of sequence coverage of PSMB8 protein along with tissue distribution of all of its identified peptides and the MS/MS spectrum of one of the peptides is shown along with seven selected reaction monitoring (SRM) transitions.
- Extended Data Figure 3: Western blot analysis of select tissue-restricted proteins. (262 KB)
a, Eight proteins showing tissue-restricted expression were tested using western blot analysis in 17 adult tissues. GAPDH was used as a loading control. b, Four proteins found to be expressed in a broad range of tissues, although bands that do not correspond to the expected molecular weight are also observed. CST, Cell Signalling Technology; SCB, Santa Cruz Biotechnology.
- Extended Data Figure 4: Identification of novel genes/ORFs and translated non-coding RNAs. (185 KB)
a, An example of a novel ORF in an alternate reading frame located in the 3′ UTR of CHTF8 gene. The relative abundance of peptides from the CHTF8 protein and the protein encoded by the novel ORF is shown (bottom). b, An example of translated non-coding RNA (NR_027693.1) identified by searching 3-frame-translated transcript database. The MS/MS spectrum of one of the five identified peptides (LEVASSPPVSEAVPR) is shown along with a similar fragmentation pattern observed from the corresponding synthetic peptide.
- Extended Data Figure 5: Human genome annotation through proteogenomic analysis using GeneSpring. (266 KB)
a, Four genome search specific peptides (GSSPs; red boxes) map to an upstream ORF (denoted as black hashes) located in 5′ UTR of the SLC35A4 gene (ORF shown as blue rectangle). b, GSSP mapping in the intergenic region between two RefSeq annotated genes NDUFv3 and PKNOX1. The ORF region is depicted in dotted lines of human endogenous retroviral element (HERV). c, GSSPs mapping to an annotated pseudogene MAGEB6P1, the alignments of parent gene and pseudogene are shown below the peptides.
- Extended Data Figure 6: Frequency of nucleotides surrounding translational start sites. (104 KB)
a, Frequency of nucleotides at positions ranging from −5 to +1 surrounding the AUG codon for confirmed translational start sites. b, Frequency of nucleotides at positions ranging from −5 to +1 surrounding the AUG codon for novel translational start sites identified in this study.
- Supplementary Information (106 KB)
This file contains a Supplementary Discussion and additional references.
- Supplementary Data (3.5 MB)
This file contains Supplementary Data.
- Supplementary Table 1 (1.1 MB)
This file contains a summary of results from proteogenomics analysis; a list of peptides indicating novel signal peptide cleavage sites; and a draft map of the human proteome.