A draft map of the human proteome

Journal name:
Nature
Volume:
509,
Pages:
575–581
Date published:
DOI:
doi:10.1038/nature13302
Received
Accepted
Published online

Abstract

The availability of human genome sequence has transformed biomedical research over the past decade. However, an equivalent map for the human proteome with direct measurements of proteins and peptides does not exist yet. Here we present a draft map of the human proteome using high-resolution Fourier-transform mass spectrometry. In-depth proteomic profiling of 30 histologically normal human samples, including 17 adult tissues, 7 fetal tissues and 6 purified primary haematopoietic cells, resulted in identification of proteins encoded by 17,294 genes accounting for approximately 84% of the total annotated protein-coding genes in humans. A unique and comprehensive strategy for proteogenomic analysis enabled us to discover a number of novel protein-coding regions, which includes translated pseudogenes, non-coding RNAs and upstream open reading frames. This large human proteome catalogue (available as an interactive web-based resource at http://www.humanproteomemap.org) will complement available human genome and transcriptome data to accelerate biomedical research in health and disease.

At a glance

Figures

  1. Overview of the workflow and comparison of data with public repositories.
    Figure 1: Overview of the workflow and comparison of data with public repositories.

    a, The adult/fetal tissues and haematopoietic cell types that were analysed to generate a draft map of the normal human proteome are shown. b, The samples were fractionated, digested and analysed on the high-resolution and high-accuracy Orbitrap mass analyser as shown. Tandem mass spectrometry data were searched against a known protein database using SEQUEST and MASCOT database search algorithms.

  2. Landscape of the normal human proteome.
    Figure 2: Landscape of the normal human proteome.

    a, Tissue-supervised hierarchical clustering reveals the landscape of gene expression across the analysed cells and tissues. Selected tissue-restricted genes are highlighted in boxes to show some well-studied genes (black) as well as hypothetical proteins of unknown function (red). The colour key indicates the normalized spectral counts per gene detected across the tissues. n.d., not determined. b, A heat map showing tissue expression of fetal tissue-restricted genes ordered by average expression across fetal tissues (left) and a zoom-in of the top 40 most abundant genes (right). The colour key indicates the spectral counts per gene. c, An ROC curve showing a comparison of the performance of the current data set (blue, area under the curve (AUC) = 0.762) with 111 individual gene expression data sets (orange) and a composite of the 111 individual data sets (red, AUC = 0.692). d, Developmental stage-specific differential expression of protein complexes in fetal and adult liver tissues. Heat map shows protein complexes with less than or equal to half of their subunits expressed in one of the tissue types. The darker the colour, the greater the number of expressed subunits.

  3. Isoform-specific expression.
    Figure 3: Isoform-specific expression.

    a, Exon structure of three known isoforms of FYN (left) along with abundance of isoform-specific peptides detected in the indicated cells/tissues (right). The colour key indicates a relative expression based on the spectral counts of isoform-specific peptides detected. b, 20S constitutive proteasome and 20S immunoproteasome core complexes. Expression of their corresponding components are depicted by a heat map (red indicates higher expression) in the Human Proteome Map portal.

  4. Proteogenomic analysis.
    Figure 4: Proteogenomic analysis.

    a, An overview of the multiple databases used in the proteogenomic analysis. A subset of peptides corresponding to genome search-specific peptides were synthesized and analysed by mass spectrometry. b, Overall summary of the results from the current study.

  5. Translation of pseudogenes and identification of novel N termini.
    Figure 5: Translation of pseudogenes and identification of novel N termini.

    a, A heat map shows the expression of pseudogenes (listed by name or accession code number) across the analysed cells/tissues. Some pseudogenes such as VDAC1P7 and GAPDHP1 were found to be globally expressed, whereas others were more restricted in their expression or were detected only in a single cell type/tissue as indicated. b, The distribution of novel N termini detected with N-terminal acetylation is shown with respect to the location of the annotated translational start site. All sites in the 5′UTR are labelled upstream whereas those located downstream of the annotated AUG start sites are labelled as 1st Met, 2nd Met and so on.

  6. Summary of proteome analysis.
    Extended Data Fig. 1: Summary of proteome analysis.

    a, Mass error in parts per million for precursor ions of all identified peptides. b, Number of peptides detected per gene binned as shown. c, Distribution of sequence coverage of identified proteins. df, %FDR with a q value of <0.01 plotted against peptide length in number of amino acids, charge state of peptide ion and number of cleavage sites missed by enzyme. P values computed from two-tailed t-test are shown. Error bars indicate s.d. calculated from FDRs of multiple fetal samples. g, h, A comparison of peptides identified in this study with PeptideAtlas and GPMDB. i, Mass error in parts per million for precursor ions identified from proteogenomics analysis.

  7. Tissue-wise gene expression and housekeeping proteins.
    Extended Data Fig. 2: Tissue-wise gene expression and housekeeping proteins.

    a, A heat map shows a partial list of not well-characterized, hypothetical genes. b, The bulk of protein mass is contributed by only a small number of genes. Only 2,350 ‘housekeeping genes’ account for ~75% of proteome mass. c, The number of cell/tissue types where a gene was observed was counted. Some genes were found to be specifically restricted in a few samples while others were observed in the majority of samples analysed. For example, 1,537 genes were detected only in one sample, and 2,350 genes were found in all samples. These latter genes can be defined as highly abundant ‘housekeeping proteins’. d, Distribution of genes in the RefSeq database based on the number of protein isoforms resulting from their annotated transcripts (left). Distribution of the transcripts with two or more protein isoforms annotated based on the number of isoform-specific or shared peptides (right). e, A representative example of sequence coverage of PSMB8 protein along with tissue distribution of all of its identified peptides and the MS/MS spectrum of one of the peptides is shown along with seven selected reaction monitoring (SRM) transitions.

  8. Western blot analysis of select tissue-restricted proteins.
    Extended Data Fig. 3: Western blot analysis of select tissue-restricted proteins.

    a, Eight proteins showing tissue-restricted expression were tested using western blot analysis in 17 adult tissues. GAPDH was used as a loading control. b, Four proteins found to be expressed in a broad range of tissues, although bands that do not correspond to the expected molecular weight are also observed. CST, Cell Signalling Technology; SCB, Santa Cruz Biotechnology.

  9. Identification of novel genes/ORFs and translated non-coding RNAs.
    Extended Data Fig. 4: Identification of novel genes/ORFs and translated non-coding RNAs.

    a, An example of a novel ORF in an alternate reading frame located in the 3′UTR of CHTF8 gene. The relative abundance of peptides from the CHTF8 protein and the protein encoded by the novel ORF is shown (bottom). b, An example of translated non-coding RNA (NR_027693.1) identified by searching 3-frame-translated transcript database. The MS/MS spectrum of one of the five identified peptides (LEVASSPPVSEAVPR) is shown along with a similar fragmentation pattern observed from the corresponding synthetic peptide.

  10. Human genome annotation through proteogenomic analysis using GeneSpring.
    Extended Data Fig. 5: Human genome annotation through proteogenomic analysis using GeneSpring.

    a, Four genome search specific peptides (GSSPs; red boxes) map to an upstream ORF (denoted as black hashes) located in 5′UTR of the SLC35A4 gene (ORF shown as blue rectangle). b, GSSP mapping in the intergenic region between two RefSeq annotated genes NDUFv3 and PKNOX1. The ORF region is depicted in dotted lines of human endogenous retroviral element (HERV). c, GSSPs mapping to an annotated pseudogene MAGEB6P1, the alignments of parent gene and pseudogene are shown below the peptides.

  11. Frequency of nucleotides surrounding translational start sites.
    Extended Data Fig. 6: Frequency of nucleotides surrounding translational start sites.

    a, Frequency of nucleotides at positions ranging from −5 to +1 surrounding the AUG codon for confirmed translational start sites. b, Frequency of nucleotides at positions ranging from −5 to +1 surrounding the AUG codon for novel translational start sites identified in this study.

References

  1. The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 5774 (2012)
  2. Aebersold, R. & Mann, M. Mass spectrometry-based proteomics. Nature 422, 198207 (2003)
  3. Bensimon, A., Heck, A. J. & Aebersold, R. Mass spectrometry-based proteomics and network biology. Annu. Rev. Biochem. 81, 379405 (2012)
  4. Cravatt, B. F., Simon, G. M. & Yates, J. R., III The biological impact of mass-spectrometry-based proteomics. Nature 450, 9911000 (2007)
  5. Nagaraj, N. et al. System-wide perturbation analysis with nearly complete coverage of the yeast proteome by single-shot ultra HPLC runs on a bench top Orbitrap. Mol. Cell. Proteomics 11, M111.013722 (2012)
  6. Picotti, P. et al. A complete mass-spectrometric map of the yeast proteome applied to quantitative trait analysis. Nature 494, 266270 (2013)
  7. Kelkar, D. S. et al. Proteogenomic analysis of Mycobacterium tuberculosis by high resolution mass spectrometry. Mol. Cell. Proteomics 10, M111.011627 (2011)
  8. Huttlin, E. L. et al. A tissue-specific atlas of mouse protein phosphorylation and expression. Cell 143, 11741189 (2010)
  9. Gholami, A. M. et al. Global proteome analysis of the NCI-60 cell line panel. Cell Rep. 4, 609620 (2013)
  10. Branca, R. M. et al. HiRIEF LC-MS enables deep proteome coverage and unbiased proteogenomics. Nature Methods 11, 5962 (2014)
  11. Farrah, T. et al. The state of the human proteome in 2012 as viewed through PeptideAtlas. J. Proteome Res. 12, 162171 (2013)
  12. Craig, R., Cortens, J. P. & Beavis, R. C. Open source system for analyzing, validating, and storing protein identification data. J. Proteome Res. 3, 12341242 (2004)
  13. Gaudet, P. et al. neXtProt: organizing protein knowledge in the context of human proteome projects. J. Proteome Res. 12, 293298 (2013)
  14. Uhlen, M. et al. Towards a knowledge-based Human Protein Atlas. Nature Biotechnol. 28, 12481250 (2010)
  15. Pruitt, K. D. et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 42, D756D763 (2014)
  16. Perkins, D. N., Pappin, D. J., Creasy, D. M. & Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 35513567 (1999)
  17. Eng, J. K., McCormack, A. L. & Yates, J. R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976989 (1994)
  18. Käll, L., Canterbury, J. D., Weston, J., Noble, W. S. & MacCoss, M. J. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nature Methods 4, 923925 (2007)
  19. Lane, L. et al. Metrics for the human proteome project 2013–2014 and strategies for finding missing proteins. J. Proteome Res. 13, 1520 (2014)
  20. Mosley, A. L. et al. Highly reproducible label free quantitative proteomic analysis of RNA polymerase complexes. Mol. Cell. Proteomics 10, M110.000687 (2011)
  21. Fountoulakis, M., Juranville, J. F., Dierssen, M. & Lubec, G. Proteomic analysis of the fetal brain. Proteomics 2, 15471576 (2002)
  22. Ying, W. et al. A dataset of human fetal liver proteome identified by subcellular fractionation and multiple protein separation and identification technology. Mol. Cell. Proteomics 5, 17031707 (2006)
  23. Jansen, R., Greenbaum, D. & Gerstein, M. Relating whole-genome expression data with protein-protein interactions. Genome Res. 12, 3746 (2002)
  24. Ge, H., Liu, Z., Church, G. M. & Vidal, M. Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nature Genet. 29, 482486 (2001)
  25. Ruepp, A. et al. CORUM: the comprehensive resource of mammalian protein complexes–2009. Nucleic Acids Res. 38, D497D501 (2010)
  26. Ferrington, D. A. & Gregerson, D. S. Immunoproteasomes: structure, function, and antigen presentation. Prog. Mol. Biol. Transl. Sci. 109, 75112 (2012)
  27. Steen, H. & Mann, M. The abc’s (and xyz’s) of peptide sequencing. Nature Rev. Mol. Cell Biol. 5, 699711 (2004)
  28. Sugimoto, J., Sugimoto, M., Bernstein, H., Jinno, Y. & Schust, D. A novel human endogenous retroviral protein inhibits cell-cell fusion. Sci. Rep. 3, 1462 (2013)
  29. Guttman, M., Russell, P., Ingolia, N. T., Weissman, J. S. & Lander, E. S. Ribosome profiling provides evidence that large noncoding RNAs do not encode proteins. Cell 154, 240251 (2013)
  30. Kalyana-Sundaram, S. et al. Expressed pseudogenes in the transcriptional landscape of human cancers. Cell 149, 16221634 (2012)
  31. Pei, B. et al. The GENCODE pseudogene resource. Genome Biol. 13, R51 (2012)
  32. Abecasis, G. R. et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 5665 (2012)
  33. Peri, S. & Pandey, A. A reassessment of the translation initiation codon in vertebrates. Trends Genet. 17, 685687 (2001)
  34. Legrain, P. et al. The human proteome project: current state and future direction. Mol. Cell. Proteomics 10, M111.009993 (2011)
  35. Paik, Y. K. et al. The Chromosome-Centric Human Proteome Project for cataloging proteins encoded in the genome. Nature Biotechnol. 30, 221223 (2012)
  36. Marko-Varga, G., Omenn, G. S., Paik, Y. K. & Hancock, W. S. A first step toward completion of a genome-wide characterization of the human proteome. J. Proteome Res. 12, 15 (2013)
  37. Shevchenko, A., Tomas, H., Havlis, J., Olsen, J. V. & Mann, M. In-gel digestion for mass spectrometric characterization of proteins and proteomes. Nature Protocols 1, 28562860 (2007)
  38. Wang, Y. et al. Reversed-phase chromatography with multiple fraction concatenation strategy for proteome profiling of human MCF10A cells. Proteomics 11, 20192026 (2011)
  39. Olsen, J. V. et al. Parts per million mass accuracy on an Orbitrap mass spectrometer via lock mass injection into a C-trap. Mol. Cell. Proteomics 4, 20102021 (2005)
  40. Vizcaíno, J. A. et al. The PRoteomics IDEntifications (PRIDE) database and associated tools: status in 2013. Nucleic Acids Res. 41, D1063D1069 (2013)
  41. Craig, R. & Beavis, R. C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics 20, 14661467 (2004)
  42. Meyer, L. R. et al. The UCSC Genome Browser database: extensions and updates 2013. Nucleic Acids Res. 41, D64D69 (2013)
  43. Razick, S., Magklaras, G. & Donaldson, I. M. iRefIndex: a consolidated protein interaction database with provenance. BMC Bioinformatics 9, 405 (2008)
  44. Zuberi, K. et al. GeneMANIA prediction server 2013 update. Nucleic Acids Res. 41, W115W122 (2013)

Download references

Author information

Affiliations

  1. McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland 21205, USA

    • Min-Sik Kim,
    • Derese Getnet,
    • Raghothama Chaerkady,
    • Pamela Leal-Rojas,
    • Samarjeet Prasad,
    • Tai-Chung Huang,
    • Jun Zhong,
    • Xinyan Wu,
    • Patrick G. Shaw,
    • Donald Freed,
    • Christopher J. Mitchell,
    • Steven D. Leach &
    • Akhilesh Pandey
  2. Department of Biological Chemistry, Johns Hopkins University School of Medicine, Baltimore, Maryland 21205, USA

    • Min-Sik Kim,
    • Raghothama Chaerkady,
    • Xinyan Wu,
    • Muhammad S. Zahari &
    • Akhilesh Pandey
  3. Institute of Bioinformatics, International Tech Park, Bangalore 560066, India

    • Sneha M. Pinto,
    • Raja Sekhar Nirujogi,
    • Srikanth S. Manda,
    • Anil K. Madugundu,
    • Dhanashree S. Kelkar,
    • Joji K. Thomas,
    • Babylakshmi Muthusamy,
    • Praveen Kumar,
    • Nandini A. Sahasrabuddhe,
    • Lavanya Balakrishnan,
    • Jayshree Advani,
    • Bijesh George,
    • Santosh Renuse,
    • Lakshmi Dhevi N. Selvan,
    • Arun H. Patil,
    • Vishalakshi Nanjappa,
    • Aneesha Radhakrishnan,
    • Tejaswini Subbannayya,
    • Rajesh Raju,
    • Manish Kumar,
    • Sreelakshmi K. Sreenivasamurthy,
    • Arivusudar Marimuthu,
    • Gajanan J. Sathe,
    • Sandip Chavan,
    • Keshava K. Datta,
    • Yashwanth Subbannayya,
    • Apeksha Sahu,
    • Soujanya D. Yelamanchi,
    • Savita Jayaram,
    • Pavithra Rajagopalan,
    • Jyoti Sharma,
    • Krishna R. Murthy,
    • Nazia Syed,
    • Renu Goel,
    • Aafaque A. Khan,
    • Sartaj Ahmad,
    • Gourav Dey,
    • Aditi Chatterjee,
    • Ravi Sirdeshmukh,
    • T. S. Keshava Prasad,
    • Harsha Gowda &
    • Akhilesh Pandey
  4. Adrienne Helis Malvin Medical Research Foundation, New Orleans, Louisiana 70130, USA

    • Derese Getnet &
    • Akhilesh Pandey
  5. The Donnelly Centre, University of Toronto, Toronto, Ontario M5S 3E1, Canada

    • Ruth Isserlin,
    • Shobhit Jain &
    • Gary D. Bader
  6. Department of Pathology, Universidad de La Frontera, Center of Genetic and Immunological Studies-Scientific and Technological Bioresource Nucleus, Temuco 4811230, Chile

    • Pamela Leal-Rojas
  7. School of Medicine, Imperial College London, South Kensington Campus, London SW7 2AZ, UK

    • Keshav Mudgal
  8. Department of Neurosurgery, Postgraduate Institute of Medical Education & Research, Chandigarh 160012, India

    • Kanchan K. Mukherjee
  9. Department of Internal Medicine Armed Forces Medical College, Pune 411040, India

    • Subramanian Shankar
  10. Department of Neuropathology, National Institute of Mental Health and Neurosciences, Bangalore 560029, India

    • Anita Mahadevan &
    • Susarla Krishna Shankar
  11. Human Brain Tissue Repository, Neurobiology Research Centre, National Institute of Mental Health and Neurosciences, Bangalore 560029, India

    • Anita Mahadevan &
    • Susarla Krishna Shankar
  12. Department of Chemical and Biomolecular Engineering and Division of Biomedical Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong

    • Henry Lam
  13. Department of Neurology, National Institute of Mental Health and Neurosciences, Bangalore 560029, India

    • Parthasarathy Satishchandra
  14. Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland 21224, USA

    • John T. Schroeder
  15. The Sol Goldman Pancreatic Cancer Research Center, Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, Maryland 21231, USA

    • Anirban Maitra,
    • Marc K. Halushka,
    • Ralph H. Hruban,
    • Christine A. Iacobuzio-Donahue &
    • Akhilesh Pandey
  16. Department of Oncology, Johns Hopkins University School of Medicine, Baltimore, Maryland 21231, USA

    • Anirban Maitra,
    • Charles G. Drake,
    • Ralph H. Hruban,
    • Christine A. Iacobuzio-Donahue &
    • Akhilesh Pandey
  17. Department of Surgery, Johns Hopkins University School of Medicine, Baltimore, Maryland 21231, USA

    • Steven D. Leach &
    • Christine A. Iacobuzio-Donahue
  18. Departments of Immunology and Urology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, Maryland 21231, USA

    • Charles G. Drake
  19. Department of Obstetrics and Gynecology, Johns Hopkins University School of Medicine Baltimore, Maryland 21205, USA

    • Candace L. Kerr
  20. Diana Helis Henry Medical Research Foundation, New Orleans, Louisiana 70130, USA

    • Akhilesh Pandey
  21. Present address: Department of Biochemistry and Molecular Biology, University of Maryland School of Medicine, Baltimore, Maryland 21201, USA.

    • Candace L. Kerr

Contributions

A.P., H.G., R.C., M.-S.K. designed the study; A.P., H.G., M.-S.K. managed the study; D.G., C.L.K., C.A.I.-D., K.R.M. collected human cells/tissues; M.-S.K., R.C., D.G. developed the pipeline of experiment and analysis; D.G., M.-S.K., S.M.P., K.M., R.C., S.R., J.Z., X.W., P.G.S., M.S.Z., T.-C.H. prepared peptide samples for LC-MS/MS; M.-S.K., R.S.N., S.M.P., R.C., D.S.K., S.R., G.J.S. performed LC-MS/MS; M.-S.K., S.M.P., S.P., S.S.M., C.J.M., J.A. and A.K.M. processed MS data and managed data; A.K.M., S.S.M., B.G., A.H.P., Y.S., M.-S.K. performed comparison analysis with PeptideAtlas, neXtProt and GPMDB; R.I., S.Jai., G.D.B. performed interaction and complex analysis; M.-S.K., S.M.P., S.S.M., P.K., A.K.M., N.A.S., R.S.N., L.B., L.D.N.S., D.S.K., V.N., A.R., T.S., M.K., S.K.Sr., G.D., A.Mar., R.R., S.C., K.K.D., A.S., S.D.Y., S.Jay., P.R., A.H.P., B.G., J.S., N.S., R.G., G.J.S., A.A.K., S.A., D.F., T.S.K.P., H.G., A.P. performed proteogenomic analysis; A.C., H.L., R.S., J.T.S., K.K.M., S.S., A.Mah., S.K.Sh., P.S., S.D.L., C.G.D., A.Mai., M.K.H., R.H.H., C.L.K., C.A.I.-D. assisted with analysis of the data; M.-S.K., S.M.P., T.-C.H., P.L.-R. performed western blot experiments; M.-S.K., J.K.T., A.K.M., B.M., S.P., S.M.P. designed the Human Proteome Map web portal; M.-S.K., A.K.M., J.K.T. generated selected reaction monitoring (SRM) database; M.-S.K., K.M., G.D., S.M.P., S.S.M. illustrated figures with help of other authors; A.P., M.-S.K., H.G. wrote the manuscript with inputs from other authors.

Competing financial interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to:

The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium (http://proteomecentral.proteomexchange.org) via the PRIDE partner repository with the dataset identifier PXD000561.

Author details

Extended data figures and tables

Extended Data Figures

  1. Extended Data Figure 1: Summary of proteome analysis. (161 KB)

    a, Mass error in parts per million for precursor ions of all identified peptides. b, Number of peptides detected per gene binned as shown. c, Distribution of sequence coverage of identified proteins. df, %FDR with a q value of <0.01 plotted against peptide length in number of amino acids, charge state of peptide ion and number of cleavage sites missed by enzyme. P values computed from two-tailed t-test are shown. Error bars indicate s.d. calculated from FDRs of multiple fetal samples. g, h, A comparison of peptides identified in this study with PeptideAtlas and GPMDB. i, Mass error in parts per million for precursor ions identified from proteogenomics analysis.

  2. Extended Data Figure 2: Tissue-wise gene expression and housekeeping proteins. (341 KB)

    a, A heat map shows a partial list of not well-characterized, hypothetical genes. b, The bulk of protein mass is contributed by only a small number of genes. Only 2,350 ‘housekeeping genes’ account for ~75% of proteome mass. c, The number of cell/tissue types where a gene was observed was counted. Some genes were found to be specifically restricted in a few samples while others were observed in the majority of samples analysed. For example, 1,537 genes were detected only in one sample, and 2,350 genes were found in all samples. These latter genes can be defined as highly abundant ‘housekeeping proteins’. d, Distribution of genes in the RefSeq database based on the number of protein isoforms resulting from their annotated transcripts (left). Distribution of the transcripts with two or more protein isoforms annotated based on the number of isoform-specific or shared peptides (right). e, A representative example of sequence coverage of PSMB8 protein along with tissue distribution of all of its identified peptides and the MS/MS spectrum of one of the peptides is shown along with seven selected reaction monitoring (SRM) transitions.

  3. Extended Data Figure 3: Western blot analysis of select tissue-restricted proteins. (262 KB)

    a, Eight proteins showing tissue-restricted expression were tested using western blot analysis in 17 adult tissues. GAPDH was used as a loading control. b, Four proteins found to be expressed in a broad range of tissues, although bands that do not correspond to the expected molecular weight are also observed. CST, Cell Signalling Technology; SCB, Santa Cruz Biotechnology.

  4. Extended Data Figure 4: Identification of novel genes/ORFs and translated non-coding RNAs. (185 KB)

    a, An example of a novel ORF in an alternate reading frame located in the 3′UTR of CHTF8 gene. The relative abundance of peptides from the CHTF8 protein and the protein encoded by the novel ORF is shown (bottom). b, An example of translated non-coding RNA (NR_027693.1) identified by searching 3-frame-translated transcript database. The MS/MS spectrum of one of the five identified peptides (LEVASSPPVSEAVPR) is shown along with a similar fragmentation pattern observed from the corresponding synthetic peptide.

  5. Extended Data Figure 5: Human genome annotation through proteogenomic analysis using GeneSpring. (266 KB)

    a, Four genome search specific peptides (GSSPs; red boxes) map to an upstream ORF (denoted as black hashes) located in 5′UTR of the SLC35A4 gene (ORF shown as blue rectangle). b, GSSP mapping in the intergenic region between two RefSeq annotated genes NDUFv3 and PKNOX1. The ORF region is depicted in dotted lines of human endogenous retroviral element (HERV). c, GSSPs mapping to an annotated pseudogene MAGEB6P1, the alignments of parent gene and pseudogene are shown below the peptides.

  6. Extended Data Figure 6: Frequency of nucleotides surrounding translational start sites. (104 KB)

    a, Frequency of nucleotides at positions ranging from −5 to +1 surrounding the AUG codon for confirmed translational start sites. b, Frequency of nucleotides at positions ranging from −5 to +1 surrounding the AUG codon for novel translational start sites identified in this study.

Supplementary information

PDF files

  1. Supplementary Information (106 KB)

    This file contains a Supplementary Discussion and additional references.

  2. Supplementary Data (3.5 MB)

    This file contains Supplementary Data.

Excel files

  1. Supplementary Table 1 (1.1 MB)

    This file contains a summary of results from proteogenomics analysis; a list of peptides indicating novel signal peptide cleavage sites; and a draft map of the human proteome.

Comments

  1. Report this comment #63451

    Xavier Roucou said:

    Among several significant contributions in this work, the discovery of 44 novel protein-coding open reading frames (ORFs) illustrates the complexity of the human proteome. Recently, we reported the discovery of 83,886 previously undescribed ORFs termed alternative ORFs (AltORFs) PMID:23950983. AltORFs are defined as ORFs present in the trans criptome that are different from annotated ORFs. We detected 1,259 proteins translated from AltORFs in human biological samples PMID:23950983. While the role and importance of this ?alternative proteome? will require substantial further validation, there can be no doubt that a comprehensive des cription of the human proteome must include the distinct possibility of a vastly greater number of functional proteins than has been traditionally considered.
    Given the existence of the alternative proteome, it is not surprising that Kim et al. found that nearly 50% of the 35 million MS/MS spectra of human proteins did not match proteins in the NCBI?s RefSeq human protein sequence database1. In an attempt to identify these novel proteins, the authors translated the human reference genome, RefSeq trans cript sequences, non-coding RNAs, and pseudogenes. Among the 193 newly identified proteins, 44 were translated from novel uORFs, ORFs located in an alternate reading frame within coding regions of annotated genes, or ORFs located in 3?-UTRs.
    The astonishing failure to have detected the alternative proteome years ago results from the fact that MS-based proteomic methods rely on existing protein sequence databases that are far from complete and therefore do not allow the assignment of all MS/MS spectra. Recent ribosome profiling and footprinting approaches have suggested the significant use of unconventional translation initiation sites in mammals PMID:22056041 PMID:22927429 PMID:22593554, and these alternative proteins should have been detected. In order to better define the human proteome, we generated a new database of alternative ORFs (AltORFs) present in NCBI?s RefSeq human mRNA sequence database. AltORFs overlap the annotated or reference protein coding ORF (RefORF) in an alternate reading frame, are located in the 5?- and 3?-UTR regions of an mRNA, or partially overlap with both the RefORF and an UTR region. This approach led to the discovery of 83,886 unique AltORFs with a minimum size of 40 codons PMID:23950983. The majority of mRNAs (87%) have at least one predicted AltORF, with an average of 3.88 AltORFs per mRNA. Additionally, the evolutionary conservation of many of these reading frames suggests functional importance. These AltORFs were translated in silico and included in an alternative protein database we used to interpret unmatched MS/MS spectra.
    So far, we and others have identified nearly 1300 alternative proteins in different human cell lines and tissues PMID:23950983 PMID:11447126 PMID:15489325 PMID:21478263 PMID:23760502 PMID:23160002 PMID:23429522, including certain of the 44 new proteins mentioned in the Kim et al. study: the alternative protein translated from the AltORFs mapping to the 5?-UTR of the SLC35A4 gene (or AltSLC35A4), was detected in Hela cells and lung tissue; the AltC11orf48 was detected in Hela cells, colon, lung and ovary tissues; and the AltCHTF8 was detected in Hela cells PMID:23950983. Twenty four of the 44 novel ORFs detected by Kim et al. were, in fact, already present in our AltORF database, and 9 of the 44 proteins translated from these novel ORFs were previously detected: AltASNSD1, AltSLC35A4, AltMKKS, AltSMCR7L, AltCHTF8, AltRPP14, AltSF1, AltC110rf48, AltHNRNPUL12. In this sense, Kim et al.`s study strongly supports the existence of the alternative proteome.
    Clearly, the alternative proteins detected by Kim et al. and by our team are the proverbial tip of the iceberg. A full map of the human proteome is thus still years away, and will require several important changes in our current thinking concerning the proteome and the concept that each mature mRNA only codes for one protein.

  2. Report this comment #64563

    Akhilesh Pandey said:

    At the time of submission of our manus cript, we had identified and reported a protein with five different peptide sequences that were derived from an annotated non-coding RNA (NR 027693.1) using our proteogenomic analysis pipeline. A different study (Cho et al. J Biol Chem. 2013. 288, 25207-18; PubMed ID:23836911) independently confirmed this protein and designated it as PERM1. This protein is now annotated by NCBI as NM 001291366.1 while the NR entry that referred to the corresponding mRNA as non-coding RNA has been ?retired.? This shows the power of global and unbiased proteogenomics methods and we anticipate seeing more reports from other groups confirming some of the more ?unusual? findings described in our study.

Subscribe to comments

Additional data