Proteogenomics: concepts, applications and computational strategies

Journal name:
Nature Methods
Year published:
Published online


Proteogenomics is an area of research at the interface of proteomics and genomics. In this approach, customized protein sequence databases generated using genomic and transcriptomic information are used to help identify novel peptides (not present in reference protein sequence databases) from mass spectrometry–based proteomic data; in turn, the proteomic data can be used to provide protein-level evidence of gene expression and to help refine gene models. In recent years, owing to the emergence of new sequencing technologies such as RNA-seq and dramatic improvements in the depth and throughput of mass spectrometry–based proteomics, the pace of proteogenomic research has greatly accelerated. Here I review the current state of proteogenomic methods and applications, including computational strategies for building and using customized protein sequence databases. I also draw attention to the challenge of false positive identifications in proteogenomics and provide guidelines for analyzing the data and reporting the results of proteogenomic studies.

At a glance


  1. Peptide and protein identification in shotgun proteomics.
    Figure 1: Peptide and protein identification in shotgun proteomics.

    (a) Overview of shotgun proteomics. Proteins are digested into peptides, separated using liquid chromatography (LC) coupled online to a mass spectrometer and then analyzed by the mass spectrometer, which generates tandem mass spectra. (b) Peptides are most commonly identified using a sequence database (DB) search approach. Traditionally, experimental MS/MS spectra are matched with theoretical spectra predicted for each peptide contained in a protein sequence database. Sequence tag–assisted database searching starts with extraction of short tags followed by database searching, in which the list of candidate peptides is restricted to only those peptides that contain one of the extracted sequence tags, thereby allowing for mutations in the sequences of candidate database peptides. Peptide sequence can also be extracted directly from the spectrum using de novo sequencing (extracted sequences can then be searched in a protein sequence database to find the exact or a homologous peptide).

  2. The concept of proteogenomics.
    Figure 2: The concept of proteogenomics.

    In a proteogenomic approach, genomic and transcriptomic data are used to generate customized protein sequence databases to help interpret proteomic data. In turn, the proteomic data provide protein-level validation of the gene expression data and help refine gene models. The enhanced gene models can help improve protein sequence databases for traditional proteomic analysis.

  3. Type of peptides identified in proteogenomics.
    Figure 3: Type of peptides identified in proteogenomics.

    Peptides identified by searching customized protein sequence databases (DB) are mapped on the genome. Intergenic peptides map to regions located between annotated gene models, whereas intragenic peptides map to genomic regions contained within or in close proximity to an annotated gene model. Intragenic peptides can be further categorized according to the annotation of the corresponding gene model (for example, “Protein-coding gene,” “Long noncoding RNA (lncRNA) gene” and “Pseudogene”). The majority of peptides map to a protein-coding gene and can be divided into exon and exon-exon junction peptides. Novel peptides include peptides mapping to untranslated regions (3′ or 5′ UTR peptides), intron peptides, peptides spanning the boundary between the coding sequence region and the neighboring UTR or intron region (“Exon boundary”), peptides spanning unannotated (alternative) splice junctions (“Alt junction”), and out-of-frame peptides (“Alt frame”).

  4. Statistical assessment of peptide identifications in proteogenomics.
    Figure 4: Statistical assessment of peptide identifications in proteogenomics.

    MS/MS spectra (Spec) are searched against a customized protein sequence database (DB) that includes target sequences for the organism of interest, i.e., a reference protein database and predicted protein sequences (containing novel peptides). In addition, two 'decoy' databases (for example, containing reversed sequences) of the same sizes as the target reference and predicted databases are appended to the target databases. The best database peptide match for each spectrum is selected for further analysis. Peptide identifications are classified as known or novel (for a decoy peptide, the class—known or novel—is determined by the class of the corresponding target sequence from which the decoy was generated). When simple database search score–based filtering is used, the numbers of target and decoy peptide identifications passing a certain score threshold are counted and used to estimate the FDR corresponding to that threshold. FDR analysis should be done separately for known and novel peptides (class-specific FDR) because of differences in the number of known and novel sequences in the searched customized sequence database and because of the lower likelihood of correctly identifying a novel peptide. For more advanced methods based on computing posterior peptide probabilities, both the database search scores and the peptide class (known or novel) should be taken into consideration.


  1. Mann, M., Kulak, N.A., Nagaraj, N. & Cox, J. The coming age of complete, accurate, and ubiquitous proteomes. Mol. Cell 49, 583590 (2013).
  2. Bantscheff, M., Lemeer, S., Savitski, M.M. & Kuster, B. Quantitative mass spectrometry in proteomics: critical review update from 2007 to the present. Anal. Bioanal. Chem. 404, 939965 (2012).
  3. Nesvizhskii, A.I. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. J. Proteomics 73, 20922123 (2010).
  4. Nesvizhskii, A.I. & Aebersold, R. Interpretation of shotgun proteomic data: the protein inference problem. Mol. Cell. Proteomics 4, 14191440 (2005).
  5. Dasari, S. et al. TagRecon: high-throughput mutation identification through sequence tagging. J. Proteome Res. 9, 17161726 (2010).
  6. Ma, B. & Johnson, R. De novo sequencing and homology searching. Mol. Cell. Proteomics 11, O111.014902 (2012).
  7. Jaffe, J.D., Berg, H.C. & Church, G.M. Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics 4, 5977 (2004).
  8. Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 5763 (2009).
  9. Ingolia, N.T. Ribosome profiling: new views of translation, from single codons to genome scale. Nat. Rev. Genet. 15, 205213 (2014).
  10. Desiere, F. et al. Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry. Genome Biol. 6, R9 (2005).
    Analysis of a large compendium of proteomic data from multiple studies: the first publicly available repository of mass spectrometry data, PeptideAtlas.
  11. Ning, K. & Nesvizhskii, A.I. The utility of mass spectrometry-based proteomic data for validation of novel alternative splice forms reconstructed from RNA-Seq data: a preliminary assessment. BMC Bioinformatics 11 (suppl. 11), S14 (2010).
  12. Menschaert, G. et al. Deep proteome coverage based on ribosome profiling aids MS-based protein and peptide discovery and provides evidence of alternative translation products and near-cognate translation initiation events. Mol. Cell. Proteomics 12, 17801790 (2013).
    Use of ribosome-profiling data for creating customized protein sequence databases.
  13. Sheynkman, G.M., Shortreed, M.R., Frey, B.L. & Smith, L.M. Discovery and mass spectrometric analysis of novel splice-junction peptides using RNA-Seq. Mol. Cell. Proteomics 12, 23412353 (2013).
  14. Low, T.Y. et al. Quantitative and qualitative proteome characteristics extracted from in-depth integrated genomics and proteomics analysis. Cell Rep. 5, 14691478 (2013).
  15. Wu, P. et al. Discovery of novel genes and gene isoforms by integrating transcriptomic and proteomic profiling from mouse liver. J. Proteome Res. 13, 24092419 (2014).
  16. Omasits, U. et al. Directed shotgun proteomics guided by saturated RNA-seq identifies a complete expressed prokaryotic proteome. Genome Res. 23, 19161927 (2013).
    Comprehensive proteogenomic study integrating RNA-seq and proteomic data.
  17. Kim, M.-S. et al. A draft map of the human proteome. Nature 509, 575581 (2014).
  18. Wilhelm, M. et al. Mass-spectrometry-based draft of the human proteome. Nature 509, 582587 (2014).
  19. Zhang, B. et al. Proteogenomic characterization of human colon and rectal cancer. Nature 513, 382387 (2014).
    Large-scale CPTAC study integrating proteomic and genomic data from human colon and rectal TCGA samples.
  20. Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 17601774 (2012).
  21. Baerenfaller, K. et al. Genome-scale proteomics reveals Arabidopsis thaliana gene models and proteome dynamics. Science 320, 938941 (2008).
    Comprehensive proteogenomic study to assemble a proteome map of an organism.
  22. Brunner, E. et al. A high-quality catalog of the Drosophila melanogaster proteome. Nat. Biotechnol. 25, 576583 (2007).
  23. Khatun, J. et al. Whole human genome proteogenomic mapping for ENCODE cell line data: identifying protein-coding regions. BMC Genomics 14, 141 (2013).
  24. Fermin, D. et al. Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics. Genome Biol. 7, R35 (2006).
  25. Castellana, N.E. et al. An automated proteogenomic method uses mass spectrometry to reveal novel genes in Zea mays. Mol. Cell. Proteomics 13, 157167 (2014).
  26. Blakeley, P., Overton, I.M. & Hubbard, S.J. Addressing statistical biases in nucleotide-derived protein databases for proteogenomic search strategies. J. Proteome Res. 11, 52215234 (2012).
  27. Brosch, M. et al. Shotgun proteomics aids discovery of novel protein-coding genes, alternative splicing, and “resurrected” pseudogenes in the mouse genome. Genome Res. 21, 756767 (2011).
  28. Tanner, S. et al. Improving gene annotation using peptide mass spectrometry. Genome Res. 17, 231239 (2007).
  29. Brent, M.R. Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Nat. Rev. Genet. 9, 6273 (2008).
  30. Castellana, N.E. et al. Discovery and revision of Arabidopsis genes by proteogenomics. Proc. Natl. Acad. Sci. USA 105, 2103421038 (2008).
    Application of an advanced computational pipeline for proteogenomic annotation.
  31. Choudhary, J.S., Blackstock, W.P., Creasy, D.M. & Cottrell, J.S. Interrogating the human genome using uninterpreted mass spectrometry data. Proteomics 1, 651667 (2001).
  32. Edwards, N.J. Novel peptide identification from tandem mass spectra using ESTs and sequence database compression. Mol. Syst. Biol. 3, 102 (2007).
  33. Nesvizhskii, A.I. et al. Dynamic spectrum quality assessment and iterative computational analysis of shotgun proteomic data: toward more efficient identification of post-translational modifications, sequence polymorphisms, and novel peptides. Mol. Cell. Proteomics 5, 652670 (2006).
  34. Derrien, T. et al. The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression. Genome Res. 22, 17751789 (2012).
  35. Engström, P.G. et al. Systematic evaluation of spliced alignment programs for RNA-seq data. Nat. Methods 10, 11851191 (2013).
  36. Steijger, T. et al. Assessment of transcript reconstruction methods for RNA-seq. Nat. Methods 10, 11771184 (2013).
  37. Evans, V.C. et al. De novo derivation of proteomes from transcriptomes for transcript and protein identification. Nat. Methods 9, 12071211 (2012).
  38. Sheynkman, G.M. et al. Using Galaxy-P to leverage RNA-Seq for the discovery of novel protein variations. BMC Genomics 15, 703 (2014).
  39. Wang, X. & Zhang, B. customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search. Bioinformatics 29, 32353237 (2013).
  40. Woo, S. et al. Proteogenomic database construction driven from large scale RNA-seq data. J. Proteome Res. 13, 2128 (2014).
  41. Li, J. et al. A bioinformatics workflow for variant peptide detection in shotgun proteomics. Mol. Cell. Proteomics 10, M110.006536 (2011).
  42. Picardi, E. & Pesole, G. REDItools: high-throughput RNA editing detection made easy. Bioinformatics 29, 18131814 (2013).
  43. Menon, R. et al. Identification of novel alternative splice isoforms of circulating proteins in a mouse model of human pancreatic cancer. Cancer Res. 69, 300309 (2009).
  44. Xie, C. et al. NONCODEv4: exploring the world of long non-coding RNA genes. Nucleic Acids Res. 42, D98D103 (2014).
  45. Cabili, M.N. et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 25, 19151927 (2011).
  46. Frenkel-Morgenstern, M. et al. ChiTaRS: a database of human, mouse and fruit fly chimeric transcripts and RNA-sequencing data. Nucleic Acids Res. 41, D142D151 (2013).
  47. Frenkel-Morgenstern, M. et al. Chimeras taking shape: potential functions of proteins encoded by chimeric RNA transcripts. Genome Res. 22, 12311242 (2012).
  48. Krug, K. et al. Deep coverage of the Escherichia coli proteome enables the assessment of false discovery rates in simple proteogenomic experiments. Mol. Cell. Proteomics 12, 34203430 (2013).
  49. Shteynberg, D., Nesvizhskii, A.I., Moritz, R.L. & Deutsch, E.W. Combining results of multiple search engines in proteomics. Mol. Cell. Proteomics 12, 23832393 (2013).
  50. Branca, R.M. et al. HiRIEF LC-MS enables deep proteome coverage and unbiased proteogenomics. Nat. Methods 11, 5962 (2014).
    Large-scale proteogenomic study seeking to identify novel protein-coding loci in human and mouse.
  51. Ning, K., Fermin, D. & Nesvizhskii, A.I. Computational analysis of unassigned high-quality MS/MS spectra in proteomic data sets. Proteomics 10, 27122718 (2010).
  52. Helmy, M., Sugiyama, N., Tomita, M. & Ishihama, Y. Mass spectrum sequential subtraction speeds up searching large peptide MS/MS spectra datasets against large nucleotide databases for proteogenomics. Genes Cells 17, 633644 (2012).
  53. Shteynberg, D. et al. iProphet: multi-level integrative analysis of shotgun proteomic data improves peptide and protein identification rates and error estimates. Mol. Cell. Proteomics 10, M111.007690 (2011).
  54. Castellana, N. & Bafna, V. Proteogenomics to discover the full coding content of genomes: A computational perspective. J. Proteomics 73, 21242135 (2010).
  55. Abraham, P., Adams, R.M., Tuskan, G.A. & Hettich, R.L. Moving away from the reference genome: evaluating a peptide sequencing tagging approach for single amino acid polymorphism identifications in the genus Populus. J. Proteome Res. 12, 36423651 (2013).
  56. Tsur, D., Tanner, S., Zandi, E., Bafna, V. & Pevzner, P.A. Identification of post-translational modifications by blind search of mass spectra. Nat. Biotechnol. 23, 15621567 (2005).
  57. Lasonder, E. et al. Analysis of the Plasmodium falciparum proteome by high-accuracy mass spectrometry. Nature 419, 537542 (2002).
  58. Merrihew, G.E. et al. Use of shotgun proteomics for the identification, confirmation, and correction of C. elegans gene annotations. Genome Res. 18, 16601669 (2008).
  59. Chaerkady, R. et al. A proteogenomic analysis of Anopheles gambiae using high-resolution Fourier transform mass spectrometry. Genome Res. 21, 18721881 (2011).
  60. Alfaro, J.A., Sinha, A., Kislinger, T. & Boutros, P.C. Onco-proteogenomics: cancer proteomics joins forces with genomics. Nat. Methods 11, 11071113 (2014).
  61. Küster, B., Mortensen, P., Andersen, J.S. & Mann, M. Mass spectrometry allows direct identification of proteins in large genomes. Proteomics 1, 641650 (2001).
  62. Yang, X. et al. Discovery and annotation of small proteins using genomics, proteomics, and computational approaches. Genome Res. 21, 634641 (2011).
  63. Frith, M.C. et al. The abundance of short proteins in the mammalian proteome. PLoS Genet. 2, e52 (2006).
  64. Oyama, M. et al. Diversity of translation start sites may define increased complexity of the human short ORFeome. Mol. Cell. Proteomics 6, 10001006 (2007).
  65. Slavoff, S.A. et al. Peptidomic discovery of short open reading frame-encoded peptides in human cells. Nat. Chem. Biol. 9, 59 (2013).
    Identification of sORFs using mass spectrometry data.
  66. Hartmann, E.M. & Armengaud, J. N-terminomics and proteogenomics, getting off to a good start. Proteomics doi:10.1002/pmic.201400157 (2014).
  67. Van Damme, P., Gawron, D., Van Criekinge, W. & Menschaert, G. N-terminal proteomics and ribosome profiling provide a comprehensive view of the alternative translation initiation landscape in mice and men. Mol. Cell. Proteomics 13, 12451261 (2014).
  68. Nilsen, T.W. & Graveley, B.R. Expansion of the eukaryotic proteome by alternative splicing. Nature 463, 457463 (2010).
  69. Menon, R. & Omenn, G.S. in Data Mining in Proteomics: From Standards to Applications (eds. Hamacher, M., Eisenacher, M. & Stephan, C.) Ch. 20, 319326 (2011).
  70. Stunnenberg, H.G. & Hubner, N.C. Genomics meets proteomics: identifying the culprits in disease. Hum. Genet. 133, 689700 (2014).
  71. Sheynkman, G.M., Shortreed, M.R., Frey, B.L., Scalf, M. & Smith, L.M. Large-scale mass spectrometric detection of variant peptides resulting from nonsynonymous nucleotide differences. J. Proteome Res. 13, 228240 (2014).
  72. Wang, X. et al. Protein identification using customized protein sequence databases derived from RNA-Seq data. J. Proteome Res. 11, 10091017 (2012).
  73. Stepanova, V.V. & Gelfand, M.S. RNA editing: classical cases and outlook of new technologies. Mol. Biol. 48, 1115 (2014).
  74. Li, M. et al. Widespread RNA and DNA sequence differences in the human transcriptome. Science 333, 5358 (2011).
  75. Guttman, M., Russell, P., Ingolia, N.T., Weissman, J.S. & Lander, E.S. Ribosome profiling provides evidence that large noncoding RNAs do not encode proteins. Cell 154, 240251 (2013).
  76. Bánfai, B. et al. Long noncoding RNAs are rarely translated in two human cell lines. Genome Res. 22, 16461657 (2012).
  77. Junqueira, M. et al. Protein identification pipeline for the homology-driven proteomics. J. Proteomics 71, 346356 (2008).
  78. Renard, B.Y. et al. Overcoming species boundaries in peptide identification with Bayesian information criterion-driven error-tolerant peptide search (BICEPS). Mol. Cell. Proteomics 11, M111.014167 (2012).
  79. Armengaud, J. et al. Non-model organisms, a species endangered by proteogenomics. J. Proteomics 105, 518 (2014).
  80. Gupta, N. et al. Comparative proteogenomics: combining mass spectrometry and comparative genomics to analyze multiple genomes. Genome Res. 18, 11331142 (2008).
  81. Tovchigrechko, A., Venepally, P. & Payne, S.H. PGP: parallel prokaryotic proteogenomics pipeline for MPI clusters, high-throughput batch clusters and multicore workstations. Bioinformatics 30, 14691470 (2014).
  82. Lo, I. et al. Strain-resolved community proteomics reveals recombining genomes of acidophilic bacteria. Nature 446, 537541 (2007).
  83. Delmotte, N. et al. Community proteogenomics reveals insights into the physiology of phyllosphere bacteria. Proc. Natl. Acad. Sci. USA 106, 1642816433 (2009).
    Large-scale study demonstrating the power of combined metagenome and metaproteome analysis.
  84. Seifert, J. et al. Bioinformatic progress and applications in metaproteogenomics for bridging the gap between genomic sequences and metabolic functions in microbial communities. Proteomics 13, 27862804 (2013).
  85. Muth, T., Benndorf, D., Reichl, U., Rapp, E. & Martens, L. Searching for a needle in a stack of needles: challenges in metaproteomics data analysis. Mol. Biosyst. 9, 578585 (2013).
  86. Tanca, A. et al. Evaluating the impact of different sequence databases on metaproteome analysis: insights from a lab-assembled microbial mixture. PLoS ONE 8, e82981 (2013).
  87. de Souza, G.A. et al. Proteogenomic analysis of polymorphisms and gene annotation divergences in prokaryotes using a clustered mass spectrometry-friendly database. Mol. Cell. Proteomics 10, M110.002527 (2011).
  88. Penzlin, A. et al. Pipasic: similarity and expression correction for strain-level identification and quantification in metaproteomics. Bioinformatics 30, i149i156 (2014).
  89. Albright, J.C., Goering, A.W., Doroghazi, J.R., Metcalf, W.W. & Kelleher, N.L. Strain-specific proteogenomics accelerates the discovery of natural products via their biosynthetic pathways. J. Ind. Microbiol. Biotechnol. 41, 451459 (2014).
  90. Rodriguez, H. et al. Recommendations from the 2008 International Summit on Proteomics Data Release and Sharing Policy: the Amsterdam principles. J. Proteome Res. 8, 36893692 (2009).
  91. Vizcaíno, J.A. et al. ProteomeXchange provides globally coordinated proteomics data submission and dissemination. Nat. Biotechnol. 32, 223226 (2014).
  92. Mudge, J.M., Frankish, A. & Harrow, J. Functional transcriptomics in the post-ENCODE era. Genome Res. 23, 19611973 (2013).
  93. Carr, S. et al. The need for guidelines in publication of peptide and protein identification data: Working Group On Publication Guidelines For Peptide And Protein Identification Data. Mol. Cell. Proteomics 3, 531533 (2004).
  94. Omenn, G.S. The strategy, organization, and progress of the HUPO Human Proteome Project. J. Proteomics 100, 37 (2014).
  95. Ellis, M.J. et al. Connecting genomic alterations to cancer biology with proteomics: the NCI Clinical Proteomic Tumor Analysis Consortium. Cancer Discov. 3, 11081112 (2013).
  96. Ezkurdia, I. et al. Comparative proteomics reveals a significant bias toward alternative protein isoforms with conserved structure and function. Mol. Biol. Evol. 29, 22652283 (2012).
    Bioinformatic analysis of proteomic data for improved characterization of alternative splicing.
  97. Leoni, G., Le Pera, L., Ferrè, F., Raimondo, D. & Tramontano, A. Coding potential of the products of alternative splicing in human. Genome Biol. 12, R9 (2011).
  98. Wu, L. et al. Variation and genetic control of protein abundance in humans. Nature 499, 7982 (2013).
  99. Albert, F.W., Treusch, S., Shockley, A.H., Bloom, J.S. & Kruglyak, L. Genetics of single-cell protein abundance variation in large yeast populations. Nature 506, 494497 (2014).
  100. Picotti, P. et al. A complete mass-spectrometric map of the yeast proteome applied to quantitative trait analysis. Nature 494, 266270 (2013).

Download references

Author information


  1. Department of Pathology, University of Michigan, Ann Arbor, Michigan, USA.

    • Alexey I Nesvizhskii
  2. Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA.

    • Alexey I Nesvizhskii

Competing financial interests

The author declares no competing financial interests.

Corresponding author

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Table (90 KB)

    Supplementary Table 1

Additional data