Proteogenomics is an area of research at the interface of proteomics and genomics. In this approach, customized protein sequence databases generated using genomic and transcriptomic information are used to help identify novel peptides (not present in reference protein sequence databases) from mass spectrometry–based proteomic data; in turn, the proteomic data can be used to provide protein-level evidence of gene expression and to help refine gene models. In recent years, owing to the emergence of new sequencing technologies such as RNA-seq and dramatic improvements in the depth and throughput of mass spectrometry–based proteomics, the pace of proteogenomic research has greatly accelerated. Here I review the current state of proteogenomic methods and applications, including computational strategies for building and using customized protein sequence databases. I also draw attention to the challenge of false positive identifications in proteogenomics and provide guidelines for analyzing the data and reporting the results of proteogenomic studies.
At a glance
- The coming age of complete, accurate, and ubiquitous proteomes. Mol. Cell 49, 583–590 (2013). , , &
- Quantitative mass spectrometry in proteomics: critical review update from 2007 to the present. Anal. Bioanal. Chem. 404, 939–965 (2012). , , &
- A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. J. Proteomics 73, 2092–2123 (2010).
- Interpretation of shotgun proteomic data: the protein inference problem. Mol. Cell. Proteomics 4, 1419–1440 (2005). &
- TagRecon: high-throughput mutation identification through sequence tagging. J. Proteome Res. 9, 1716–1726 (2010). et al.
- De novo sequencing and homology searching. Mol. Cell. Proteomics 11, O111.014902 (2012). &
- Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics 4, 59–77 (2004). , &
- RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 57–63 (2009). , &
- Ribosome profiling: new views of translation, from single codons to genome scale. Nat. Rev. Genet. 15, 205–213 (2014).
- Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry. Genome Biol. 6, R9 (2005).
Analysis of a large compendium of proteomic data from multiple studies: the first publicly available repository of mass spectrometry data, PeptideAtlas.
- The utility of mass spectrometry-based proteomic data for validation of novel alternative splice forms reconstructed from RNA-Seq data: a preliminary assessment. BMC Bioinformatics 11 (suppl. 11), S14 (2010). &
- Deep proteome coverage based on ribosome profiling aids MS-based protein and peptide discovery and provides evidence of alternative translation products and near-cognate translation initiation events. Mol. Cell. Proteomics 12, 1780–1790 (2013).
Use of ribosome-profiling data for creating customized protein sequence databases.
- Discovery and mass spectrometric analysis of novel splice-junction peptides using RNA-Seq. Mol. Cell. Proteomics 12, 2341–2353 (2013). , , &
- Quantitative and qualitative proteome characteristics extracted from in-depth integrated genomics and proteomics analysis. Cell Rep. 5, 1469–1478 (2013). et al.
- Discovery of novel genes and gene isoforms by integrating transcriptomic and proteomic profiling from mouse liver. J. Proteome Res. 13, 2409–2419 (2014). et al.
- Directed shotgun proteomics guided by saturated RNA-seq identifies a complete expressed prokaryotic proteome. Genome Res. 23, 1916–1927 (2013).
Comprehensive proteogenomic study integrating RNA-seq and proteomic data.
- A draft map of the human proteome. Nature 509, 575–581 (2014). et al.
- Mass-spectrometry-based draft of the human proteome. Nature 509, 582–587 (2014). et al.
- Proteogenomic characterization of human colon and rectal cancer. Nature 513, 382–387 (2014).
Large-scale CPTAC study integrating proteomic and genomic data from human colon and rectal TCGA samples.
- GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012). et al.
- Genome-scale proteomics reveals Arabidopsis thaliana gene models and proteome dynamics. Science 320, 938–941 (2008).
Comprehensive proteogenomic study to assemble a proteome map of an organism.
- A high-quality catalog of the Drosophila melanogaster proteome. Nat. Biotechnol. 25, 576–583 (2007). et al.
- Whole human genome proteogenomic mapping for ENCODE cell line data: identifying protein-coding regions. BMC Genomics 14, 141 (2013). et al.
- Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics. Genome Biol. 7, R35 (2006). et al.
- An automated proteogenomic method uses mass spectrometry to reveal novel genes in Zea mays. Mol. Cell. Proteomics 13, 157–167 (2014). et al.
- Addressing statistical biases in nucleotide-derived protein databases for proteogenomic search strategies. J. Proteome Res. 11, 5221–5234 (2012). , &
- Shotgun proteomics aids discovery of novel protein-coding genes, alternative splicing, and “resurrected” pseudogenes in the mouse genome. Genome Res. 21, 756–767 (2011). et al.
- Improving gene annotation using peptide mass spectrometry. Genome Res. 17, 231–239 (2007). et al.
- Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Nat. Rev. Genet. 9, 62–73 (2008).
- Discovery and revision of Arabidopsis genes by proteogenomics. Proc. Natl. Acad. Sci. USA 105, 21034–21038 (2008).
Application of an advanced computational pipeline for proteogenomic annotation.
- Interrogating the human genome using uninterpreted mass spectrometry data. Proteomics 1, 651–667 (2001). , , &
- Novel peptide identification from tandem mass spectra using ESTs and sequence database compression. Mol. Syst. Biol. 3, 102 (2007).
- Dynamic spectrum quality assessment and iterative computational analysis of shotgun proteomic data: toward more efficient identification of post-translational modifications, sequence polymorphisms, and novel peptides. Mol. Cell. Proteomics 5, 652–670 (2006). et al.
- The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression. Genome Res. 22, 1775–1789 (2012). et al.
- Systematic evaluation of spliced alignment programs for RNA-seq data. Nat. Methods 10, 1185–1191 (2013). et al.
- Assessment of transcript reconstruction methods for RNA-seq. Nat. Methods 10, 1177–1184 (2013). et al.
- De novo derivation of proteomes from transcriptomes for transcript and protein identification. Nat. Methods 9, 1207–1211 (2012). et al.
- Using Galaxy-P to leverage RNA-Seq for the discovery of novel protein variations. BMC Genomics 15, 703 (2014). et al.
- customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search. Bioinformatics 29, 3235–3237 (2013). &
- Proteogenomic database construction driven from large scale RNA-seq data. J. Proteome Res. 13, 21–28 (2014). et al.
- A bioinformatics workflow for variant peptide detection in shotgun proteomics. Mol. Cell. Proteomics 10, M110.006536 (2011). et al.
- REDItools: high-throughput RNA editing detection made easy. Bioinformatics 29, 1813–1814 (2013). &
- Identification of novel alternative splice isoforms of circulating proteins in a mouse model of human pancreatic cancer. Cancer Res. 69, 300–309 (2009). et al.
- NONCODEv4: exploring the world of long non-coding RNA genes. Nucleic Acids Res. 42, D98–D103 (2014). et al.
- Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 25, 1915–1927 (2011). et al.
- ChiTaRS: a database of human, mouse and fruit fly chimeric transcripts and RNA-sequencing data. Nucleic Acids Res. 41, D142–D151 (2013). et al.
- Chimeras taking shape: potential functions of proteins encoded by chimeric RNA transcripts. Genome Res. 22, 1231–1242 (2012). et al.
- Deep coverage of the Escherichia coli proteome enables the assessment of false discovery rates in simple proteogenomic experiments. Mol. Cell. Proteomics 12, 3420–3430 (2013). et al.
- Combining results of multiple search engines in proteomics. Mol. Cell. Proteomics 12, 2383–2393 (2013). , , &
- HiRIEF LC-MS enables deep proteome coverage and unbiased proteogenomics. Nat. Methods 11, 59–62 (2014).
Large-scale proteogenomic study seeking to identify novel protein-coding loci in human and mouse.
- Computational analysis of unassigned high-quality MS/MS spectra in proteomic data sets. Proteomics 10, 2712–2718 (2010). , &
- Mass spectrum sequential subtraction speeds up searching large peptide MS/MS spectra datasets against large nucleotide databases for proteogenomics. Genes Cells 17, 633–644 (2012). , , &
- iProphet: multi-level integrative analysis of shotgun proteomic data improves peptide and protein identification rates and error estimates. Mol. Cell. Proteomics 10, M111.007690 (2011). et al.
- Proteogenomics to discover the full coding content of genomes: A computational perspective. J. Proteomics 73, 2124–2135 (2010). &
- Moving away from the reference genome: evaluating a peptide sequencing tagging approach for single amino acid polymorphism identifications in the genus Populus. J. Proteome Res. 12, 3642–3651 (2013). , , &
- Identification of post-translational modifications by blind search of mass spectra. Nat. Biotechnol. 23, 1562–1567 (2005). , , , &
- Analysis of the Plasmodium falciparum proteome by high-accuracy mass spectrometry. Nature 419, 537–542 (2002). et al.
- Use of shotgun proteomics for the identification, confirmation, and correction of C. elegans gene annotations. Genome Res. 18, 1660–1669 (2008). et al.
- A proteogenomic analysis of Anopheles gambiae using high-resolution Fourier transform mass spectrometry. Genome Res. 21, 1872–1881 (2011). et al.
- Onco-proteogenomics: cancer proteomics joins forces with genomics. Nat. Methods 11, 1107–1113 (2014). , , &
- Mass spectrometry allows direct identification of proteins in large genomes. Proteomics 1, 641–650 (2001). , , &
- Discovery and annotation of small proteins using genomics, proteomics, and computational approaches. Genome Res. 21, 634–641 (2011). et al.
- The abundance of short proteins in the mammalian proteome. PLoS Genet. 2, e52 (2006). et al.
- Diversity of translation start sites may define increased complexity of the human short ORFeome. Mol. Cell. Proteomics 6, 1000–1006 (2007). et al.
- Peptidomic discovery of short open reading frame-encoded peptides in human cells. Nat. Chem. Biol. 9, 59 (2013).
Identification of sORFs using mass spectrometry data.
- N-terminomics and proteogenomics, getting off to a good start. Proteomics doi:10.1002/pmic.201400157 (2014). &
- N-terminal proteomics and ribosome profiling provide a comprehensive view of the alternative translation initiation landscape in mice and men. Mol. Cell. Proteomics 13, 1245–1261 (2014). , , &
- Expansion of the eukaryotic proteome by alternative splicing. Nature 463, 457–463 (2010). &
- Ch. 20, 319–326 (2011). & in Data Mining in Proteomics: From Standards to Applications (eds. Hamacher, M., Eisenacher, M. & Stephan, C.)
- Genomics meets proteomics: identifying the culprits in disease. Hum. Genet. 133, 689–700 (2014). &
- Large-scale mass spectrometric detection of variant peptides resulting from nonsynonymous nucleotide differences. J. Proteome Res. 13, 228–240 (2014). , , , &
- Protein identification using customized protein sequence databases derived from RNA-Seq data. J. Proteome Res. 11, 1009–1017 (2012). et al.
- RNA editing: classical cases and outlook of new technologies. Mol. Biol. 48, 11–15 (2014). &
- Widespread RNA and DNA sequence differences in the human transcriptome. Science 333, 53–58 (2011). et al.
- Ribosome profiling provides evidence that large noncoding RNAs do not encode proteins. Cell 154, 240–251 (2013). , , , &
- Long noncoding RNAs are rarely translated in two human cell lines. Genome Res. 22, 1646–1657 (2012). et al.
- Protein identification pipeline for the homology-driven proteomics. J. Proteomics 71, 346–356 (2008). et al.
- Overcoming species boundaries in peptide identification with Bayesian information criterion-driven error-tolerant peptide search (BICEPS). Mol. Cell. Proteomics 11, M111.014167 (2012). et al.
- Non-model organisms, a species endangered by proteogenomics. J. Proteomics 105, 5–18 (2014). et al.
- Comparative proteogenomics: combining mass spectrometry and comparative genomics to analyze multiple genomes. Genome Res. 18, 1133–1142 (2008). et al.
- PGP: parallel prokaryotic proteogenomics pipeline for MPI clusters, high-throughput batch clusters and multicore workstations. Bioinformatics 30, 1469–1470 (2014). , &
- Strain-resolved community proteomics reveals recombining genomes of acidophilic bacteria. Nature 446, 537–541 (2007). et al.
- Community proteogenomics reveals insights into the physiology of phyllosphere bacteria. Proc. Natl. Acad. Sci. USA 106, 16428–16433 (2009).
Large-scale study demonstrating the power of combined metagenome and metaproteome analysis.
- Bioinformatic progress and applications in metaproteogenomics for bridging the gap between genomic sequences and metabolic functions in microbial communities. Proteomics 13, 2786–2804 (2013). et al.
- Searching for a needle in a stack of needles: challenges in metaproteomics data analysis. Mol. Biosyst. 9, 578–585 (2013). , , , &
- Evaluating the impact of different sequence databases on metaproteome analysis: insights from a lab-assembled microbial mixture. PLoS ONE 8, e82981 (2013). et al.
- Proteogenomic analysis of polymorphisms and gene annotation divergences in prokaryotes using a clustered mass spectrometry-friendly database. Mol. Cell. Proteomics 10, M110.002527 (2011). et al.
- Pipasic: similarity and expression correction for strain-level identification and quantification in metaproteomics. Bioinformatics 30, i149–i156 (2014). et al.
- Strain-specific proteogenomics accelerates the discovery of natural products via their biosynthetic pathways. J. Ind. Microbiol. Biotechnol. 41, 451–459 (2014). , , , &
- Recommendations from the 2008 International Summit on Proteomics Data Release and Sharing Policy: the Amsterdam principles. J. Proteome Res. 8, 3689–3692 (2009). et al.
- ProteomeXchange provides globally coordinated proteomics data submission and dissemination. Nat. Biotechnol. 32, 223–226 (2014). et al.
- Functional transcriptomics in the post-ENCODE era. Genome Res. 23, 1961–1973 (2013). , &
- The need for guidelines in publication of peptide and protein identification data: Working Group On Publication Guidelines For Peptide And Protein Identification Data. Mol. Cell. Proteomics 3, 531–533 (2004). et al.
- The strategy, organization, and progress of the HUPO Human Proteome Project. J. Proteomics 100, 3–7 (2014).
- Connecting genomic alterations to cancer biology with proteomics: the NCI Clinical Proteomic Tumor Analysis Consortium. Cancer Discov. 3, 1108–1112 (2013). et al.
- Comparative proteomics reveals a significant bias toward alternative protein isoforms with conserved structure and function. Mol. Biol. Evol. 29, 2265–2283 (2012).
Bioinformatic analysis of proteomic data for improved characterization of alternative splicing.
- Coding potential of the products of alternative splicing in human. Genome Biol. 12, R9 (2011). , , , &
- Variation and genetic control of protein abundance in humans. Nature 499, 79–82 (2013). et al.
- Genetics of single-cell protein abundance variation in large yeast populations. Nature 506, 494–497 (2014). , , , &
- A complete mass-spectrometric map of the yeast proteome applied to quantitative trait analysis. Nature 494, 266–270 (2013). et al.
- Supplementary Table (90 KB)
Supplementary Table 1