Abstract
The analysis of the large amount of data generated in mass spectrometry–based proteomics experiments represents a significant challenge and is currently a bottleneck in many proteomics projects. In this review we discuss critical issues related to data processing and analysis in proteomics and describe available methods and tools. We place special emphasis on the elaboration of results that are supported by sound statistical arguments.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Aebersold, R. & Mann, M. Mass spectrometry-based proteomics. Nature 422, 198–207 (2003).
Domon, B. & Aebersold, R. Mass spectrometry and protein analysis. Science 312, 212–217 (2006).
Carr, S. et al. The need for guidelines in publication of peptide and protein identification data. Mol. Cell. Proteomics 3, 531–533 (2004).
Geer, L.Y. et al. Open mass spectrometry search algorithm. J. Proteome Res. 3, 958–964 (2004).
Sadygov, R.G. & Yates, J.R. A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases. Anal. Chem. 75, 3792–3798 (2003).
Fenyo, D. & Beavis, R.C. A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Anal. Chem. 75, 768–774 (2003).
King, N.L. et al. Analysis of the Saccharomyces cerevisiae proteome with PeptideAtlas. Genome Biol. [online] 7, R106 (2006).
Brunner, E. et al. A high-quality catalog of the Drosophila melanogaster proteome. Nat. Biotechnol. 25, 576–583 (2007).
Yates, J.R., Morgan, S.F., Gatlin, C.L., Griffin, P.R. & Eng, J.K. Method to compare collision-induced dissociation spectra of peptides: potential for library searching and subtractive analysis. Anal. Chem. 70, 3557–3565 (1998).
Craig, R., Cortens, J.C., Fenyo, D. & Beavis, R.C. Using annotated peptide mass spectrum libraries for protein identification. J. Proteome Res. 5, 1843–1849 (2006).
Frewen, B.E., Merrihew, G.E., Wu, C.C., Noble, W.S. & MacCoss, M.J. Analysis of peptide MS/MS spectra from large-scale proteomics experiments using spectrum libraries. Anal. Chem. 78, 5678–5684 (2006).
Lam, H. et al. Development and validation of a spectral library searching method for peptide identification from MS/MS. Proteomics 7, 655–667 (2007).
Stein, S.E. & Scott, D.R. Optimization and testing of mass-spectral library search algorithms for compound identification. J. Am. Soc. Mass Spectrom. 5, 859–866 (1994).
Nesvizhskii, A.I. et al. Dynamic spectrum quality assessment and iterative computational analysis of shotgun proteomic data: toward more efficient identification of post-translational modifications, sequence polymorphisms, and novel peptides. Mol. Cell. Proteomics 5, 652–670 (2006).
Mann, M. & Wilm, M. Error tolerant identification of peptides in sequence databases by peptide sequence tags. Anal. Chem. 66, 4390–4399 (1994).
Tabb, D.L., Saraf, A. & Yates, J.R. GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model. Anal. Chem. 75, 6415–6421 (2003).
Tanner, S. et al. InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. Anal. Chem. 77, 4626–4639 (2005).
Bern, M., Cai, Y.H. & Goldberg, D. Lookup peaks: a hybrid of de novo sequencing and database search for protein identification by tandem mass spectrometry. Anal. Chem. 79, 1393–1400 (2007).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate—a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Methodol. 57, 289–300 (1995).
Elias, J.E. & Gygi, S.P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 4, 207–214 (2007).
Keller, A., Nesvizhskii, A.I., Kolker, E. & Aebersold, R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 74, 5383–5392 (2002).
Storey, J.D. & Tibshirani, R. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. USA 100, 9440–9445 (2003).
Kapp, E.A. et al. An evaluation, comparison, and accurate benchmarking of several publicly available MS/MS search algorithms: Sensitivity and specificity analysis. Proteomics 5, 3475–3490 (2005).
Elias, J.E., Haas, W., Faherty, B.K. & Gygi, S.P. Comparative evaluation of mass spectrometry platforms used in large-scale proteomics investigations. Nat. Methods 2, 667–675 (2005).
Lopez-Ferrer, D. et al. Statistical model for large-scale peptide identification in databases from tandem mass spectra using SEQUEST. Anal. Chem. 76, 6853–6860 (2004).
Anderson, D.C., Li, W.Q., Payan, D.G. & Noble, W.S. A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores. J. Proteome Res. 2, 137–146 (2003).
Kislinger, T. et al. PRISM, a generic large scale proteomic investigation strategy for mammals. Mol. Cell. Proteomics 2, 96–106 (2003).
Ulintz, P.J., Zhu, J., Qin, Z.H.S. & Andrews, P.C. Improved classification of mass spectrometry database search results using newer machine learning approaches. Mol. Cell. Proteomics 5, 497–509 (2006).
Gentzel, M., Kocher, T., Ponnusamy, S. & Wilm, M. Preprocessing of tandem mass spectrometric data to support automatic protein identification. Proteomics 3, 1597–1610 (2003).
Mujezinovic, N. et al. Cleaning of raw peptide MS/MS spectra: improved protein identification following deconvolution of multiply charged peaks, isotope clusters, and removal of background noise. Proteomics 6, 5117–5131 (2006).
Beer, I., Barnea, E., Ziv, T. & Admon, A. Improving large-scale proteomics by clustering of mass spectrometry data. Proteomics 4, 950–960 (2004).
Tabb, D.L., Thompson, M.R., Khalsa-Moyers, G., VerBerkmoes, N.C. & McDonald, W.H. MS2Grouper: Group assessment and synthetic replacement of duplicate proteomic tandem mass spectra. J. Am. Soc. Mass Spectrom. 16, 1250–1261 (2005).
Zhang, N. et al. ProblDtree: an automated software program capable of identifying multiple peptides from a single collision-induced dissociation spectrum collected by a tandem mass spectrometer. Proteomics 5, 4096–4106 (2005).
Moore, R.E., Young, M.K. & Lee, T.D. Method for screening peptide fragment ion mass spectra prior to database searching. J. Am. Soc. Mass Spectrom. 11, 422–426 (2000).
Wong, J.W.H., Sullivan, M.J., Cartwright, H.M. & Cagney, G. msmsEval: tandem mass spectral quality assignment for high-throughput proteomics. BMC Bioinformatics [online] 8, 51 (2007).
Flikka, K., Martens, L., Vandekerckhoe, J., Gevaert, K. & Eidhammer, I. Improving the reliability and throughput of mass spectrometry-based proteomics by spectrum quality filtering. Proteomics 6, 2086–2094 (2006).
Xu, M. et al. Assessing data quality of peptide mass spectra obtained by quadrupole ion trap mass spectrometry. J. Proteome Res. 4, 300–305 (2005).
Colinge, J., Magnin, J., Dessingy, T., Giron, M. & Masselot, A. Improved peptide charge state assignment. Proteomics 3, 1434–1440 (2003).
Tabb, D.L. et al. Determination of peptide and protein ion charge states by Fourier transformation of isotope-resolved mass spectra. J. Am. Soc. Mass Spectrom. 17, 903–915 (2006).
Resing, K.A. et al. Improving reproducibility and sensitivity in identifying human proteins by shotgun proteomics. Anal. Chem. 76, 3556–3568 (2004).
Price, T.S. et al. EBP, a program for protein identification using multiple tandem mass spectrometry data sets. Mol. Cell. Proteomics 6, 527–536 (2007).
Higgs, R.E. et al. Estimating the statistical significance of peptide identifications from shotgun proteomics experiments. J. Proteome Res. 6, 1758–1767 (2007).
Keller, A., Eng, J., Zhang, N., Li, X.-J. & Aebersold, R. A uniform proteomics MS/MS analysis platform utilizing open XML file formats. Mol. Syst. Biol. [online] 1, E1–E8 (2005).
Olsen, J.V. & Mann, M. Improved peptide identification in proteomics by two consecutive stages of mass spectrometric fragmentation. Proc. Natl. Acad. Sci. USA 101, 13417–13422 (2004).
Strittmatter, E.F. et al. Application of peptide LC retention time information in a discriminant function for peptide identification by tandem mass spectrometry. J. Proteome Res. 3, 760–769 (2004).
Qian, W.J. et al. Probability-based evaluation of peptide and protein identifications from tandem mass spectrometry and SEQUEST analysis: the human proteome. J. Proteome Res. 4, 53–62 (2005).
Malmstrom, J. et al. Optimized peptide separation and identification for mass spectrometry based proteomics via free-flow electrophoresis. J. Proteome Res. 5, 2241–2249 (2006).
Xie, H. & Griffin, T.J. Trade-off between high sensitivity and increased potential for false positive peptide sequence matches using a two-dimensional linear ion trap for tandem mass spectrometry-based proteomics. J. Proteome Res. 5, 1003–1009 (2006).
Cargile, B.J., Bundy, J.L., Freeman, T.W. & Stephenson, J.L. Gel based isoelectric focusing of peptides and the utility of isoelectric point in protein identification. J. Proteome Res. 3, 112–119 (2004).
Zhang, H. et al. High throughput quantitative analysis of serum proteins using glycopeptide capture and liquid chromatography mass spectrometry. Mol. Cell. Proteomics 4, 144–155 (2005).
Heller, M. et al. Added value for tandem mass spectrometry shotgun proteomics data validation through isoelectric focusing of peptides. J. Proteome Res. 4, 2273–2282 (2005).
Olsen, J.V. et al. Parts per million mass accuracy on an orbitrap mass spectrometer via lock mass injection into a C-trap. Mol. Cell. Proteomics 4, 2010–2021 (2005).
Rudnick, P.A., Wang, Y.J., Evans, E., Lee, C.S. & Balgley, B.M. Large scale analysis of MASCOT results using a mass accuracy-based THreshold (MATH) effectively improves data interpretation. J. Proteome Res. 4, 1353–1360 (2005).
Nesvizhskii, A.I. & Aebersold, R. Analysis, statistical validation and dissemination of large-scale proteomics data sets generated by tandem MS. Drug Discov. Today 9, 173–181 (2004).
Nesvizhskii, A.I. & Aebersold, R. Interpretation of shotgun proteomic data: the protein inference problem. Mol. Cell. Proteomics 4, 1419–1440 (2005).
Nesvizhskii, A.I., Keller, A., Kolker, E. & Aebersold, R. A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 75, 4646–4658 (2003).
Omenn, G.S. et al. Overview of the HUPO Plasma Proteome Project: results from the pilot phase with 35 collaborating laboratories and multiple analytical groups, generating a core data set of 3020 proteins and a publicly-available database. Proteomics 5, 3226–3245 (2005).
Rappsilber, J. & Mann, M. What does it mean to identify a protein in proteomics? Trends Biochem. Sci. 27, 74–78 (2002).
Yang, X. et al. DBParser: web-based software for shotgun proteomic data analyses. J. Proteome Res. 3, 1002–1008 (2004).
Weatherly, D.B. et al. A heuristic method for assigning a false-discovery rate for protein identifications from mascot database search results. Mol. Cell. Proteomics 4, 762–772 (2005).
Bandeira, N., Tsur, D., Frank, A. & Pevzner, P.A. Protein identification by spectral networks analysis. Proc. Natl. Acad. Sci. USA 104, 6140–6145 (2007).
States, D.J. et al. Challenges in deriving high-confidence protein identifications from data gathered by a HUPO plasma proteome collaborative study. Nat. Biotechnol. 24, 333–338 (2006).
Sadygov, R.G., Liu, H.B. & Yates, J.R. Statistical models for protein validation using tandem mass spectral data and protein amino acid sequence databases. Anal. Chem. 76, 1664–1671 (2004).
Mallick, P. et al. Computational prediction of proteotypic peptides for quantitative proteomics. Nat. Biotechnol. 25, 125–131 (2007).
Goshe, M.B. & Smith, R.D. Stable isotope-coded proteomic mass spectrometry. Curr. Opin. Biotechnol. 14, 101–109 (2003).
Old, W.M. et al. Comparison of label-free methods for quantifying human proteins by shotgun proteomics. Mol. Cell. Proteomics 4, 1487–1502 (2005).
Ishihama, Y. et al. Exponentially modified protein abundance index (emPAI) for estimation of absolute protein amount in proteomics by the number of sequenced peptides per protein. Mol. Cell. Proteomics 4, 1265–1272 (2005).
Zybailov, B., Coleman, M.K., Florens, L. & Washburn, M.P. Correlation of relative abundance ratios derived from peptide ion chromatograms and spectrum counting for quantitative proteomic analysis using stable isotope labeling. Anal. Chem. 77, 6218–6224 (2005).
Liu, H., Sadygov, R.G. & Yates, J.R. III. A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Anal. Chem. 76, 4193–4201 (2004).
Silva, J.C., Gorenstein, M.V., Li, G.Z., Vissers, J.P.C. & Geromanos, S.J. Absolute quantification of proteins by LCMSE: a virtue of parallel MS acquisition. Mol. Cell. Proteomics 5, 144–156 (2006).
Lu, P., Vogel, C., Wang, R., Yao, X. & Marcotte, E.M. Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation. Nat. Biotechnol. 25, 117–124 (2007).
Blondeau, F. et al. Tandem MS analysis of brain clathrin-coated vesicles reveals their critical involvement in synaptic vesicle recycling. Proc. Natl. Acad. Sci. USA 101, 3833–3838 (2004).
Radulovic, D. et al. Informatics platform for global proteomic profiling and biomarker discovery using liquid chromatography-tandem mass spectrometry. Mol. Cell. Proteomics 3, 984–997 (2004).
Jaffe, J.D. et al. PEPPeR, a platform for experimental proteomic pattern recognition. Mol. Cell. Proteomics 5, 1927–1941 (2006).
Li, X.-J., Yi, E.C., Kemp, C.J., Zhang, H. & Aebersold, R. A software suite for the generation and comparison of peptide arrays from sets of data collected by liquid chromatography-mass spectrometry. Mol. Cell. Proteomics 4, 1328–1340 (2005).
Listgarten, J. & Emili, A. Statistical and computational methods for comparative proteomic profiling using liquid chromatography-tandem mass spectrometry. Mol. Cell. Proteomics 4, 419–434 (2005).
Qian, W.-J., Jacobs, J.M., Liu, T., Camp, D.G. II & Smith, R.D. Advances and challenges in liquid chromatography-mass spectrometry-based proteomics profiling for clinical applications. Mol. Cell. Proteomics 5, 1727–1744 (2006).
Anderson, L. & Hunter, C.L. Quantitative mass spectrometric MRM assays for major plasma proteins. Mol. Cell. Proteomics 5, 573–588 (2006).
Gentleman, R.C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. [online] 5, R80 (2004).
Meng, F., Forbes, A.J., Miller, L.M. & Kelleher, N.L. Detection and localization of protein modifications by high resolution tandem mass spectrometry. Mass Spectrom. Rev. 24, 126–134 (2005).
Han, X., Jin, M., Breuker, K. & McLafferty, F.W. Extending top-down mass spectrometry to proteins with masses greater than 200 kilodaltons. Science 314, 109–112 (2006).
Chait, B.T. Chemistry: mass spectrometry: bottom-up or top-down? Science 314, 65–66 (2006).
Kuster, B., Schirle, M., Mallick, P. & Aebersold, R. Scoring proteomes with proteotypic peptide probes. Nat. Rev. Mol. Cell Biol. 6, 577–583 (2005).
Eng, J.K., McCormack, A.L. & Yates, J.R. An approach to correlate tandem mass-spectral data of peptides with amino-acid-sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976–989 (1994).
Perkins, D.N., Pappin, D.J.C., Creasy, D.M. & Cottrell, J.S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567 (1999).
Clauser, K.R., Baker, P. & Burlingame, A.L. Role of accurate mass measurement (+/− 10 ppm) in protein identification strategies employing MS or MS/MS and database searching. Anal. Chem. 71, 2871–2882 (1999).
Zhang, N., Aebersold, R. & Schwilkowski, B. ProbID: a probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data. Proteomics 2, 1406–1412 (2002).
Craig, R. & Beavis, R.C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics 20, 1466–1467 (2004).
Colinge, J., Masselot, A., Giron, M., Dessingy, T. & Magnin, J. OLAV: Towards high-throughput tandem mass spectrometry data identification. Proteomics 3, 1454–1463 (2003).
Matthiesen, R., Trelle, M.B., Hojrup, P., Bunkenborg, J. & Jensen, O.N. VEMS 3.0: algorithms and computational tools for tandem mass spectrometry based identification of post-translational modifications in proteins. J. Proteome Res. 4, 2338–2347 (2005).
Tabb, D.L., Fernando, C.G. & Chambers, M.C. MyriMatch: highly accurate tandem mass spectral peptide identification by multivariate hypergeometric analysis. J. Proteome Res. 6, 654–661 (2007).
Craig, R., Cortens, J.P. & Beavis, R.C. The use of proteotypic peptide libraries for protein identification. Rapid Commun. Mass Spectrom. 19, 1844–1850 (2005).
Johnson, R.S. & Taylor, J.A. Searching sequence databases via de novo peptide sequencing by tandem mass spectrometry. Mol. Biotechnol. 22, 301–315 (2002).
Frank, A. & Pevzner, P. PepNovo: de novo peptide sequencing via probabilistic network modeling. Anal. Chem. 77, 964–973 (2005).
Ma, B. et al. PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Commun. Mass Spectrom. 17, 2337–2342 (2003).
Hernandez, P., Gras, R., Frey, J. & Appel, R.D. Popitam: towards new heuristic strategies to improve protein identification from tandem mass spectrometry data. Proteomics 3, 870–878 (2003).
Desiere, F. et al. Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry. Genome Biol. [online] 6, R9 (2005).
Rauch, A. et al. Computational proteomics analysis system (CPAS): an extensible, open-source analytic system for evaluating and publishing proteomic data and high throughput biological experiments. J. Proteome Res. 5, 112–121 (2006).
Martens, L. et al. PRIDE: the proteomics identifications database. Proteomics 5, 3537–3545 (2005).
Li, X.J., Zhang, H., Ranish, J.A. & Aebersold, R. Automated statistical analysis of protein abundance ratios from data generated by stable-isotope dilution and tandem mass spectrometry. Anal. Chem. 75, 6648–6657 (2003).
MacCoss, M.J., Wu, C.C., Liu, H.B., Sadygov, R. & Yates, J.R. A correlation algorithm for the automated quantitative analysis of shotgun proteomics data. Anal. Chem. 75, 6912–6921 (2003).
Dudoit, S., Yang, Y.H., Callow, M.J. & Speed, T.P. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat. Sinica 12, 111–139 (2002).
Tusher, V.G., Tibshirani, R. & Chu, G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA 98, 5116–5121 (2001).
Efron, B., Tibshirani, R., Storey, J.D. & Tusher, V. Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc. 96, 1151–1160 (2001).
Fermin, D. et al. Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics. Genome Biol. [online] 7, R35 (2006).
Tanner, S. et al. Improving gene annotation using peptide mass spectrometry. Genome Res. 17, 231–239 (2007).
Edwards, N.J. Novel peptide identification from tandem mass spectra using ESTs and sequence database compression. Mol. Syst. Biol. [online] 3, 102 (2007).
Pedrioli, P.G.A. et al. A common open representation of mass spectrometry data and its application to proteomics research. Nat. Biotechnol. 22, 1459–1466 (2004).
Martens, L. et al. Do we want our data raw? Including binary mass spectrometry data in public proteomics data repositories. Proteomics 5, 3501–3505 (2005).
Acknowledgements
This work was supported in part by US National Institutes of Health (NIH) National Cancer Institute Grant R01 CA126239 to A.I.N. and with federal funds from the National Heart, Lung, and Blood Institute of the NIH under contract no. N01-HV-28179 to R.A.
Author information
Authors and Affiliations
Corresponding author
Supplementary information
Supplementary Text and Figures
Supplementary Notes (PDF 104 kb)
Rights and permissions
About this article
Cite this article
Nesvizhskii, A., Vitek, O. & Aebersold, R. Analysis and validation of proteomic data generated by tandem mass spectrometry. Nat Methods 4, 787–797 (2007). https://doi.org/10.1038/nmeth1088
Published:
Issue Date:
DOI: https://doi.org/10.1038/nmeth1088
This article is cited by
-
High-quality and robust protein quantification in large clinical/pharmaceutical cohorts with IonStar proteomics investigation
Nature Protocols (2023)
-
Affinity Selection from Synthetic Peptide Libraries Enabled by De Novo MS/MS Sequencing
International Journal of Peptide Research and Therapeutics (2022)
-
StrucGP: de novo structural sequencing of site-specific N-glycan on glycoproteins using a modularization strategy
Nature Methods (2021)
-
Transfer posterior error probability estimation for peptide identification
BMC Bioinformatics (2020)
-
A European proposal for quality control and quality assurance of tandem mass spectral libraries
Environmental Sciences Europe (2020)