mProphet: automated data processing and statistical validation for large-scale SRM experiments

Journal name:
Nature Methods
Volume:
8,
Pages:
430–435
Year published:
DOI:
doi:10.1038/nmeth.1584
Received
Accepted
Published online
Corrected online

Abstract

Selected reaction monitoring (SRM) is a targeted mass spectrometric method that is increasingly used in proteomics for the detection and quantification of sets of preselected proteins at high sensitivity, reproducibility and accuracy. Currently, data from SRM measurements are mostly evaluated subjectively by manual inspection on the basis of ad hoc criteria, precluding the consistent analysis of different data sets and an objective assessment of their error rates. Here we present mProphet, a fully automated system that computes accurate error rates for the identification of targeted peptides in SRM data sets and maximizes specificity and sensitivity by combining relevant features in the data into a statistical model.

At a glance

Figures

  1. Structure of SRM data and definition of terms.
    Figure 1: Structure of SRM data and definition of terms.

    (a) Representation of the SRM measurement of one peptide. We name the precursor (Q1) to fragment ion (Q3) transitions, used to measure one targeted peptide, a transition group. The data resulting from the measurement of one transition or transition group are called a trace or transition group record, respectively. In one transition group record, several peak groups can be identified that potentially represent the peptide of interest. (b) Peak group features that can be used to identify a true peak group. Red indicates an unexpected behavior for true peak groups. If the peak group is derived from the targeted peptide, the peaks tend to have similar retention time profile and shape. Furthermore, the relative intensities of the fragment ions must correspond to previously measured intensity ratios (for example, from a consensus spectrum). If a reference peptide is in the sample, the relative intensities for all corresponding traces as well as peak shape and elution time should be similar for intrinsic peptide and reference.

  2. Generation of a gold-standard data set with assigned true peak groups.
    Figure 2: Generation of a gold-standard data set with assigned true peak groups.

    Synthetic peptides (100) in isotopically light and heavy forms were added at three different concentrations to three different background matrices (trypsinized protein extracts from L. interrogans, C. elegans and H. sapiens u2os cells) of increasing complexity. (a) Dilution series of a synthetic peptide mixed into a background matrix. The peptide was measured using SRM with five transitions in three different samples and at three different concentrations. Signal intensities (square root of counts per second) versus retention time for one transition group record at three different dilutions. Square root is used to visualize the full intensity range. The true peptide signal at 34 min, as we expected, is proportional to the peptide concentration, whereas a second signal (indicated by an asterisk) is constant among all three samples and thus designated a wrong peak group and neglected. (b) Systematic discrimination between true and false peak groups. Every peak group was compared to the peak groups of the other two dilutions in terms of retention time and expected intensity as shown here for the peak groups of the 64-fold dilution. The comparisons in black fulfill the stringent filtering. Only transition group records with one peak group fulfilling the criteria were accepted. (c) Histograms of subscore distributions of true and false peak groups; inset, corresponding ROC plots and AUC.

  3. Combining features improves the separation of true and false peak groups.
    Figure 3: Combining features improves the separation of true and false peak groups.

    (a) Separation of true and false target peak groups in the test data set by mProphet after training of a classifier with a semisupervised learning strategy. (b) ROC plots for all the single subscores compared with the composite mProphet score. (c) Comparison of mProphet computed and true sensitivity and FDR in the test data set. (d) Signal-to-noise ratio of peak groups versus the mProphet score in the test data set. Signals with an SNR greater than ~10 were completely separated from false peak groups. (e) Dependence of the true-false intensity correlation score separation power on the number of transitions. ROC curves for the data set using three to five transitions recorded (six to ten including the heavy transitions). (f) Dependence of the mProphet separation power on the fraction of available decoy transition groups. ROC curves for 36%, 20%, 10%, 5% and 2% decoy transition group records relative to the total amount of target data.

  4. Separation of true from false peak group signals in a total human u2os cell line lysate using decoy transitions and mProphet scoring.
    Figure 4: Separation of true from false peak group signals in a total human u2os cell line lysate using decoy transitions and mProphet scoring.

    (a) Decoy transition groups were designed as pairs of decoy transitions for the endogenous isoform and the target transitions for the reference form. Both decoy and target transition groups were scored against the same reference (the spiked-in peptide). (b) Cumulative spectral counts in a shotgun mass spectrometry experiment of an off-gel electrophoresis fractionated u2os total human cell lysate of the peptides selected for targeting with SRM. (c) mProphet score distribution for target and decoy peak groups. Most of the target signals were separated from the decoy distribution. (d) Sensitivity and FDR as function of the mProphet score cutoff. Most peptides could be detected with a high confidence (FDR < 1%). (e) High-confidence low signal-to-noise identification by mProphet. (f) Dependence of mProphet separation on the number of transitions. ROC curves using two to six transitions recorded (six to ten including the heavy transitions). (g) Dependence of mProphet separation on the number of transitions when completely neglecting the reference peptide data. ROC curves for the data set using two to six transitions recorded.

Change history

Corrected online 06 April 2011
In the version of this article initially published online, a 'greater than' sign was inadvertently reversed, and an author contribution was incorrectly attributed. The error has been corrected for the print, PDF and HTML versions of this article.

References

  1. Lange, V., Picotti, P., Domon, B. & Aebersold, R. Selected reaction monitoring for quantitative proteomics: a tutorial. Mol. Syst. Biol. 4, 222 (2008).
  2. Picotti, P., Bodenmiller, B., Mueller, L.N., Domon, B. & Aebersold, R. Full dynamic range proteome analysis of S. cerevisiae by targeted proteomics. Cell 138, 795806 (2009).
  3. Wolf-Yadlin, A., Hautaniemi, S., Lauffenburger, D.A. & White, F.M. Multiple reaction monitoring for robust quantitative proteomic analysis of cellular signaling networks. Proc. Natl. Acad. Sci. USA 104, 58605865 (2007).
  4. Anderson, L. & Hunter, C.L. Quantitative mass spectrometric multiple reaction monitoring assays for major plasma proteins. Mol. Cell. Proteomics 5, 573588 (2006).
  5. Jovanovic, M. et al. A quantitative targeted proteomics approach to validate predicted microRNA targets in C. elegans. Nat. Methods 7, 837842 (2010).
  6. Oberg, A.L. & Vitek, O. Statistical design of quantitative mass spectrometry-based proteomic experiments. J. Proteome Res. 8, 21442156 (2009).
  7. Addona, T.A. et al. Multi-site assessment of the precision and reproducibility of multiple reaction monitoring-based measurements of proteins in plasma. Nat. Biotechnol. 27, 633641 (2009).
  8. Whiteaker, J.R. et al. Integrated pipeline for mass spectrometry-based discovery and confirmation of biomarkers demonstrated in a mouse model of breast cancer. J. Proteome Res. 6, 39623975 (2007).
  9. Keshishian, H., Addona, T., Burgess, M., Kuhn, E. & Carr, S.A. Quantitative, multiplexed assays for low abundance proteins in plasma by targeted mass spectrometry and stable isotope dilution. Mol. Cell. Proteomics 6, 22122229 (2007).
  10. Keshishian, H. et al. Quantification of cardiovascular biomarkers in patient plasma by targeted mass spectrometry and stable isotope dilution. Mol. Cell. Proteomics 8, 23392349 (2009).
  11. Mallick, P. et al. Computational prediction of proteotypic peptides for quantitative proteomics. Nat. Biotechnol. 25, 125131 (2007).
  12. Deutsch, E.W., Lam, H. & Aebersold, R. PeptideAtlas: a resource for target selection for emerging targeted proteomics workflows. EMBO Rep. 9, 429434 (2008).
  13. Lange, V. et al. Targeted quantitative analysis of Streptococcus pyogenes virulence factors by multiple reaction monitoring. Mol. Cell. Proteomics 7, 14891500 (2008).
  14. Picotti, P. et al. A database of mass spectrometric assays for the yeast proteome. Nat. Methods 5, 913914 (2008).
  15. Fusaro, V.A., Mani, D.R., Mesirov, J.P. & Carr, S.A. Prediction of high-responding peptides for targeted protein assays by mass spectrometry. Nat. Biotechnol. 27, 190198 (2009).
  16. Sherwood, C. et al. MaRiMba: a software application for spectral library-based MRM transition list assembly. J. Proteome Res. 8, 43964405 (2009).
  17. MacLean, B. et al. Skyline: an open source document editor for creating and analyzing targeted proteomics experiments. Bioinformatics 26, 966968 (2010).
  18. Prakash, A. et al. Expediting the development of targeted SRM assays: using data from shotgun proteomics to automate method development. J. Proteome Res. 8, 27332739 (2009).
  19. Abbatiello, S.E., Mani, D.R., Keshishian, H. & Carr, S.A. Automated detection of inaccurate and imprecise transitions in peptide quantification by multiple reaction monitoring mass spectrometry. Clin. Chem. 56, 291305 (2010).
  20. Stahl-Zeng, J. et al. High sensitivity detection of plasma proteins by multiple reaction monitoring of N-glycosites. Mol. Cell. Proteomics 6, 18091817 (2007).
  21. Nesvizhskii, A.I., Keller, A., Kolker, E. & Aebersold, R. A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 75, 46464658 (2003).
  22. Elias, J.E. & Gygi, S.P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 4, 207214 (2007).
  23. Kall, L., Canterbury, J.D., Weston, J., Noble, W.S. & MacCoss, M.J. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat. Methods 4, 923925 (2007).
  24. Reiter, L. et al. Protein identification false discovery rates for very large proteomics data sets generated by tandem mass spectrometry. Mol. Cell. Proteomics 8, 24052417 (2009).
  25. Picotti, P. et al. High-throughput generation of selected reaction-monitoring assays for proteins and proteomes. Nat. Methods 7, 4346 (2010).
  26. Moore, R.E., Young, M.K. & Lee, T.D. Qscore: an algorithm for evaluating SEQUEST database search results. J. Am. Soc. Mass Spectrom. 13, 378386 (2002).
  27. Sherman, J., McKay, M.J., Ashman, K. & Molloy, M.P. How specific is my SRM?: The issue of precursor and product ion redundancy. Proteomics 9, 11201123 (2009).
  28. Choi, H. & Nesvizhskii, A.I. Semisupervised model-based validation of peptide identifications in mass spectrometry-based proteomics. J. Proteome Res. 7, 254265 (2008).
  29. Hilpert, K., Winkler, D.F. & Hancock, R.E. Peptide arrays on cellulose support: SPOT synthesis, a time and cost efficient method for synthesis of large numbers of peptides in a parallel and addressable fashion. Nat. Protoc. 2, 13331349 (2007).
  30. Wenschuh, H. et al. Coherent membrane supports for parallel microsynthesis and screening of bioactive peptides. Biopolymers 55, 188206 (2000).
  31. Keller, A., Nesvizhskii, A.I., Kolker, E. & Aebersold, R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 74, 53835392 (2002).
  32. Kim, S., Gupta, N. & Pevzner, P.A. Spectral probabilities and generating functions of tandem mass spectra: a strike against decoy databases. J. Proteome Res. 7, 33543363 (2008).
  33. Ong, S.E. et al. Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics. Mol. Cell. Proteomics 1, 376386 (2002).
  34. Gerber, S.A., Rush, J., Stemman, O., Kirschner, M.W. & Gygi, S.P. Absolute quantification of proteins and phosphoproteins from cell lysates by tandem MS. Proc. Natl. Acad. Sci. USA 100, 69406945 (2003).
  35. Pedrioli, P.G. et al. A common open representation of mass spectrometry data and its application to proteomics research. Nat. Biotechnol. 22, 14591466 (2004).
  36. Keller, A., Eng, J., Zhang, N., Li, X.J. & Aebersold, R. A uniform proteomics MS/MS analysis platform utilizing open XML file formats. Mol. Syst. Biol. 1, 2005.0017 (2005).
  37. Storey, J.D. & Tibshirani, R. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. USA 100, 94409445 (2003).
  38. R Development Core Team. R: A Language and Environment for Statistical Computing (2008).

Download references

Author information

  1. These authors contributed equally to this work.

    • Lukas Reiter &
    • Oliver Rinner

Affiliations

  1. Biognosys AG, Zurich, Switzerland.

    • Lukas Reiter &
    • Oliver Rinner
  2. Institute of Molecular Systems Biology, Department of Biology, Swiss Federal Institute of Technology (ETH) Zurich, Zurich, Switzerland.

    • Lukas Reiter,
    • Oliver Rinner,
    • Paola Picotti,
    • Ruth Hüttenhain,
    • Martin Beck &
    • Ruedi Aebersold
  3. Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland.

    • Lukas Reiter &
    • Michael O Hengartner
  4. PhD Program in Molecular Life Sciences Zurich, Zurich, Switzerland.

    • Lukas Reiter
  5. Competence Center for Systems Physiology and Metabolic Diseases, Zurich, Switzerland.

    • Ruth Hüttenhain &
    • Ruedi Aebersold
  6. Institute for Systems Biology, Seattle, Washington, USA.

    • Mi-Youn Brusniak
  7. Faculty of Science, University of Zurich, Zurich, Switzerland.

    • Ruedi Aebersold
  8. Present addresses: Institute of Biochemistry, Department of Biology, ETH Zurich, Zurich, Switzerland (P.P.) and European Molecular Biology Laboratory, Structural and Computational Biology Unit, Heidelberg, Germany (M.B.).

    • Paola Picotti &
    • Martin Beck

Contributions

L.R., O.R., P.P., M.-Y.B. and R.A. designed the gold-standard data set. P.P. carried out the measurements on the gold-standard data set. L.R., O.R. and R.A. wrote the paper. L.R. and O.R. wrote the software and did the data analysis. L.R. did most of the statistical data analysis. R.H. contributed to the experiment involving the human plasma N-glycopeptide-enriched samples. M.B. contributed to the experiment involving the human u2os cell line. M.O.H. provided critical input on the project. R.A. supervised the project.

Competing financial interests

O.R. and L.R. are employees of Biognosys AG. This company funded parts of the work.

Corresponding author

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Text and Figures (7M)

    Supplementary Figures 1–12, Supplementary Table 1, Supplementary Results and Supplementary Note

Excel files

  1. Supplementary Data 1 (3M)

    Table of transitions, table of peak groups, table with identification statistics and classifier of the gold standard data set analysis. The transitions sheet contains the precursor m/z (Q1), fragment ion m/z (Q3), an id that groups the transitions according to precursor (transition group id), an id for the transition (transition id), a string describing the isotopic labeling of the peptide (isotype), the collision energy used (CE), the expected retention time used for scheduled SRM (tR), the expected relative intensity of the fragment ions (relative intensity %), a string indicating whether the transition is a decoy or target (decoy) and an id to group corresponding target and decoy transition groups (target decoy transition group id). The mProphet peak groups sheet contains a row for each peak group. The most important columns are an id for a transition group measurement (transition_group_record), the features used for scoring (all columns starting with main_var or var_), a column indicating the dilution of the synthetic peptides in the specific matrix (dilution), the species used for the background matrix (background), the class of the peak group in terms of identity as determined by the dilution alignment (real_class), a boolean indicating whether the peak group was derived from decoy or target transitions (real_decoy), a boolean indicating whether treated as decoy or target in the mProphet analysis (decoy) and the mProphet discrimination score (d_score). The mProphet all peak groups sheet contains the all peak groups of the analysis, not only the ones that rank highest in one transition group record (peak_group_rank). The mProphet stat sheet relates the mProphet discrimination score (cutoff) to the false discovery rate (FDR) and the sensitivity (sens). The mProphet classifier weight sheet contains the weights that were determined using the semi-supervised learning approach.

  2. Supplementary Data 2 (4M)

    Table of transitions, table of peak groups, table with identification statistics and classifier of the human u2os cell line analysis. For a detailed description of the sheets see Supplementary Data 1 legend.

  3. Supplementary Data 3 (1M)

    Table of transitions, table of peak groups, table with identification statistics and classifier of the human plasma analysis. For a detailed description of the sheets see Supplementary Data 1 legend.

  4. Supplementary Data 4 (692M)

    Table of transitions and peak groups for the measurement of yeast target and decoy transitions in human plasma. The transitions sheet contains target transitions of yeast peptides and corresponding decoy transitions generated by two different decoy transition generation algorithms (ADD_RANDOM and REVERSE_PEP_AND_INCREASE_Q1). The mQuest peak groups sheet contains the data processed with mQuest. The mProphet analysis does result in meaningful results since the data contains no positive target measurements. For a detailed description of the sheets see Supplementary Data 1 legend.

Additional data