A large synthetic peptide and phosphopeptide reference library for mass spectrometry–based proteomics

Journal name:
Nature Biotechnology
Year published:
Published online


We present a peptide library and data resource of >100,000 synthetic, unmodified peptides and their phosphorylated counterparts with known sequences and phosphorylation sites. Analysis of the library by mass spectrometry yielded a data set that we used to evaluate the merits of different search engines (Mascot and Andromeda) and fragmentation methods (beam-type collision-induced dissociation (HCD) and electron transfer dissociation (ETD)) for peptide identification. We also compared the sensitivities and accuracies of phosphorylation-site localization tools (Mascot Delta Score, PTM score and phosphoRS), and we characterized the chromatographic behavior of peptides in the library. We found that HCD identified more peptides and phosphopeptides than did ETD, that phosphopeptides generally eluted later from reversed-phase columns and were easier to identify than unmodified peptides and that current computational tools for proteomics can still be substantially improved. These peptides and spectra will facilitate the development, evaluation and improvement of experimental and computational proteomic strategies, such as separation techniques and the prediction of retention times and fragmentation patterns.

At a glance


  1. Library design and synthesis.
    Figure 1: Library design and synthesis.

    (a) Number and relative abundance of nonredundant serine, threonine and tyrosine phosphorylation sites (pS, pT and pY, respectively) identified in five large-scale human phosphoproteomic data sets19, 20, 21, 22, 23. (b) Hydrophobicity (GRAVY score) plotted against sequence length for the 851 peptides (black diamonds) identified in three out of the five large-scale data sets. The 96 representative 'seed' peptides in the 5–95% percentile interval (dashed box) were selected manually for subsequent library synthesis. Selected peptides are depicted in red, showing a representative distribution of both length and hydrophobicity. The selection of peptides also contains a representative distribution of phosphorylation sites of the sequence (Supplementary Fig. 4) and a representative distribution of lysine or arginine residues at the C terminus. (c) Schematic representation of the peptide library design in which position x0 of a seed peptide represents the site of phosphorylation and is synthesized with either serine, threonine or tyrosine or their phosphorylated forms. Both positions x−1 and x+1 are permutated with all 20 natural occurring amino acids during synthesis, creating up to 2,400 different (phospho)peptides for each library. (d) The number of phosphorylated serine, threonine and tyrosine peptides and their relative abundances identified from LC-MS/MS analysis of the library using both HCD and ETD fragmentation. In total, 57,830 phosphopeptides with equal representation of all phosphate acceptor amino acids were identified.

  2. Peptide library identification rate.
    Figure 2: Peptide library identification rate.

    (a) Total sequence coverage of each peptide library showing an overall identification rate of 63% by Mascot (70% for unmodified peptides and 57% for phosphopeptides not adjusted for FDR). Libraries that were based on seed peptides with C- or N-terminal phosphorylation sites only contain a maximum of 120 (phospho)peptides and are therefore marked as such. (bd) Venn diagrams showing the overlap between peptides (b), phosphopeptides (c) and nonphosphopeptides (d) identified by HCD and ETD fragmentation. (e) Number of peptides identified from each precursor charge state (2+, 3+, 4+ and 5+) by HCD only (light gray), ETD only (black) or both methods (dark gray). Further information can be found in Supplementary Figure 6.

  3. FDR determination for peptide identification.
    Figure 3: FDR determination for peptide identification.

    Because the sequences of all library peptides are known, FDRs can be determined by counting the number of correct and incorrect matches. (a) Local FDR in each Mascot score bin for both phosphorylated (red) and nonphosphorylated peptides (nonphospho; blue) using HCD as the fragmentation technique (for the ETD data, see Supplementary Figs. 10–13). Phosphorylated peptides seem to be identified more easily than nonphosphorylated peptides. Dashed vertical lines mark the Mascot identity (32) and homology (18) score thresholds for the database search. The identity score indicates a 5% probability of a PSM to be a random event. The homology score indicates that a PSM is an outlier in the distribution of random scores (that is, the PSM is probably not a random match). (b) Global FDR plotted against Mascot score for phosphorylated (red) and nonphosphorylated peptides (blue). (c,d) Plots to those shown in a and b for the search engine Andromeda (see also Supplementary Figs. 14 and 15). Colored dashed lines throughout the figure correspond to curves fit using a sum of two exponentials as indicated in the main text (see Supplementary Table 6 for the coefficients).

  4. FLR determination for phosphorylated peptides.
    Figure 4: FLR determination for phosphorylated peptides.

    Because the modification sites of all library peptides are known, FLRs can be determined by counting the number of correct and incorrect matches. (a) Global FLR across MD scores for phosphorylated peptides identified by HCD (blue) and ETD (red; see also Supplementary Figs. 17–19). Colored dashed lines correspond to a curve fit using a sum of two exponentials of the form: FLR = A × exp(−C × score) + B × exp(−D × score) akin to Figure 2b (see Supplementary Table 6 for coefficients). Despite the lower phosphopeptide identification success of ETD, it outperforms HCD in site localization. To reach a 1% global FLR, ETD requires an MD score of 10, whereas HCD requires a score of 20. (b) Qualitative and quantitative comparisons of PTM score, MD score and phosphoRS for phosphorylation site localization (HCD data). Although all three scores had comparable overall accuracies (main plot, generated using data from the intersection of all three tools), the Venn diagram in the inset shows the complementarity of the different tools at 1% FDR and 1% FLR. (c) Histogram of the number of correctly and incorrectly assigned spectra within probability bins provided by the three localization tools. The graphs show that all localization tools underestimate the true FLR within most bins but also indicate that this error is small for the vast majority of the data (see Supplementary Fig. 20 for further information). TP, true positives; FP, false positives. (d) Application of the FDR and FLR models derived from library spectra to the analysis of a phosphoproteomic sample generated by Ti-IMAC enrichment from human K562 cells. The results confirm the complementarity of the different localization scores at the level of 1% FDR and 1% FLR.

  5. Retention time analysis.
    Figure 5: Retention time analysis.

    (a) Distribution of the retention time shift introduced by the addition of a phosphate group to a peptide (HCD data; shifts (Δ) in retention time values were calculated by subtracting the retention time of the unmodified peptide from that of the corresponding phosphorylated peptide). The majority of phosphopeptides (70%) eluted later than their nonphosphorylated counterparts, 4% eluted within the same time window, and 26% eluted earlier (25-s window, corresponding to the width of an average liquid chromatography peak at half maximum). Retention time shifts seem to be independent of peptide length (bottom; see also Supplementary Fig. 22). (b) The library contains a large number of paired sequence isomers (x−1 and x+1 positions around the phosphorylation site; n = 45,516 pairs at 5% global FDR). Sixty-two percent of these positional isomers cannot be distinguished by a difference in retention time (red), whereas the remaining 38% can (blue) (left). For the indistinguishable positional isomers, no clear over- or under-representation of amino acids at the x+1 and x−1 positions is detectable (middle). In contrast, for the paired positional isomers that can be distinguished by retention time (right), there is a strong underrepresentation of glycine at the x+1 and x−1 positions. (c) The influence of individual amino acids on retention behavior studied using peptides for which all 156 possible amino acid permutations in the x+1 and x−1 positions were observed by HCD. The trend clearly follows the general hydrophobicity and charge properties of the amino acids as approximated by GRAVY scores and reported by others (see Supplementary Fig. 23 for further details). Amino acids are referred to by their one-letter abbreviations in b and c.


  1. Chen, Y., Kwon, S.W., Kim, S.C. & Zhao, Y. Integrated approach for manual evaluation of peptides identified by searching protein sequence databases with tandem mass spectra. J. Proteome Res. 4, 9981005 (2005).
  2. Chen, Y., Zhang, J., Xing, G. & Zhao, Y. Mascot-derived false positive peptide identifications revealed by manual analysis of tandem mass spectra. J. Proteome Res. 8, 31413147 (2009).
  3. Keller, A. et al. Experimental protein mixture for validating tandem mass spectral analysis. OMICS 6, 207212 (2002).
  4. Rudnick, P.A., Wang, Y., Evans, E., Lee, C.S. & Balgley, B.M. Large scale analysis of MASCOT results using a Mass Accuracy-based THreshold (MATH) effectively improves data interpretation. J. Proteome Res. 4, 13531360 (2005).
  5. Klimek, J. et al. The standard protein mix database: a diverse data set to assist in the production of improved Peptide and protein identification software tools. J. Proteome Res. 7, 96103 (2008).
  6. Mallick, P. et al. Computational prediction of proteotypic peptides for quantitative proteomics. Nat. Biotechnol. 25, 125131 (2007).
  7. Nesvizhskii, A.I., Vitek, O. & Aebersold, R. Analysis and validation of proteomic data generated by tandem mass spectrometry. Nat. Methods 4, 787797 (2007).
  8. Mallick, P. & Kuster, B. Proteomics: a pragmatic perspective. Nat. Biotechnol. 28, 695709 (2010).
  9. Elias, J.E. & Gygi, S.P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 4, 207214 (2007).
  10. Keller, A., Nesvizhskii, A.I., Kolker, E. & Aebersold, R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 74, 53835392 (2002).
  11. Nesvizhskii, A.I., Keller, A., Kolker, E. & Aebersold, R. A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 75, 46464658 (2003).
  12. Shteynberg, D. et al. iProphet: multi-level integrative analysis of shotgun proteomic data improves peptide and protein identification rates and error estimates. Mol. Cell. Proteomics 10, M111.007690 (2011).
  13. Bohrer, B.C. et al. Combinatorial libraries of synthetic peptides as a model for shotgun proteomics. Anal. Chem. 82, 65596568 (2010).
  14. Beausoleil, S.A., Villen, J., Gerber, S.A., Rush, J. & Gygi, S.P. A probability-based approach for high-throughput protein phosphorylation analysis and site localization. Nat. Biotechnol. 24, 12851292 (2006).
  15. Bailey, C.M. et al. SLoMo: automated site localization of modifications from ETD/ECD mass spectra. J. Proteome Res. 8, 19651971 (2009).
  16. Lemeer, S. et al. Phosphorylation site localization in peptides by MALDI MS/MS and the Mascot Delta Score. Anal. Bioanal. Chem. 402, 249260 (2012).
  17. Savitski, M.M. et al. Confident phosphorylation site localization using the Mascot Delta Score. Mol. Cell. Proteomics 10, M110.003830 (2011).
  18. Taus, T. et al. Universal and confident phosphorylation site localization using phosphoRS. J. Proteome Res. 10, 53545362 (2011).
  19. Daub, H. et al. Kinase-selective enrichment enables quantitative phosphoproteomics of the kinome across the cell cycle. Mol. Cell 31, 438448 (2008).
  20. Olsen, J.V. et al. Global, in vivo, and site-specific phosphorylation dynamics in signaling networks. Cell 127, 635648 (2006).
  21. Olsen, J.V. et al. Quantitative phosphoproteomics reveals widespread full phosphorylation site occupancy during mitosis. Sci. Signal. 3, ra3 (2010).
  22. Oppermann, F.S. et al. Large-scale proteomics analysis of the human kinome. Mol. Cell. Proteomics 8, 17511764 (2009).
  23. Rikova, K. et al. Global survey of phosphotyrosine signaling identifies oncogenic kinases in lung cancer. Cell 131, 11901203 (2007).
  24. Steen, H., Jebanathirajah, J.A., Rush, J., Morrice, N. & Kirschner, M.W. Phosphorylation analysis by mass spectrometry: myths, facts, and the consequences for qualitative and quantitative measurements. Mol. Cell Proteomics 5, 172181 (2006).
  25. Krokhin, O.V. Sequence-specific retention calculator. Algorithm for peptide retention prediction in ion-pair RP-HPLC: application to 300- and 100-A pore size C18 sorbents. Anal. Chem. 78, 77857795 (2006).
  26. Jedrychowski, M.P. et al. Evaluation of HCD- and CID-type fragmentation within their respective detection platforms for murine phosphoproteomics. Mol. Cell. Proteomics 10, M111.009910 (2011).
  27. Nagaraj, N., D'Souza, R.C., Cox, J., Olsen, J.V. & Mann, M. Feasibility of large-scale phosphoproteomics with higher energy collisional dissociation fragmentation. J. Proteome Res. 9, 67866794 (2010).
  28. Swaney, D.L., McAlister, G.C. & Coon, J.J. Decision tree–driven tandem mass spectrometry for shotgun proteomics. Nat. Methods 5, 959964 (2008).
  29. Swaney, D.L., Wenger, C.D., Thomson, J.A. & Coon, J.J. Human embryonic stem cell phosphoproteome revealed by electron transfer dissociation tandem mass spectrometry. Proc. Natl. Acad. Sci. USA 106, 9951000 (2009).
  30. Boersema, P.J., Mohammed, S. & Heck, A.J. Phosphopeptide fragmentation and analysis by mass spectrometry. J. Mass. Spectrom. 44, 861878 (2009).
  31. Frese, C.K. et al. Improved peptide identification by targeted fragmentation using CID, HCD and ETD on an LTQ-Orbitrap Velos. J. Proteome Res. 10, 23772388 (2011).
  32. Zhou, H. et al. Enhancing the identification of phosphopeptides from putative basophilic kinase substrates using Ti (IV) based IMAC enrichment. Mol. Cell. Proteomics 10, M110.006452 (2011).
  33. Kapp, E.A. et al. An evaluation, comparison, and accurate benchmarking of several publicly available MS/MS search algorithms: sensitivity and specificity analysis. Proteomics 5, 34753490 (2005).
  34. Frank, A.M. et al. Spectral archives: extending spectral libraries to analyze both identified and unidentified spectra. Nat. Methods 8, 587591 (2011).
  35. Huang, Y. et al. A data-mining scheme for identifying peptide structural motifs responsible for different MS/MS fragmentation intensity patterns. J. Proteome Res. 7, 7079 (2008).
  36. Lemeer, S. & Heck, A.J. The phosphoproteomics data explosion. Curr. Opin. Chem. Biol. 13, 414420 (2009).
  37. Baker, P.R., Trinidad, J.C. & Chalkley, R.J. Modification site localization scoring integrated into a search engine. Mol. Cell. Proteomics 10, M111.008078 (2011).
  38. Chalkley, R.J. & Clauser, K.R. Modification site localization scoring: strategies and performance. Mol. Cell. Proteomics 11, 314 (2012).
  39. Kelstrup, C.D., Hekmat, O., Francavilla, C. & Olsen, J.V. Pinpointing phosphorylation sites: quantitative filtering and a novel site-specific x-ion fragment. J. Proteome Res. 10, 29372948 (2011).
  40. Krokhin, O.V. & Spicer, V. Peptide retention standards and hydrophobicity indexes in reversed-phase high-performance liquid chromatography of peptides. Anal. Chem. 81, 95229530 (2009).
  41. Conrads, T.P., Anderson, G.A., Veenstra, T.D., Pasa-Tolic, L. & Smith, R.D. Utility of accurate mass tags for proteome-wide protein identification. Anal. Chem. 72, 33493354 (2000).
  42. Moruz, L. et al. Chromatographic retention time prediction for posttranslationally modified peptides. Proteomics 12, 11511159 (2012).
  43. Moruz, L., Tomazela, D. & Kall, L. Training, selection, and robust calibration of retention time models for targeted proteomics. J. Proteome Res. 9, 52095216 (2010).
  44. Geromanos, S.J. et al. The detection, correlation, and comparison of peptide precursor and product ions from data independent LC-MS with data dependant LC-MS/MS. Proteomics 9, 16831695 (2009).
  45. Hoaglund-Hyzer, C.S., Li, J. & Clemmer, D.E. Mobility labeling for parallel CID of ion mixtures. Anal. Chem. 72, 27372740 (2000).
  46. Thingholm, T.E., Jensen, O.N. & Larsen, M.R. Analytical strategies for phosphoproteomics. Proteomics 9, 14511468 (2009).
  47. Kyte, J. & Doolittle, R.F. A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. 157, 105132 (1982).
  48. Zhou, H. et al. Robust phosphoproteome enrichment using monodisperse microsphere-based immobilized titanium (IV) ion affinity chromatography. Nat. Protoc. 8, 461480 (2013).

Download references

Author information


  1. Chair for Proteomics and Bioanalytics, Technische Universität München, Freising, Germany.

    • Harald Marx,
    • Simone Lemeer,
    • Jan Erik Schliep &
    • Bernhard Kuster
  2. Biomolecular Mass Spectrometry and Proteomics, Bijvoet Center for Biomolecular Research and Utrecht Institute of Pharmaceutical Sciences, Utrecht University, Utrecht, The Netherlands.

    • Lucrece Matheron,
    • Shabaz Mohammed &
    • Albert J R Heck
  3. The Netherlands Proteomics Centre, The Netherlands.

    • Lucrece Matheron,
    • Shabaz Mohammed &
    • Albert J R Heck
  4. Proteomics and Signal Transduction, Max-Planck Institute of Biochemistry, Martinsried, Germany.

    • Jürgen Cox &
    • Matthias Mann
  5. Center for Integrated Protein Science, Munich, Germany.

    • Bernhard Kuster
  6. Present address: Departments of Chemistry and Biochemistry, University of Oxford, Physical and Theoretical Chemistry Laboratory, Oxford, UK.

    • Shabaz Mohammed


H.M., J.E.S. and B.K. designed the study. S.L., J.E.S., L.M. and S.M. performed experiments. H.M., S.L., L.M. and J.C. analyzed data. H.M., S.L., A.J.R.H., M.M. and B.K. wrote manuscript.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Text and Figures (3 MB)

    Supplementary Figures 1–23

Excel files

  1. Supplementary Table 1 (49 KB)

    Sequence, site of phosphorylation within the sequence, length and GRAVY score (Hydrophobicity) of the 851 representative sample peptides derived from the consensus of three out of the five publically available human phosphorylation data sets used in this study

  2. Supplementary Table 2 (16 KB)

    Peptide sequence, position of phosphorylation site in the sequence and Gravy score of the seed peptide synthesis of libraries used in this study. For each seed peptide sequence the final number of peptides in the library is given

  3. Supplementary Table 3 (88 MB)

    Search and classification result of HCD data aquired on a Orbitrap Velos.

  4. Supplementary Table 4 (60 MB)

    Search and classification result of ETD-FT data aquired on a Orbitrap Velos.

  5. Supplementary Table 5 (541 KB)

    Number of peptide identifications and phosphorylation site localizations at a given global or local false discovery rate (Mascot)

  6. Supplementary Table 6 (565 KB)

    Coefficients for the computation of local and global FDRs and FLRs

Additional data