Abstract
In mass-spectrometry-based proteomics, the identification and quantification of peptides and proteins heavily rely on sequence database searching or spectral library matching. The lack of accurate predictive models for fragment ion intensities impairs the realization of the full potential of these approaches. Here, we extended the ProteomeTools synthetic peptide library to 550,000 tryptic peptides and 21 million high-quality tandem mass spectra. We trained a deep neural network, termed Prosit, resulting in chromatographic retention time and fragment ion intensity predictions that exceed the quality of the experimental data. Integrating Prosit into database search pipelines led to more identifications at >10× lower false discovery rates. We show the general applicability of Prosit by predicting spectra for proteases other than trypsin, generating spectral libraries for data-independent acquisition and improving the analysis of metaproteomes. Prosit is integrated into ProteomicsDB, allowing search result re-scoring and custom spectral library generation for any organism on the basis of peptide sequence alone.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout






Similar content being viewed by others
Data availability
Reference spectra are available at https://www.proteomicsdb.org, and updates to the resource are available at http://www.proteometools.org. The mass spectrometric raw data of ProteomeTools have been deposited with the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifier PXD010595. The MaxQuant and Spectronaut search data including intermediate results underlying the presented analysis have been deposited with the dataset identifier PXD010871. Learned Prosit and Elude models are deposited at https://figshare.com/projects/prosit/35582.
Code availability
Source code and scripts are available on GitHub at https://github.com/kusterlab/prosit.
References
Aebersold, R. & Mann, M. Mass-spectrometric exploration of proteome structure and function. Nature 537, 347–355 (2016).
Zhang, Y., Fonslow, B. R., Shan, B., Baek, M.-C. & Yates, J. R. Protein analysis by shotgun/bottom-up proteomics. Chem. Rev. 113, 2343–2394 (2013).
Mallick, P. & Kuster, B. Proteomics: a pragmatic perspective. Nat. Biotechnol. 28, 695 (2010).
Sinitcyn, P., Rudolph, J. D. & Cox, J. Computational methods for understanding mass spectrometry-based shotgun proteomics data. Annu. Rev. Biomed. Data Sci. 1, 207–234 (2018).
Cox, J. et al. Andromeda: a peptide search engine integrated into the maxquant environment. J. Proteome Res. 10, 1794–1805 (2011).
Perkins, D. N., Pappin, D. J. C., Creasy, D. M. & Cottrell, J. S. Probability‐based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567 (1999).
Eng, J. K., McCormack, A. L. & Yates, J. R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976–989 (1994).
Stein, S. E. & Scott, D. R. Optimization and testing of mass spectral library search algorithms for compound identification. J. Am. Soc. Mass Spectrom. 5, 859–866 (1994).
Lam, H. et al. Development and validation of a spectral library searching method for peptide identification from MS/MS. Proteomics 7, 655–667 (2007).
Schubert, O. T. et al. Building high-quality assay libraries for targeted analysis of SWATH MS data. Nat. Protoc. 10, 426–441 (2015).
Deutsch, E. W. et al. Expanding the use of spectral libraries in proteomics. J. Proteome Res. 17, 4051–4060 (2018).
Gillet, L. C. et al. Targeted data extraction of the MS/MS spectra generated by data-independent acquisition: a new concept for consistent and accurate proteome analysis. Mol. Cell. Proteomics 11, O111.016717 (2012).
Lange, V., Picotti, P., Domon, B. & Aebersold, R. Selected reaction monitoring for quantitative proteomics: a tutorial. Mol. Syst. Biol. 4, 222 (2008).
Bruderer, R., Bernhardt, O. M., Gandhi, T. & Reiter, L. High‐precision iRT prediction in the targeted analysis of data‐independent acquisition and its impact on identification and quantitation. Proteomics 16, 2246–2256 (2016).
Krokhin, O. V. & Spicer, V. Generation of accurate peptide retention data for targeted and data independent quantitative LC–MS analysis: chromatographic lessons in proteomics. Proteomics 16, 2931–2936 (2016).
Moruz, L. et al. Chromatographic retention time prediction for posttranslationally modified peptides. Proteomics 12, 1151–1159 (2012).
Elias, J. E., Gibbons, F. D., King, O. D., Roth, F. P. & Gygi, S. P. Intensity-based protein identification by machine learning from a library of tandem mass spectra. Nat. Biotechnol. 22, 214–219 (2004).
Arnold, R. J., Jayasankar, N., Aggarwal, D., Tang, H. & Radivojac, P. A machine learning approach to predicting peptide fragmentation spectra. Pac. Symp. Biocomput. 2006, 219–230 (2006).
Frank, A. M. Predicting intensity ranks of peptide fragment ions. J. Proteome Res. 8, 2226–2240 (2009).
Degroeve, S., Maddelein, D. & Martens, L. MS2PIP prediction server: compute and visualize MS2 peak intensity predictions for CID and HCD fragmentation. Nucleic Acids Res. 43, W326–W330 (2015).
Zhou, X.-X. et al. pDeep: predicting MS/MS spectra of peptides with deep learning. Anal. Chem. 89, 12690–12697 (2017).
Zolg, D. et al. PROCAL: a set of 40 peptide standards for retention time indexing, column performance monitoring, and collision energy calibration. Proteomics 17, 1700263 (2017).
Zolg, D. P. et al. Building ProteomeTools based on a complete synthetic human proteome. Nat. Methods 14, 259–262 (2017).
Wu, Y. et al. Google’s neural machine translation system: bridging the gap between human and machine translation. Preprint at https://arxiv.org/abs/1609.08144 (2016).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980 (2014).
Xu, K. et al. Show, attend and tell: neural image caption generation with visual attention. In Proc. International Conference on Machine Learning (eds. Bach, F. & Blei, D.) 2048–2057 (JMLR, 2015).
Krokhin, O. V. Sequence-specific retention calculator. Algorithm for peptide retention prediction in ion-pair RP-HPLC: application to 300- and 100-A pore size C18 sorbents. Anal. Chem. 78, 7785–7795 (2006).
Toprak, U. H. et al. Conserved peptide fragmentation as a benchmarking tool for mass spectrometers and a discriminating feature for targeted proteomics. Mol. Cell. Proteomics 13, 2056–2071 (2014).
Diedrich, J. K., Pinto, A. F. M. & Yates, J. R. Energy dependence of HCD on peptide fragmentation: stepped collisional energy finds the sweet spot. J. Am. Soc. Mass Spectrom. 24, 1690–1699 (2013).
Bekker-Jensen, D. B. et al. An optimized shotgun strategy for the rapid generation of comprehensive human proteomes. Cell Syst. 4, 587–599 (2017).
Bruderer, R. et al. Optimization of experimental parameters in data-independent mass spectrometry significantly increases depth and reproducibility of results. Mol. Cell. Proteomics 16, 2296–2309 (2017).
Fabre, B. et al. Spectral libraries for SWATH-MS assays for Drosophila melanogaster and Solanum lycopersicum. Proteomics 17, 1700216 (2017).
Schmidt, T. et al. ProteomicsDB. Nucleic Acids Res. 46, D1271–D1281 (2017).
Wilhelm, M. et al. Mass-spectrometry-based draft of the human proteome. Nature 509, 582–587 (2014).
Elias, J. E. & Gygi, S. P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 4, 207–214 (2007).
The, M., MacCoss, M. J., Noble, W. S. & Käll, L. Fast and accurate protein false discovery rates on large-scale proteomics data sets with Percolator 3.0. J. Am. Soc. Mass Spectrom. 27, 1719–1727 (2016).
Shanmugam, A. K. & Nesvizhskii, A. I. Effective leveraging of targeted search spaces for improving peptide identification in tandem mass spectrometry based proteomics. J. Proteome Res. 14, 5169–5178 (2015).
Muth, T., Benndorf, D., Reichl, U., Rapp, E. & Martens, L. Searching for a needle in a stack of needles: challenges in metaproteomics data analysis. Mol. Biosyst. 9, 578–585 (2012).
Rechenberger, J. et al. Challenges in clinical metaproteomics highlighted by the analysis of acute leukemia patients with gut colonization by multidrug-resistant enterobacteriaceae. Proteomes 7, 2 (2019).
Li, J. et al. An integrated catalog of reference genes in the human gut microbiome. Nat. Biotechnol. 32, 834 (2014).
Muth, T. R. et al. Navigating through metaproteomics data: a logbook of database searching. Proteomics 15, 3439–3453 (2017).
Nesvizhskii, A. I. Proteogenomics: concepts, applications and computational strategies. Nat. Methods 11, 1114 (2014).
Schumacher, F. R. et al. Building proteomic tool boxes to monitor MHC class I and class II peptides. Proteomics 17, 1600061 (2017).
Zolg, D. et al. ProteomeTools: systematic characterization of 21 post-translational protein modifications by LC-MS/MS using synthetic peptides. Mol. Cell. Proteomics 17, 1850–1863 (2018).
Wang, D. et al. A deep proteome and transcriptome abundance atlas of 29 healthy human tissues. Mol. Syst. Biol. 15, e8503 (2019).
Dorfer, V., Maltsev, S., Winkler, S. & Mechtler, K. CharmeRT: boosting peptide identifications by chimeric spectra identification and retention time prediction. J. Proteome Res. 17, 2581–2589 (2018).
Wenschuh, H. et al. Coherent membrane supports for parallel microsynthesis and screening of bioactive peptides. Pept. Sci. 55, 188–206 (2000).
Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. Preprint at https://arxiv.org/abs/1412.3555 (2014).
Bahdanau, D., Cho, K. & Bengio, Y. Neural machine translation by jointly learning to align and translate. Preprint at https://arxiv.org/abs/1409.0473 (2014).
Moruz, L., Tomazela, D. & Käll, L. Training, selection, and robust calibration of retention time models for targeted proteomics. J. Proteome Res. 9, 5209–5216 (2010).
Davis, S. et al. Expanding proteome coverage with CHarge Ordered Parallel Ion aNalysis (CHOPIN) combined with broad specificity proteolysis. J. Proteome Res. 16, 1288–1299 (2017).
Acknowledgements
This work was in part funded by the German Federal Ministry of Education and Research (BMBF, grant no. 031L0008A and no. 031L0168). The Titan Xp used in this research were donated by the NVIDIA corporation. The authors thank R. Bruderer (Biognosys) for sharing spectral libraries in textual and editable format, and R. Bruderer and members of the Kuster lab for fruitful discussions.
Author information
Authors and Affiliations
Contributions
H.-C.E., S.A., B.K. and M.W. conceived the study. S.G., T.S., D.P.Z., J.R., K.S., J.Z., T.K., U.R., B.D., A.H., B.K. and M.W. designed experiments. S.G., T.S., D.P.Z., P.S., T.K., K.S., J.R., J.Z., B.D. and M.W. performed experiments. S.G., T.S., D.P.Z. and P.S. analyzed data. S.G., T.S., P.S. and M.W. extended the web resource. S.G., T.S., D.P.Z., B.K. and M.W. wrote the manuscript
Corresponding authors
Ethics declarations
Competing interests
M.W. and B.K. are founders and shareholders of OmicScouts. They have no operational role in the company. K.S., J.Z., T.K., H.W. and U.R. are employees of JPT. B.D. and A.H. are employees of Thermo Fisher Scientific. S.G., H.-C.E. and S.A. are employees of SAP SE.
Additional information
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Integrated supplementary information
Supplementary Figure 1 Overview of identified peptides in the ProteomeTools project.
(a) Recovery of synthesized peptide sequences across all four new datasets. Bars display the percentage of peptides identified in comparison to the peptides synthesized per pool of ~1000 peptides. Only identifications with an Andromeda Score of at least 50 are considered. (b) Identified peptides over Andromeda score cutoff for both the newly released dataset as well as the complete ProteomeTools peptide library. Numbers at the arbitrary cutoff of 100 are displayed for both datasets, the median Andromeda Score is indicated.
Supplementary Figure 2 The Prosit deep learning model and its training.
(a) Overview of the neural network architecture of the fragment ion intensity prediction model. The model takes precursor charge, normalized collision energy and the peptide sequence as input. First, for every input a specific encoder is trained, consisting of one dense layer for precursor charge and normalized collision energy. The encoder for the peptide sequence is split in an embedding layer connected to 2 bi-directional recurrent neural networks (BDN) with gated recurrent memory (GRU) units and an attention layer. Both encoder representations are element-wise multiplied for a fixed size latent space representation. The decoder for fragment ion intensity prediction consists of one bidirectional GRU resulting in 6 predictions for up to 29 fragmentation positions. The indexed retention time (iRT) model uses the same encoder but dense layers as decoder. (b) Model performance for 5 random splits of the ProteomeTools data into Training, Test and Holdout. The main panel shows best performing models from 5 random splits of the data. The inset details the median models error with intervals (shaded regions) ranging from the best performing model to the worst performing model over the 5 splits for Training Test and Holdout. (c) Comparison of Pearson correlation and normalized spectral contrast angle (short spectral angle) as measures for spectral similarity between predicted and measured spectra contained in the holdout set for fragment ion intensity prediction.
Supplementary Figure 3 iRT predition using SSRCalc and Prosit on the ProteomeTools holdout set.
Benchmark of the indexed retention time (iRT) prediction model of Prosit (a) in comparison to SSRCalc (b). Plotted are the predicted and measured iRT values of peptides (dots) in the holdout set. The required iRT window that would encompass 95% of all peptides is indicated by the red dashed lines.
Supplementary Figure 4 Comparison of Prosit and MS2PIP.
Subsets of the ProteomeTools holdout are used in this figure. None of the peptides and spectra in this dataset were used to train or test Prosit’s fragment intensity model. In boxplots, outliers are not shown, whiskers indicate 1.5 interquartile range (IQR), and black horizontal lines indicate median values. For reference, a spectral angle of 0.9 and 0.7 are indicated. (a) Benchmark of Prosit’s (green) and MS2PIP’s (orange) fragment ion intensity prediction compared to the experimental ProteomeTools spectrum respectively. Data is split by peptide length on a random subset of the ProteomeTools holdout dataset. (b) Same as (a) but split by precursor charge. (c) Same as (a) but split by collision energy. (d) Comparison of Prosit’s and MS2PIP’s fragment ion intensity prediction limited to spectra acquired at NCE 35 of the holdout set. Orange dots denote peptides that were (likely) part of MS2PIPs training data.
Supplementary Figure 5 Collision energy dependency of experimental and predicted spectra.
Heatmap of the median spectral angle when comparing experimental vs experimental (a), experimental vs predicted (b) and predicted vs predicted (c) spectra across 15 different normalized collision energies (NCEs) of ~40 synthetic peptides used for retention time and NCE calibration (Zolg et. al. 2017).
Supplementary Figure 6 Evaluation of model overfitting on internal and external datasets.
(a) Comparison precursor charges of peptides from the Bekker-Jensen tryptic dataset. Peptides that were also part of the ProteomeTools Holdout dataset exhibit a different precursor charge distribution than those that were not. (b) Spectral angle distributions by precursor charge for peptides in the Bekker-Jensen tryptic dataset split by whether they were also part of the ProteomeTools Holdout dataset. (c) Benchmark of Prosit’s (green) and MS2PIP’s (orange) fragment ion intensity prediction on tryptic peptides from the Bekker-Jensen dataset. The top histogram shows spectral angles for peptides that were also synthesized in the ProteomeTools project, but not used for training Prosit. The bottom histogram shows the distribution of spectral angles for peptides not part of ProteomeTools.
Supplementary Figure 7 Effect of the iRT model refinement for external datasets.
Predicted vs experimental retention times using the general Prosit indexed retention time (iRT) prediction model (a), Elude (b), or the refined Prosit model (c) on representative LC-MS/MS measurements for 4 proteases (left to right: Trypsin, Chymotrypsin, Glu-C and Lys-C) from the Bekker-Jensen et. al. dataset. Model refinement for Prosit was performed using the tryptic data from the same dataset. The required retention time window that would encompass 95% of all peptides is indicated by the red dashed lines. Sample number n and Pearson correlation are indicated.
Supplementary Figure 8 Effect of the refined Prosit iRT model on DIA spectral libraries.
Evaluation of the general (top) and refined indexed retention time (iRT) (bottom) prediction model of Prosit on C. elegans (a), E. coli (b) and S. cerevisiae (c) data from the Bruderer et. al. dataset. For refinement, the project specific HEK library was used. The required iRT window that would encompass 95% of all peptides is indicated by the red dashed lines. Sample number n and Pearson correlation are indicated.
Supplementary Figure 9 DIA analysis using predicted spectral libraries.
(a) Impact of the retention time refinement using Prosit on the number identified peptides using either the general (indicated by “-”) or refined (indicated by “+”) Prosit indexed retention time (iRT) prediction model. The number of shared (blue), gained (green) and lost (orange) identified peptide sequences is plotted with respect to the original filtered library. iRT refinement was performed using the experimental retention time of the filtered HEK-293 data. See Supplementary Figure 8 for iRT model refinement analysis. (b) Identical analysis as Figure 4 for S. cerevisiae and E. coli. (c) Re-analysis of Orbitrap/TOF based data independent acquisition (DIA)/SWATH datasets using predicted spectral libraries. Data and project specific spectral libraries were obtained from public repositories. To facilitate comparisons, the original library was filtered for entries that Prosit is not yet able to predict (other modifications besides oxidized methionine, neutral losses and peptides >30 amino acids). The original and filtered spectral libraries were queried against the DIA data using Spectronaut and the barcharts depict the number of shared (blue), gained (green) and lost (orange) identified peptide sequences when using the original filtered library compared to the original unfiltered library. (d) Identical analysis as Figure 4, however protein-groups instead of peptides are displayed
Supplementary Figure 10 Comparison of predicted spectra with QTOF originated spectra.
(a) Density distribution of normalized spectral contrast angles between predicted spectra and QTOF originated spectral libraries (Rosenberger et. al, Schubert et al, Fabre, et al). The spectral angle is calculated based on annotated fragment ions, excluding fragments with a neutral loss, less then 3 amino acids and m/z <300. (b) Representative mirror spectrum of one predicted spectrum at normalized collision energy (NCE) 30 (top) vs one experimental spectrum contained in the D. Melanogaster QTOF library. (c) Number of fragment ions annotated fragment ions, more than 3 amino acids and m/z >300 per spectrum in the S. cerevisiae library and after prediction. (d) Density distribution of normalized spectral contrast angles between predicted spectra and DDA QTOF spectra for S. cerevisiae (Schubert et al.). Besides neutral loss fragments, all ions were accounted for. (e) Density distribution of normalized spectral contrast angles between predicted spectra and DDA QTOF spectra for S. cerevisiae (Schubert et al.) as function of the most intense peak in the QTOF DDA spectrum.
Supplementary Figure 11 Prediction performance analysis of Prosit.
(a) Barplot of predicted spectra per second using Prosit’s fragment ion intensity prediction across several datasets investigated in this study, excluding data transformation as well as read and write operations. Numbers in each bar indicate the total number of predicted spectra. (b) Total prediction time including transformation, read and write operations plotted against the number of predicted spectra using Prosit’s fragment ion intensity prediction model for differently sized datasets.
Supplementary Figure 12 Percolator feature weights.
Barplots of final feature weights assigned by percolator for four different proteases when using the Prosit feature set (See Supplementary table 5 for description of the features). The evaluated percolator models were trained on Bekker-Jensen datasets with proteases (top to bottom): Trypsin, Chymostrypsin, Lys-C and Glu-C.
Supplementary Figure 13 FDR analysis Bekker-Jensen Trypsin.
(a) Percent of shared (blue), gained (green) and lost (red) peptide identification when using the Prosit score set at different peptide level FDR cutoffs in comparison to the number of identification when using the Andromeda score set at 1% peptide level FDR. (b) Spectral angle distributions of decoy (orange) and false negative classified target (green) peptide spectrum matches (PSMs). The top panels are filtered at 1% peptide level FDR and the bottom panels are filtered at 0.1% peptide level FDR. The left panels show the distributions for the Andromeda and the right panel for the Prosit scores set.
Supplementary Figure 14 FDR comparison of Prosit and MS2PIP on Bekker-Jensen Trypsin and Bekker-Jense Chymotrypsin.
Number of estimated true positive (#targets - #decoys at respective false discovery rate (FDR) cutoff) peptide spectrum matches using percolator at different peptide level FDR cutoffs when using the Andromeda (blue), full score set based on intensity predictions (orange) (see Supplementary Table 5 for feature set description). Dashed line indicates the number of true positive identifications when using the Andromeda feature set at 1% peptide level FDR. Top figures show the analysis for trypsin and bottom figures for chymotrypsin. On the left MS2PIP predictions were used and on the right Prosit predictions.
Supplementary Figure 15 Target-decoy analysis of Prosit and MS2PIP.
(a) Comparison of the spectral angles of MS2PIP predictions to spectral angles of Prosit predictions for target (green) and decoy (orange) peptide spectrum matches generated by Andromeda. A random subset of 10,000 PSMs from the Bekker-Jensen tryptic dataset are shown. The PSM for peptide LVDCLSR is analysed in (c). (b) As in (a), but on a random subset of 10,000 PSMS from the Bekker-Jensen chymotrypsin dataset. (c) Analysis of a PSM for which Prosit’s prediction has a lower spectral angle than MS2PIP’s prediction. The PSM is highlighted in (a). Top: Prosit prediction compared to 12 experimental ProteomeTools spectra. Middle: Prosit prediction compared to the experimental spectrum from Bekker-Jensen. Bottom: MS2PIP prediction compared to the experimental spectrum from Bekker-Jensen. SA and R state spectral angle and Pearson correlation, respectively.
Supplementary information
Supplementary Information
Supplementary Figs. 1–15 and Supplementary Notes
Supplementary Table 1
Peptide mapping and identifications in ProteomeTools
Supplementary Table 2
NCE effect on fragmentation
Supplementary Table 3
Comparison of Prosit to external data
Supplementary Table 4
DIA results
Supplementary Table 5
DDA results
Supplementary Table 6
Metaproteomics results
Rights and permissions
About this article
Cite this article
Gessulat, S., Schmidt, T., Zolg, D.P. et al. Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning. Nat Methods 16, 509–518 (2019). https://doi.org/10.1038/s41592-019-0426-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-019-0426-7
This article is cited by
-
Using protein-per-mRNA differences among human tissues in codon optimization
Genome Biology (2023)
-
Transplanting FVIII/ET3-secreting cells in fetal sheep increases FVIII levels long-term without inducing immunity or toxicity
Nature Communications (2023)
-
InvitroSPI and a large database of proteasome-generated spliced and non-spliced peptides
Scientific Data (2023)
-
MSBooster: improving peptide identification rates using deep learning-based features
Nature Communications (2023)
-
DeepFLR facilitates false localization rate control in phosphoproteomics
Nature Communications (2023)