Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning

Gessulat, Siegfried; Schmidt, Tobias; Zolg, Daniel Paul; Samaras, Patroklos; Schnatbaum, Karsten; Zerweck, Johannes; Knaute, Tobias; Rechenberger, Julia; Delanghe, Bernard; Huhmer, Andreas; Reimer, Ulf; Ehrlich, Hans-Christian; Aiche, Stephan; Kuster, Bernhard; Wilhelm, Mathias

doi:10.1038/s41592-019-0426-7

Article
Published: 27 May 2019

Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning

Nature Methods volume 16, pages 509–518 (2019)Cite this article

34k Accesses
413 Citations
173 Altmetric
Metrics details

Subjects

Abstract

In mass-spectrometry-based proteomics, the identification and quantification of peptides and proteins heavily rely on sequence database searching or spectral library matching. The lack of accurate predictive models for fragment ion intensities impairs the realization of the full potential of these approaches. Here, we extended the ProteomeTools synthetic peptide library to 550,000 tryptic peptides and 21 million high-quality tandem mass spectra. We trained a deep neural network, termed Prosit, resulting in chromatographic retention time and fragment ion intensity predictions that exceed the quality of the experimental data. Integrating Prosit into database search pipelines led to more identifications at >10× lower false discovery rates. We show the general applicability of Prosit by predicting spectra for proteases other than trypsin, generating spectral libraries for data-independent acquisition and improving the analysis of metaproteomes. Prosit is integrated into ProteomicsDB, allowing search result re-scoring and custom spectral library generation for any organism on the basis of peptide sequence alone.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Accurate retention time and fragment ion intensity prediction by deep learning.**

**Fig. 2: Collision energy calibration yields fragment intensity predictions with near-synthetic peptide spectrum quality.**

**Fig. 3: Evaluation of fragment ion intensity and iRT prediction for non-tryptic peptides.**

**Fig. 4: Prosit enables generation of in silico spectral libraries.**

**Fig. 5: Intensity prediction greatly improves database search quality.**

**Fig. 6: Prosit enables confident identification in large metaproteomic search spaces.**

Prediction of peptide mass spectral libraries with machine learning

Article 25 August 2022

Jürgen Cox

AlphaPeptDeep: a modular deep learning framework to predict peptide properties for proteomics

Article Open access 24 November 2022

Wen-Feng Zeng, Xie-Xuan Zhou, … Matthias Mann

MSBooster: improving peptide identification rates using deep learning-based features

Article Open access 27 July 2023

Kevin L. Yang, Fengchao Yu, … Alexey I. Nesvizhskii

Data availability

Reference spectra are available at https://www.proteomicsdb.org, and updates to the resource are available at http://www.proteometools.org. The mass spectrometric raw data of ProteomeTools have been deposited with the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifier PXD010595. The MaxQuant and Spectronaut search data including intermediate results underlying the presented analysis have been deposited with the dataset identifier PXD010871. Learned Prosit and Elude models are deposited at https://figshare.com/projects/prosit/35582.

Code availability

Source code and scripts are available on GitHub at https://github.com/kusterlab/prosit.

References

Aebersold, R. & Mann, M. Mass-spectrometric exploration of proteome structure and function. Nature 537, 347–355 (2016).
Article CAS Google Scholar
Zhang, Y., Fonslow, B. R., Shan, B., Baek, M.-C. & Yates, J. R. Protein analysis by shotgun/bottom-up proteomics. Chem. Rev. 113, 2343–2394 (2013).
Article CAS Google Scholar
Mallick, P. & Kuster, B. Proteomics: a pragmatic perspective. Nat. Biotechnol. 28, 695 (2010).
Article CAS Google Scholar
Sinitcyn, P., Rudolph, J. D. & Cox, J. Computational methods for understanding mass spectrometry-based shotgun proteomics data. Annu. Rev. Biomed. Data Sci. 1, 207–234 (2018).
Article Google Scholar
Cox, J. et al. Andromeda: a peptide search engine integrated into the maxquant environment. J. Proteome Res. 10, 1794–1805 (2011).
Article CAS Google Scholar
Perkins, D. N., Pappin, D. J. C., Creasy, D. M. & Cottrell, J. S. Probability‐based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567 (1999).
Article CAS Google Scholar
Eng, J. K., McCormack, A. L. & Yates, J. R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976–989 (1994).
Article CAS Google Scholar
Stein, S. E. & Scott, D. R. Optimization and testing of mass spectral library search algorithms for compound identification. J. Am. Soc. Mass Spectrom. 5, 859–866 (1994).
Article CAS Google Scholar
Lam, H. et al. Development and validation of a spectral library searching method for peptide identification from MS/MS. Proteomics 7, 655–667 (2007).
Article CAS Google Scholar
Schubert, O. T. et al. Building high-quality assay libraries for targeted analysis of SWATH MS data. Nat. Protoc. 10, 426–441 (2015).
Article CAS Google Scholar
Deutsch, E. W. et al. Expanding the use of spectral libraries in proteomics. J. Proteome Res. 17, 4051–4060 (2018).
Article CAS Google Scholar
Gillet, L. C. et al. Targeted data extraction of the MS/MS spectra generated by data-independent acquisition: a new concept for consistent and accurate proteome analysis. Mol. Cell. Proteomics 11, O111.016717 (2012).
Article Google Scholar
Lange, V., Picotti, P., Domon, B. & Aebersold, R. Selected reaction monitoring for quantitative proteomics: a tutorial. Mol. Syst. Biol. 4, 222 (2008).
Article Google Scholar
Bruderer, R., Bernhardt, O. M., Gandhi, T. & Reiter, L. High‐precision iRT prediction in the targeted analysis of data‐independent acquisition and its impact on identification and quantitation. Proteomics 16, 2246–2256 (2016).
Article CAS Google Scholar
Krokhin, O. V. & Spicer, V. Generation of accurate peptide retention data for targeted and data independent quantitative LC–MS analysis: chromatographic lessons in proteomics. Proteomics 16, 2931–2936 (2016).
Article CAS Google Scholar
Moruz, L. et al. Chromatographic retention time prediction for posttranslationally modified peptides. Proteomics 12, 1151–1159 (2012).
Article CAS Google Scholar
Elias, J. E., Gibbons, F. D., King, O. D., Roth, F. P. & Gygi, S. P. Intensity-based protein identification by machine learning from a library of tandem mass spectra. Nat. Biotechnol. 22, 214–219 (2004).
Article CAS Google Scholar
Arnold, R. J., Jayasankar, N., Aggarwal, D., Tang, H. & Radivojac, P. A machine learning approach to predicting peptide fragmentation spectra. Pac. Symp. Biocomput. 2006, 219–230 (2006).
Google Scholar
Frank, A. M. Predicting intensity ranks of peptide fragment ions. J. Proteome Res. 8, 2226–2240 (2009).
Article CAS Google Scholar
Degroeve, S., Maddelein, D. & Martens, L. MS2PIP prediction server: compute and visualize MS2 peak intensity predictions for CID and HCD fragmentation. Nucleic Acids Res. 43, W326–W330 (2015).
Article CAS Google Scholar
Zhou, X.-X. et al. pDeep: predicting MS/MS spectra of peptides with deep learning. Anal. Chem. 89, 12690–12697 (2017).
Article CAS Google Scholar
Zolg, D. et al. PROCAL: a set of 40 peptide standards for retention time indexing, column performance monitoring, and collision energy calibration. Proteomics 17, 1700263 (2017).
Article Google Scholar
Zolg, D. P. et al. Building ProteomeTools based on a complete synthetic human proteome. Nat. Methods 14, 259–262 (2017).
Article CAS Google Scholar
Wu, Y. et al. Google’s neural machine translation system: bridging the gap between human and machine translation. Preprint at https://arxiv.org/abs/1609.08144 (2016).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980 (2014).
Xu, K. et al. Show, attend and tell: neural image caption generation with visual attention. In Proc. International Conference on Machine Learning (eds. Bach, F. & Blei, D.) 2048–2057 (JMLR, 2015).
Krokhin, O. V. Sequence-specific retention calculator. Algorithm for peptide retention prediction in ion-pair RP-HPLC: application to 300- and 100-A pore size C18 sorbents. Anal. Chem. 78, 7785–7795 (2006).
Article CAS Google Scholar
Toprak, U. H. et al. Conserved peptide fragmentation as a benchmarking tool for mass spectrometers and a discriminating feature for targeted proteomics. Mol. Cell. Proteomics 13, 2056–2071 (2014).
Article CAS Google Scholar
Diedrich, J. K., Pinto, A. F. M. & Yates, J. R. Energy dependence of HCD on peptide fragmentation: stepped collisional energy finds the sweet spot. J. Am. Soc. Mass Spectrom. 24, 1690–1699 (2013).
Article CAS Google Scholar
Bekker-Jensen, D. B. et al. An optimized shotgun strategy for the rapid generation of comprehensive human proteomes. Cell Syst. 4, 587–599 (2017).
Article CAS Google Scholar
Bruderer, R. et al. Optimization of experimental parameters in data-independent mass spectrometry significantly increases depth and reproducibility of results. Mol. Cell. Proteomics 16, 2296–2309 (2017).
Article CAS Google Scholar
Fabre, B. et al. Spectral libraries for SWATH-MS assays for Drosophila melanogaster and Solanum lycopersicum. Proteomics 17, 1700216 (2017).
Article Google Scholar
Schmidt, T. et al. ProteomicsDB. Nucleic Acids Res. 46, D1271–D1281 (2017).
Article Google Scholar
Wilhelm, M. et al. Mass-spectrometry-based draft of the human proteome. Nature 509, 582–587 (2014).
Article CAS Google Scholar
Elias, J. E. & Gygi, S. P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 4, 207–214 (2007).
Article CAS Google Scholar
The, M., MacCoss, M. J., Noble, W. S. & Käll, L. Fast and accurate protein false discovery rates on large-scale proteomics data sets with Percolator 3.0. J. Am. Soc. Mass Spectrom. 27, 1719–1727 (2016).
Article CAS Google Scholar
Shanmugam, A. K. & Nesvizhskii, A. I. Effective leveraging of targeted search spaces for improving peptide identification in tandem mass spectrometry based proteomics. J. Proteome Res. 14, 5169–5178 (2015).
Article CAS Google Scholar
Muth, T., Benndorf, D., Reichl, U., Rapp, E. & Martens, L. Searching for a needle in a stack of needles: challenges in metaproteomics data analysis. Mol. Biosyst. 9, 578–585 (2012).
Article Google Scholar
Rechenberger, J. et al. Challenges in clinical metaproteomics highlighted by the analysis of acute leukemia patients with gut colonization by multidrug-resistant enterobacteriaceae. Proteomes 7, 2 (2019).
Article Google Scholar
Li, J. et al. An integrated catalog of reference genes in the human gut microbiome. Nat. Biotechnol. 32, 834 (2014).
Article CAS Google Scholar
Muth, T. R. et al. Navigating through metaproteomics data: a logbook of database searching. Proteomics 15, 3439–3453 (2017).
Article Google Scholar
Nesvizhskii, A. I. Proteogenomics: concepts, applications and computational strategies. Nat. Methods 11, 1114 (2014).
Article CAS Google Scholar
Schumacher, F. R. et al. Building proteomic tool boxes to monitor MHC class I and class II peptides. Proteomics 17, 1600061 (2017).
Article Google Scholar
Zolg, D. et al. ProteomeTools: systematic characterization of 21 post-translational protein modifications by LC-MS/MS using synthetic peptides. Mol. Cell. Proteomics 17, 1850–1863 (2018).
Article CAS Google Scholar
Wang, D. et al. A deep proteome and transcriptome abundance atlas of 29 healthy human tissues. Mol. Syst. Biol. 15, e8503 (2019).
Article Google Scholar
Dorfer, V., Maltsev, S., Winkler, S. & Mechtler, K. CharmeRT: boosting peptide identifications by chimeric spectra identification and retention time prediction. J. Proteome Res. 17, 2581–2589 (2018).
Article Google Scholar
Wenschuh, H. et al. Coherent membrane supports for parallel microsynthesis and screening of bioactive peptides. Pept. Sci. 55, 188–206 (2000).
Article CAS Google Scholar
Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. Preprint at https://arxiv.org/abs/1412.3555 (2014).
Bahdanau, D., Cho, K. & Bengio, Y. Neural machine translation by jointly learning to align and translate. Preprint at https://arxiv.org/abs/1409.0473 (2014).
Moruz, L., Tomazela, D. & Käll, L. Training, selection, and robust calibration of retention time models for targeted proteomics. J. Proteome Res. 9, 5209–5216 (2010).
Article CAS Google Scholar
Davis, S. et al. Expanding proteome coverage with CHarge Ordered Parallel Ion aNalysis (CHOPIN) combined with broad specificity proteolysis. J. Proteome Res. 16, 1288–1299 (2017).
Article CAS Google Scholar

Download references

Acknowledgements

This work was in part funded by the German Federal Ministry of Education and Research (BMBF, grant no. 031L0008A and no. 031L0168). The Titan Xp used in this research were donated by the NVIDIA corporation. The authors thank R. Bruderer (Biognosys) for sharing spectral libraries in textual and editable format, and R. Bruderer and members of the Kuster lab for fruitful discussions.

Author information

These authors contributed equally: Siegfried Gessulat, Tobias Schmidt.

Authors and Affiliations

Chair of Proteomics and Bioanalytics, Technical University of Munich, Freising, Germany
Siegfried Gessulat, Tobias Schmidt, Daniel Paul Zolg, Patroklos Samaras, Julia Rechenberger, Bernhard Kuster & Mathias Wilhelm
SAP SE, Potsdam, Germany
Siegfried Gessulat, Hans-Christian Ehrlich & Stephan Aiche
JPT Peptide Technologies GmbH, Berlin, Germany
Karsten Schnatbaum, Johannes Zerweck, Tobias Knaute & Ulf Reimer
Thermo Fisher Scientific, Bremen, Germany
Bernard Delanghe
Thermo Fisher Scientific, San Jose, CA, USA
Andreas Huhmer
Bavarian Center for Biomolecular Mass Spectrometry, Freising, Germany
Bernhard Kuster

Authors

Siegfried Gessulat
View author publications
You can also search for this author in PubMed Google Scholar
Tobias Schmidt
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Paul Zolg
View author publications
You can also search for this author in PubMed Google Scholar
Patroklos Samaras
View author publications
You can also search for this author in PubMed Google Scholar
Karsten Schnatbaum
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Zerweck
View author publications
You can also search for this author in PubMed Google Scholar
Tobias Knaute
View author publications
You can also search for this author in PubMed Google Scholar
Julia Rechenberger
View author publications
You can also search for this author in PubMed Google Scholar
Bernard Delanghe
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Huhmer
View author publications
You can also search for this author in PubMed Google Scholar
Ulf Reimer
View author publications
You can also search for this author in PubMed Google Scholar
Hans-Christian Ehrlich
View author publications
You can also search for this author in PubMed Google Scholar
Stephan Aiche
View author publications
You can also search for this author in PubMed Google Scholar
Bernhard Kuster
View author publications
You can also search for this author in PubMed Google Scholar
Mathias Wilhelm
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

H.-C.E., S.A., B.K. and M.W. conceived the study. S.G., T.S., D.P.Z., J.R., K.S., J.Z., T.K., U.R., B.D., A.H., B.K. and M.W. designed experiments. S.G., T.S., D.P.Z., P.S., T.K., K.S., J.R., J.Z., B.D. and M.W. performed experiments. S.G., T.S., D.P.Z. and P.S. analyzed data. S.G., T.S., P.S. and M.W. extended the web resource. S.G., T.S., D.P.Z., B.K. and M.W. wrote the manuscript

Corresponding authors

Correspondence to Bernhard Kuster or Mathias Wilhelm.

Ethics declarations

Competing interests

M.W. and B.K. are founders and shareholders of OmicScouts. They have no operational role in the company. K.S., J.Z., T.K., H.W. and U.R. are employees of JPT. B.D. and A.H. are employees of Thermo Fisher Scientific. S.G., H.-C.E. and S.A. are employees of SAP SE.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Overview of identified peptides in the ProteomeTools project.

(a) Recovery of synthesized peptide sequences across all four new datasets. Bars display the percentage of peptides identified in comparison to the peptides synthesized per pool of ~1000 peptides. Only identifications with an Andromeda Score of at least 50 are considered. (b) Identified peptides over Andromeda score cutoff for both the newly released dataset as well as the complete ProteomeTools peptide library. Numbers at the arbitrary cutoff of 100 are displayed for both datasets, the median Andromeda Score is indicated.

Supplementary Figure 2 The Prosit deep learning model and its training.

(a) Overview of the neural network architecture of the fragment ion intensity prediction model. The model takes precursor charge, normalized collision energy and the peptide sequence as input. First, for every input a specific encoder is trained, consisting of one dense layer for precursor charge and normalized collision energy. The encoder for the peptide sequence is split in an embedding layer connected to 2 bi-directional recurrent neural networks (BDN) with gated recurrent memory (GRU) units and an attention layer. Both encoder representations are element-wise multiplied for a fixed size latent space representation. The decoder for fragment ion intensity prediction consists of one bidirectional GRU resulting in 6 predictions for up to 29 fragmentation positions. The indexed retention time (iRT) model uses the same encoder but dense layers as decoder. (b) Model performance for 5 random splits of the ProteomeTools data into Training, Test and Holdout. The main panel shows best performing models from 5 random splits of the data. The inset details the median models error with intervals (shaded regions) ranging from the best performing model to the worst performing model over the 5 splits for Training Test and Holdout. (c) Comparison of Pearson correlation and normalized spectral contrast angle (short spectral angle) as measures for spectral similarity between predicted and measured spectra contained in the holdout set for fragment ion intensity prediction.

Supplementary Figure 3 iRT predition using SSRCalc and Prosit on the ProteomeTools holdout set.

Benchmark of the indexed retention time (iRT) prediction model of Prosit (a) in comparison to SSRCalc (b). Plotted are the predicted and measured iRT values of peptides (dots) in the holdout set. The required iRT window that would encompass 95% of all peptides is indicated by the red dashed lines.

Supplementary Figure 4 Comparison of Prosit and MS2PIP.

Subsets of the ProteomeTools holdout are used in this figure. None of the peptides and spectra in this dataset were used to train or test Prosit’s fragment intensity model. In boxplots, outliers are not shown, whiskers indicate 1.5 interquartile range (IQR), and black horizontal lines indicate median values. For reference, a spectral angle of 0.9 and 0.7 are indicated. (a) Benchmark of Prosit’s (green) and MS2PIP’s (orange) fragment ion intensity prediction compared to the experimental ProteomeTools spectrum respectively. Data is split by peptide length on a random subset of the ProteomeTools holdout dataset. (b) Same as (a) but split by precursor charge. (c) Same as (a) but split by collision energy. (d) Comparison of Prosit’s and MS2PIP’s fragment ion intensity prediction limited to spectra acquired at NCE 35 of the holdout set. Orange dots denote peptides that were (likely) part of MS2PIPs training data.

Supplementary Figure 5 Collision energy dependency of experimental and predicted spectra.

Heatmap of the median spectral angle when comparing experimental vs experimental (a), experimental vs predicted (b) and predicted vs predicted (c) spectra across 15 different normalized collision energies (NCEs) of ~40 synthetic peptides used for retention time and NCE calibration (Zolg et. al. 2017).

Supplementary Figure 6 Evaluation of model overfitting on internal and external datasets.

(a) Comparison precursor charges of peptides from the Bekker-Jensen tryptic dataset. Peptides that were also part of the ProteomeTools Holdout dataset exhibit a different precursor charge distribution than those that were not. (b) Spectral angle distributions by precursor charge for peptides in the Bekker-Jensen tryptic dataset split by whether they were also part of the ProteomeTools Holdout dataset. (c) Benchmark of Prosit’s (green) and MS2PIP’s (orange) fragment ion intensity prediction on tryptic peptides from the Bekker-Jensen dataset. The top histogram shows spectral angles for peptides that were also synthesized in the ProteomeTools project, but not used for training Prosit. The bottom histogram shows the distribution of spectral angles for peptides not part of ProteomeTools.

Supplementary Figure 7 Effect of the iRT model refinement for external datasets.

Predicted vs experimental retention times using the general Prosit indexed retention time (iRT) prediction model (a), Elude (b), or the refined Prosit model (c) on representative LC-MS/MS measurements for 4 proteases (left to right: Trypsin, Chymotrypsin, Glu-C and Lys-C) from the Bekker-Jensen et. al. dataset. Model refinement for Prosit was performed using the tryptic data from the same dataset. The required retention time window that would encompass 95% of all peptides is indicated by the red dashed lines. Sample number n and Pearson correlation are indicated.

Supplementary Figure 8 Effect of the refined Prosit iRT model on DIA spectral libraries.

Evaluation of the general (top) and refined indexed retention time (iRT) (bottom) prediction model of Prosit on C. elegans (a), E. coli (b) and S. cerevisiae (c) data from the Bruderer et. al. dataset. For refinement, the project specific HEK library was used. The required iRT window that would encompass 95% of all peptides is indicated by the red dashed lines. Sample number n and Pearson correlation are indicated.

Supplementary Figure 9 DIA analysis using predicted spectral libraries.

(a) Impact of the retention time refinement using Prosit on the number identified peptides using either the general (indicated by “-”) or refined (indicated by “+”) Prosit indexed retention time (iRT) prediction model. The number of shared (blue), gained (green) and lost (orange) identified peptide sequences is plotted with respect to the original filtered library. iRT refinement was performed using the experimental retention time of the filtered HEK-293 data. See Supplementary Figure 8 for iRT model refinement analysis. (b) Identical analysis as Figure 4 for S. cerevisiae and E. coli. (c) Re-analysis of Orbitrap/TOF based data independent acquisition (DIA)/SWATH datasets using predicted spectral libraries. Data and project specific spectral libraries were obtained from public repositories. To facilitate comparisons, the original library was filtered for entries that Prosit is not yet able to predict (other modifications besides oxidized methionine, neutral losses and peptides >30 amino acids). The original and filtered spectral libraries were queried against the DIA data using Spectronaut and the barcharts depict the number of shared (blue), gained (green) and lost (orange) identified peptide sequences when using the original filtered library compared to the original unfiltered library. (d) Identical analysis as Figure 4, however protein-groups instead of peptides are displayed

Supplementary Figure 10 Comparison of predicted spectra with QTOF originated spectra.

(a) Density distribution of normalized spectral contrast angles between predicted spectra and QTOF originated spectral libraries (Rosenberger et. al, Schubert et al, Fabre, et al). The spectral angle is calculated based on annotated fragment ions, excluding fragments with a neutral loss, less then 3 amino acids and m/z <300. (b) Representative mirror spectrum of one predicted spectrum at normalized collision energy (NCE) 30 (top) vs one experimental spectrum contained in the D. Melanogaster QTOF library. (c) Number of fragment ions annotated fragment ions, more than 3 amino acids and m/z >300 per spectrum in the S. cerevisiae library and after prediction. (d) Density distribution of normalized spectral contrast angles between predicted spectra and DDA QTOF spectra for S. cerevisiae (Schubert et al.). Besides neutral loss fragments, all ions were accounted for. (e) Density distribution of normalized spectral contrast angles between predicted spectra and DDA QTOF spectra for S. cerevisiae (Schubert et al.) as function of the most intense peak in the QTOF DDA spectrum.

Supplementary Figure 11 Prediction performance analysis of Prosit.

(a) Barplot of predicted spectra per second using Prosit’s fragment ion intensity prediction across several datasets investigated in this study, excluding data transformation as well as read and write operations. Numbers in each bar indicate the total number of predicted spectra. (b) Total prediction time including transformation, read and write operations plotted against the number of predicted spectra using Prosit’s fragment ion intensity prediction model for differently sized datasets.

Supplementary Figure 12 Percolator feature weights.

Barplots of final feature weights assigned by percolator for four different proteases when using the Prosit feature set (See Supplementary table 5 for description of the features). The evaluated percolator models were trained on Bekker-Jensen datasets with proteases (top to bottom): Trypsin, Chymostrypsin, Lys-C and Glu-C.

Supplementary Figure 13 FDR analysis Bekker-Jensen Trypsin.

(a) Percent of shared (blue), gained (green) and lost (red) peptide identification when using the Prosit score set at different peptide level FDR cutoffs in comparison to the number of identification when using the Andromeda score set at 1% peptide level FDR. (b) Spectral angle distributions of decoy (orange) and false negative classified target (green) peptide spectrum matches (PSMs). The top panels are filtered at 1% peptide level FDR and the bottom panels are filtered at 0.1% peptide level FDR. The left panels show the distributions for the Andromeda and the right panel for the Prosit scores set.

Supplementary Figure 14 FDR comparison of Prosit and MS2PIP on Bekker-Jensen Trypsin and Bekker-Jense Chymotrypsin.

Number of estimated true positive (#targets - #decoys at respective false discovery rate (FDR) cutoff) peptide spectrum matches using percolator at different peptide level FDR cutoffs when using the Andromeda (blue), full score set based on intensity predictions (orange) (see Supplementary Table 5 for feature set description). Dashed line indicates the number of true positive identifications when using the Andromeda feature set at 1% peptide level FDR. Top figures show the analysis for trypsin and bottom figures for chymotrypsin. On the left MS2PIP predictions were used and on the right Prosit predictions.

Supplementary Figure 15 Target-decoy analysis of Prosit and MS2PIP.

(a) Comparison of the spectral angles of MS2PIP predictions to spectral angles of Prosit predictions for target (green) and decoy (orange) peptide spectrum matches generated by Andromeda. A random subset of 10,000 PSMs from the Bekker-Jensen tryptic dataset are shown. The PSM for peptide LVDCLSR is analysed in (c). (b) As in (a), but on a random subset of 10,000 PSMS from the Bekker-Jensen chymotrypsin dataset. (c) Analysis of a PSM for which Prosit’s prediction has a lower spectral angle than MS2PIP’s prediction. The PSM is highlighted in (a). Top: Prosit prediction compared to 12 experimental ProteomeTools spectra. Middle: Prosit prediction compared to the experimental spectrum from Bekker-Jensen. Bottom: MS2PIP prediction compared to the experimental spectrum from Bekker-Jensen. SA and R state spectral angle and Pearson correlation, respectively.

Supplementary information

Supplementary Information

Supplementary Figs. 1–15 and Supplementary Notes

Reporting Summary

Supplementary Table 1

Peptide mapping and identifications in ProteomeTools

Supplementary Table 2

NCE effect on fragmentation

Supplementary Table 3

Comparison of Prosit to external data

Supplementary Table 4

DIA results

Supplementary Table 5

DDA results

Supplementary Table 6

Metaproteomics results

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gessulat, S., Schmidt, T., Zolg, D.P. et al. Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning. Nat Methods 16, 509–518 (2019). https://doi.org/10.1038/s41592-019-0426-7

Download citation

Received: 20 August 2018
Accepted: 18 April 2019
Published: 27 May 2019
Issue Date: June 2019
DOI: https://doi.org/10.1038/s41592-019-0426-7

This article is cited by

Highly sensitive therapeutic drug monitoring of infliximab in serum by targeted mass spectrometry in comparison to ELISA data
- Andreas Hentschel
- Gina Piontek
- Albert Sickmann
Clinical Proteomics (2024)
Development and crystal structures of a potent second-generation dual degrader of BCL-2 and BCL-xL
- Digant Nayak
- Dongwen Lv
- Shaun K. Olsen
Nature Communications (2024)
A new mass analyzer shakes up the proteomics field
- Bernhard Kuster
- Johanna Tüshaus
- Florian P. Bayer
Nature Biotechnology (2024)
MARS an improved de novo peptide candidate selection method for non-canonical antigen target discovery in cancer
- Hanqing Liao
- Carolina Barra
- Nicola Ternette
Nature Communications (2024)
A universal molecular control for DNA, mRNA and protein expression
- Helen M. Gunter
- Scott E. Youlten
- Tim R. Mercer
Nature Communications (2024)