DeepLC can predict retention times for peptides that carry as-yet unseen modifications

Bouwmeester, Robbin; Gabriels, Ralf; Hulstaert, Niels; Martens, Lennart; Degroeve, Sven

doi:10.1038/s41592-021-01301-5

Article
Published: 28 October 2021

DeepLC can predict retention times for peptides that carry as-yet unseen modifications

Nature Methods volume 18, pages 1363–1369 (2021)Cite this article

7251 Accesses
72 Citations
41 Altmetric
Metrics details

Subjects

Abstract

The inclusion of peptide retention time prediction promises to remove peptide identification ambiguity in complex liquid chromatography–mass spectrometry identification workflows. However, due to the way peptides are encoded in current prediction models, accurate retention times cannot be predicted for modified peptides. This is especially problematic for fledgling open searches, which will benefit from accurate retention time prediction for modified peptides to reduce identification ambiguity. We present DeepLC, a deep learning peptide retention time predictor using peptide encoding based on atomic composition that allows the retention time of (previously unseen) modified peptides to be predicted accurately. We show that DeepLC performs similarly to current state-of-the-art approaches for unmodified peptides and, more importantly, accurately predicts retention times for modifications not seen during training. Moreover, we show that DeepLC’s ability to predict retention times for any modification enables potentially incorrect identifications to be flagged in an open search of a wide variety of proteome data.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Scatter plots of predicted against observed on three of the largest data sets.**

**Fig. 2: Prediction performance in terms of three metric for all data sets.**

**Fig. 3: Learning curves for each of the three selected data sets.**

Fig. 4: The modification that was excluded for training is shown on the horizontal axis, and the vertical axis shows the retention time error (experimental − predicted) when the modification was either not encoded (red) or encoded during the predictions (blue).

**Fig. 5: Each amino acid that was excluded for training is shown as a circle, where the size of the circle and color indicates the remaining training peptides and chemical property, respectively.**

**Fig. 6: Predicted retention time analysis for open results of human tissue data.**

Improving microbial phylogeny with citizen science within a mass-market video game

Article Open access 15 April 2024

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

An open source knowledge graph ecosystem for the life sciences

Article Open access 11 April 2024

Data availability

Data from the following projects were used to train and evaluate DeepLC: HeLa hf⁴², ProteomeTools⁴⁴, SWATH Library⁴¹, Plasma lumos 1h⁵⁴, DIA HF⁴³, HeLa lumos 2h⁵⁴, Pancreas⁵⁵, Xbridge⁵⁶, ATLANTIS SILICA⁵⁶, LUNA SILICA⁵⁶, LUNA hydrophilic interaction chromatography⁵⁶, strong ion exchange⁵⁶, Yeast 2h⁵⁷, HeLa lumos 1h⁵⁴, Yeast 1h⁵⁷, Arabidopsis⁵⁸, Yeast DeepRT⁵⁹, ProteomeTools PTM³⁰, Plasma lumos 2h⁵⁴ and HeLa DeepRT⁶⁰. The files of each data set and open search results are available on Zenodo at https://zenodo.org/record/4542884. The raw files the open search was performed on are available at PRIDE repository under the identifier PXD000561 (ref. ³²).

Code availability

The following Python (v.3.6) libraries were used in DeepLC: Pandas (v.0.25.1)⁶¹, TensorFlow (v.1.14.0)⁶², Pyteomics (v.4.1.2)⁶³, SciPy (v.1.4.0)⁶⁴, matplotlib (v.3.1.3)⁶⁵, seaborn (v.0.10.0)⁶⁶ and Numpy (v.1.17.3)⁶⁷. Other software used for DeepLC are: ThermoRawFileParser⁴⁸ (v.1.2.0), FileZilla (v.3.48.1), MS-GF+ (ref. ⁴⁹) (v.2019.08.26), Percolator⁶⁸ (v3.4) and open-pFind²⁶ (v.3.1.5). Code used to prepare the data sets, calibrate retention times, generate DeepLC models, make predictions and to reproduce the figures is available on Zenodo at https://zenodo.org/record/4542884.

The DeepLC tool including a GUI (Supplementary Fig. 17) is available for download from the following repositories and package indexes:

• GUI: https://github.com/compomics/DeepLC/releases/latest

• Python package: https://pypi.org/project/deeplc/

• Bioconda package: https://bioconda.github.io/recipes/deeplc/README.html

• Biocontainers docker image: https://quay.io/repository/biocontainers/deeplc

• Streamlit webserver: https://iomics.ugent.be/deeplc/

• Source code: https://github.com/compomics/DeepLC.

References

Aebersold, R. & Mann, M. Mass spectrometry-based proteomics. Nature 422, 198–207 (2003).
Article CAS PubMed Google Scholar
Shishkova, E., Hebert, A. S. & Coon, J. J. Now, more than ever, proteomics needs better chromatography. Cell Syst. 3, 321–324 (2016).
Article CAS PubMed PubMed Central Google Scholar
Michalski, A., Cox, J. & Mann, M. More than 100,000 detectable peptide species elute in single shotgun proteomics runs but the majority is inaccessible to data-dependent LC–MS/MS. J. Proteome Res. 10, 1785–1793 (2011).
Article CAS PubMed Google Scholar
Bruderer, R. et al. Extending the limits of quantitative proteome profiling with data-independent acquisition and application to acetaminophen-treated three-dimensional liver microtissues*[S]. Mol. Cell. Proteom. 14, 1400–1410 (2015).
Article CAS Google Scholar
Moruz, L. & Käll, L. Peptide retention time prediction. Mass Spectrom. Rev. 36, 615–623 (2017).
Article CAS PubMed Google Scholar
Reimer, J., Spicer, V. & Krokhin, O. V. Application of modern reversed-phase peptide retention prediction algorithms to the Houghten and DeGraw dataset: peptide helicity and its effect on prediction accuracy. J. Chromatogr. A. 1256, 160–168 (2012).
Article CAS PubMed Google Scholar
Searle, B. C. et al. Chromatogram libraries improve peptide detection and quantification by data independent acquisition mass spectrometry. Nat. Commun. 9, 5128 (2018).
Article PubMed PubMed Central CAS Google Scholar
Guo, D., Mant, C. T., Taneja, A. K. & Hodges, R. S. Prediction of peptide retention times in reversed-phase high-performance liquid chromatography II. Correlation of observed and predicted peptide retention times factors and influencing the retention times of peptides. J. Chromatogr. A. 359, 519–532 (1986).
Article CAS Google Scholar
Meek, J. L. Prediction of peptide retention times in high-pressure liquid chromatography on the basis of amino acid composition. Proc. Natl Acad. Sci. USA 77, 1632–1636 (1980).
Article CAS PubMed PubMed Central Google Scholar
Palmblad, M., Ramström, M., Markides, K. E., Håkansson, P. & Bergquist, J. Prediction of chromatographic retention and protein identification in liquid chromatography/mass spectrometry. Anal. Chem. 74, 5826–5830 (2002).
Article CAS PubMed Google Scholar
Moruz, L., Tomazela, D. & Käll, L. Training, selection, and robust calibration of retention time models for targeted proteomics. J. Proteome Res. 9, 5209–5216 (2010).
Article CAS PubMed Google Scholar
Moruz, L. et al. Chromatographic retention time prediction for posttranslationally modified peptides. Proteomics 12, 1151–1159 (2012).
Article CAS PubMed Google Scholar
Guan, S., Moran, M. F. & Ma, B. Prediction of LC-MS/MS properties of peptides from sequence by deep learning. Mol. Cell. Proteom. 18, 2099–2107 (2019).
Article Google Scholar
Gessulat, S. et al. Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning. Nat. Methods 16, 509–518 (2019).
Article CAS PubMed Google Scholar
Ma, C. et al. Improved peptide retention time prediction in liquid chromatography through deep learning. Anal. Chem. 90, 10881–10888 (2018).
Article CAS PubMed Google Scholar
MacLean, B. et al. Skyline: an open source document editor for creating and analyzing targeted proteomics experiments. Bioinformatics 26, 966–968 (2010).
Article CAS PubMed PubMed Central Google Scholar
C Silva, A. S. et al. Accurate peptide fragmentation predictions allow data driven approaches to replace and improve upon proteomics search engine scoring functions. Bioinformatics 35, 1401–1403 (2019).
Google Scholar
Bertsch, A. et al. Optimal de novo design of MRM experiments for rapid assay development in targeted proteomics. J. Proteome Res. 9, 2696–2704 (2010).
Article CAS PubMed Google Scholar
Dorfer, V., Maltsev, S., Winkler, S. & Mechtler, K. CharmeRT: boosting peptide identifications by chimeric spectra identification and retention time prediction. J. Proteome Res. 17, 2581–2589 (2018).
Article PubMed PubMed Central CAS Google Scholar
Van Puyvelde, B. et al. Removing the hidden data dependency of DIA with predicted spectral libraries. Proteomics 20, 1900306 (2020).
Article CAS Google Scholar
Yang, Y. et al. In silico spectral libraries by deep learning facilitate data-independent acquisition proteomics. Nat. Commun. 11, 146 (2020).
Article CAS PubMed PubMed Central Google Scholar
Searle, B. C. et al. Generating high quality libraries for DIA MS with empirically corrected peptide predictions. Nat. Commun. 11, 1548 (2020).
Article CAS PubMed PubMed Central Google Scholar
Bouwmeester, R., Gabriels, R., Van Den Bossche, T., Martens, L. & Degroeve, S. The age of data‐driven proteomics: how machine learning enables novel workflows. Proteomics 20, 1900351 (2020).
Article CAS Google Scholar
Bittremieux, W., Meysman, P., Noble, W. S. & Laukens, K. Fast open modification spectral library searching through approximate nearest neighbor indexing. J. Proteome Res. 17, 3463–3474 (2018).
Article CAS PubMed PubMed Central Google Scholar
Kong, A. T., Leprevost, F. V., Avtonomov, D. M., Mellacheruvu, D. & Nesvizhskii, A. I. MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry–based proteomics. Nat. Methods 14, 513–520 (2017).
Article CAS PubMed PubMed Central Google Scholar
Chi, H. et al. Comprehensive identification of peptides in tandem mass spectra using an efficient open search engine. Nat. Biotechnol. 36, 1059–1066 (2018).
Article CAS Google Scholar
Na, S., Bandeira, N. & Paek, E. Fast multi-blind modification search through tandem mass spectrometry. Mol. Cell Proteomics 11, M111.010199 (2012).
Creasy, D. M. & Cottrell, J. S. Unimod: protein modifications for mass spectrometry. Proteomics 4, 1534–1536 (2004).
Article CAS PubMed Google Scholar
Wren, S. A. C. Peak capacity in gradient ultra performance liquid chromatography (UPLC). J. Pharm. Biomed. Anal. 38, 337–343 (2005).
Article CAS PubMed Google Scholar
Paul Zolg, D. et al. Proteometools: systematic characterization of 21 post-translational protein modifications by liquid chromatography tandem mass spectrometry (LC-MS/MS) using synthetic peptides. Mol. Cell. Proteom. 17, 1850–1863 (2018).
Article Google Scholar
Colaert, N., Degroeve, S., Helsens, K. & Martens, L. Analysis of the resolution limitations of peptide identification algorithms. J. Proteome Res. 10, 5555–5561 (2011).
Article CAS PubMed Google Scholar
Kim, M. S. et al. A draft map of the human proteome. Nature 509, 575–581 (2014).
Article CAS PubMed PubMed Central Google Scholar
Müller, T. & Winter, D. Systematic evaluation of protein reduction and alkylation reveals massive unspecific side effects by iodine-containing reagents. Mol. Cell. Proteom. 16, 1173–1187 (2017).
Article Google Scholar
Salz, R. et al. Personalized proteome: comparing proteogenomics and open variant search approaches for single amino acid variant detection. J. Proteome Res. 20, 3353–3364 (2021).
Article CAS PubMed PubMed Central Google Scholar
Aicheler, F. et al. Retention time prediction improves identification in nontargeted lipidomics approaches. Anal. Chem. 87, 7698–7704 (2015).
Article CAS PubMed Google Scholar
Creek, D. J. et al. Toward global metabolomics analysis with hydrophilic interaction liquid chromatography–mass spectrometry: improved metabolite identification by retention time prediction. Anal. Chem. 83, 8703–8710 (2011).
Article CAS PubMed Google Scholar
Fukushima, K. Neocognitron: A hierarchical neural network capable of visual pattern recognition. Neural Netw. 1, 119–130 (1988).
Article Google Scholar
Ranzato, M., Huang, F., Boureau, Y. B. & LeCun, Y. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In Proc. 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA (IEEE, 2007).
Parker, J. M. R., Guo, D. & Hodges, R. S. New hydrophilicity scale derived from high-performance liquid chromatography peptide retention data: Correlation of predicted surface residues with antigenicity and X-ray-derived accessible sites. Biochemistry 25, 5425–5432 (1986).
Article CAS PubMed Google Scholar
Nair, V. & Hinton, G. E. Rectified Linear Units Improve Restricted Boltzmann Machines https://www.cs.toronto.edu/~hinton/absps/reluICML.pdf (Univ. Toronto, 2010).
Rosenberger, G. et al. A repository of assays to quantify 10,000 human proteins by SWATH-MS. Sci. Data 1, 140031 (2014).
Kelstrup, C. D. et al. Performance evaluation of the Q exactive HF-X for shotgun proteomics. J. Proteome Res. 17, 727–738 (2018).
Article CAS PubMed Google Scholar
Bruderer, R. et al. Optimization of experimental parameters in data-independent mass spectrometry significantly increases depth and reproducibility of results. Mol. Cell. Proteom. 16, 2296–2309 (2017).
Article CAS Google Scholar
Zolg, D. P. et al. Building ProteomeTools based on a complete synthetic human proteome. Nat. Methods 14, 259–262 (2017).
Article CAS PubMed PubMed Central Google Scholar
Escher, C. et al. Using iRT, a normalized retention time for more targeted measurement of peptides. Proteomics 12, 1111–1121 (2012).
Article CAS PubMed PubMed Central Google Scholar
Zolg, D. P. et al. PROCAL: A set of 40 peptide standards for retention time indexing, column performance monitoring, and collision energy calibration. Proteomics 17, 1700263 (2017).
Article CAS Google Scholar
Martens, L. et al. PRIDE: the proteomics identifications database. Proteomics 5, 3537–3545 (2005).
Article CAS PubMed Google Scholar
Hulstaert, N. et al. ThermoRawFileParser: modular, scalable, and cross-platform RAW file conversion. J. Proteome Res. 19, 537–542 (2020).
Article CAS PubMed Google Scholar
Kim, S. & Pevzner, P. A. MS-GF+ makes progress towards a universal database search tool for proteomics. Nat. Commun. 5, 5277 (2014).
Article CAS PubMed Google Scholar
Käll, L., Canterbury, J. D., Weston, J., Noble, W. S. & MacCoss, M. J. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat. Methods 4, 923–925 (2007).
Article PubMed CAS Google Scholar
Köster, J. & Rahmann, S. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics 28, 2520–2522 (2012).
Article PubMed CAS Google Scholar
Cox, J. & Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 26, 1367–1372 (2008).
Article CAS PubMed Google Scholar
Vizcaíno, J. A. et al. ProteomeXchange provides globally coordinated proteomics data submission and dissemination. Nat. Biotechnol. 32, 223–226 (2014).
Article PubMed PubMed Central CAS Google Scholar
Li, W. et al. Assessing the relationship between mass window width and retention time scheduling on protein coverage for data-independent acquisition. J. Am. Soc. Mass. Spectrom. 30, 1396–1405 (2019).
Article CAS PubMed Google Scholar
Wang, D. et al. A deep proteome and transcriptome abundance atlas of 29 healthy human tissues. Mol. Syst. Biol. 15, e8503 (2019).
Gussakovsky, D., Neustaeter, H., Spicer, V. & Krokhin, O. V. Sequence-specific model for peptide retention time prediction in strong cation exchange chromatography. Anal. Chem. 89, 11795–11802 (2017).
Article CAS PubMed Google Scholar
Jarnuczak, A. F. et al. Analysis of intrinsic peptide detectability via integrated label-free and SRM-based absolute quantitative proteomics. J. Proteome Res. 15, 2945–2959 (2016).
Article CAS PubMed Google Scholar
Mucha, S. et al. The formation of a camalexin biosynthetic metabolon. Plant Cell 31, 2697–2710 (2019).
CAS PubMed PubMed Central Google Scholar
Nagaraj, N. et al. System-wide perturbation analysis with nearly complete coverage of the yeast proteome by single-shot ultra HPLC runs on a bench top orbitrap. Mol. Cell. Proteomics 11, M111.013722 (2012).
Sharma, K. et al. Ultradeep human phosphoproteome reveals a distinct regulatory nature of Tyr and Ser/Thr-based signaling. Cell Rep. 8, 1583–1594 (2014).
Article CAS PubMed Google Scholar
McKinney, W. pandas: a foundational Python library for data analysis and statistics. Python High Perform. Sci. Comput. 1–9, https://www.dlr.de/sc/en/Portaldata/15/Resources/dokumente/pyhpc2011/submissions/pyhpc2011_submission_9.pdf (2011).
Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous distributed systems. Preprint at arXiv.org www.tensorflow.org
Levitsky, L. I., Klein, J. A., Ivanov, M. V. & Gorshkov, M. V. Pyteomics 4.0: five years of development of a python proteomics framework. J. Proteome Res. 18, 709–714 (2019).
Article CAS PubMed Google Scholar
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Article CAS PubMed PubMed Central Google Scholar
Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).
Article Google Scholar
Waskom, M. L. Seaborn: statistical data visualization. J. Open Source Softw. 6, 3021 (2021).
Article Google Scholar
Oliphant, T. E. A Guide to NumPy Vol. 1 (Trelgol Publishing, 2006).
The, M., MacCoss, M. J., Noble, W. S. & Käll, L. Fast and accurate protein false discovery rates on large-scale proteomics data sets with percolator 3.0. J. Am. Soc. Mass. Spectrom. 27, 1719–1727 (2016).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

R.B. acknowledges funding from the Marie Sklodowska-Curie EU Framework for Research and Innovation Horizon 2020 MASSTRPLAN (grant no. 675132) and Vlaams Agentschap Innoveren en Ondernemen under project number HBC.2020.2205. R.G. acknowledges funding from the Research Foundation Flanders (FWO) (grant no. 1S50918N). S.D. and L.M. acknowledge funding from the European Union’s Horizon 2020 Programme (grant nos. H2020-INFRAIA-2018-1 and 823839). N.H. and L.M. acknowledge funding from the Research Foundation Flanders (FWO) (grant nos. G042518N and G028821N). L.M. acknowledges funding from Ghent University Concerted Research Action (grant no. BOF21-GOA-033).

Author information

Authors and Affiliations

VIB-UGent Center for Medical Biotechnology, VIB, Ghent, Belgium
Robbin Bouwmeester, Ralf Gabriels, Niels Hulstaert, Lennart Martens & Sven Degroeve
Department of Biomolecular Medicine, Ghent University, Ghent, Belgium
Robbin Bouwmeester, Ralf Gabriels, Niels Hulstaert, Lennart Martens & Sven Degroeve

Authors

Robbin Bouwmeester
View author publications
You can also search for this author in PubMed Google Scholar
Ralf Gabriels
View author publications
You can also search for this author in PubMed Google Scholar
Niels Hulstaert
View author publications
You can also search for this author in PubMed Google Scholar
Lennart Martens
View author publications
You can also search for this author in PubMed Google Scholar
Sven Degroeve
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

R.B., R.G. and S.D. conceived the study. R.B., R.G., L.M. and S.D. designed the experiments, analyzed the results and wrote the paper. R.G. made the tool available in Python package repositories. N.H. and R.B. built the graphical user interface.

Corresponding author

Correspondence to Lennart Martens.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Methods thanks Lukas Reiter and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Arunima Singh was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–17 and Tables 1–4.

Reporting Summary

Supplementary Table 5

Concatenation of all peptides used for training, validation and evaluation. For each peptide, the randomly assigned split group is provided.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bouwmeester, R., Gabriels, R., Hulstaert, N. et al. DeepLC can predict retention times for peptides that carry as-yet unseen modifications. Nat Methods 18, 1363–1369 (2021). https://doi.org/10.1038/s41592-021-01301-5

Download citation

Received: 15 April 2020
Accepted: 13 September 2021
Published: 28 October 2021
Issue Date: November 2021
DOI: https://doi.org/10.1038/s41592-021-01301-5

This article is cited by

COSMIC-based mutation database enhances identification efficiency of HLA-I immunopeptidome
- Fanghzou Wang
- Zhenpeng Zhang
- Shichun Lu
Journal of Translational Medicine (2024)
Prediction of glycopeptide fragment mass spectra by deep learning
- Yi Yang
- Qun Fang
Nature Communications (2024)
Thunder-DDA-PASEF enables high-coverage immunopeptidomics and is boosted by MS2Rescore with MS2PIP timsTOF fragmentation prediction model
- David Gomez-Zepeda
- Danielle Arnold-Schild
- Stefan Tenzer
Nature Communications (2024)
Design and Characterization of Anticancer Peptides Derived from Snake Venom Metalloproteinase Library
- S. Saranya
- M. Bharathi
- P Chellapandi
International Journal of Peptide Research and Therapeutics (2024)
Accurate de novo peptide sequencing using fully convolutional neural networks
- Kaiyuan Liu
- Yuzhen Ye
- Haixu Tang
Nature Communications (2023)

DeepLC can predict retention times for peptides that carry as-yet unseen modifications

Subjects

Abstract

Access options

Similar content being viewed by others

Improving microbial phylogeny with citizen science within a mass-market video game

Highly accurate protein structure prediction with AlphaFold

An open source knowledge graph ecosystem for the life sciences

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary Information

Reporting Summary

Supplementary Table 5

Rights and permissions

About this article

Cite this article

This article is cited by

COSMIC-based mutation database enhances identification efficiency of HLA-I immunopeptidome

Prediction of glycopeptide fragment mass spectra by deep learning

Thunder-DDA-PASEF enables high-coverage immunopeptidomics and is boosted by MS2Rescore with MS2PIP timsTOF fragmentation prediction model

Design and Characterization of Anticancer Peptides Derived from Snake Venom Metalloproteinase Library

Accurate de novo peptide sequencing using fully convolutional neural networks

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links