Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

DeepLC can predict retention times for peptides that carry as-yet unseen modifications

Abstract

The inclusion of peptide retention time prediction promises to remove peptide identification ambiguity in complex liquid chromatography–mass spectrometry identification workflows. However, due to the way peptides are encoded in current prediction models, accurate retention times cannot be predicted for modified peptides. This is especially problematic for fledgling open searches, which will benefit from accurate retention time prediction for modified peptides to reduce identification ambiguity. We present DeepLC, a deep learning peptide retention time predictor using peptide encoding based on atomic composition that allows the retention time of (previously unseen) modified peptides to be predicted accurately. We show that DeepLC performs similarly to current state-of-the-art approaches for unmodified peptides and, more importantly, accurately predicts retention times for modifications not seen during training. Moreover, we show that DeepLC’s ability to predict retention times for any modification enables potentially incorrect identifications to be flagged in an open search of a wide variety of proteome data.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Scatter plots of predicted against observed on three of the largest data sets.
Fig. 2: Prediction performance in terms of three metric for all data sets.
Fig. 3: Learning curves for each of the three selected data sets.
Fig. 4: The modification that was excluded for training is shown on the horizontal axis, and the vertical axis shows the retention time error (experimental − predicted) when the modification was either not encoded (red) or encoded during the predictions (blue).
Fig. 5: Each amino acid that was excluded for training is shown as a circle, where the size of the circle and color indicates the remaining training peptides and chemical property, respectively.
Fig. 6: Predicted retention time analysis for open results of human tissue data.

Similar content being viewed by others

Data availability

Data from the following projects were used to train and evaluate DeepLC: HeLa hf42, ProteomeTools44, SWATH Library41, Plasma lumos 1h54, DIA HF43, HeLa lumos 2h54, Pancreas55, Xbridge56, ATLANTIS SILICA56, LUNA SILICA56, LUNA hydrophilic interaction chromatography56, strong ion exchange56, Yeast 2h57, HeLa lumos 1h54, Yeast 1h57, Arabidopsis58, Yeast DeepRT59, ProteomeTools PTM30, Plasma lumos 2h54 and HeLa DeepRT60. The files of each data set and open search results are available on Zenodo at https://zenodo.org/record/4542884. The raw files the open search was performed on are available at PRIDE repository under the identifier PXD000561 (ref. 32).

Code availability

The following Python (v.3.6) libraries were used in DeepLC: Pandas (v.0.25.1)61, TensorFlow (v.1.14.0)62, Pyteomics (v.4.1.2)63, SciPy (v.1.4.0)64, matplotlib (v.3.1.3)65, seaborn (v.0.10.0)66 and Numpy (v.1.17.3)67. Other software used for DeepLC are: ThermoRawFileParser48 (v.1.2.0), FileZilla (v.3.48.1), MS-GF+ (ref. 49) (v.2019.08.26), Percolator68 (v3.4) and open-pFind26 (v.3.1.5). Code used to prepare the data sets, calibrate retention times, generate DeepLC models, make predictions and to reproduce the figures is available on Zenodo at https://zenodo.org/record/4542884.

The DeepLC tool including a GUI (Supplementary Fig. 17) is available for download from the following repositories and package indexes:

• GUI: https://github.com/compomics/DeepLC/releases/latest

• Python package: https://pypi.org/project/deeplc/

• Bioconda package: https://bioconda.github.io/recipes/deeplc/README.html

• Biocontainers docker image: https://quay.io/repository/biocontainers/deeplc

• Streamlit webserver: https://iomics.ugent.be/deeplc/

• Source code: https://github.com/compomics/DeepLC.

References

  1. Aebersold, R. & Mann, M. Mass spectrometry-based proteomics. Nature 422, 198–207 (2003).

    Article  CAS  PubMed  Google Scholar 

  2. Shishkova, E., Hebert, A. S. & Coon, J. J. Now, more than ever, proteomics needs better chromatography. Cell Syst. 3, 321–324 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Michalski, A., Cox, J. & Mann, M. More than 100,000 detectable peptide species elute in single shotgun proteomics runs but the majority is inaccessible to data-dependent LC–MS/MS. J. Proteome Res. 10, 1785–1793 (2011).

    Article  CAS  PubMed  Google Scholar 

  4. Bruderer, R. et al. Extending the limits of quantitative proteome profiling with data-independent acquisition and application to acetaminophen-treated three-dimensional liver microtissues*[S]. Mol. Cell. Proteom. 14, 1400–1410 (2015).

    Article  CAS  Google Scholar 

  5. Moruz, L. & Käll, L. Peptide retention time prediction. Mass Spectrom. Rev. 36, 615–623 (2017).

    Article  CAS  PubMed  Google Scholar 

  6. Reimer, J., Spicer, V. & Krokhin, O. V. Application of modern reversed-phase peptide retention prediction algorithms to the Houghten and DeGraw dataset: peptide helicity and its effect on prediction accuracy. J. Chromatogr. A. 1256, 160–168 (2012).

    Article  CAS  PubMed  Google Scholar 

  7. Searle, B. C. et al. Chromatogram libraries improve peptide detection and quantification by data independent acquisition mass spectrometry. Nat. Commun. 9, 5128 (2018).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  8. Guo, D., Mant, C. T., Taneja, A. K. & Hodges, R. S. Prediction of peptide retention times in reversed-phase high-performance liquid chromatography II. Correlation of observed and predicted peptide retention times factors and influencing the retention times of peptides. J. Chromatogr. A. 359, 519–532 (1986).

    Article  CAS  Google Scholar 

  9. Meek, J. L. Prediction of peptide retention times in high-pressure liquid chromatography on the basis of amino acid composition. Proc. Natl Acad. Sci. USA 77, 1632–1636 (1980).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Palmblad, M., Ramström, M., Markides, K. E., Håkansson, P. & Bergquist, J. Prediction of chromatographic retention and protein identification in liquid chromatography/mass spectrometry. Anal. Chem. 74, 5826–5830 (2002).

    Article  CAS  PubMed  Google Scholar 

  11. Moruz, L., Tomazela, D. & Käll, L. Training, selection, and robust calibration of retention time models for targeted proteomics. J. Proteome Res. 9, 5209–5216 (2010).

    Article  CAS  PubMed  Google Scholar 

  12. Moruz, L. et al. Chromatographic retention time prediction for posttranslationally modified peptides. Proteomics 12, 1151–1159 (2012).

    Article  CAS  PubMed  Google Scholar 

  13. Guan, S., Moran, M. F. & Ma, B. Prediction of LC-MS/MS properties of peptides from sequence by deep learning. Mol. Cell. Proteom. 18, 2099–2107 (2019).

    Article  Google Scholar 

  14. Gessulat, S. et al. Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning. Nat. Methods 16, 509–518 (2019).

    Article  CAS  PubMed  Google Scholar 

  15. Ma, C. et al. Improved peptide retention time prediction in liquid chromatography through deep learning. Anal. Chem. 90, 10881–10888 (2018).

    Article  CAS  PubMed  Google Scholar 

  16. MacLean, B. et al. Skyline: an open source document editor for creating and analyzing targeted proteomics experiments. Bioinformatics 26, 966–968 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. C Silva, A. S. et al. Accurate peptide fragmentation predictions allow data driven approaches to replace and improve upon proteomics search engine scoring functions. Bioinformatics 35, 1401–1403 (2019).

    Google Scholar 

  18. Bertsch, A. et al. Optimal de novo design of MRM experiments for rapid assay development in targeted proteomics. J. Proteome Res. 9, 2696–2704 (2010).

    Article  CAS  PubMed  Google Scholar 

  19. Dorfer, V., Maltsev, S., Winkler, S. & Mechtler, K. CharmeRT: boosting peptide identifications by chimeric spectra identification and retention time prediction. J. Proteome Res. 17, 2581–2589 (2018).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  20. Van Puyvelde, B. et al. Removing the hidden data dependency of DIA with predicted spectral libraries. Proteomics 20, 1900306 (2020).

    Article  CAS  Google Scholar 

  21. Yang, Y. et al. In silico spectral libraries by deep learning facilitate data-independent acquisition proteomics. Nat. Commun. 11, 146 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Searle, B. C. et al. Generating high quality libraries for DIA MS with empirically corrected peptide predictions. Nat. Commun. 11, 1548 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Bouwmeester, R., Gabriels, R., Van Den Bossche, T., Martens, L. & Degroeve, S. The age of data‐driven proteomics: how machine learning enables novel workflows. Proteomics 20, 1900351 (2020).

    Article  CAS  Google Scholar 

  24. Bittremieux, W., Meysman, P., Noble, W. S. & Laukens, K. Fast open modification spectral library searching through approximate nearest neighbor indexing. J. Proteome Res. 17, 3463–3474 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Kong, A. T., Leprevost, F. V., Avtonomov, D. M., Mellacheruvu, D. & Nesvizhskii, A. I. MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry–based proteomics. Nat. Methods 14, 513–520 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Chi, H. et al. Comprehensive identification of peptides in tandem mass spectra using an efficient open search engine. Nat. Biotechnol. 36, 1059–1066 (2018).

    Article  CAS  Google Scholar 

  27. Na, S., Bandeira, N. & Paek, E. Fast multi-blind modification search through tandem mass spectrometry. Mol. Cell Proteomics 11, M111.010199 (2012).

  28. Creasy, D. M. & Cottrell, J. S. Unimod: protein modifications for mass spectrometry. Proteomics 4, 1534–1536 (2004).

    Article  CAS  PubMed  Google Scholar 

  29. Wren, S. A. C. Peak capacity in gradient ultra performance liquid chromatography (UPLC). J. Pharm. Biomed. Anal. 38, 337–343 (2005).

    Article  CAS  PubMed  Google Scholar 

  30. Paul Zolg, D. et al. Proteometools: systematic characterization of 21 post-translational protein modifications by liquid chromatography tandem mass spectrometry (LC-MS/MS) using synthetic peptides. Mol. Cell. Proteom. 17, 1850–1863 (2018).

    Article  Google Scholar 

  31. Colaert, N., Degroeve, S., Helsens, K. & Martens, L. Analysis of the resolution limitations of peptide identification algorithms. J. Proteome Res. 10, 5555–5561 (2011).

    Article  CAS  PubMed  Google Scholar 

  32. Kim, M. S. et al. A draft map of the human proteome. Nature 509, 575–581 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Müller, T. & Winter, D. Systematic evaluation of protein reduction and alkylation reveals massive unspecific side effects by iodine-containing reagents. Mol. Cell. Proteom. 16, 1173–1187 (2017).

    Article  Google Scholar 

  34. Salz, R. et al. Personalized proteome: comparing proteogenomics and open variant search approaches for single amino acid variant detection. J. Proteome Res. 20, 3353–3364 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Aicheler, F. et al. Retention time prediction improves identification in nontargeted lipidomics approaches. Anal. Chem. 87, 7698–7704 (2015).

    Article  CAS  PubMed  Google Scholar 

  36. Creek, D. J. et al. Toward global metabolomics analysis with hydrophilic interaction liquid chromatography–mass spectrometry: improved metabolite identification by retention time prediction. Anal. Chem. 83, 8703–8710 (2011).

    Article  CAS  PubMed  Google Scholar 

  37. Fukushima, K. Neocognitron: A hierarchical neural network capable of visual pattern recognition. Neural Netw. 1, 119–130 (1988).

    Article  Google Scholar 

  38. Ranzato, M., Huang, F., Boureau, Y. B. & LeCun, Y. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In Proc. 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA (IEEE, 2007).

  39. Parker, J. M. R., Guo, D. & Hodges, R. S. New hydrophilicity scale derived from high-performance liquid chromatography peptide retention data: Correlation of predicted surface residues with antigenicity and X-ray-derived accessible sites. Biochemistry 25, 5425–5432 (1986).

    Article  CAS  PubMed  Google Scholar 

  40. Nair, V. & Hinton, G. E. Rectified Linear Units Improve Restricted Boltzmann Machines https://www.cs.toronto.edu/~hinton/absps/reluICML.pdf (Univ. Toronto, 2010).

  41. Rosenberger, G. et al. A repository of assays to quantify 10,000 human proteins by SWATH-MS. Sci. Data 1, 140031 (2014).

  42. Kelstrup, C. D. et al. Performance evaluation of the Q exactive HF-X for shotgun proteomics. J. Proteome Res. 17, 727–738 (2018).

    Article  CAS  PubMed  Google Scholar 

  43. Bruderer, R. et al. Optimization of experimental parameters in data-independent mass spectrometry significantly increases depth and reproducibility of results. Mol. Cell. Proteom. 16, 2296–2309 (2017).

    Article  CAS  Google Scholar 

  44. Zolg, D. P. et al. Building ProteomeTools based on a complete synthetic human proteome. Nat. Methods 14, 259–262 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Escher, C. et al. Using iRT, a normalized retention time for more targeted measurement of peptides. Proteomics 12, 1111–1121 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Zolg, D. P. et al. PROCAL: A set of 40 peptide standards for retention time indexing, column performance monitoring, and collision energy calibration. Proteomics 17, 1700263 (2017).

    Article  CAS  Google Scholar 

  47. Martens, L. et al. PRIDE: the proteomics identifications database. Proteomics 5, 3537–3545 (2005).

    Article  CAS  PubMed  Google Scholar 

  48. Hulstaert, N. et al. ThermoRawFileParser: modular, scalable, and cross-platform RAW file conversion. J. Proteome Res. 19, 537–542 (2020).

    Article  CAS  PubMed  Google Scholar 

  49. Kim, S. & Pevzner, P. A. MS-GF+ makes progress towards a universal database search tool for proteomics. Nat. Commun. 5, 5277 (2014).

    Article  CAS  PubMed  Google Scholar 

  50. Käll, L., Canterbury, J. D., Weston, J., Noble, W. S. & MacCoss, M. J. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat. Methods 4, 923–925 (2007).

    Article  PubMed  CAS  Google Scholar 

  51. Köster, J. & Rahmann, S. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics 28, 2520–2522 (2012).

    Article  PubMed  CAS  Google Scholar 

  52. Cox, J. & Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 26, 1367–1372 (2008).

    Article  CAS  PubMed  Google Scholar 

  53. Vizcaíno, J. A. et al. ProteomeXchange provides globally coordinated proteomics data submission and dissemination. Nat. Biotechnol. 32, 223–226 (2014).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  54. Li, W. et al. Assessing the relationship between mass window width and retention time scheduling on protein coverage for data-independent acquisition. J. Am. Soc. Mass. Spectrom. 30, 1396–1405 (2019).

    Article  CAS  PubMed  Google Scholar 

  55. Wang, D. et al. A deep proteome and transcriptome abundance atlas of 29 healthy human tissues. Mol. Syst. Biol. 15, e8503 (2019).

  56. Gussakovsky, D., Neustaeter, H., Spicer, V. & Krokhin, O. V. Sequence-specific model for peptide retention time prediction in strong cation exchange chromatography. Anal. Chem. 89, 11795–11802 (2017).

    Article  CAS  PubMed  Google Scholar 

  57. Jarnuczak, A. F. et al. Analysis of intrinsic peptide detectability via integrated label-free and SRM-based absolute quantitative proteomics. J. Proteome Res. 15, 2945–2959 (2016).

    Article  CAS  PubMed  Google Scholar 

  58. Mucha, S. et al. The formation of a camalexin biosynthetic metabolon. Plant Cell 31, 2697–2710 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  59. Nagaraj, N. et al. System-wide perturbation analysis with nearly complete coverage of the yeast proteome by single-shot ultra HPLC runs on a bench top orbitrap. Mol. Cell. Proteomics 11, M111.013722 (2012).

  60. Sharma, K. et al. Ultradeep human phosphoproteome reveals a distinct regulatory nature of Tyr and Ser/Thr-based signaling. Cell Rep. 8, 1583–1594 (2014).

    Article  CAS  PubMed  Google Scholar 

  61. McKinney, W. pandas: a foundational Python library for data analysis and statistics. Python High Perform. Sci. Comput. 1–9, https://www.dlr.de/sc/en/Portaldata/15/Resources/dokumente/pyhpc2011/submissions/pyhpc2011_submission_9.pdf (2011).

  62. Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous distributed systems. Preprint at arXiv.org www.tensorflow.org

  63. Levitsky, L. I., Klein, J. A., Ivanov, M. V. & Gorshkov, M. V. Pyteomics 4.0: five years of development of a python proteomics framework. J. Proteome Res. 18, 709–714 (2019).

    Article  CAS  PubMed  Google Scholar 

  64. Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  65. Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).

    Article  Google Scholar 

  66. Waskom, M. L. Seaborn: statistical data visualization. J. Open Source Softw. 6, 3021 (2021).

    Article  Google Scholar 

  67. Oliphant, T. E. A Guide to NumPy Vol. 1 (Trelgol Publishing, 2006).

  68. The, M., MacCoss, M. J., Noble, W. S. & Käll, L. Fast and accurate protein false discovery rates on large-scale proteomics data sets with percolator 3.0. J. Am. Soc. Mass. Spectrom. 27, 1719–1727 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

R.B. acknowledges funding from the Marie Sklodowska-Curie EU Framework for Research and Innovation Horizon 2020 MASSTRPLAN (grant no. 675132) and Vlaams Agentschap Innoveren en Ondernemen under project number HBC.2020.2205. R.G. acknowledges funding from the Research Foundation Flanders (FWO) (grant no. 1S50918N). S.D. and L.M. acknowledge funding from the European Union’s Horizon 2020 Programme (grant nos. H2020-INFRAIA-2018-1 and 823839). N.H. and L.M. acknowledge funding from the Research Foundation Flanders (FWO) (grant nos. G042518N and G028821N). L.M. acknowledges funding from Ghent University Concerted Research Action (grant no. BOF21-GOA-033).

Author information

Authors and Affiliations

Authors

Contributions

R.B., R.G. and S.D. conceived the study. R.B., R.G., L.M. and S.D. designed the experiments, analyzed the results and wrote the paper. R.G. made the tool available in Python package repositories. N.H. and R.B. built the graphical user interface.

Corresponding author

Correspondence to Lennart Martens.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Methods thanks Lukas Reiter and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Arunima Singh was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–17 and Tables 1–4.

Reporting Summary

Supplementary Table 5

Concatenation of all peptides used for training, validation and evaluation. For each peptide, the randomly assigned split group is provided.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bouwmeester, R., Gabriels, R., Hulstaert, N. et al. DeepLC can predict retention times for peptides that carry as-yet unseen modifications. Nat Methods 18, 1363–1369 (2021). https://doi.org/10.1038/s41592-021-01301-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-021-01301-5

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research