Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Deep learning prediction of glycopeptide tandem mass spectra powers glycoproteomics

A preprint version of the article is available at bioRxiv.

Abstract

Protein glycosylation, a post-translational modification of proteins by glycans, plays an important role in numerous physiological and pathological cellular functions. Glycoproteomics, the study of protein glycosylation on a proteome-wide scale, utilizes liquid chromatography coupled with tandem mass spectrometry (MS/MS) to get combinational information on glycosylation site, glycosylation level and glycan structure. However, current database searching methods for glycoproteomics often struggle with glycan structure determination due to the limited occurrence of structure-determining ions. Although spectral searching methods can leverage fragment intensity to facilitate the structure identification of glycopeptides, their application is hindered by difficulties in spectral library construction. In this work, we present DeepGP, a hybrid deep learning framework based on transformer and graph neural networks, for the prediction of MS/MS spectra and retention time of glycopeptides. Two graph neural network modules are employed to capture the branched glycan structure and predict glycan ion intensity, respectively. Additionally, a pretraining strategy is implemented to alleviate the insufficiency of glycoproteomics data. Testing on multiple biological datasets, DeepGP accurately predicts MS/MS spectra and retention time of glycopeptides, closely aligning with the experimental results. Comprehensive benchmarking of DeepGP on synthetic and biological datasets validates its effectiveness in distinguishing similar glycans. Based on various decoy methods, DeepGP in combination with database searching can increase glycopeptide detection sensitivity. We anticipate that DeepGP can inspire extensive future work in glycoproteomics.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Model architecture and glycopeptide MS/MS spectra prediction.
Fig. 2: Performance of DeepGP in MS/MS prediction.
Fig. 3: DeepGP-based differentiation of similar glycan compositions on a synthetic dataset.
Fig. 4: Performance of DeepGP in glycopeptide identification.
Fig. 5: Glycopeptide identification by DeepGP in combination with pGlyco3.

Similar content being viewed by others

Data availability

The raw datasets used in this study are available in the PRIDE45 database under accession codes PXD005411 (ref. 7), PXD005412 (ref. 7), PXD005413 (ref. 7), PXD005553 (ref. 7), PXD005555 ref. 7, PXD025859 (ref. 9), PXD015360 (ref. 36), PXD009654 (ref. 37), PXD023980 (ref. 18), PXD016428 (ref. 46), PXD005931 (ref. 47), PXD025455 (ref. 48), PXD009716 (ref. 49) and PXD005565 (ref. 7). Further details regarding the datasets and raw files used in this study can be found in Supplementary Table 1 and Supplementary Data 7. Source data are provided with this paper. The source data for the main figures including statistics are provided as a Source Data file. The source data for the supplementary figures including statistics are provided as Supplementary Data 8. FASTA files were from the UniProt H. sapiens reference proteome (20,600 entries), UniProt M. musculus reference proteome (17,082 entries) and UniProt S. pombe reference proteome (5,140 entries). For the synthetic glycopeptide dataset (Syn_1), the FASTA file was compiled from synthetic glycopeptide sequences as stated in the original publication.

Code availability

DeepGP along with the user guide is freely available via GitHub (https://github.com/lmsac/DeepGP) and Zenodo (https://doi.org/10.5281/zenodo.11911189)50.

References

  1. Wang, Y. C., Peterson, S. E. & Loring, J. F. Protein post-translational modifications and regulation of pluripotency in human stem cells. Cell Res. 24, 143–160 (2014).

    Article  Google Scholar 

  2. Hart, G. W. & Copeland, R. J. Glycomics hits the big time. Cell 143, 672–676 (2010).

    Article  Google Scholar 

  3. Hu, H., Khatri, K. & Zaia, J. Algorithms and design strategies towards automated glycoproteomics analysis. Mass Spectrom. Rev. 36, 475–498 (2017).

    Article  Google Scholar 

  4. Hu, H., Khatri, K., Klein, J., Leymarie, N. & Zaia, J. A review of methods for interpretation of glycopeptide tandem mass spectral data. Glycoconj. J. 33, 285–296 (2016).

    Article  Google Scholar 

  5. Bojar, D. & Lisacek, F. Glycoinformatics in the artificial intelligence era. Chem. Rev. 122, 15971–15988 (2022).

    Article  Google Scholar 

  6. Zeng, W. F. et al. pGlyco: a pipeline for the identification of intact N-glycopeptides by using HCD- and CID-MS/MS and MS3. Sci. Rep. 6, 25102 (2016).

    Article  Google Scholar 

  7. Liu, M. Q. et al. pGlyco 2.0 enables precision N-glycoproteomics with comprehensive quality control and one-step mass spectrometry for intact glycopeptide identification. Nat. Commun. 8, 438 (2017).

    Article  Google Scholar 

  8. Zeng, W. F., Cao, W. Q., Liu, M. Q., He, S. M. & Yang, P. Y. Precise, fast and comprehensive analysis of intact glycopeptides and modified glycans with pGlyco3. Nat. Methods 18, 1515–1523 (2021).

    Article  Google Scholar 

  9. Shen, J. C. et al. StrucGP: de novo structural sequencing of site-specific N-glycan on glycoproteins using a modularization strategy. Nat. Methods 18, 921–929 (2021).

    Article  Google Scholar 

  10. Polasky, D. A., Yu, F. C., Teo, G. C. & Nesvizhskii, A. I. Fast and comprehensive N- and O-glycoproteomics analysis with MSFragger-Glyco. Nat. Methods 17, 1125–1132 (2020).

    Article  Google Scholar 

  11. Lu, L., Riley, N. M., Shortreed, M. R., Bertozzi, C. R. & Smith, L. M. O-Pair search with MetaMorpheus for O-glycopeptide characterization. Nat. Methods 17, 1133–1138 (2020).

    Article  Google Scholar 

  12. Medzihradszky, K. F., Maynard, J., Kaasik, K. & Bern, M. Intact N- and O-linked glycopeptide identification from HCD data using Byonic. Mol. Cell. Proteomics 13, S36 (2014).

    Google Scholar 

  13. Fang, Z. et al. Glyco-Decipher enables glycan database-independent peptide matching and in-depth characterization of site-specific N-glycosylation. Nat. Commun. 13, 1900 (2022).

    Article  Google Scholar 

  14. Xiao, K. & Tian, Z. GPSeeker enables quantitative structural N-Glycoproteomics for site- and structure-specific characterization of differentially expressed N-glycosylation in hepatocellular carcinoma. J. Proteome Res. 18, 2885–2895 (2019).

    Article  Google Scholar 

  15. Peng, W. et al. MS-based glycomics and glycoproteomics methods enabling isomeric characterization. Mass Spectrom. Rev. 42, 577–616 (2023).

    Article  Google Scholar 

  16. Toghi Eshghi, S., Shah, P., Yang, W., Li, X. & Zhang, H. GPQuest: a spectral library matching algorithm for site-specific assignment of tandem mass spectra to intact N-glycopeptides. Anal. Chem. 87, 5181–5188 (2015).

    Article  Google Scholar 

  17. Li, S. J., Zhu, J. H., Lubman, D. M., Zhou, H. & Tang, H. X. GlycoSLASH: concurrent glycopeptide identification from multiple related LC-MS/MS data sets by using spectral clustering and library searching. J. Proteome Res. 22, 1501–1509 (2023).

    Article  Google Scholar 

  18. Yang, Y. et al. GproDIA enables data-independent acquisition glycoproteomics with comprehensive statistical control. Nat. Commun. 12, 6073 (2021).

    Article  Google Scholar 

  19. Zeng, W. F. et al. MS/MS spectrum prediction for modified peptides using pDeep2 trained by transfer learning. Anal. Chem. 91, 9724–9731 (2019).

    Article  Google Scholar 

  20. Zhou, X. X. et al. pDeep: predicting MS/MS spectra of peptides with deep learning. Anal. Chem. 89, 12690–12697 (2017).

    Article  Google Scholar 

  21. Tarn, C. & Zeng, W. F. pDeep3: toward more accurate spectrum prediction with fast few-shot learning. Anal. Chem. 93, 5815–5822 (2021).

    Article  Google Scholar 

  22. Gessulat, S. et al. Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning. Nat. Methods 16, 509–518 (2019).

    Article  Google Scholar 

  23. Tiwary, S. et al. High-quality MS/MS spectrum prediction for data-dependent and data-independent acquisition data analysis. Nat. Methods 16, 519 (2019).

    Article  Google Scholar 

  24. Yang, Y. et al. In silico spectral libraries by deep learning facilitate data-independent acquisition proteomics. Nat. Commun. 11, 146 (2020).

    Article  Google Scholar 

  25. Lou, R. H. et al. DeepPhospho accelerates DIA phosphoproteome profiling through in silico library generation. Nat. Commun. 12, 6685 (2021).

    Article  Google Scholar 

  26. Zong, Y. et al. DeepFLR facilitates false localization rate control in phosphoproteomics. Nat. Commun. 14, 2269 (2023).

    Article  Google Scholar 

  27. Reily, C., Stewart, T. J., Renfrow, M. B. & Novak, J. Glycosylation in health and disease. Nat. Rev. Nephrol. 15, 346–366 (2019).

    Article  Google Scholar 

  28. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (eds Burstein, J. et al.) 4171–4186 (ACL, 2018); https://doi.org/10.18653/V1/N19-1423

  29. Cao, W. et al. Recent advances in software tools for more generic and precise intact glycopeptide analysis. Mol. Cell. Proteomics 20, 100060 (2021).

    Article  Google Scholar 

  30. Liu, J. et al. Methods for peptide identification by spectral comparison. Proteome Sci 5, 3 (2007).

    Article  Google Scholar 

  31. Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. Preprint at https://arXiv.org/1609.02907 (2016).

  32. Xu, K., Hu, W., Leskovec, J. & Jegelka, S. How powerful are graph neural networks? Preprint at https://arXiv.org/1810.00826 (2018).

  33. Veličković, P. et al. Graph attention networks. In Proc. 6th International Conference on Learning Representations (ICLR, 2018); https://doi.org/10.48550/arXiv.1710.10903

  34. Xiong, Z. et al. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. J. Med. Chem. 63, 8749–8760 (2020).

    Article  Google Scholar 

  35. Vaswani, A. et al. Attention is all you need. In Proc. Advances in Neural Information Processing Systems (eds von Luxburg, U. et al.) 5999–6009 (Curran Associates, 2017); https://doi.org/10.48550/arXiv.1706.03762

  36. Zhang, Y. et al. Comparative glycoproteomic profiling of human body fluid between healthy controls and patients with papillary thyroid carcinoma. J. Proteome Res. 19, 2539–2552 (2020).

    Article  Google Scholar 

  37. Qin, H. et al. Highly efficient analysis of glycoprotein sialylation in human serum by simultaneous quantification of glycosites and site-specific glycoforms. J. Proteome Res. 18, 3439–3446 (2019).

    Article  Google Scholar 

  38. Sun, W. et al. Glycopeptide database search and de novo sequencing with PEAKS GlycanFinder enable highly sensitive glycoproteomics. Nat. Commun. 14, 4046 (2023).

    Article  Google Scholar 

  39. Polasky, D. A., Geiszler, D. J., Yu, F. & Nesvizhskii, A. I. Multiattribute glycan identification and FDR control for glycoproteomics. Mol. Cell. Proteomics 21, 100205 (2022).

    Article  Google Scholar 

  40. Zhang, S. Spectrum and Retention Time Prediction for N-Glycopeptides Using Deep Learning. Master's thesis, Univ. of Waterloo (2023).

  41. Kawahara, R. et al. Community evaluation of glycoproteomics informatics solutions reveals high-performance search strategies for serum glycopeptide analysis. Nat. Methods 18, 1304–1316 (2021).

    Article  Google Scholar 

  42. Klein, J., Carvalho, L. & Zaia, J. Expanding N-Glycopeptide identifications by fragmentation prediction and glycome network smoothing. Preprint at bioRxiv https://doi.org/10.1101/2021.02.14.431154 (2021).

  43. Zhang, Z. & Shah, B. Prediction of collision-induced dissociation spectra of common N-glycopeptides for glycoform identification. Anal. Chem. 82, 10194–10202 (2010).

    Article  Google Scholar 

  44. Yang, Y. & Fang, Q. Prediction of glycopeptide fragment mass spectra by deep learning. Nat. Commun. 15, 2448 (2024).

    Article  Google Scholar 

  45. Vizcaino, J. A. et al. 2016 update of the PRIDE database and its related tools. Nucleic Acids Res. 44, D447–D456 (2016).

    Article  Google Scholar 

  46. Zhang, Y. et al. Glyco-CPLL: an integrated method for in-depth and comprehensive N-glycoproteome profiling of human plasma. J. Proteome Res. 19, 655–666 (2020).

    Article  Google Scholar 

  47. Bollineni, R. C., Koehler, C. J., Gislefoss, R. E., Anonsen, J. H. & Thiede, B. Large-scale intact glycopeptide identification by Mascot database search. Sci. Rep. 8, 2117 (2018).

    Article  Google Scholar 

  48. Lin, Y. et al. A panel of glycopeptides as candidate biomarkers for early diagnosis of NASH hepatocellular carcinoma using a stepped HCD Method and PRM evaluation. J. Proteome Res. 20, 3278–3289 (2021).

    Article  Google Scholar 

  49. Pioch, M., Hoffmann, M., Pralow, A., Reichl, U. & Rapp, E. glyXtool(MS): an open-source pipeline for semiautomated analysis of glycopeptide mass spectrometry data. Anal. Chem. 90, 11908–11916 (2018).

    Article  Google Scholar 

  50. Zong, Y. Code for DeepGP. Zenodo https://doi.org/10.5281/zenodo.11911189 (2024).

Download references

Acknowledgements

We thank M. Ye and Z. Fang from Dalian Institute of Chemical Physics, Chinese Academy of Sciences for assisting us in using GP-plotter. This work was supported by the Science and Technology Commission of Shanghai Municipality (grant no. 23JS1400100, L.Q.) and the National Natural Science Foundation of China (grant no. 21934001, L.Q.). The work also received support from the AI for Science project of Fudan University (X.Q. and L.Q.).

Author information

Authors and Affiliations

Authors

Contributions

Y.Z. did the majority of the coding work and data analysis and wrote the original draft of the manuscript. Y.W. built the deep learning model. X.Q. and X.H. assisted in the building of the deep learning model and provided the computational resources. L.Q. supervised all aspects of the work and finalized the manuscript. All authors were involved in the design of this work.

Corresponding author

Correspondence to Liang Qiao.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Wen-Feng Zeng and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Tables 1–3, Figs. 1–31 and Notes 1–7.

Reporting Summary

Supplementary Data 1

Scores by DeepGP and pGlyco3 for glycopeptide candidates of MS/MS spectra from mouse liver dataset with 1% or 100% FDR by pGlyco3.

Supplementary Data 2

Scores by DeepGP and pGlyco3 for glycopeptide candidates of MS/MS spectra from mouse brain dataset with 1% or 100% FDR by pGlyco3.

Supplementary Data 3

The glycopeptides identified by DeepGP, pGlyco3 and StrucGP from the MS/MS spectra extracted from the mouse liver and brain datasets.

Supplementary Data 4

Glycopeptides identified by DeepGP for MS/MS spectra, removing diagnostic ions.

Supplementary Data 5

Glycopeptides identified by DeepGP + pGlyco3 and pGlyco3 alone with different decoy methods.

Supplementary Data 6

Glycopeptides additionally identified by DeepGP + pGlyco3 with different decoy methods.

Supplementary Data 7

Summary of the raw files used for each dataset from the public resources.

Supplementary Data 8

Source data for supplementary figures.

Source data

Source Data Fig. 2

Statistical source data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zong, Y., Wang, Y., Qiu, X. et al. Deep learning prediction of glycopeptide tandem mass spectra powers glycoproteomics. Nat Mach Intell 6, 950–961 (2024). https://doi.org/10.1038/s42256-024-00875-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-024-00875-x

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing