Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Tandem mass spectrum prediction for small molecules using graph transformers

A preprint version of the article is available at arXiv.

Abstract

Tandem mass spectra capture fragmentation patterns that provide key structural information about molecules. Although mass spectrometry is applied in many areas, the vast majority of small molecules lack experimental reference spectra. For over 70 years, spectrum prediction has remained a key challenge in the field. Existing deep learning methods do not leverage global structure in the molecule, potentially resulting in difficulties when generalizing to new data. In this work we propose the MassFormer model for accurately predicting tandem mass spectra. MassFormer uses a graph transformer architecture to model long-distance relationships between atoms in the molecule. The transformer module is initialized with parameters obtained through a chemical pretraining task, then fine-tuned on spectral data. MassFormer outperforms competing approaches for spectrum prediction on multiple datasets and accurately models the effects of collision energy. Gradient-based attribution methods reveal that MassFormer can identify compositional relationships between peaks in the spectrum. When applied to spectrum identification problems, MassFormer generally surpasses the performance of existing prediction-based methods.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of the method.
Fig. 2: Spectrum similarity experiments.
Fig. 3: Collision energy experiments.
Fig. 4: Explainability using gradient attributions.

Similar content being viewed by others

Data availability

All public data from the study have been uploaded to Zenodo at https://doi.org/10.5281/zenodo.8399738 (ref. 93). Some data that support the findings of this study are available from the National Institute of Standards and Technology (NIST). However, its access is subject to restrictions, requiring the purchase of an appropriate license or special permission from NIST.

Code availability

The code used in this study is open-source (BSD-2-Clause license) and can be found in a GitHub repository (https://github.com/Roestlab/massformer/)94 with a DOI of https://doi.org/10.5281/zenodo.10558852 (ref. 95).

References

  1. Gross, J. H. Mass Spectrometry—A Textbook (Springer, 2011); https://doi.org/10.1007/978-3-319-54398-7

  2. Niessen, W. M. A. & Falck, D. in Analyzing Biomolecular Interactions by Mass Spectrometry Ch. 1 (eds Kool, J. & Niessen, W. M. A.) (Wiley, 2015); https://doi.org/10.1002/9783527673391

  3. Aebersold, R. & Mann, M. Mass-spectrometric exploration of proteome structure and function. Nature 537, 347–355 (2016).

    Article  Google Scholar 

  4. Gowda, G. A. N. & Djukovic, D. Overview of mass spectrometry-based metabolomics: opportunities and challenges. Methods Mol. Biol. 1198, 3–12 (2014).

    Article  Google Scholar 

  5. De Vijlder, T. & Cuyckens, F. A tutorial in small molecule identification via electrospray ionization-mass spectrometry: the practical art of structural elucidation. Mass Spectrom. Rev. 37, 607–629 (2018).

    Article  Google Scholar 

  6. Peters, F. T. Recent advances of liquid chromatography-(tandem) mass spectrometry in clinical and forensic toxicology. Clin. Biochem. 44, 54–65 (2011).

    Article  Google Scholar 

  7. Van Bocxlaer, J. F. et al. Liquid chromatography-mass spectrometry in forensic toxicology. Mass Spectrom. Rev. 19, 165–214 (2000).

    Article  Google Scholar 

  8. Lebedev, A. T. Environmental mass spectrometry. Ann. Rev. Anal.Chem. 6, 163–189 (2013).

    Article  Google Scholar 

  9. Stein, S. E. & Scott, D. R. Optimization and testing of mass spectral library search algorithms for compound identification. J. Am. Soc. Mass Spectrom. 5, 859–866 (1994).

    Article  Google Scholar 

  10. Li, Y. et al. Spectral entropy outperforms MS/MS dot product similarity for small-molecule compound identification. Nat. Methods 18, 1524–1531 (2021).

    Article  Google Scholar 

  11. Majewski, S. et al. The Wasserstein distance as a dissimilarity measure for mass spectra with application to spectral deconvolution. In 18th International Workshop on Algorithms in Bioinformatics (eds Parida, L. & Ukkonen, E.) 25:1–25:21 (WABI, 2018); https://doi.org/10.4230/LIPICS.WABI.2018.25

  12. Benton, H. P., Wong, D. M., Trauger, S. A. & Siuzdak, G. XCMS2: processing tandem mass spectrometry data for metabolite identification and structural characterization. Anal. Chem. 80, 6382–6389 (2008).

    Article  Google Scholar 

  13. Wishart, D. S. et al. HMDB 4.0: the human metabolome database for 2018. Nucleic Acids Res. 46, 608–617 (2018).

    Article  Google Scholar 

  14. Kim, S. et al. PubChem 2019 update: improved access to chemical data. Nucleic Acids Res. 47, 1102–1109 (2019).

    Article  Google Scholar 

  15. Kanehisa, M., Furumichi, M., Sato, Y., Ishiguro-Watanabe, M. & Tanabe, M. KEGG: integrating viruses and cellular organisms. Nucleic Acids Res. 49, 545–551 (2021).

    Article  Google Scholar 

  16. Horai, H. et al. MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass Spectrom. 45, 703–714 (2010).

    Article  Google Scholar 

  17. Wang, M. et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nat. Biotechnol. 34, 828–837 (2016).

    Article  Google Scholar 

  18. Sawada, Y. et al. RIKEN tandem mass spectral database (ReSpect) for phytochemicals: a plant-specific MS/MS-based data resource and database. Phytochemistry 82, 38–45 (2012).

  19. MassBank of North America (MoNA, 2022); https://mona.fiehnlab.ucdavis.edu/

  20. Stein, S. Mass spectral reference libraries: an ever-expanding resource for chemical identification. Anal. Chem. 84, 7274–7282 (2012).

    Article  Google Scholar 

  21. Yang, X., Neta, P. & Stein, S. E. Quality control for building libraries from electrospray ionization tandem mass spectra. Anal. Chem. 86, 6393–6400 (2014).

    Article  Google Scholar 

  22. Guijas, C. et al. METLIN: a technology platform for identifying knowns and unknowns. Anal. Chem. 90, 3156–3164 (2018).

    Article  Google Scholar 

  23. Wiley Registry of Mass Spectral Data 2023 (Wiley, 2023); https://sciencesolutions.wiley.com/solutions/technique/gc-ms/wiley-registry-of-mass-spectral-data/

  24. Allen, F., Greiner, R. & Wishart, D. Competitive fragmentation modeling of ESI-MS/MS spectra for putative metabolite identification. Metabolomics 11, 98–110 (2015).

    Article  Google Scholar 

  25. Djoumbou-Feunang, Y. et al. CFM-ID 3.0: significantly improved ESI-MS/MS prediction and identification. Metabolites 9, 72 (2019).

  26. Wang, F. et al. CFM-ID 4.0: more accurate ESI-MS/MS spectral prediction and compound identification. Anal. Chem. 93, 11692–11700 (2021); https://doi.org/10.1021/acs.analchem.1c01465

  27. Wei, J. N., Belanger, D., Adams, R. P. & Sculley, D. Rapid prediction of electron-ionization mass spectrometry using neural networks. ACS Cent. Sci. 5, 700–708 (2019).

    Article  Google Scholar 

  28. Zhu, H., Liu, L. & Hassoun, S. Using graph neural networks for mass spectrometry prediction. Preprint at https://arxiv.org/abs/2010.04661 (2020).

  29. Li, X., Zhu, H., Liu, L.-p. & Hassoun, S. Ensemble spectral prediction (ESP) model for metabolite annotation. Preprint at https://arxiv.org/abs/2203.13783 (2022).

  30. Zhang, B., Zhang, J., Xia, Y., Chen, P. & Wang, B. Prediction of electron ionization mass spectra based on graph convolutional networks. Int. J. Mass Spectrom. 475, 116817 (2022).

    Article  Google Scholar 

  31. Xu, K., Hu, W., Leskovec, J. & Jegelka, S. How powerful are graph neural networks? In 7th International Conference on Learning Representations, ICLR 2019 (OpenReview.net, 2019); https://openreview.net/forum?id=B1gabhRcYX

  32. Chen, D. et al. Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. In Proc. AAAI Conference on Artificial Intelligence Vol. 34, 3438–3445 (AAAI Press, 2020); https://doi.org/10.1609/aaai.v34i04.5747

  33. Liu, M., Gao, H. & Ji, S. Towards deeper graph neural networks. In Proc. 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining 338–348 (Association for Computing Machinery, 2020); https://doi.org/10.1145/3394486.3403076

  34. Ying, C. et al. Do transformers really perform bad for graph representation? In Advances in Neural Information Processing Systems 34 (NeurIPS 2021) (eds Ranzato, M. et al.) 28877–28888 (Curran Associates, 2021).

  35. Hong, Y. et al. 3DMolMS: prediction of tandem mass spectra from 3D molecular conformations. Bioinformatics 39, btad354 (2023); https://doi.org/10.1093/bioinformatics/btad354

  36. Murphy, M. et al. Efficiently predicting high resolution mass spectra with graph neural networks. Proc. 40th International Conference on Machine Learning (ICML 2023) Vol. 70 (eds Krause, A. et al.), 25549–25562 (PMLR, 2023).

  37. Goldman, S., Bradshaw, J., Xin, J. & Coley, C. W. Prefix-tree decoding for predicting mass spectra from molecules. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023) (eds Oh, A. et al.) 48548–48572 (Curran Associates, 2023).

  38. Zhu, R. L. & Jonas, E. Rapid approximate subset-based spectra prediction for electron ionization-mass spectrometry. Anal. Chem. 95, 2653–2663 (2023).

    Article  Google Scholar 

  39. Goldman, S., Li, J. & Coley, C. W. Generating molecular fragmentation graphs with autoregressive neural networks. Anal. Chem. 96, 3419–3428 (2024).

  40. Jin, W., Coley, C., Barzilay, R. & Jaakkola, T. Predicting organic reaction outcomes with Weisfeiler-Lehman network. In Advances in Neural Information Processing Systems 30 (eds Guyon, I. et al.) (Curran Associates, 2017).

  41. Bemis, G. W. & Murcko, M. A. The properties of known drugs. 1. Molecular frameworks. J. Med.Chem. 39, 2887–2893 (1996).

    Article  Google Scholar 

  42. Landrum, G. RDKit: open-source cheminformatics software. Zenodo https://doi.org/10.5281/zenodo.4973812 (2021).

  43. Djoumbou Feunang, Y. et al. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J. Cheminformatics 8, 61 (2016).

    Article  Google Scholar 

  44. Kind, T. et al. LipidBlast in silico tandem mass spectrometry database for lipid identification. Nat. Methods 10, 755–758 (2013).

    Article  Google Scholar 

  45. Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. In Proc. 34th International Conference on Machine Learning (ICML 2017) Vol. 70 (eds Precup, D. & Teh, Y. W.) 3145–3153 (PMLR, 2017).

  46. Ancona, M., Ceolini, E., Öztireli, C. & Gross, M. Towards better understanding of gradient-based attribution methods for deep neural networks. In 6th International Conference on Learning Representations, ICLR 2018 (OpenReview.net, 2018); https://openreview.net/forum?id=Sy21R9JAW

  47. Ali, A. et al. XAI for transformers: better explanations through conservative propagation. In Proc. 39th International Conference on Machine Learning Vol. 162 (eds Chaudhuri, K. et al.) 435–451 (PMLR, 2022).

  48. Dührkop, K., Shen, H., Meusel, M., Rousu, J. & Böcker, S. Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proc. Natl Acad. Sci. USA 112, 12580–12585 (2015).

    Article  Google Scholar 

  49. Dührkop, K. et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods 16, 299–302 (2019).

    Article  Google Scholar 

  50. Schymanski, E. L. & Neumann, S. CASMI: and the winner is. Metabolites 3, 412–439 (2013).

    Article  Google Scholar 

  51. Schymanski, E. L. et al. Critical assessment of small molecule identification 2016: automated methods. J. Cheminform. 9, 22 (2017).

    Article  Google Scholar 

  52. Revisiting CASMI. Fiehn Laboratory https://fiehnlab.ucdavis.edu/casmi (2022).

  53. McCoy, R. T., Min, J. & Linzen, T. BERTs of a feather do not generalize together: large variability in generalization across models with similar test set performance. In Proc. 3rd BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (eds Alishahi, A. et al.) 217–227 (Association for Computational Linguistics, 2020).

  54. Zhou, X., Nie, Y., Tan, H. & Bansal, M. The curse of performance instability in analysis datasets: consequences, source, and suggestions. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds Webber, B. et al.) 8215–8228 (Association for Computational Linguistics, 2020).

  55. D’Amour, A. et al. Underspecification presents challenges for credibility in modern machine learning. J. Mach. Learn. Res. 23, 1–61 (2022).

  56. Goldman, S. et al. Annotating metabolite mass spectra with domain-inspired chemical formula transformers. Nat. Mach. Intell. 5, 965–979 (2023).

    Article  Google Scholar 

  57. Shrivastava, A. D. et al. MassGenie: a transformer-based deep learning method for identifying small molecules from their mass spectra. Biomolecules 11, 1793 (2021).

    Article  Google Scholar 

  58. Stravs, M. A., Dührkop, K., Böcker, S. & Zamboni, N. MSNovelist: de novo structure generation from mass spectra. Nat. Methods 19, 865–870 (2022).

    Article  Google Scholar 

  59. Butler, T. et al. MS2Mol: A transformer model for illuminating dark chemical space from mass spectra. Preprint at https://doi.org/10.26434/chemrxiv-2023-vsmpx-v2 (2023).

  60. Jonas, E. Deep imitation learning for molecular inverse problems. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (eds Wallach, H. et al.) 4991–5001 (Curran Associates, 2019); https://proceedings.neurips.cc/paper_files/paper/2019/file/b0bef4c9a6e50d43880191492d4fc827-Paper.pdf

  61. Shanthamoorthy, P., Young, A. & Röst, H. Analyzing assay specificity in metabolomics using unique ion signature simulations. Anal. Chem. 93, 11415–11423 (2021).

    Article  Google Scholar 

  62. Hoffmann, M. A. et al. High-confidence structural annotation of metabolites absent from spectral libraries. Nature Biotechnol. 40, 411–421 (2021); https://doi.org/10.1038/s41587-021-01045-9

  63. Scheubert, K. et al. Significance estimation for large scale metabolomics annotations by spectral matching. Nat. Commun. 8, 1494 (2017).

    Article  Google Scholar 

  64. Durant, J. L., Leland, B. A., Henry, D. R. & Nourse, J. G. Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 42, 1273–1280 (2002).

    Article  Google Scholar 

  65. Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).

    Article  Google Scholar 

  66. Zhou, G. et al. Uni-Mol: a universal 3D molecular representation learning framework. In The 11th International Conference on Learning Representations (OpenReview.net, 2022); https://openreview.net/forum?id=6K2RM6wVqKu

  67. Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017) (eds Guyon, I. et al.) (Curran Associates, 2017).

  68. Tan, Z. et al. Neural machine translation: a review of methods, resources, and tools. AI Open 1, 5–21 (2020).

    Article  Google Scholar 

  69. Janner, M., Li, Q. & Levine, S. Offline reinforcement learning as one big sequence modeling problem. In Advances in Neural Information Processing Systems 34 (NeurIPS 2021) (eds Ranzato, M. et al.) 1273–1286 (Curran Associates, 2021).

  70. Dosovitskiy, A. et al. An image is worth 16x16 words: transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021 (OpenReview.net, 2021); https://openreview.net/forum?id=YicbFdNTTy

  71. Ahmadi, A. H. K., Hassani, K., Moradi, P., Lee, L., & Morris, Q. Memory-based graph networks. In 8th International Conference on Learning Representations, ICLR 2020 (OpenReview.net, 2020); https://openreview.net/forum?id=r1laNeBYPB

  72. Mialon, G., Chen, D., Selosse, M. & Mairal, J. GraphiT: encoding graph structure in transformers. Preprint at https://arxiv.org/abs/2106.05667 (2021).

  73. Maziarka, L. et al. Molecule attention transformer. Preprint at https://arxiv.org/abs/2002.08264 (2020).

  74. Rong, Y. et al. Self-supervised graph transformer on large-scale molecular data. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020) (eds Larochelle, H. et al.) 12559–12571 (Curran Associates, 2020).

  75. Hu, W. et al. Open graph benchmark: datasets for machine learning on graphs. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020) (eds Larochelle, H. et al.) 22118–22133 (Curran Associates, 2020).

  76. Velickovic, P. et al. Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018 (OpenReview.net, 2018); https://openreview.net/forum?id=rJXMpikCZ

  77. Hu, W. et al. Open graph benchmark: datasets for machine learning on graphs. In Proc. 34th International Conference on Neural Information Processing Systems (eds Larochelle, H. et al.) 1855 (Curran Associates, 2020).

  78. Floyd, R. W. Algorithm 97: shortest path. Commun. ACM 5, 345 (1962).

  79. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).

    MathSciNet  Google Scholar 

  80. Ba, J. L., Kiros, J. R. & Hinton, G. E. Layer normalization. Preprint at https://arxiv.org/abs/1607.06450 (2016).

  81. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019) Vol. 1 (eds Burstein, J. et al.) 4171–4186 (Association for Computational Linguistics, 2019).

  82. Nair, V. & Hinton, G. E. Rectified linear units improve restricted Boltzmann machines. In Proc. 27th International Conference on International Conference on Machine Learning (eds Fürnkranz, J. & Joachims, T.) 807–814 (Omnipress, 2010).

  83. Hu, W. et al. OGB-LSC: a large-scale challenge for machine learning on graphs. In Proc. Neural Information Processing Systems Track on Datasets and Benchmarks (eds Vanschoren, J. & Yeung, S.) (Curran Associates, 2021).

  84. Nakata, M. & Shimazaki, T. PubChemQC project: a large-scale first-principles electronic structure database for data-driven chemistry. J. Chem. Info. Mod. 57, 1300–1308 (2017).

    Article  Google Scholar 

  85. Heller, S. R., McNaught, A., Pletnev, I., Stein, S. & Tchekhovskoi, D. InChI, the IUPAC international chemical identifier. J. Cheminformatics 7, 23 (2015).

    Article  Google Scholar 

  86. Wu, Z. et al. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).

    Article  Google Scholar 

  87. Pence, H. E. & Williams, A. ChemSpider: an online chemical information resource. J. Chem. Educ. 87, 1123–1124 (2010).

    Article  Google Scholar 

  88. Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (eds Wallach, H. et al.) (Curran Associates, 2019).

  89. Fey, M. & Lenssen, J. E. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds (OpenReview.net, 2019); https://rlgm.github.io/papers/2.pdf

  90. Wang, M. et al. Deep Graph Library: a graph-centric, highly-performant package for graph neural networks. Preprint at https://arxiv.org/abs/1909.01315 (2020).

  91. Li, M. et al. DGL-LifeSci: an open-source toolkit for deep learning on graphs in life science. ACS Omega 6, 27233–27238 (2021).

  92. Biewald, L. Experiment tracking with Weights & Biases. Weights & Biases http://wandb.com (2020).

  93. Young, A., Wang, B. & Röst, H. Public Data files for MassFormer. Zenodo https://doi.org/10.5281/zenodo.8399738 (2023).

  94. Young, A. Roestlab/massformer. GitHub https://github.com/Roestlab/massformer/ (2024).

  95. Young, A. Roestlab/massformer v0.4.0 Zenodo https://doi.org/10.5281/zenodo.10558852 (2024).

  96. WELCH, B. L. The generalization of ‘student’s’ problem when several different population varlances are involved. Biometrika 34, 28–35 (1947).

    MathSciNet  Google Scholar 

  97. Šidák, Z. Rectangular confidence regions for the means of multivariate normal distributions. J. Am. Stat. Assoc. 62, 626–633 (1967).

    MathSciNet  Google Scholar 

Download references

Acknowledgements

Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada, through the Canadian Institute for Advanced Research (CIFAR) and companies sponsoring the Vector Institute. This research was also enabled in part by support provided by Compute Ontario (https://www.computeontario.ca/) and the Digital Research Alliance of Canada (alliancecan.ca). A.Y. is supported by a Natural Sciences and Engineering Research Council of Canada (NSERC) Postgraduate Scholarship (Doctoral Program) and a Vector Institute research grant. H.R. is supported by NSERC, the Canadian Institutes for Health Research (CIHR), the Canadian Foundation for Innovation, the Canada Research Coordinating Committee (CRCC), the John R. Evans Leaders Fund and the Canada Research Chair Program. B.W. is supported by NSERC (grants: RGPIN-2020-06189 and DGECR-2020-00294), the Peter Munk Cardiac Centre AI Fund at the University Health Network and the CIFAR AI Chair Program. We thank B. Lieng, P. Shanthamoorthy, R. Montenegro-Burke and Q. Morris for helpful discussions. We thank C. Harrigan for feedback on the figures. We thank F. Wang for help with the CFM baseline experiments. We thank S. Ma, P. Fradkin, A. Toma and C. Wang for feedback on the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

A.Y., H.R. and B.W. conceived the project. A.Y. wrote the computer code and ran the experiments. H.R. and B.W. supervised the work. A.Y., H.R. and B.W. wrote the manuscript.

Corresponding authors

Correspondence to Hannes Röst or Bo Wang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Sebastian Böcker and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Additional Spectrum Similarity Experiments.

A more detailed comparison of the deep learning models that does not involve filtering compounds based on overlap with CFM’s training set. Training set sizes (N) are indicated for each split. (a) Test set cosine similarity (1 Da bin resolution) when training and evaluating on [M+H]+ spectra, (b) Test set cosine similarity (1 Da bin resolution) training and evaluating on all six supported precursor adducts. MassFormer demonstrates strong performance in both cases. Averages and standard deviations from 10 independently trained models are reported. Statistical significance is determined by one-sided Welch’s t-test with Šidák correction.

Extended Data Fig. 2 ClassyFire Similarity Experiments.

Cosine similarity (1 Da bin resolution) on spectra corresponding to the top 10 most frequent chemical classes from the NIST-Scaffold test set ([M+H]+ adducts only). The chemical classes, identified by ClassyFire, are sorted from most to least frequent on the x-axis and are not necessarily disjoint. The average performance for each model (across all compounds) is indicated by a black dashed line. (a) MassFormer scores, (b) CFM scores. MassFormer performs better than CFM in each category. Strikingly, MassFormer performs best on “lipids and lipid-like molecules", which is one class that CFM seems to struggle with. Averages and standard deviations from 10 independently trained models are reported (except for CFM, which is pretrained).

Extended Data Fig. 3 Additional Hetero-atom Peak Separability Experiments.

Linear peak classification accuracy distributions for four hetero-atoms: (a) chlorine, (b) sulfur, (c) fluorine, (d) phosphorus. For each hetero-atom, the distribution of optimal linear classification accuracy induced by the hetero-atom labelling strategy is markedly different from the random labelling distribution (higher accuracy indicates improved separability of the peaks). Sample size and statistical significance (Welch’s t-test with Šidák correction) for separability differences are provided for each plot.

Extended Data Fig. 4 Spectrum Identification Rank Distributions.

MassFormer ranks candidate structures more correctly than competing approaches. (a) Distributions of the matching candidate’s predicted rank for CASMI 2016, CASMI 2022, and NIST20 Outlier queries. (b) Corresponding distributions of the matching candidate’s normalized rank. Note that for both metrics, a lower score is better. MassFormer’s rank and normalized rank distributions are more strongly skewed towards lower values. Boxplot lines represent median and interquartile range, whiskers represent 1.5 times the interquartile range, and the “X" symbol represents the mean.

Extended Data Fig. 5 Spectrum Identification Candidate Set Statistics.

Different spectrum identification tasks vary in terms of the diversity and size of their candidate sets. (a,c,e) Distribution of Tanimoto similarities between candidate molecules and their corresponding queries for (a) CASMI 2016, (c) CASMI 2022, (e) NIST20 Outlier. (b,d,f) Distribution of the number of candidates per query for (b) CASMI 2016, (d) CASMI 2022, (f) NIST20 Outlier. The CASMI 2022 and NIST20 Outlier candidate sets are sampled from PubChem using the query’s molecular weight. The NIST20 Outlier dataset uses a smaller weight tolerance (0.5ppm) than CASMI 2022 (10 ppm), resulting in fewer candidates with higher chemical similarity to the query.

Extended Data Fig. 6 Additional Prediction Examples.

Twelve spectrum predictions of varying accuracy, roughly covering a range of 0.4 to 1.0 cosine similarity. All spectra are merged over multiple collision energies. The predictions are described in terms of InChIKey-14, precursor adduct, and cosine similarity with ground truth. (a) AWXJBCZMGZDXCG, [M+H]+, 0.46 (b) GLFJFDAJNJYPGW, [M+H]+, 0.47 (c) HIEYVTSQMLHJEZ, [M+H]+, 0.51 (d) JNKVBUQSDAHKDQ, [M+H-H2O]+, 0.59 (e) WDVCZSSWRMVHAU, [M+H-H2O]+, 0.65 (f) YCTAOQGPWNTYJE, [M+H]+, 0.66 (g) CILGSELJQXSDBE, [M+H]+, 0.72 (h) XCDOHVHQWSFAEN, [M+H-2H2O]+, 0.79 (i) ISNRVVKKHPECQN, [M+H-H2O]+, 0.80 (j) XBGGUPMXALFZOT, [M+H]+, 0.86 (k) DTLKTHCXEMHTIQ, [M+H-2H2O]+, 0.91 (l) BLJBQVQHDXUDTE, [M+H]+, 0.98.

Supplementary information

Supplementary Information

Supplementary text and Tables 1–9.

Supplementary Table 10

The complete set of average similarity scores, across all data splits, for each of the four models (CFM, FP, WLN, MF). There are 56 different methods of similarity calculation. Each variant is defined by a particular intensity transformation (no transform, log transform, square root transform, precursor peak removal), similarity function (cosine, Jensen–Shannon, Jaccard), collision energy merging strategy (merging, no merging) and score aggregation method (spectrum averaging, molecule averaging). Note that intensity transformations are not used in combination with the Jaccard similarity function, which assumes binary intensities.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Young, A., Röst, H. & Wang, B. Tandem mass spectrum prediction for small molecules using graph transformers. Nat Mach Intell 6, 404–416 (2024). https://doi.org/10.1038/s42256-024-00816-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-024-00816-8

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research