Tandem mass spectrum prediction for small molecules using graph transformers

Young, Adamo; Röst, Hannes; Wang, Bo

doi:10.1038/s42256-024-00816-8

Article
Published: 05 April 2024

Tandem mass spectrum prediction for small molecules using graph transformers

Nature Machine Intelligence volume 6, pages 404–416 (2024)Cite this article

1867 Accesses
25 Altmetric
Metrics details

Subjects

A preprint version of the article is available at arXiv.

Abstract

Tandem mass spectra capture fragmentation patterns that provide key structural information about molecules. Although mass spectrometry is applied in many areas, the vast majority of small molecules lack experimental reference spectra. For over 70 years, spectrum prediction has remained a key challenge in the field. Existing deep learning methods do not leverage global structure in the molecule, potentially resulting in difficulties when generalizing to new data. In this work we propose the MassFormer model for accurately predicting tandem mass spectra. MassFormer uses a graph transformer architecture to model long-distance relationships between atoms in the molecule. The transformer module is initialized with parameters obtained through a chemical pretraining task, then fine-tuned on spectral data. MassFormer outperforms competing approaches for spectrum prediction on multiple datasets and accurately models the effects of collision energy. Gradient-based attribution methods reveal that MassFormer can identify compositional relationships between peaks in the spectrum. When applied to spectrum identification problems, MassFormer generally surpasses the performance of existing prediction-based methods.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Spectrum similarity experiments.**

**Fig. 3: Collision energy experiments.**

**Fig. 4: Explainability using gradient attributions.**

Mass spectra prediction with structural motif-based graph neural networks

Article Open access 16 January 2024

Annotating metabolite mass spectra with domain-inspired chemical formula transformers

Article 17 August 2023

MolDiscovery: learning mass spectrometry fragmentation of small molecules

Article Open access 17 June 2021

Data availability

All public data from the study have been uploaded to Zenodo at https://doi.org/10.5281/zenodo.8399738 (ref. ⁹³). Some data that support the findings of this study are available from the National Institute of Standards and Technology (NIST). However, its access is subject to restrictions, requiring the purchase of an appropriate license or special permission from NIST.

Code availability

The code used in this study is open-source (BSD-2-Clause license) and can be found in a GitHub repository (https://github.com/Roestlab/massformer/)⁹⁴ with a DOI of https://doi.org/10.5281/zenodo.10558852 (ref. ⁹⁵).

References

Gross, J. H. Mass Spectrometry—A Textbook (Springer, 2011); https://doi.org/10.1007/978-3-319-54398-7
Niessen, W. M. A. & Falck, D. in Analyzing Biomolecular Interactions by Mass Spectrometry Ch. 1 (eds Kool, J. & Niessen, W. M. A.) (Wiley, 2015); https://doi.org/10.1002/9783527673391
Aebersold, R. & Mann, M. Mass-spectrometric exploration of proteome structure and function. Nature 537, 347–355 (2016).
Article Google Scholar
Gowda, G. A. N. & Djukovic, D. Overview of mass spectrometry-based metabolomics: opportunities and challenges. Methods Mol. Biol. 1198, 3–12 (2014).
Article Google Scholar
De Vijlder, T. & Cuyckens, F. A tutorial in small molecule identification via electrospray ionization-mass spectrometry: the practical art of structural elucidation. Mass Spectrom. Rev. 37, 607–629 (2018).
Article Google Scholar
Peters, F. T. Recent advances of liquid chromatography-(tandem) mass spectrometry in clinical and forensic toxicology. Clin. Biochem. 44, 54–65 (2011).
Article Google Scholar
Van Bocxlaer, J. F. et al. Liquid chromatography-mass spectrometry in forensic toxicology. Mass Spectrom. Rev. 19, 165–214 (2000).
Article Google Scholar
Lebedev, A. T. Environmental mass spectrometry. Ann. Rev. Anal.Chem. 6, 163–189 (2013).
Article Google Scholar
Stein, S. E. & Scott, D. R. Optimization and testing of mass spectral library search algorithms for compound identification. J. Am. Soc. Mass Spectrom. 5, 859–866 (1994).
Article Google Scholar
Li, Y. et al. Spectral entropy outperforms MS/MS dot product similarity for small-molecule compound identification. Nat. Methods 18, 1524–1531 (2021).
Article Google Scholar
Majewski, S. et al. The Wasserstein distance as a dissimilarity measure for mass spectra with application to spectral deconvolution. In 18th International Workshop on Algorithms in Bioinformatics (eds Parida, L. & Ukkonen, E.) 25:1–25:21 (WABI, 2018); https://doi.org/10.4230/LIPICS.WABI.2018.25
Benton, H. P., Wong, D. M., Trauger, S. A. & Siuzdak, G. XCMS2: processing tandem mass spectrometry data for metabolite identification and structural characterization. Anal. Chem. 80, 6382–6389 (2008).
Article Google Scholar
Wishart, D. S. et al. HMDB 4.0: the human metabolome database for 2018. Nucleic Acids Res. 46, 608–617 (2018).
Article Google Scholar
Kim, S. et al. PubChem 2019 update: improved access to chemical data. Nucleic Acids Res. 47, 1102–1109 (2019).
Article Google Scholar
Kanehisa, M., Furumichi, M., Sato, Y., Ishiguro-Watanabe, M. & Tanabe, M. KEGG: integrating viruses and cellular organisms. Nucleic Acids Res. 49, 545–551 (2021).
Article Google Scholar
Horai, H. et al. MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass Spectrom. 45, 703–714 (2010).
Article Google Scholar
Wang, M. et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nat. Biotechnol. 34, 828–837 (2016).
Article Google Scholar
Sawada, Y. et al. RIKEN tandem mass spectral database (ReSpect) for phytochemicals: a plant-specific MS/MS-based data resource and database. Phytochemistry 82, 38–45 (2012).
MassBank of North America (MoNA, 2022); https://mona.fiehnlab.ucdavis.edu/
Stein, S. Mass spectral reference libraries: an ever-expanding resource for chemical identification. Anal. Chem. 84, 7274–7282 (2012).
Article Google Scholar
Yang, X., Neta, P. & Stein, S. E. Quality control for building libraries from electrospray ionization tandem mass spectra. Anal. Chem. 86, 6393–6400 (2014).
Article Google Scholar
Guijas, C. et al. METLIN: a technology platform for identifying knowns and unknowns. Anal. Chem. 90, 3156–3164 (2018).
Article Google Scholar
Wiley Registry of Mass Spectral Data 2023 (Wiley, 2023); https://sciencesolutions.wiley.com/solutions/technique/gc-ms/wiley-registry-of-mass-spectral-data/
Allen, F., Greiner, R. & Wishart, D. Competitive fragmentation modeling of ESI-MS/MS spectra for putative metabolite identification. Metabolomics 11, 98–110 (2015).
Article Google Scholar
Djoumbou-Feunang, Y. et al. CFM-ID 3.0: significantly improved ESI-MS/MS prediction and identification. Metabolites 9, 72 (2019).
Wang, F. et al. CFM-ID 4.0: more accurate ESI-MS/MS spectral prediction and compound identification. Anal. Chem. 93, 11692–11700 (2021); https://doi.org/10.1021/acs.analchem.1c01465
Wei, J. N., Belanger, D., Adams, R. P. & Sculley, D. Rapid prediction of electron-ionization mass spectrometry using neural networks. ACS Cent. Sci. 5, 700–708 (2019).
Article Google Scholar
Zhu, H., Liu, L. & Hassoun, S. Using graph neural networks for mass spectrometry prediction. Preprint at https://arxiv.org/abs/2010.04661 (2020).
Li, X., Zhu, H., Liu, L.-p. & Hassoun, S. Ensemble spectral prediction (ESP) model for metabolite annotation. Preprint at https://arxiv.org/abs/2203.13783 (2022).
Zhang, B., Zhang, J., Xia, Y., Chen, P. & Wang, B. Prediction of electron ionization mass spectra based on graph convolutional networks. Int. J. Mass Spectrom. 475, 116817 (2022).
Article Google Scholar
Xu, K., Hu, W., Leskovec, J. & Jegelka, S. How powerful are graph neural networks? In 7th International Conference on Learning Representations, ICLR 2019 (OpenReview.net, 2019); https://openreview.net/forum?id=B1gabhRcYX
Chen, D. et al. Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. In Proc. AAAI Conference on Artificial Intelligence Vol. 34, 3438–3445 (AAAI Press, 2020); https://doi.org/10.1609/aaai.v34i04.5747
Liu, M., Gao, H. & Ji, S. Towards deeper graph neural networks. In Proc. 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining 338–348 (Association for Computing Machinery, 2020); https://doi.org/10.1145/3394486.3403076
Ying, C. et al. Do transformers really perform bad for graph representation? In Advances in Neural Information Processing Systems 34 (NeurIPS 2021) (eds Ranzato, M. et al.) 28877–28888 (Curran Associates, 2021).
Hong, Y. et al. 3DMolMS: prediction of tandem mass spectra from 3D molecular conformations. Bioinformatics 39, btad354 (2023); https://doi.org/10.1093/bioinformatics/btad354
Murphy, M. et al. Efficiently predicting high resolution mass spectra with graph neural networks. Proc. 40th International Conference on Machine Learning (ICML 2023) Vol. 70 (eds Krause, A. et al.), 25549–25562 (PMLR, 2023).
Goldman, S., Bradshaw, J., Xin, J. & Coley, C. W. Prefix-tree decoding for predicting mass spectra from molecules. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023) (eds Oh, A. et al.) 48548–48572 (Curran Associates, 2023).
Zhu, R. L. & Jonas, E. Rapid approximate subset-based spectra prediction for electron ionization-mass spectrometry. Anal. Chem. 95, 2653–2663 (2023).
Article Google Scholar
Goldman, S., Li, J. & Coley, C. W. Generating molecular fragmentation graphs with autoregressive neural networks. Anal. Chem. 96, 3419–3428 (2024).
Jin, W., Coley, C., Barzilay, R. & Jaakkola, T. Predicting organic reaction outcomes with Weisfeiler-Lehman network. In Advances in Neural Information Processing Systems 30 (eds Guyon, I. et al.) (Curran Associates, 2017).
Bemis, G. W. & Murcko, M. A. The properties of known drugs. 1. Molecular frameworks. J. Med.Chem. 39, 2887–2893 (1996).
Article Google Scholar
Landrum, G. RDKit: open-source cheminformatics software. Zenodo https://doi.org/10.5281/zenodo.4973812 (2021).
Djoumbou Feunang, Y. et al. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J. Cheminformatics 8, 61 (2016).
Article Google Scholar
Kind, T. et al. LipidBlast in silico tandem mass spectrometry database for lipid identification. Nat. Methods 10, 755–758 (2013).
Article Google Scholar
Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. In Proc. 34th International Conference on Machine Learning (ICML 2017) Vol. 70 (eds Precup, D. & Teh, Y. W.) 3145–3153 (PMLR, 2017).
Ancona, M., Ceolini, E., Öztireli, C. & Gross, M. Towards better understanding of gradient-based attribution methods for deep neural networks. In 6th International Conference on Learning Representations, ICLR 2018 (OpenReview.net, 2018); https://openreview.net/forum?id=Sy21R9JAW
Ali, A. et al. XAI for transformers: better explanations through conservative propagation. In Proc. 39th International Conference on Machine Learning Vol. 162 (eds Chaudhuri, K. et al.) 435–451 (PMLR, 2022).
Dührkop, K., Shen, H., Meusel, M., Rousu, J. & Böcker, S. Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proc. Natl Acad. Sci. USA 112, 12580–12585 (2015).
Article Google Scholar
Dührkop, K. et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods 16, 299–302 (2019).
Article Google Scholar
Schymanski, E. L. & Neumann, S. CASMI: and the winner is. Metabolites 3, 412–439 (2013).
Article Google Scholar
Schymanski, E. L. et al. Critical assessment of small molecule identification 2016: automated methods. J. Cheminform. 9, 22 (2017).
Article Google Scholar
Revisiting CASMI. Fiehn Laboratory https://fiehnlab.ucdavis.edu/casmi (2022).
McCoy, R. T., Min, J. & Linzen, T. BERTs of a feather do not generalize together: large variability in generalization across models with similar test set performance. In Proc. 3rd BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (eds Alishahi, A. et al.) 217–227 (Association for Computational Linguistics, 2020).
Zhou, X., Nie, Y., Tan, H. & Bansal, M. The curse of performance instability in analysis datasets: consequences, source, and suggestions. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds Webber, B. et al.) 8215–8228 (Association for Computational Linguistics, 2020).
D’Amour, A. et al. Underspecification presents challenges for credibility in modern machine learning. J. Mach. Learn. Res. 23, 1–61 (2022).
Goldman, S. et al. Annotating metabolite mass spectra with domain-inspired chemical formula transformers. Nat. Mach. Intell. 5, 965–979 (2023).
Article Google Scholar
Shrivastava, A. D. et al. MassGenie: a transformer-based deep learning method for identifying small molecules from their mass spectra. Biomolecules 11, 1793 (2021).
Article Google Scholar
Stravs, M. A., Dührkop, K., Böcker, S. & Zamboni, N. MSNovelist: de novo structure generation from mass spectra. Nat. Methods 19, 865–870 (2022).
Article Google Scholar
Butler, T. et al. MS2Mol: A transformer model for illuminating dark chemical space from mass spectra. Preprint at https://doi.org/10.26434/chemrxiv-2023-vsmpx-v2 (2023).
Jonas, E. Deep imitation learning for molecular inverse problems. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (eds Wallach, H. et al.) 4991–5001 (Curran Associates, 2019); https://proceedings.neurips.cc/paper_files/paper/2019/file/b0bef4c9a6e50d43880191492d4fc827-Paper.pdf
Shanthamoorthy, P., Young, A. & Röst, H. Analyzing assay specificity in metabolomics using unique ion signature simulations. Anal. Chem. 93, 11415–11423 (2021).
Article Google Scholar
Hoffmann, M. A. et al. High-confidence structural annotation of metabolites absent from spectral libraries. Nature Biotechnol. 40, 411–421 (2021); https://doi.org/10.1038/s41587-021-01045-9
Scheubert, K. et al. Significance estimation for large scale metabolomics annotations by spectral matching. Nat. Commun. 8, 1494 (2017).
Article Google Scholar
Durant, J. L., Leland, B. A., Henry, D. R. & Nourse, J. G. Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 42, 1273–1280 (2002).
Article Google Scholar
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Article Google Scholar
Zhou, G. et al. Uni-Mol: a universal 3D molecular representation learning framework. In The 11th International Conference on Learning Representations (OpenReview.net, 2022); https://openreview.net/forum?id=6K2RM6wVqKu
Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017) (eds Guyon, I. et al.) (Curran Associates, 2017).
Tan, Z. et al. Neural machine translation: a review of methods, resources, and tools. AI Open 1, 5–21 (2020).
Article Google Scholar
Janner, M., Li, Q. & Levine, S. Offline reinforcement learning as one big sequence modeling problem. In Advances in Neural Information Processing Systems 34 (NeurIPS 2021) (eds Ranzato, M. et al.) 1273–1286 (Curran Associates, 2021).
Dosovitskiy, A. et al. An image is worth 16x16 words: transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021 (OpenReview.net, 2021); https://openreview.net/forum?id=YicbFdNTTy
Ahmadi, A. H. K., Hassani, K., Moradi, P., Lee, L., & Morris, Q. Memory-based graph networks. In 8th International Conference on Learning Representations, ICLR 2020 (OpenReview.net, 2020); https://openreview.net/forum?id=r1laNeBYPB
Mialon, G., Chen, D., Selosse, M. & Mairal, J. GraphiT: encoding graph structure in transformers. Preprint at https://arxiv.org/abs/2106.05667 (2021).
Maziarka, L. et al. Molecule attention transformer. Preprint at https://arxiv.org/abs/2002.08264 (2020).
Rong, Y. et al. Self-supervised graph transformer on large-scale molecular data. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020) (eds Larochelle, H. et al.) 12559–12571 (Curran Associates, 2020).
Hu, W. et al. Open graph benchmark: datasets for machine learning on graphs. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020) (eds Larochelle, H. et al.) 22118–22133 (Curran Associates, 2020).
Velickovic, P. et al. Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018 (OpenReview.net, 2018); https://openreview.net/forum?id=rJXMpikCZ
Hu, W. et al. Open graph benchmark: datasets for machine learning on graphs. In Proc. 34th International Conference on Neural Information Processing Systems (eds Larochelle, H. et al.) 1855 (Curran Associates, 2020).
Floyd, R. W. Algorithm 97: shortest path. Commun. ACM 5, 345 (1962).
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).
MathSciNet Google Scholar
Ba, J. L., Kiros, J. R. & Hinton, G. E. Layer normalization. Preprint at https://arxiv.org/abs/1607.06450 (2016).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019) Vol. 1 (eds Burstein, J. et al.) 4171–4186 (Association for Computational Linguistics, 2019).
Nair, V. & Hinton, G. E. Rectified linear units improve restricted Boltzmann machines. In Proc. 27th International Conference on International Conference on Machine Learning (eds Fürnkranz, J. & Joachims, T.) 807–814 (Omnipress, 2010).
Hu, W. et al. OGB-LSC: a large-scale challenge for machine learning on graphs. In Proc. Neural Information Processing Systems Track on Datasets and Benchmarks (eds Vanschoren, J. & Yeung, S.) (Curran Associates, 2021).
Nakata, M. & Shimazaki, T. PubChemQC project: a large-scale first-principles electronic structure database for data-driven chemistry. J. Chem. Info. Mod. 57, 1300–1308 (2017).
Article Google Scholar
Heller, S. R., McNaught, A., Pletnev, I., Stein, S. & Tchekhovskoi, D. InChI, the IUPAC international chemical identifier. J. Cheminformatics 7, 23 (2015).
Article Google Scholar
Wu, Z. et al. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).
Article Google Scholar
Pence, H. E. & Williams, A. ChemSpider: an online chemical information resource. J. Chem. Educ. 87, 1123–1124 (2010).
Article Google Scholar
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (eds Wallach, H. et al.) (Curran Associates, 2019).
Fey, M. & Lenssen, J. E. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds (OpenReview.net, 2019); https://rlgm.github.io/papers/2.pdf
Wang, M. et al. Deep Graph Library: a graph-centric, highly-performant package for graph neural networks. Preprint at https://arxiv.org/abs/1909.01315 (2020).
Li, M. et al. DGL-LifeSci: an open-source toolkit for deep learning on graphs in life science. ACS Omega 6, 27233–27238 (2021).
Biewald, L. Experiment tracking with Weights & Biases. Weights & Biases http://wandb.com (2020).
Young, A., Wang, B. & Röst, H. Public Data files for MassFormer. Zenodo https://doi.org/10.5281/zenodo.8399738 (2023).
Young, A. Roestlab/massformer. GitHub https://github.com/Roestlab/massformer/ (2024).
Young, A. Roestlab/massformer v0.4.0 Zenodo https://doi.org/10.5281/zenodo.10558852 (2024).
WELCH, B. L. The generalization of ‘student’s’ problem when several different population varlances are involved. Biometrika 34, 28–35 (1947).
MathSciNet Google Scholar
Šidák, Z. Rectangular confidence regions for the means of multivariate normal distributions. J. Am. Stat. Assoc. 62, 626–633 (1967).
MathSciNet Google Scholar

Download references

Acknowledgements

Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada, through the Canadian Institute for Advanced Research (CIFAR) and companies sponsoring the Vector Institute. This research was also enabled in part by support provided by Compute Ontario (https://www.computeontario.ca/) and the Digital Research Alliance of Canada (alliancecan.ca). A.Y. is supported by a Natural Sciences and Engineering Research Council of Canada (NSERC) Postgraduate Scholarship (Doctoral Program) and a Vector Institute research grant. H.R. is supported by NSERC, the Canadian Institutes for Health Research (CIHR), the Canadian Foundation for Innovation, the Canada Research Coordinating Committee (CRCC), the John R. Evans Leaders Fund and the Canada Research Chair Program. B.W. is supported by NSERC (grants: RGPIN-2020-06189 and DGECR-2020-00294), the Peter Munk Cardiac Centre AI Fund at the University Health Network and the CIFAR AI Chair Program. We thank B. Lieng, P. Shanthamoorthy, R. Montenegro-Burke and Q. Morris for helpful discussions. We thank C. Harrigan for feedback on the figures. We thank F. Wang for help with the CFM baseline experiments. We thank S. Ma, P. Fradkin, A. Toma and C. Wang for feedback on the manuscript.

Author information

Authors and Affiliations

Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
Adamo Young, Hannes Röst & Bo Wang
Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
Adamo Young & Hannes Röst
Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada
Adamo Young & Bo Wang
Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
Hannes Röst
Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Ontario, Canada
Bo Wang
Peter Munk Cardiac Centre, University Health Network, Toronto, Ontario, Canada
Bo Wang

Authors

Adamo Young
View author publications
You can also search for this author in PubMed Google Scholar
Hannes Röst
View author publications
You can also search for this author in PubMed Google Scholar
Bo Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.Y., H.R. and B.W. conceived the project. A.Y. wrote the computer code and ran the experiments. H.R. and B.W. supervised the work. A.Y., H.R. and B.W. wrote the manuscript.

Corresponding authors

Correspondence to Hannes Röst or Bo Wang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Sebastian Böcker and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Additional Spectrum Similarity Experiments.

A more detailed comparison of the deep learning models that does not involve filtering compounds based on overlap with CFM’s training set. Training set sizes (N) are indicated for each split. (a) Test set cosine similarity (1 Da bin resolution) when training and evaluating on [M+H]+ spectra, (b) Test set cosine similarity (1 Da bin resolution) training and evaluating on all six supported precursor adducts. MassFormer demonstrates strong performance in both cases. Averages and standard deviations from 10 independently trained models are reported. Statistical significance is determined by one-sided Welch’s t-test with Šidák correction.

Extended Data Fig. 2 ClassyFire Similarity Experiments.

Cosine similarity (1 Da bin resolution) on spectra corresponding to the top 10 most frequent chemical classes from the NIST-Scaffold test set ([M+H]+ adducts only). The chemical classes, identified by ClassyFire, are sorted from most to least frequent on the x-axis and are not necessarily disjoint. The average performance for each model (across all compounds) is indicated by a black dashed line. (a) MassFormer scores, (b) CFM scores. MassFormer performs better than CFM in each category. Strikingly, MassFormer performs best on “lipids and lipid-like molecules", which is one class that CFM seems to struggle with. Averages and standard deviations from 10 independently trained models are reported (except for CFM, which is pretrained).

Extended Data Fig. 3 Additional Hetero-atom Peak Separability Experiments.

Linear peak classification accuracy distributions for four hetero-atoms: (a) chlorine, (b) sulfur, (c) fluorine, (d) phosphorus. For each hetero-atom, the distribution of optimal linear classification accuracy induced by the hetero-atom labelling strategy is markedly different from the random labelling distribution (higher accuracy indicates improved separability of the peaks). Sample size and statistical significance (Welch’s t-test with Šidák correction) for separability differences are provided for each plot.

Extended Data Fig. 4 Spectrum Identification Rank Distributions.

MassFormer ranks candidate structures more correctly than competing approaches. (a) Distributions of the matching candidate’s predicted rank for CASMI 2016, CASMI 2022, and NIST20 Outlier queries. (b) Corresponding distributions of the matching candidate’s normalized rank. Note that for both metrics, a lower score is better. MassFormer’s rank and normalized rank distributions are more strongly skewed towards lower values. Boxplot lines represent median and interquartile range, whiskers represent 1.5 times the interquartile range, and the “X" symbol represents the mean.

Extended Data Fig. 5 Spectrum Identification Candidate Set Statistics.

Different spectrum identification tasks vary in terms of the diversity and size of their candidate sets. (a,c,e) Distribution of Tanimoto similarities between candidate molecules and their corresponding queries for (a) CASMI 2016, (c) CASMI 2022, (e) NIST20 Outlier. (b,d,f) Distribution of the number of candidates per query for (b) CASMI 2016, (d) CASMI 2022, (f) NIST20 Outlier. The CASMI 2022 and NIST20 Outlier candidate sets are sampled from PubChem using the query’s molecular weight. The NIST20 Outlier dataset uses a smaller weight tolerance (0.5ppm) than CASMI 2022 (10 ppm), resulting in fewer candidates with higher chemical similarity to the query.

Extended Data Fig. 6 Additional Prediction Examples.

Twelve spectrum predictions of varying accuracy, roughly covering a range of 0.4 to 1.0 cosine similarity. All spectra are merged over multiple collision energies. The predictions are described in terms of InChIKey-14, precursor adduct, and cosine similarity with ground truth. (a) AWXJBCZMGZDXCG, [M+H]+, 0.46 (b) GLFJFDAJNJYPGW, [M+H]+, 0.47 (c) HIEYVTSQMLHJEZ, [M+H]+, 0.51 (d) JNKVBUQSDAHKDQ, [M+H-H2O]+, 0.59 (e) WDVCZSSWRMVHAU, [M+H-H2O]+, 0.65 (f) YCTAOQGPWNTYJE, [M+H]+, 0.66 (g) CILGSELJQXSDBE, [M+H]+, 0.72 (h) XCDOHVHQWSFAEN, [M+H-2H2O]+, 0.79 (i) ISNRVVKKHPECQN, [M+H-H2O]+, 0.80 (j) XBGGUPMXALFZOT, [M+H]+, 0.86 (k) DTLKTHCXEMHTIQ, [M+H-2H2O]+, 0.91 (l) BLJBQVQHDXUDTE, [M+H]+, 0.98.

Supplementary information

Supplementary Information

Supplementary text and Tables 1–9.

Supplementary Table 10

The complete set of average similarity scores, across all data splits, for each of the four models (CFM, FP, WLN, MF). There are 56 different methods of similarity calculation. Each variant is defined by a particular intensity transformation (no transform, log transform, square root transform, precursor peak removal), similarity function (cosine, Jensen–Shannon, Jaccard), collision energy merging strategy (merging, no merging) and score aggregation method (spectrum averaging, molecule averaging). Note that intensity transformations are not used in combination with the Jaccard similarity function, which assumes binary intensities.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Young, A., Röst, H. & Wang, B. Tandem mass spectrum prediction for small molecules using graph transformers. Nat Mach Intell 6, 404–416 (2024). https://doi.org/10.1038/s42256-024-00816-8

Download citation

Received: 17 March 2023
Accepted: 27 February 2024
Published: 05 April 2024
Issue Date: April 2024
DOI: https://doi.org/10.1038/s42256-024-00816-8