Abstract
Tandem mass spectra capture fragmentation patterns that provide key structural information about molecules. Although mass spectrometry is applied in many areas, the vast majority of small molecules lack experimental reference spectra. For over 70 years, spectrum prediction has remained a key challenge in the field. Existing deep learning methods do not leverage global structure in the molecule, potentially resulting in difficulties when generalizing to new data. In this work we propose the MassFormer model for accurately predicting tandem mass spectra. MassFormer uses a graph transformer architecture to model long-distance relationships between atoms in the molecule. The transformer module is initialized with parameters obtained through a chemical pretraining task, then fine-tuned on spectral data. MassFormer outperforms competing approaches for spectrum prediction on multiple datasets and accurately models the effects of collision energy. Gradient-based attribution methods reveal that MassFormer can identify compositional relationships between peaks in the spectrum. When applied to spectrum identification problems, MassFormer generally surpasses the performance of existing prediction-based methods.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
All public data from the study have been uploaded to Zenodo at https://doi.org/10.5281/zenodo.8399738 (ref. 93). Some data that support the findings of this study are available from the National Institute of Standards and Technology (NIST). However, its access is subject to restrictions, requiring the purchase of an appropriate license or special permission from NIST.
Code availability
The code used in this study is open-source (BSD-2-Clause license) and can be found in a GitHub repository (https://github.com/Roestlab/massformer/)94 with a DOI of https://doi.org/10.5281/zenodo.10558852 (ref. 95).
References
Gross, J. H. Mass Spectrometry—A Textbook (Springer, 2011); https://doi.org/10.1007/978-3-319-54398-7
Niessen, W. M. A. & Falck, D. in Analyzing Biomolecular Interactions by Mass Spectrometry Ch. 1 (eds Kool, J. & Niessen, W. M. A.) (Wiley, 2015); https://doi.org/10.1002/9783527673391
Aebersold, R. & Mann, M. Mass-spectrometric exploration of proteome structure and function. Nature 537, 347–355 (2016).
Gowda, G. A. N. & Djukovic, D. Overview of mass spectrometry-based metabolomics: opportunities and challenges. Methods Mol. Biol. 1198, 3–12 (2014).
De Vijlder, T. & Cuyckens, F. A tutorial in small molecule identification via electrospray ionization-mass spectrometry: the practical art of structural elucidation. Mass Spectrom. Rev. 37, 607–629 (2018).
Peters, F. T. Recent advances of liquid chromatography-(tandem) mass spectrometry in clinical and forensic toxicology. Clin. Biochem. 44, 54–65 (2011).
Van Bocxlaer, J. F. et al. Liquid chromatography-mass spectrometry in forensic toxicology. Mass Spectrom. Rev. 19, 165–214 (2000).
Lebedev, A. T. Environmental mass spectrometry. Ann. Rev. Anal.Chem. 6, 163–189 (2013).
Stein, S. E. & Scott, D. R. Optimization and testing of mass spectral library search algorithms for compound identification. J. Am. Soc. Mass Spectrom. 5, 859–866 (1994).
Li, Y. et al. Spectral entropy outperforms MS/MS dot product similarity for small-molecule compound identification. Nat. Methods 18, 1524–1531 (2021).
Majewski, S. et al. The Wasserstein distance as a dissimilarity measure for mass spectra with application to spectral deconvolution. In 18th International Workshop on Algorithms in Bioinformatics (eds Parida, L. & Ukkonen, E.) 25:1–25:21 (WABI, 2018); https://doi.org/10.4230/LIPICS.WABI.2018.25
Benton, H. P., Wong, D. M., Trauger, S. A. & Siuzdak, G. XCMS2: processing tandem mass spectrometry data for metabolite identification and structural characterization. Anal. Chem. 80, 6382–6389 (2008).
Wishart, D. S. et al. HMDB 4.0: the human metabolome database for 2018. Nucleic Acids Res. 46, 608–617 (2018).
Kim, S. et al. PubChem 2019 update: improved access to chemical data. Nucleic Acids Res. 47, 1102–1109 (2019).
Kanehisa, M., Furumichi, M., Sato, Y., Ishiguro-Watanabe, M. & Tanabe, M. KEGG: integrating viruses and cellular organisms. Nucleic Acids Res. 49, 545–551 (2021).
Horai, H. et al. MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass Spectrom. 45, 703–714 (2010).
Wang, M. et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nat. Biotechnol. 34, 828–837 (2016).
Sawada, Y. et al. RIKEN tandem mass spectral database (ReSpect) for phytochemicals: a plant-specific MS/MS-based data resource and database. Phytochemistry 82, 38–45 (2012).
MassBank of North America (MoNA, 2022); https://mona.fiehnlab.ucdavis.edu/
Stein, S. Mass spectral reference libraries: an ever-expanding resource for chemical identification. Anal. Chem. 84, 7274–7282 (2012).
Yang, X., Neta, P. & Stein, S. E. Quality control for building libraries from electrospray ionization tandem mass spectra. Anal. Chem. 86, 6393–6400 (2014).
Guijas, C. et al. METLIN: a technology platform for identifying knowns and unknowns. Anal. Chem. 90, 3156–3164 (2018).
Wiley Registry of Mass Spectral Data 2023 (Wiley, 2023); https://sciencesolutions.wiley.com/solutions/technique/gc-ms/wiley-registry-of-mass-spectral-data/
Allen, F., Greiner, R. & Wishart, D. Competitive fragmentation modeling of ESI-MS/MS spectra for putative metabolite identification. Metabolomics 11, 98–110 (2015).
Djoumbou-Feunang, Y. et al. CFM-ID 3.0: significantly improved ESI-MS/MS prediction and identification. Metabolites 9, 72 (2019).
Wang, F. et al. CFM-ID 4.0: more accurate ESI-MS/MS spectral prediction and compound identification. Anal. Chem. 93, 11692–11700 (2021); https://doi.org/10.1021/acs.analchem.1c01465
Wei, J. N., Belanger, D., Adams, R. P. & Sculley, D. Rapid prediction of electron-ionization mass spectrometry using neural networks. ACS Cent. Sci. 5, 700–708 (2019).
Zhu, H., Liu, L. & Hassoun, S. Using graph neural networks for mass spectrometry prediction. Preprint at https://arxiv.org/abs/2010.04661 (2020).
Li, X., Zhu, H., Liu, L.-p. & Hassoun, S. Ensemble spectral prediction (ESP) model for metabolite annotation. Preprint at https://arxiv.org/abs/2203.13783 (2022).
Zhang, B., Zhang, J., Xia, Y., Chen, P. & Wang, B. Prediction of electron ionization mass spectra based on graph convolutional networks. Int. J. Mass Spectrom. 475, 116817 (2022).
Xu, K., Hu, W., Leskovec, J. & Jegelka, S. How powerful are graph neural networks? In 7th International Conference on Learning Representations, ICLR 2019 (OpenReview.net, 2019); https://openreview.net/forum?id=B1gabhRcYX
Chen, D. et al. Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. In Proc. AAAI Conference on Artificial Intelligence Vol. 34, 3438–3445 (AAAI Press, 2020); https://doi.org/10.1609/aaai.v34i04.5747
Liu, M., Gao, H. & Ji, S. Towards deeper graph neural networks. In Proc. 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining 338–348 (Association for Computing Machinery, 2020); https://doi.org/10.1145/3394486.3403076
Ying, C. et al. Do transformers really perform bad for graph representation? In Advances in Neural Information Processing Systems 34 (NeurIPS 2021) (eds Ranzato, M. et al.) 28877–28888 (Curran Associates, 2021).
Hong, Y. et al. 3DMolMS: prediction of tandem mass spectra from 3D molecular conformations. Bioinformatics 39, btad354 (2023); https://doi.org/10.1093/bioinformatics/btad354
Murphy, M. et al. Efficiently predicting high resolution mass spectra with graph neural networks. Proc. 40th International Conference on Machine Learning (ICML 2023) Vol. 70 (eds Krause, A. et al.), 25549–25562 (PMLR, 2023).
Goldman, S., Bradshaw, J., Xin, J. & Coley, C. W. Prefix-tree decoding for predicting mass spectra from molecules. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023) (eds Oh, A. et al.) 48548–48572 (Curran Associates, 2023).
Zhu, R. L. & Jonas, E. Rapid approximate subset-based spectra prediction for electron ionization-mass spectrometry. Anal. Chem. 95, 2653–2663 (2023).
Goldman, S., Li, J. & Coley, C. W. Generating molecular fragmentation graphs with autoregressive neural networks. Anal. Chem. 96, 3419–3428 (2024).
Jin, W., Coley, C., Barzilay, R. & Jaakkola, T. Predicting organic reaction outcomes with Weisfeiler-Lehman network. In Advances in Neural Information Processing Systems 30 (eds Guyon, I. et al.) (Curran Associates, 2017).
Bemis, G. W. & Murcko, M. A. The properties of known drugs. 1. Molecular frameworks. J. Med.Chem. 39, 2887–2893 (1996).
Landrum, G. RDKit: open-source cheminformatics software. Zenodo https://doi.org/10.5281/zenodo.4973812 (2021).
Djoumbou Feunang, Y. et al. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J. Cheminformatics 8, 61 (2016).
Kind, T. et al. LipidBlast in silico tandem mass spectrometry database for lipid identification. Nat. Methods 10, 755–758 (2013).
Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. In Proc. 34th International Conference on Machine Learning (ICML 2017) Vol. 70 (eds Precup, D. & Teh, Y. W.) 3145–3153 (PMLR, 2017).
Ancona, M., Ceolini, E., Öztireli, C. & Gross, M. Towards better understanding of gradient-based attribution methods for deep neural networks. In 6th International Conference on Learning Representations, ICLR 2018 (OpenReview.net, 2018); https://openreview.net/forum?id=Sy21R9JAW
Ali, A. et al. XAI for transformers: better explanations through conservative propagation. In Proc. 39th International Conference on Machine Learning Vol. 162 (eds Chaudhuri, K. et al.) 435–451 (PMLR, 2022).
Dührkop, K., Shen, H., Meusel, M., Rousu, J. & Böcker, S. Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proc. Natl Acad. Sci. USA 112, 12580–12585 (2015).
Dührkop, K. et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods 16, 299–302 (2019).
Schymanski, E. L. & Neumann, S. CASMI: and the winner is. Metabolites 3, 412–439 (2013).
Schymanski, E. L. et al. Critical assessment of small molecule identification 2016: automated methods. J. Cheminform. 9, 22 (2017).
Revisiting CASMI. Fiehn Laboratory https://fiehnlab.ucdavis.edu/casmi (2022).
McCoy, R. T., Min, J. & Linzen, T. BERTs of a feather do not generalize together: large variability in generalization across models with similar test set performance. In Proc. 3rd BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (eds Alishahi, A. et al.) 217–227 (Association for Computational Linguistics, 2020).
Zhou, X., Nie, Y., Tan, H. & Bansal, M. The curse of performance instability in analysis datasets: consequences, source, and suggestions. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds Webber, B. et al.) 8215–8228 (Association for Computational Linguistics, 2020).
D’Amour, A. et al. Underspecification presents challenges for credibility in modern machine learning. J. Mach. Learn. Res. 23, 1–61 (2022).
Goldman, S. et al. Annotating metabolite mass spectra with domain-inspired chemical formula transformers. Nat. Mach. Intell. 5, 965–979 (2023).
Shrivastava, A. D. et al. MassGenie: a transformer-based deep learning method for identifying small molecules from their mass spectra. Biomolecules 11, 1793 (2021).
Stravs, M. A., Dührkop, K., Böcker, S. & Zamboni, N. MSNovelist: de novo structure generation from mass spectra. Nat. Methods 19, 865–870 (2022).
Butler, T. et al. MS2Mol: A transformer model for illuminating dark chemical space from mass spectra. Preprint at https://doi.org/10.26434/chemrxiv-2023-vsmpx-v2 (2023).
Jonas, E. Deep imitation learning for molecular inverse problems. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (eds Wallach, H. et al.) 4991–5001 (Curran Associates, 2019); https://proceedings.neurips.cc/paper_files/paper/2019/file/b0bef4c9a6e50d43880191492d4fc827-Paper.pdf
Shanthamoorthy, P., Young, A. & Röst, H. Analyzing assay specificity in metabolomics using unique ion signature simulations. Anal. Chem. 93, 11415–11423 (2021).
Hoffmann, M. A. et al. High-confidence structural annotation of metabolites absent from spectral libraries. Nature Biotechnol. 40, 411–421 (2021); https://doi.org/10.1038/s41587-021-01045-9
Scheubert, K. et al. Significance estimation for large scale metabolomics annotations by spectral matching. Nat. Commun. 8, 1494 (2017).
Durant, J. L., Leland, B. A., Henry, D. R. & Nourse, J. G. Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 42, 1273–1280 (2002).
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Zhou, G. et al. Uni-Mol: a universal 3D molecular representation learning framework. In The 11th International Conference on Learning Representations (OpenReview.net, 2022); https://openreview.net/forum?id=6K2RM6wVqKu
Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017) (eds Guyon, I. et al.) (Curran Associates, 2017).
Tan, Z. et al. Neural machine translation: a review of methods, resources, and tools. AI Open 1, 5–21 (2020).
Janner, M., Li, Q. & Levine, S. Offline reinforcement learning as one big sequence modeling problem. In Advances in Neural Information Processing Systems 34 (NeurIPS 2021) (eds Ranzato, M. et al.) 1273–1286 (Curran Associates, 2021).
Dosovitskiy, A. et al. An image is worth 16x16 words: transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021 (OpenReview.net, 2021); https://openreview.net/forum?id=YicbFdNTTy
Ahmadi, A. H. K., Hassani, K., Moradi, P., Lee, L., & Morris, Q. Memory-based graph networks. In 8th International Conference on Learning Representations, ICLR 2020 (OpenReview.net, 2020); https://openreview.net/forum?id=r1laNeBYPB
Mialon, G., Chen, D., Selosse, M. & Mairal, J. GraphiT: encoding graph structure in transformers. Preprint at https://arxiv.org/abs/2106.05667 (2021).
Maziarka, L. et al. Molecule attention transformer. Preprint at https://arxiv.org/abs/2002.08264 (2020).
Rong, Y. et al. Self-supervised graph transformer on large-scale molecular data. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020) (eds Larochelle, H. et al.) 12559–12571 (Curran Associates, 2020).
Hu, W. et al. Open graph benchmark: datasets for machine learning on graphs. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020) (eds Larochelle, H. et al.) 22118–22133 (Curran Associates, 2020).
Velickovic, P. et al. Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018 (OpenReview.net, 2018); https://openreview.net/forum?id=rJXMpikCZ
Hu, W. et al. Open graph benchmark: datasets for machine learning on graphs. In Proc. 34th International Conference on Neural Information Processing Systems (eds Larochelle, H. et al.) 1855 (Curran Associates, 2020).
Floyd, R. W. Algorithm 97: shortest path. Commun. ACM 5, 345 (1962).
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).
Ba, J. L., Kiros, J. R. & Hinton, G. E. Layer normalization. Preprint at https://arxiv.org/abs/1607.06450 (2016).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019) Vol. 1 (eds Burstein, J. et al.) 4171–4186 (Association for Computational Linguistics, 2019).
Nair, V. & Hinton, G. E. Rectified linear units improve restricted Boltzmann machines. In Proc. 27th International Conference on International Conference on Machine Learning (eds Fürnkranz, J. & Joachims, T.) 807–814 (Omnipress, 2010).
Hu, W. et al. OGB-LSC: a large-scale challenge for machine learning on graphs. In Proc. Neural Information Processing Systems Track on Datasets and Benchmarks (eds Vanschoren, J. & Yeung, S.) (Curran Associates, 2021).
Nakata, M. & Shimazaki, T. PubChemQC project: a large-scale first-principles electronic structure database for data-driven chemistry. J. Chem. Info. Mod. 57, 1300–1308 (2017).
Heller, S. R., McNaught, A., Pletnev, I., Stein, S. & Tchekhovskoi, D. InChI, the IUPAC international chemical identifier. J. Cheminformatics 7, 23 (2015).
Wu, Z. et al. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).
Pence, H. E. & Williams, A. ChemSpider: an online chemical information resource. J. Chem. Educ. 87, 1123–1124 (2010).
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (eds Wallach, H. et al.) (Curran Associates, 2019).
Fey, M. & Lenssen, J. E. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds (OpenReview.net, 2019); https://rlgm.github.io/papers/2.pdf
Wang, M. et al. Deep Graph Library: a graph-centric, highly-performant package for graph neural networks. Preprint at https://arxiv.org/abs/1909.01315 (2020).
Li, M. et al. DGL-LifeSci: an open-source toolkit for deep learning on graphs in life science. ACS Omega 6, 27233–27238 (2021).
Biewald, L. Experiment tracking with Weights & Biases. Weights & Biases http://wandb.com (2020).
Young, A., Wang, B. & Röst, H. Public Data files for MassFormer. Zenodo https://doi.org/10.5281/zenodo.8399738 (2023).
Young, A. Roestlab/massformer. GitHub https://github.com/Roestlab/massformer/ (2024).
Young, A. Roestlab/massformer v0.4.0 Zenodo https://doi.org/10.5281/zenodo.10558852 (2024).
WELCH, B. L. The generalization of ‘student’s’ problem when several different population varlances are involved. Biometrika 34, 28–35 (1947).
Šidák, Z. Rectangular confidence regions for the means of multivariate normal distributions. J. Am. Stat. Assoc. 62, 626–633 (1967).
Acknowledgements
Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada, through the Canadian Institute for Advanced Research (CIFAR) and companies sponsoring the Vector Institute. This research was also enabled in part by support provided by Compute Ontario (https://www.computeontario.ca/) and the Digital Research Alliance of Canada (alliancecan.ca). A.Y. is supported by a Natural Sciences and Engineering Research Council of Canada (NSERC) Postgraduate Scholarship (Doctoral Program) and a Vector Institute research grant. H.R. is supported by NSERC, the Canadian Institutes for Health Research (CIHR), the Canadian Foundation for Innovation, the Canada Research Coordinating Committee (CRCC), the John R. Evans Leaders Fund and the Canada Research Chair Program. B.W. is supported by NSERC (grants: RGPIN-2020-06189 and DGECR-2020-00294), the Peter Munk Cardiac Centre AI Fund at the University Health Network and the CIFAR AI Chair Program. We thank B. Lieng, P. Shanthamoorthy, R. Montenegro-Burke and Q. Morris for helpful discussions. We thank C. Harrigan for feedback on the figures. We thank F. Wang for help with the CFM baseline experiments. We thank S. Ma, P. Fradkin, A. Toma and C. Wang for feedback on the manuscript.
Author information
Authors and Affiliations
Contributions
A.Y., H.R. and B.W. conceived the project. A.Y. wrote the computer code and ran the experiments. H.R. and B.W. supervised the work. A.Y., H.R. and B.W. wrote the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Sebastian Böcker and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Additional Spectrum Similarity Experiments.
A more detailed comparison of the deep learning models that does not involve filtering compounds based on overlap with CFM’s training set. Training set sizes (N) are indicated for each split. (a) Test set cosine similarity (1 Da bin resolution) when training and evaluating on [M+H]+ spectra, (b) Test set cosine similarity (1 Da bin resolution) training and evaluating on all six supported precursor adducts. MassFormer demonstrates strong performance in both cases. Averages and standard deviations from 10 independently trained models are reported. Statistical significance is determined by one-sided Welch’s t-test with Šidák correction.
Extended Data Fig. 2 ClassyFire Similarity Experiments.
Cosine similarity (1 Da bin resolution) on spectra corresponding to the top 10 most frequent chemical classes from the NIST-Scaffold test set ([M+H]+ adducts only). The chemical classes, identified by ClassyFire, are sorted from most to least frequent on the x-axis and are not necessarily disjoint. The average performance for each model (across all compounds) is indicated by a black dashed line. (a) MassFormer scores, (b) CFM scores. MassFormer performs better than CFM in each category. Strikingly, MassFormer performs best on “lipids and lipid-like molecules", which is one class that CFM seems to struggle with. Averages and standard deviations from 10 independently trained models are reported (except for CFM, which is pretrained).
Extended Data Fig. 3 Additional Hetero-atom Peak Separability Experiments.
Linear peak classification accuracy distributions for four hetero-atoms: (a) chlorine, (b) sulfur, (c) fluorine, (d) phosphorus. For each hetero-atom, the distribution of optimal linear classification accuracy induced by the hetero-atom labelling strategy is markedly different from the random labelling distribution (higher accuracy indicates improved separability of the peaks). Sample size and statistical significance (Welch’s t-test with Šidák correction) for separability differences are provided for each plot.
Extended Data Fig. 4 Spectrum Identification Rank Distributions.
MassFormer ranks candidate structures more correctly than competing approaches. (a) Distributions of the matching candidate’s predicted rank for CASMI 2016, CASMI 2022, and NIST20 Outlier queries. (b) Corresponding distributions of the matching candidate’s normalized rank. Note that for both metrics, a lower score is better. MassFormer’s rank and normalized rank distributions are more strongly skewed towards lower values. Boxplot lines represent median and interquartile range, whiskers represent 1.5 times the interquartile range, and the “X" symbol represents the mean.
Extended Data Fig. 5 Spectrum Identification Candidate Set Statistics.
Different spectrum identification tasks vary in terms of the diversity and size of their candidate sets. (a,c,e) Distribution of Tanimoto similarities between candidate molecules and their corresponding queries for (a) CASMI 2016, (c) CASMI 2022, (e) NIST20 Outlier. (b,d,f) Distribution of the number of candidates per query for (b) CASMI 2016, (d) CASMI 2022, (f) NIST20 Outlier. The CASMI 2022 and NIST20 Outlier candidate sets are sampled from PubChem using the query’s molecular weight. The NIST20 Outlier dataset uses a smaller weight tolerance (0.5ppm) than CASMI 2022 (10 ppm), resulting in fewer candidates with higher chemical similarity to the query.
Extended Data Fig. 6 Additional Prediction Examples.
Twelve spectrum predictions of varying accuracy, roughly covering a range of 0.4 to 1.0 cosine similarity. All spectra are merged over multiple collision energies. The predictions are described in terms of InChIKey-14, precursor adduct, and cosine similarity with ground truth. (a) AWXJBCZMGZDXCG, [M+H]+, 0.46 (b) GLFJFDAJNJYPGW, [M+H]+, 0.47 (c) HIEYVTSQMLHJEZ, [M+H]+, 0.51 (d) JNKVBUQSDAHKDQ, [M+H-H2O]+, 0.59 (e) WDVCZSSWRMVHAU, [M+H-H2O]+, 0.65 (f) YCTAOQGPWNTYJE, [M+H]+, 0.66 (g) CILGSELJQXSDBE, [M+H]+, 0.72 (h) XCDOHVHQWSFAEN, [M+H-2H2O]+, 0.79 (i) ISNRVVKKHPECQN, [M+H-H2O]+, 0.80 (j) XBGGUPMXALFZOT, [M+H]+, 0.86 (k) DTLKTHCXEMHTIQ, [M+H-2H2O]+, 0.91 (l) BLJBQVQHDXUDTE, [M+H]+, 0.98.
Supplementary information
Supplementary Information
Supplementary text and Tables 1–9.
Supplementary Table 10
The complete set of average similarity scores, across all data splits, for each of the four models (CFM, FP, WLN, MF). There are 56 different methods of similarity calculation. Each variant is defined by a particular intensity transformation (no transform, log transform, square root transform, precursor peak removal), similarity function (cosine, Jensen–Shannon, Jaccard), collision energy merging strategy (merging, no merging) and score aggregation method (spectrum averaging, molecule averaging). Note that intensity transformations are not used in combination with the Jaccard similarity function, which assumes binary intensities.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Young, A., Röst, H. & Wang, B. Tandem mass spectrum prediction for small molecules using graph transformers. Nat Mach Intell 6, 404–416 (2024). https://doi.org/10.1038/s42256-024-00816-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-024-00816-8