Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Annotating metabolite mass spectra with domain-inspired chemical formula transformers

A preprint version of the article is available at bioRxiv.

Abstract

Metabolomics studies have identified small molecules that mediate cell signaling, competition and disease pathology, in part due to large-scale community efforts to measure tandem mass spectra for thousands of metabolite standards. Nevertheless, the majority of spectra observed in clinical samples cannot be unambiguously matched to known structures. Deep learning approaches to small-molecule structure elucidation have surprisingly failed to rival classical statistical methods, which we hypothesize is due to the lack of in-domain knowledge incorporated into current neural network architectures. Here we introduce a neural network-driven workflow for untargeted metabolomics, Metabolite Inference with Spectrum Transformers (MIST), to annotate tandem mass spectra peaks with chemical structures. Unlike existing approaches, MIST incorporates domain insights into its architecture by encoding peaks with their chemical formula representations, implicitly featurizing pairwise neutral losses and training the network to additionally predict substructure fragments. MIST performs favorably compared with both standard neural architectures and the state-of-the-art kernel method on the task of fingerprint prediction for over 70% of metabolite standards and retrieves 66% of metabolites with equal or improved accuracy, with 29% strictly better. We further demonstrate the utility of MIST by suggesting potential dipeptide and alkaloid structures for differentially abundant spectra found in an inflammatory bowel disease patient cohort.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: MIST replaces the critical spectrum-to-fingerprint prediction step in the computational workflow.
Fig. 2: MIST accurately predicts compound fingerprints from mass spectra.
Fig. 3: Contrastive fine-tuning improves compound annotation by database retrieval.
Fig. 4: MIST annotates putative and clinically relevant dipeptides.
Fig. 5: Putative alkaloids separate healthy and diseased IBD cohorts.
Fig. 6: Model ablations affirm the value of domain-inspired model components.

Similar content being viewed by others

Data availability

Public data used for benchmarking MIST models as processed by ref. 23 can be downloaded alongside our code with full directions included at https://github.com/samgoldman97/mist and ref. 39. Data for NIST and head-to-head CSI comparisons are unavailable due to strict licensing rules around the NIST2047 dataset. Data to repeat the retrospective study and reanalysis of IBD data can be retrieved from the MassIVE database at accessions MSV000084908 (raw data) and MSV000084908 (cohort info) and via Zenodo record 808408839. PubChem (April 2022) and HMDB 5.0 data libraries used for compound retrieval are publicly accessible with exact details for reproduction described alongside released code.

Code availability

All code to replicate experiments, train new models and load pretrained models is available at https://github.com/samgoldman97/mist. The exact repository version used in this work has been archived with Zenodo39.

References

  1. Xu, W. et al. Oncometabolite 2-hydroxyglutarate is a competitive inhibitor of α-ketoglutarate-dependent dioxygenases. Cancer Cell 19, 17–30 (2011).

    Article  Google Scholar 

  2. Dang, L. et al. Cancer-associated IDH1 mutations produce 2-hydroxyglutarate. Nature 462, 739–744 (2009).

    Article  Google Scholar 

  3. Torrens-Spence, M. P. et al. PBS3 and EPS1 complete salicylic acid biosynthesis from isochorismate in Arabidopsis. Mol. Plant 12, 1577–1586 (2019).

    Article  Google Scholar 

  4. Wishart, D. S. Metabolomics for investigating physiological and pathophysiological processes. Physiol. Rev. 99, 1819–1875 (2019).

    Article  Google Scholar 

  5. Bundy, J. G., Davey, M. P. & Viant, M. R. Environmental metabolomics: a critical review and future perspectives. Metabolomics 5, 3–21 (2009).

    Article  Google Scholar 

  6. Sato, Y. et al. Novel bile acid biosynthetic pathways are enriched in the microbiome of centenarians. Nature 599, 458–464 (2021).

    Article  Google Scholar 

  7. Neumann, S. & Böcker, S. Computational mass spectrometry for metabolomics: identification of metabolites and small molecules. Anal. Bioanal. Chem. 398, 2779–2788 (2010).

    Article  Google Scholar 

  8. Bittremieux, W., Wang, M. & Dorrestein, P. C. The critical role that spectral libraries play in capturing the metabolomics community knowledge. Metabolomics 18, 94 (2022).

    Article  Google Scholar 

  9. AlQuraishi, M. Machine learning in protein structure prediction. Curr. Opin. Chem. Biol. 65, 1–8 (2021).

    Article  Google Scholar 

  10. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    Article  Google Scholar 

  11. Nothias, L.-F. et al. Feature-based molecular networking in the GNPS analysis environment. Nat. Methods 17, 905–908 (2020).

    Article  Google Scholar 

  12. Watrous, J. et al. Mass spectral molecular networking of living microbial colonies. Proc. Natl Acad. Sci. USA 109, E1743–E1752 (2012).

    Article  Google Scholar 

  13. Quinn, R. A. et al. Global chemical effects of the microbiome include new bile-acid conjugations. Nature 579, 123–129 (2020).

    Article  Google Scholar 

  14. Wolf, S., Schmidt, S., Müller-Hannemann, M. & Neumann, S. In silico fragmentation for computer assisted identification of metabolite mass spectra. BMC Bioinform. 11, 148 (2010).

    Article  Google Scholar 

  15. Ruttkies, C., Schymanski, E. L., Wolf, S., Hollender, J. & Neumann, S. MetFrag relaunched: incorporating strategies beyond in silico fragmentation. J. Cheminform. 8, 1–16 (2016).

    Article  Google Scholar 

  16. Wang, F. et al. CFM-ID 4.0: more accurate ESI-MS/MS spectral prediction and compound identification. Anal. Chem. 93, 11692–11700 (2021).

    Article  Google Scholar 

  17. Allen, F., Greiner, R. & Wishart, D. Competitive fragmentation modeling of ESI-MS/MS spectra for putative metabolite identification. Metabolomics 11, 98–110 (2015).

    Article  Google Scholar 

  18. Dührkop, K., Shen, H., Meusel, M., Rousu, J. & Böcker, S. Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proc. Natl Acad. Sci. USA 112, 12580–12585 (2015).

    Article  Google Scholar 

  19. Shen, H., Zamboni, N., Heinonen, M. & Rousu, J. Metabolite identification through machine learning-tackling CASMI challenge using FingerID. Metabolites 3, 484–505 (2013).

    Article  Google Scholar 

  20. Critical Assessment of Small Molecule Identification. CASMI http://www.casmi-contest.org/2022/index.shtml (2022).

  21. Schymanski, E. L. et al. Critical Assessment of Small Molecule Identification 2016: automated methods. J. Cheminform. 9, 1–21 (2017).

    Article  Google Scholar 

  22. Böcker, S. & Dührkop, K. Fragmentation trees reloaded. J. Cheminform. 8, 1–26 (2016).

    Article  MATH  Google Scholar 

  23. Dührkop, K. et al. Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra. Nat. Biotechnol. 39, 462–471 (2021).

    Article  Google Scholar 

  24. Hjörleifsson Eldjárn, G. et al. Ranking microbial metabolomic and genomic links in the NPLinker framework using complementary scoring functions. PLoS Comput. Biol. 17, e1008920 (2021).

    Article  Google Scholar 

  25. Stravs, M. A., Dührkop, K., Böcker, S. & Zamboni, N. MSNovelist: de novo structure generation from mass spectra. Nat. Methods 19, 865–870 (2022).

    Article  Google Scholar 

  26. Hoffmann, M. A. et al. High-confidence structural annotation of metabolites absent from spectral libraries. Nat. Biotechnol. 40, 411–421 (2021).

    Article  Google Scholar 

  27. Tripathi, A. et al. Chemically informed analyses of metabolomics mass spectrometry data with Qemistree. Nature Chem. Biol. 17, 146–151 (2021).

    Article  Google Scholar 

  28. Huber, F. et al. Spec2Vec: improved mass spectral similarity scoring through learning of structural relationships. PLoS Comput. Biol. 17, e1008724 (2021).

    Article  Google Scholar 

  29. Huber, F., van der Burg, S., van der Hooft, J. J. & Ridder, L. MS2DeepScore: a novel deep learning similarity measure to compare tandem mass spectra. J. Cheminform. 13, 1–14 (2021).

    Article  Google Scholar 

  30. Voronov, G. et al. Multi-scale sinusoidal embeddings enable learning on high resolution mass spectrometry data. ICLR 2023 Machine Learning for Drug Discovery workshop (2023).

  31. Wei, J. N., Belanger, D., Adams, R. P. & Sculley, D. Rapid prediction of electron-ionization mass spectrometry using neural networks. ACS Cent. Sci. 5, 700–708 (2019).

    Article  Google Scholar 

  32. Li, X., Zhu, H., Liu, L.-P. & Hassoun, S. Ensemble Spectral Prediction (ESP) model for metabolite annotation. Preprint at https://arxiv.org/abs/2203.13783 (2022).

  33. Young, A., Wang, B. & Röst, H. MassFormer: tandem mass spectrum prediction with graph transformers. Preprint at https://arxiv.org/abs/2111.04824 (2021).

  34. Shrivastava, A. D. et al. MassGenie: a transformer-based deep learning method for identifying small molecules from their mass spectra. Biomolecules 11, 1793 (2021).

    Article  Google Scholar 

  35. Litsa, E. E. et al. An end-to-end deep learning framework for translating mass spectra to de-novo molecules. Communications Chemistry 6, 132 (2023).

    Article  Google Scholar 

  36. Fan, Z., Alley, A., Ghaffari, K. & Ressom, H. W. MetFID: artificial neural network-based compound fingerprint prediction for metabolite annotation. Metabolomics 16, 104 (2020).

    Article  Google Scholar 

  37. Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36 (1988).

    Article  Google Scholar 

  38. Dührkop, K. Deep kernel learning improves molecular fingerprint prediction from tandem mass spectra. Bioinformatics 38, i342–i349 (2022).

    Article  Google Scholar 

  39. Goldman, S. MIST software. Zenodo https://zenodo.org/record/8084088 (2022).

  40. Dührkop, K. et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods 16, 299–302 (2019).

    Article  Google Scholar 

  41. Lee, J. et al. Set transformer: a framework for attention-based permutation-invariant neural networks. In International Conference on Machine Learning 3744–3753 (PMLR, 2019).

  42. Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, (2017).

  43. Aisporna, A. et al. Neutral loss mass spectral data enhances molecular similarity analysis in METLIN. J. Am. Soc. Mass Spectrom. 33, 530–534 (2022).

    Article  Google Scholar 

  44. Karras, T., Aila, T., Laine, S. & Lehtinen, J. Progressive growing of GANs for improved quality, stability, and variation. International Conference on Learning Representations (2018).

  45. Ridder, L., van der Hooft, J. J. & Verhoeven, S. Automatic compound annotation from mass spectrometry data using MAGMa. Mass Spectrom. 3, S0033–S0033 (2014).

    Article  Google Scholar 

  46. Xie, Q., Luong, M.-T., Hovy, E. & Le, Q. V. Self-training with noisy student improves imagenet classification. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 10687–10698 (IEEE, 2020).

  47. Tandem Mass Spectral Library (NIST, 2020); https://www.nist.gov/programs-projects/tandem-mass-spectral-library

  48. MassBank of North America (MoNA, 2022); https://mona.fiehnlab.ucdavis.edu/

  49. Wang, M. et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nat. Biotechnol. 34, 828–837 (2016).

    Article  Google Scholar 

  50. Ludwig, M., Dührkop, K. & Böcker, S. Bayesian networks for mass spectrometric metabolite identification via molecular fingerprints. Bioinformatics 34, i333–i340 (2018).

  51. Oord, A. v. d., Li, Y. & Vinyals, O. Representation learning with contrastive predictive coding. Preprint at https://arxiv.org/abs/1807.03748 (2018).

  52. Huber, J. et al. Matchms-processing and similarity evaluation of mass spectrometry data. J. Open Source Software 5, 2411 (2020).

  53. McInnes, L. et al. UMAP: Uniform Manifold Approximation and Projection. J. Open Source Software 3, 861 (2018).

    Article  Google Scholar 

  54. Kim, H. W. et al. NPClassifier: a deep neural network-based structural classification tool for natural products. J. Nat. Prod. 84, 2795–2807 (2021).

    Article  Google Scholar 

  55. Mills, R. H. et al. Multi-omics analyses of the ulcerative colitis gut microbiome link Bacteroides vulgatus proteases with disease severity. Nat. Microbiol. 7, 262–276 (2022).

    Article  Google Scholar 

  56. Cao, Y. et al. Commensal microbiota from patients with inflammatory bowel disease produce genotoxic metabolites. Science 378, eabm3233 (2022).

    Article  Google Scholar 

  57. Franzosa, E. A. et al. Gut microbiome structure and metabolic activity in inflammatory bowel disease. Nat. Microbiol. 4, 293–305 (2019).

    Article  Google Scholar 

  58. Schirmer, M. et al. Compositional and temporal changes in the gut microbiome of pediatric ulcerative colitis patients are linked to disease course. Cell Host Microbe 24, 600–610.e4 (2018).

    Article  Google Scholar 

  59. Rojas-Tapias, D. F. et al. Inflammation-associated nitrate facilitates ectopic colonization of oral bacterium Veillonella parvula in the intestine. Nat. Microbiol. 7, 1673–1685 (2022).

    Article  Google Scholar 

  60. Bezerra, G. A. et al. Bacterial protease uses distinct thermodynamic signatures for substrate recognition. Sci. Rep. 7, 2848 (2017).

    Article  Google Scholar 

  61. Wlodarska, M. et al. Indoleacrylic acid produced by commensal peptostreptococcus species suppresses inflammation. Cell Host Microbe 22, 25–37.e6 (2017).

    Article  Google Scholar 

  62. Schymanski, E. L. & Neumann, S. The Critical Assessment of Small Molecule Identification (CASMI): challenges and solutions. Metabolites 3, 517–538 (2013).

    Article  Google Scholar 

  63. Landrum, G. RDKit: a software suite for cheminformatics, computational chemistry, and predictive modeling. Greg Landrum (2013).

  64. Malisiewicz, T., Gupta, A. & Efros, A. A. Ensemble of exemplar-svms for object detection and beyond. In 2011 International conference on Computer Vision 89–96 (IEEE, 2011).

  65. Ludwig, M. et al. Database-independent molecular formula annotation using Gibbs sampling through ZODIAC. Nat. Mach. Intell. 2, 629–641 (2020).

    Article  Google Scholar 

  66. Tu, Z. & Coley, C. W. Permutation invariant graph-to-sequence model for template-free retrosynthesis and reaction prediction. J. Chem. Inf. Model. 62, 3503–3513 (2022).

    Article  Google Scholar 

  67. Dai, Z. et al. Transformer-XL: attentive language models beyond a fixed-length context. Proc. 57th Ann. Meeting Assoc. Computational Linguistics. (2019).

  68. Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).

    Article  Google Scholar 

  69. Gutmann, M. & Hyvärinen, A. Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In Proc. Thirteenth International Conference on Artificial Intelligence and Statistics 297–304 (JMLR, 2010).

  70. Liu, L. et al. On the Variance of the Adaptive Learning Rate and Beyond. Intern. Conf. on Learning Representations. (2019).

  71. Wishart, D. S. et al. HMDB: the Human Metabolome Database. Nucleic Acids Res. 35, D521–D526 (2007).

    Article  Google Scholar 

  72. Shinbo, Y. et al. KNApSAcK: A Comprehensive Species-Metabolite Relationship Database. In: Saito, K., Dixon, R.A., Willmitzer, L. (eds) Plant Metabolomics. Biotechnology in Agriculture and Forestry, (Springer, 2006).

  73. Kanehisa, M. The KEGG database. In Novartis Foundation Symposium 91–100 (Wiley Online Library, 2002).

  74. Kim, S. et al. PubChem 2019 update: improved access to chemical data. Nucleic Acids Res. 47, D1102–D1109 (2019).

    Article  Google Scholar 

  75. Wishart, D. S. et al. HMDB 5.0: the Human Metabolome Database for 2022. Nucleic Acids Res. 50, D622–D631 (2022).

    Article  Google Scholar 

  76. Schmid, R. et al. Integrative analysis of multimodal mass spectrometry data in MZmine 3. Nat. Biotech. 41, 447–449 (2023).

    Article  Google Scholar 

Download references

Acknowledgements

We thank J. Bradshaw, R. Mercado, R. Barzilay, M. Wang, J. C. Hütter, J. Pacheco, C. Tzouanas, M. Zhu and D. Hitchcock for valuable feedback and discussion on the work. We are especially grateful to K. Duhrkop and S. Böcker for providing data to directly compare to their CSI:FingerID model and help utilizing the SIRIUS software. This work was supported by the Machine Learning for Pharmaceutical Discovery and Synthesis consortium, as well as the National Institutes of Health (P30DK043351 and R01AI172147 to R.J.X.). S.G. thanks the Takeda Healthcare AI Fellowship for additional support.

Author information

Authors and Affiliations

Authors

Contributions

S.G. wrote the software and conducted experiments. G.H. adapted MAGMa substructure labelling for auxiliary model training. S.G., J.W. and C.W.C. conceptualized the project and designed model components. M.S. and R.J.X. provided prospective clinical data analysis support. S.G. and C.W.C wrote the paper. C.W.C supervised the work.

Corresponding author

Correspondence to Connor W. Coley.

Ethics declarations

Competing interests

C.W.C. is a scientific advisor to Enveda Therapeutics, Inc. R.J.X is a co-founder of Celsius Therapeutics and Jnana Therapeutics, Board of Directors at MoonLake Immunotherapeutics, and Scientific Advisory Board at Nestlé. The other authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Wout Bittremieux and Tomáš Pluskal for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Sections 1.1–1.8, Figs. 1–12 and Tables 1–12.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Goldman, S., Wohlwend, J., Stražar, M. et al. Annotating metabolite mass spectra with domain-inspired chemical formula transformers. Nat Mach Intell 5, 965–979 (2023). https://doi.org/10.1038/s42256-023-00708-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-023-00708-3

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing