Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

BUDDY: molecular formula discovery via bottom-up MS/MS interrogation

Abstract

A substantial fraction of metabolic features remains undetermined in mass spectrometry (MS)-based metabolomics, and molecular formula annotation is the starting point for unraveling their chemical identities. Here we present bottom-up tandem MS (MS/MS) interrogation, a method for de novo formula annotation. Our approach prioritizes MS/MS-explainable formula candidates, implements machine-learned ranking and offers false discovery rate estimation. Compared with the mathematically exhaustive formula enumeration, our approach shrinks the formula candidate space by 42.8% on average. Method benchmarking on annotation accuracy was systematically carried out on reference MS/MS libraries and real metabolomics datasets. Applied on 155,321 recurrent unidentified spectra, our approach confidently annotated >5,000 novel molecular formulae absent from chemical databases. Beyond the level of individual metabolic features, we combined bottom-up MS/MS interrogation with global optimization to refine formula annotations while revealing peak interrelationships. This approach allowed the systematic annotation of 37 fatty acid amide molecules in human fecal data. All bioinformatics pipelines are available in a standalone software, BUDDY (https://github.com/HuanLab/BUDDY).

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Methodological comparison between top-down and bottom-up approaches for molecular formula annotation.
Fig. 2: Bottom-up approach prioritizes MS/MS-explainable candidates.
Fig. 3: Method performance evaluation.
Fig. 4: Platt calibration and FDR estimation.
Fig. 5: BUDDY discovers novel formulae absent from chemical databases.
Fig. 6: Experiment-specific global peak annotation and its application in untargeted metabolomics.

Similar content being viewed by others

Data availability

The four reference MS/MS libraries used for evaluation can be freely downloaded at https://mona.fiehnlab.ucdavis.edu/downloads. The MS/MS spectra of five newly discovered compounds can be accessed on GNPS (CCMSLIB00005467952 and CCMSLIB00005716808) and https://www.nature.com/articles/s41592-021-01303-3#MOESM10. ARUS MS/MS libraries can be downloaded at https://chemdata.nist.gov/dokuwiki/doku.php?id=chemdata:arus, and corresponding annotation results can be accessed on Zenodo (https://doi.org/10.5281/zenodo.7495955). All LC–MS/MS datasets are available from the MassIVE repository (American Gut Project, MSV000081981; tomato, MSV000081463; Chagas disease, MSV000086988; NIST human fecal material standards, MSV000086988 and MSV000086989). Evaluation results are provided in Supplementary Tables.

Code availability

BUDDY is written in the C# language on the Universal Windows Platform (UWP). It currently works in the Windows OS (Windows 10 or higher). The standalone software can be freely downloaded from GitHub (https://github.com/HuanLab/BUDDY) and Zenodo (https://doi.org/10.5281/zenodo.7735295). Source codes are also available on GitHub (https://github.com/HuanLab/BUDDY) under the MIT License.

References

  1. Wang, M. et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nat. Biotechnol. 34, 828–837 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  2. NIST Standard Reference Database 1A (NIST, 2014); https://www.nist.gov/srd/nist-standard-reference-database-1a

  3. Xue, J., Guijas, C., Benton, H. P., Warth, B. & Siuzdak, G. METLIN MS2 molecular standards database: a broad chemical and biological resource. Nat. Methods 17, 953–954 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  4. Horai, H. et al. MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass Spectrom. 45, 703–714 (2010).

    CAS  PubMed  Google Scholar 

  5. da Silva, R. R., Dorrestein, P. C. & Quinn, R. A. Illuminating the dark matter in metabolomics. Proc. Natl Acad. Sci. USA 112, 12549–12550 (2015).

    PubMed  PubMed Central  Google Scholar 

  6. Stein, S. Mass spectral reference libraries: an ever-expanding resource for chemical identification. Anal. Chem. 84, 7274–7282 (2012).

    CAS  PubMed  Google Scholar 

  7. Dührkop, K. et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods 16, 299–302 (2019).

    PubMed  Google Scholar 

  8. Bittremieux, W., May, D. H., Bilmes, J. & Noble, W. S. A learned embedding for efficient joint analysis of millions of mass spectra. Nat. Methods 19, 675–678 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  9. Dührkop, K., Shen, H., Meusel, M., Rousu, J. & Böcker, S. Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proc. Natl Acad. Sci. USA 112, 12580–12585 (2015).

    PubMed  PubMed Central  Google Scholar 

  10. Hoffmann, M. A. et al. High-confidence structural annotation of metabolites absent from spectral libraries. Nat. Biotechnol. 40, 411–421 (2022).

    CAS  PubMed  Google Scholar 

  11. Chen, L. et al. Metabolite discovery through global annotation of untargeted metabolomics data. Nat. Methods 18, 1377–1385 (2021).

    PubMed  PubMed Central  Google Scholar 

  12. Shen, X. et al. Metabolic reaction network-based recursive metabolite annotation for untargeted metabolomics. Nat. Commun. 10, 1516 (2019).

    PubMed  PubMed Central  Google Scholar 

  13. Ludwig, M. et al. Database-independent molecular formula annotation using Gibbs sampling through ZODIAC. Nat. Mach. Intell. 2, 629–641 (2020).

    Google Scholar 

  14. Ernst, M. et al. MolNetEnhancer: enhanced molecular networks by integrating metabolome mining and annotation tools. Metabolites 9, 144 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. Wishart, D. S. et al. HMDB 4.0: the human metabolome database for 2018. Nucleic Acids Res. 46, D608–D617 (2018).

    CAS  PubMed  Google Scholar 

  16. Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457–D462 (2016).

    CAS  PubMed  Google Scholar 

  17. Hastings, J. et al. ChEBI in 2016: improved services and an expanding collection of metabolites. Nucleic Acids Res. 44, D1214–D1219 (2016).

    CAS  PubMed  Google Scholar 

  18. Kim, S. et al. PubChem 2019 update: improved access to chemical data. Nucleic Acids Res. 47, D1102–D1109 (2019).

    PubMed  Google Scholar 

  19. Pence, H. E. & Williams, A. ChemSpider: an online chemical information resource. J. Chem. Educ. 87, 1123–1124 (2010).

    CAS  Google Scholar 

  20. Kind, T. & Fiehn, O. Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry. BMC Bioinformatics 8, 105 (2007).

    PubMed  PubMed Central  Google Scholar 

  21. Pluskal, T., Uehara, T. & Yanagida, M. Highly accurate chemical formula prediction tool utilizing high-resolution mass spectra, MS/MS fragmentation, heuristic rules, and isotope pattern matching. Anal. Chem. 84, 4396–4403 (2012).

    CAS  PubMed  Google Scholar 

  22. Bocker, S. & Liptak, Z. A fast and simple algorithm for the money changing problem. Algorithmica 48, 413–432 (2007).

    Google Scholar 

  23. Böcker, S., Letzel, M. C., Lipták, Z. & Pervukhin, A. SIRIUS: decomposing isotope patterns for metabolite identification. Bioinformatics 25, 218–224 (2009).

    PubMed  Google Scholar 

  24. Rasche, F., Svatoš, A., Maddula, R. K., Böttcher, C. & Böcker, S. Computing fragmentation trees from tandem mass spectrometry data. Anal. Chem. 83, 1243–1251 (2011).

    CAS  PubMed  Google Scholar 

  25. Staden, R. A strategy of DNA sequencing employing computer programs. Nucleic Acids Res. 6, 2601–2610 (1979).

    CAS  PubMed  PubMed Central  Google Scholar 

  26. Anderson, S. Shotgun DNA sequencing using cloned DNase I-generated fragments. Nucleic Acids Res. 9, 3015–3027 (1981).

    CAS  PubMed  PubMed Central  Google Scholar 

  27. Aebersold, R. & Mann, M. Mass spectrometry-based proteomics. Nature 422, 198–207 (2003).

    CAS  PubMed  Google Scholar 

  28. Chait, B. T. Mass spectrometry: bottom-up or top-down? Science 314, 65–66 (2006).

    CAS  PubMed  Google Scholar 

  29. Scheubert, K. et al. Significance estimation for large scale metabolomics annotations by spectral matching. Nat. Commun. 8, 1494 (2017).

    PubMed  PubMed Central  Google Scholar 

  30. Xing, S. & Huan, T. Radical fragment ions in collision-induced dissociation-based tandem mass spectrometry. Anal. Chim. Acta 1200, 339613 (2022).

    CAS  PubMed  Google Scholar 

  31. Senior, J. K. Partitions and their representative graphs. Am. J. Math. 73, 663–689 (1951).

    Google Scholar 

  32. Platt, J. C. in Advances in Large Margin Classifiers (eds Smola, A. J. et al.) (MIT Press, 2000).

  33. Nikolskiy, I., Mahieu, N. G., Chen, Y.-J., Tautenhahn, R. & Patti, G. J. An untargeted metabolomic workflow to improve structural characterization of metabolites. Anal. Chem. 85, 7713–7719 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  34. Xing, S. et al. Recognizing contamination fragment ions in liquid chromatography–tandem mass spectrometry data. J. Am. Soc. Mass. Spectrom. 32, 2296–2305 (2021).

    CAS  PubMed  Google Scholar 

  35. Blaženović, I. et al. Structure annotation of all mass spectra in untargeted metabolomics. Anal. Chem. 91, 2155–2162 (2019).

    PubMed  Google Scholar 

  36. Djoumbou Feunang, Y. et al. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J. Cheminform. 8, 61 (2016).

    PubMed  PubMed Central  Google Scholar 

  37. Li, Y. et al. Spectral entropy outperforms MS/MS dot product similarity for small-molecule compound identification. Nat. Methods 18, 1524–1531 (2021).

    CAS  PubMed  Google Scholar 

  38. Schymanski, E. L. et al. Identifying small molecules via high resolution mass spectrometry: communicating confidence. Environ. Sci. Technol. 48, 2097–2098 (2014).

    CAS  PubMed  Google Scholar 

  39. McDonald, D. et al. American Gut: an open platform for citizen science microbiome research. mSystems 3, e00031-18 (2018).

    PubMed  PubMed Central  Google Scholar 

  40. Lai, Z. et al. Identifying metabolites by integrating metabolome databases with mass spectrometry cheminformatics. Nat. Methods 15, 53–56 (2018).

    CAS  PubMed  Google Scholar 

  41. Simón-Manso, Y. et al. Mass spectrometry fingerprints of small-molecule metabolites in biofluids: building a spectral library of recurrent spectra for urine analysis. Anal. Chem. 91, 12021–12029 (2019).

    PubMed  PubMed Central  Google Scholar 

  42. Wang, M. et al. Mass spectrometry searches using MASST. Nat. Biotechnol. 38, 23–26 (2020).

    PubMed  PubMed Central  Google Scholar 

  43. Watrous, J. et al. Mass spectral molecular networking of living microbial colonies. Proc. Natl Acad. Sci. USA 109, E1743–E1752 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  44. Cohen, L. J. et al. Commensal bacteria make GPCR ligands that mimic human signalling molecules. Nature 549, 48–53 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  45. Chang, F.-Y. et al. Gut-inhabiting Clostridia build human GPCR ligands by conjugating neurotransmitters with diet- and human-derived fatty acids. Nat. Microbiol. 6, 792–805 (2021).

    CAS  PubMed  Google Scholar 

  46. Giné, R. et al. HERMES: a molecular-formula-oriented method to target the metabolome. Nat. Methods 18, 1370–1376 (2021).

    PubMed  PubMed Central  Google Scholar 

  47. Yin, Y., Wang, R., Cai, Y., Wang, Z. & Zhu, Z.-J. DecoMetDIA: deconvolution of multiplexed MS/MS spectra for metabolite identification in SWATH-MS-based untargeted metabolomics. Anal. Chem. 91, 11897–11904 (2019).

    CAS  PubMed  Google Scholar 

  48. Tsugawa, H. et al. MS-DIAL: data-independent MS/MS deconvolution for comprehensive metabolome analysis. Nat. Methods 12, 523–526 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  49. Tada, I. et al. Correlation-based deconvolution (CorrDec) to generate high-quality MS2 spectra from data-independent acquisition in multisample studies. Anal. Chem. 92, 11310–11317 (2020).

    CAS  PubMed  Google Scholar 

  50. Li, D. et al. XY-Meta: a high-efficiency search engine for large-scale metabolome annotation with accurate FDR estimation. Anal. Chem. 92, 5701–5707 (2020).

    CAS  PubMed  Google Scholar 

  51. Bonini, P., Kind, T., Tsugawa, H., Barupal, D. K. & Fiehn, O. Retip: retention time prediction for compound annotation in untargeted metabolomics. Anal. Chem. 92, 7515–7522 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  52. Bach, E., Szedmak, S., Brouard, C., Böcker, S. & Rousu, J. Liquid-chromatography retention order prediction for metabolite identification. Bioinformatics 34, i875–i883 (2018).

    CAS  PubMed  Google Scholar 

  53. Domingo-Almenara, X. et al. The METLIN small molecule dataset for machine learning-based retention time prediction. Nat. Commun. 10, 5811 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  54. Zhou, Z. et al. Ion mobility collision cross-section atlas for known and unknown metabolite annotation in untargeted metabolomics. Nat. Commun. 11, 4334 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  55. Huber, F. et al. Spec2Vec: improved mass spectral similarity scoring through learning of structural relationships. PLoS Comput. Biol. 17, e1008724 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  56. Xing, S. et al. Retrieving and utilizing hypothetical neutral losses from tandem mass spectra for spectral similarity analysis and unknown metabolite annotation. Anal. Chem. 92, 14476–14483 (2020).

    CAS  PubMed  Google Scholar 

  57. Treen, D. G. C. et al. SIMILE enables alignment of tandem mass spectra with statistical significance. Nat. Commun. 13, 2510 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  58. van der Hooft, J. J. J., Wandy, J., Barrett, M. P., Burgess, K. E. V. & Rogers, S. Topic modeling for untargeted substructure exploration in metabolomics. Proc. Natl Acad. Sci. USA 113, 13738–13743 (2016).

    PubMed  PubMed Central  Google Scholar 

  59. Djoumbou-Feunang, Y. et al. BioTransformer: a comprehensive computational tool for small molecule metabolism prediction and metabolite identification. J. Cheminform. 11, 2 (2019).

    PubMed  PubMed Central  Google Scholar 

  60. Jeffryes, J. G. et al. MINEs: open access databases of computationally predicted enzyme promiscuity products for untargeted metabolomics. J. Cheminform. 7, 44 (2015).

    PubMed  PubMed Central  Google Scholar 

  61. Ludwig, M. et al. Studying charge migration fragmentation of sodiated precursor ions in collision-induced dissociation at the library scale. J. Am. Soc. Mass. Spectrom. 32, 180–186 (2021).

    CAS  PubMed  Google Scholar 

  62. Bertz, S. H. The first general index of molecular complexity. J. Am. Chem. Soc. 103, 3599–3601 (1981).

    CAS  Google Scholar 

  63. Ertl, P., Roggo, S. & Schuffenhauer, A. Natural product-likeness score and its application for prioritization of compound libraries. J. Chem. Inf. Model. 48, 68–74 (2008).

    CAS  PubMed  Google Scholar 

  64. Kessner, D., Chambers, M., Burke, R., Agus, D. & Mallick, P. ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics 24, 2534–2536 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  65. Stein, S. E. & Scott, D. R. Optimization and testing of mass spectral library search algorithms for compound identification. J. Am. Soc. Mass. Spectrom. 5, 859–866 (1994).

    CAS  PubMed  Google Scholar 

Download references

Acknowledgements

This study was funded by the University of British Columbia Start-up Grant (grant no. F18-03001), Canada Foundation for Innovation (grant no. CFI 38159) and Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grants (grant nos. RGPIN-2020-04895 and RGPIN-2022-05316). We thank A. Hui for proofreading this paper.

Author information

Authors and Affiliations

Authors

Contributions

S.X. and T.H. conceived the research project. S.X. developed the computational algorithms for BUDDY. X.L. assisted with the MLR feature design and Platt calibration. S.X., S.S. and B.X. constructed the standalone software platform in C#. S.X. performed method evaluations and applications. S.X. and T.H. wrote the paper and all authors approved the final version.

Corresponding author

Correspondence to Tao Huan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Corey Broeckling, Oscar Yanes and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Arunima Singh, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 BUDDY graphical user interface.

BUDDY offers an intuitive graphical user interface directly downloadable from GitHub. BUDDY can simultaneously process data files from different sources. For identified metabolites, MS/MS matching information and detailed chemical descriptions are provided to help check the identification quality and further investigate biological insights. If experiment-specific global peak annotation is performed, feature connections are listed with MS/MS similarity, formula difference, and connection description information.

Extended Data Fig. 2 Generation of MS/MS-explainable candidate space.

Both even-electron and odd-electron species are considered in bottom-up MS/MS interrogation.

Extended Data Fig. 3 Structural complexity of tested compounds in reference MS/MS libraries.

The tested compounds cover a broad range of molecular complexity and NP-likeness scores compared to the entire chemical space of PubChem.

Extended Data Fig. 4 Potential factors impacting the formula annotation performance on tested metabolomics datasets.

a, The relative mass deviations of experimental precursor masses from theoretical masses for all identified metabolites. b, The number of reserved fragments (maximum: 50) after low-abundance noise ions and MS/MS isotopic peaks are removed in MS/MS preprocessing. c, The number of valid fragments that are subformula-explainable using the ground-truth molecular formulae.

Extended Data Fig. 5 Validation of FDR estimation on reference MS/MS libraries.

We used four reference MS/MS libraries to evaluate FDR estimation in BUDDY. Q-Q plots of estimated FDR and exact FDR are shown. In both positive and negative ion modes, estimated FDR shows high Pearson’s correlation coefficients (r > 0.98) with exact FDR.

Extended Data Fig. 6 Method evaluation on novel compounds discovered in references.

Bottom-up MS/MS interrogation was evaluated using the MS/MS spectra of five novel compounds, three of which were confirmed in NMR experiments. The correct formulae were ranked first in all cases, with the estimated FDR ranging from 0.6 to 16.3%.

Extended Data Fig. 7 Molecular formula discovery in NIST human fecal material standards.

a, Venn diagram of annotated molecular formulae with estimated FDR < 5%. b, Scatter and density plots of m/z, retention time, DBE value, and hydrogen carbon ratio (H / C) of high-confidence annotated molecular formulae. c, Manual inspection of N-valeryl histamine. The experimental MS/MS of N-valeryl histamine shows a reverse dot product score of 0.99 against the reference MS/MS of histamine. Two neutral losses representing the acyl chain were found. Valeric acid is commonly produced in the gut microbiota (for example, by Clostridia species), and histamine is a well-known diet-derived metabolite.

Supplementary information

Supplementary Information

Supplementary Figs. 1–11 and Notes 1–18.

Reporting Summary

Supplementary Tables

Supplementary Tables 1–22. Captions are provided within each tab.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xing, S., Shen, S., Xu, B. et al. BUDDY: molecular formula discovery via bottom-up MS/MS interrogation. Nat Methods 20, 881–890 (2023). https://doi.org/10.1038/s41592-023-01850-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-023-01850-x

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research