BUDDY: molecular formula discovery via bottom-up MS/MS interrogation

Xing, Shipei; Shen, Sam; Xu, Banghua; Li, Xiaoxiao; Huan, Tao

doi:10.1038/s41592-023-01850-x

Article
Published: 13 April 2023

BUDDY: molecular formula discovery via bottom-up MS/MS interrogation

Shipei Xing¹,
Sam Shen¹,
Banghua Xu¹,
Xiaoxiao Li² &
…
Tao Huan ORCID: orcid.org/0000-0001-6295-2435¹

Nature Methods volume 20, pages 881–890 (2023)Cite this article

6044 Accesses
11 Citations
37 Altmetric
Metrics details

Subjects

Abstract

A substantial fraction of metabolic features remains undetermined in mass spectrometry (MS)-based metabolomics, and molecular formula annotation is the starting point for unraveling their chemical identities. Here we present bottom-up tandem MS (MS/MS) interrogation, a method for de novo formula annotation. Our approach prioritizes MS/MS-explainable formula candidates, implements machine-learned ranking and offers false discovery rate estimation. Compared with the mathematically exhaustive formula enumeration, our approach shrinks the formula candidate space by 42.8% on average. Method benchmarking on annotation accuracy was systematically carried out on reference MS/MS libraries and real metabolomics datasets. Applied on 155,321 recurrent unidentified spectra, our approach confidently annotated >5,000 novel molecular formulae absent from chemical databases. Beyond the level of individual metabolic features, we combined bottom-up MS/MS interrogation with global optimization to refine formula annotations while revealing peak interrelationships. This approach allowed the systematic annotation of 37 fatty acid amide molecules in human fecal data. All bioinformatics pipelines are available in a standalone software, BUDDY (https://github.com/HuanLab/BUDDY).

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Methodological comparison between top-down and bottom-up approaches for molecular formula annotation.**

**Fig. 2: Bottom-up approach prioritizes MS/MS-explainable candidates.**

**Fig. 3: Method performance evaluation.**

**Fig. 4: Platt calibration and FDR estimation.**

**Fig. 5: BUDDY discovers novel formulae absent from chemical databases.**

**Fig. 6: Experiment-specific global peak annotation and its application in untargeted metabolomics.**

Database-independent molecular formula annotation using Gibbs sampling through ZODIAC

Article 13 October 2020

Marcus Ludwig, Louis-Félix Nothias, … Sebastian Böcker

High-confidence structural annotation of metabolites absent from spectral libraries

Article Open access 14 October 2021

Martin A. Hoffmann, Louis-Félix Nothias, … Sebastian Böcker

HERMES: a molecular-formula-oriented method to target the metabolome

Article 01 November 2021

Roger Giné, Jordi Capellades, … Oscar Yanes

Data availability

The four reference MS/MS libraries used for evaluation can be freely downloaded at https://mona.fiehnlab.ucdavis.edu/downloads. The MS/MS spectra of five newly discovered compounds can be accessed on GNPS (CCMSLIB00005467952 and CCMSLIB00005716808) and https://www.nature.com/articles/s41592-021-01303-3#MOESM10. ARUS MS/MS libraries can be downloaded at https://chemdata.nist.gov/dokuwiki/doku.php?id=chemdata:arus, and corresponding annotation results can be accessed on Zenodo (https://doi.org/10.5281/zenodo.7495955). All LC–MS/MS datasets are available from the MassIVE repository (American Gut Project, MSV000081981; tomato, MSV000081463; Chagas disease, MSV000086988; NIST human fecal material standards, MSV000086988 and MSV000086989). Evaluation results are provided in Supplementary Tables.

Code availability

BUDDY is written in the C# language on the Universal Windows Platform (UWP). It currently works in the Windows OS (Windows 10 or higher). The standalone software can be freely downloaded from GitHub (https://github.com/HuanLab/BUDDY) and Zenodo (https://doi.org/10.5281/zenodo.7735295). Source codes are also available on GitHub (https://github.com/HuanLab/BUDDY) under the MIT License.

References

Wang, M. et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nat. Biotechnol. 34, 828–837 (2016).
CAS PubMed PubMed Central Google Scholar
NIST Standard Reference Database 1A (NIST, 2014); https://www.nist.gov/srd/nist-standard-reference-database-1a
Xue, J., Guijas, C., Benton, H. P., Warth, B. & Siuzdak, G. METLIN MS2 molecular standards database: a broad chemical and biological resource. Nat. Methods 17, 953–954 (2020).
CAS PubMed PubMed Central Google Scholar
Horai, H. et al. MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass Spectrom. 45, 703–714 (2010).
CAS PubMed Google Scholar
da Silva, R. R., Dorrestein, P. C. & Quinn, R. A. Illuminating the dark matter in metabolomics. Proc. Natl Acad. Sci. USA 112, 12549–12550 (2015).
PubMed PubMed Central Google Scholar
Stein, S. Mass spectral reference libraries: an ever-expanding resource for chemical identification. Anal. Chem. 84, 7274–7282 (2012).
CAS PubMed Google Scholar
Dührkop, K. et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods 16, 299–302 (2019).
PubMed Google Scholar
Bittremieux, W., May, D. H., Bilmes, J. & Noble, W. S. A learned embedding for efficient joint analysis of millions of mass spectra. Nat. Methods 19, 675–678 (2022).
CAS PubMed PubMed Central Google Scholar
Dührkop, K., Shen, H., Meusel, M., Rousu, J. & Böcker, S. Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proc. Natl Acad. Sci. USA 112, 12580–12585 (2015).
PubMed PubMed Central Google Scholar
Hoffmann, M. A. et al. High-confidence structural annotation of metabolites absent from spectral libraries. Nat. Biotechnol. 40, 411–421 (2022).
CAS PubMed Google Scholar
Chen, L. et al. Metabolite discovery through global annotation of untargeted metabolomics data. Nat. Methods 18, 1377–1385 (2021).
PubMed PubMed Central Google Scholar
Shen, X. et al. Metabolic reaction network-based recursive metabolite annotation for untargeted metabolomics. Nat. Commun. 10, 1516 (2019).
PubMed PubMed Central Google Scholar
Ludwig, M. et al. Database-independent molecular formula annotation using Gibbs sampling through ZODIAC. Nat. Mach. Intell. 2, 629–641 (2020).
Google Scholar
Ernst, M. et al. MolNetEnhancer: enhanced molecular networks by integrating metabolome mining and annotation tools. Metabolites 9, 144 (2019).
CAS PubMed PubMed Central Google Scholar
Wishart, D. S. et al. HMDB 4.0: the human metabolome database for 2018. Nucleic Acids Res. 46, D608–D617 (2018).
CAS PubMed Google Scholar
Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457–D462 (2016).
CAS PubMed Google Scholar
Hastings, J. et al. ChEBI in 2016: improved services and an expanding collection of metabolites. Nucleic Acids Res. 44, D1214–D1219 (2016).
CAS PubMed Google Scholar
Kim, S. et al. PubChem 2019 update: improved access to chemical data. Nucleic Acids Res. 47, D1102–D1109 (2019).
PubMed Google Scholar
Pence, H. E. & Williams, A. ChemSpider: an online chemical information resource. J. Chem. Educ. 87, 1123–1124 (2010).
CAS Google Scholar
Kind, T. & Fiehn, O. Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry. BMC Bioinformatics 8, 105 (2007).
PubMed PubMed Central Google Scholar
Pluskal, T., Uehara, T. & Yanagida, M. Highly accurate chemical formula prediction tool utilizing high-resolution mass spectra, MS/MS fragmentation, heuristic rules, and isotope pattern matching. Anal. Chem. 84, 4396–4403 (2012).
CAS PubMed Google Scholar
Bocker, S. & Liptak, Z. A fast and simple algorithm for the money changing problem. Algorithmica 48, 413–432 (2007).
Google Scholar
Böcker, S., Letzel, M. C., Lipták, Z. & Pervukhin, A. SIRIUS: decomposing isotope patterns for metabolite identification^†. Bioinformatics 25, 218–224 (2009).
PubMed Google Scholar
Rasche, F., Svatoš, A., Maddula, R. K., Böttcher, C. & Böcker, S. Computing fragmentation trees from tandem mass spectrometry data. Anal. Chem. 83, 1243–1251 (2011).
CAS PubMed Google Scholar
Staden, R. A strategy of DNA sequencing employing computer programs. Nucleic Acids Res. 6, 2601–2610 (1979).
CAS PubMed PubMed Central Google Scholar
Anderson, S. Shotgun DNA sequencing using cloned DNase I-generated fragments. Nucleic Acids Res. 9, 3015–3027 (1981).
CAS PubMed PubMed Central Google Scholar
Aebersold, R. & Mann, M. Mass spectrometry-based proteomics. Nature 422, 198–207 (2003).
CAS PubMed Google Scholar
Chait, B. T. Mass spectrometry: bottom-up or top-down? Science 314, 65–66 (2006).
CAS PubMed Google Scholar
Scheubert, K. et al. Significance estimation for large scale metabolomics annotations by spectral matching. Nat. Commun. 8, 1494 (2017).
PubMed PubMed Central Google Scholar
Xing, S. & Huan, T. Radical fragment ions in collision-induced dissociation-based tandem mass spectrometry. Anal. Chim. Acta 1200, 339613 (2022).
CAS PubMed Google Scholar
Senior, J. K. Partitions and their representative graphs. Am. J. Math. 73, 663–689 (1951).
Google Scholar
Platt, J. C. in Advances in Large Margin Classifiers (eds Smola, A. J. et al.) (MIT Press, 2000).
Nikolskiy, I., Mahieu, N. G., Chen, Y.-J., Tautenhahn, R. & Patti, G. J. An untargeted metabolomic workflow to improve structural characterization of metabolites. Anal. Chem. 85, 7713–7719 (2013).
CAS PubMed PubMed Central Google Scholar
Xing, S. et al. Recognizing contamination fragment ions in liquid chromatography–tandem mass spectrometry data. J. Am. Soc. Mass. Spectrom. 32, 2296–2305 (2021).
CAS PubMed Google Scholar
Blaženović, I. et al. Structure annotation of all mass spectra in untargeted metabolomics. Anal. Chem. 91, 2155–2162 (2019).
PubMed Google Scholar
Djoumbou Feunang, Y. et al. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J. Cheminform. 8, 61 (2016).
PubMed PubMed Central Google Scholar
Li, Y. et al. Spectral entropy outperforms MS/MS dot product similarity for small-molecule compound identification. Nat. Methods 18, 1524–1531 (2021).
CAS PubMed Google Scholar
Schymanski, E. L. et al. Identifying small molecules via high resolution mass spectrometry: communicating confidence. Environ. Sci. Technol. 48, 2097–2098 (2014).
CAS PubMed Google Scholar
McDonald, D. et al. American Gut: an open platform for citizen science microbiome research. mSystems 3, e00031-18 (2018).
PubMed PubMed Central Google Scholar
Lai, Z. et al. Identifying metabolites by integrating metabolome databases with mass spectrometry cheminformatics. Nat. Methods 15, 53–56 (2018).
CAS PubMed Google Scholar
Simón-Manso, Y. et al. Mass spectrometry fingerprints of small-molecule metabolites in biofluids: building a spectral library of recurrent spectra for urine analysis. Anal. Chem. 91, 12021–12029 (2019).
PubMed PubMed Central Google Scholar
Wang, M. et al. Mass spectrometry searches using MASST. Nat. Biotechnol. 38, 23–26 (2020).
PubMed PubMed Central Google Scholar
Watrous, J. et al. Mass spectral molecular networking of living microbial colonies. Proc. Natl Acad. Sci. USA 109, E1743–E1752 (2012).
CAS PubMed PubMed Central Google Scholar
Cohen, L. J. et al. Commensal bacteria make GPCR ligands that mimic human signalling molecules. Nature 549, 48–53 (2017).
CAS PubMed PubMed Central Google Scholar
Chang, F.-Y. et al. Gut-inhabiting Clostridia build human GPCR ligands by conjugating neurotransmitters with diet- and human-derived fatty acids. Nat. Microbiol. 6, 792–805 (2021).
CAS PubMed Google Scholar
Giné, R. et al. HERMES: a molecular-formula-oriented method to target the metabolome. Nat. Methods 18, 1370–1376 (2021).
PubMed PubMed Central Google Scholar
Yin, Y., Wang, R., Cai, Y., Wang, Z. & Zhu, Z.-J. DecoMetDIA: deconvolution of multiplexed MS/MS spectra for metabolite identification in SWATH-MS-based untargeted metabolomics. Anal. Chem. 91, 11897–11904 (2019).
CAS PubMed Google Scholar
Tsugawa, H. et al. MS-DIAL: data-independent MS/MS deconvolution for comprehensive metabolome analysis. Nat. Methods 12, 523–526 (2015).
CAS PubMed PubMed Central Google Scholar
Tada, I. et al. Correlation-based deconvolution (CorrDec) to generate high-quality MS2 spectra from data-independent acquisition in multisample studies. Anal. Chem. 92, 11310–11317 (2020).
CAS PubMed Google Scholar
Li, D. et al. XY-Meta: a high-efficiency search engine for large-scale metabolome annotation with accurate FDR estimation. Anal. Chem. 92, 5701–5707 (2020).
CAS PubMed Google Scholar
Bonini, P., Kind, T., Tsugawa, H., Barupal, D. K. & Fiehn, O. Retip: retention time prediction for compound annotation in untargeted metabolomics. Anal. Chem. 92, 7515–7522 (2020).
CAS PubMed PubMed Central Google Scholar
Bach, E., Szedmak, S., Brouard, C., Böcker, S. & Rousu, J. Liquid-chromatography retention order prediction for metabolite identification. Bioinformatics 34, i875–i883 (2018).
CAS PubMed Google Scholar
Domingo-Almenara, X. et al. The METLIN small molecule dataset for machine learning-based retention time prediction. Nat. Commun. 10, 5811 (2019).
CAS PubMed PubMed Central Google Scholar
Zhou, Z. et al. Ion mobility collision cross-section atlas for known and unknown metabolite annotation in untargeted metabolomics. Nat. Commun. 11, 4334 (2020).
CAS PubMed PubMed Central Google Scholar
Huber, F. et al. Spec2Vec: improved mass spectral similarity scoring through learning of structural relationships. PLoS Comput. Biol. 17, e1008724 (2021).
CAS PubMed PubMed Central Google Scholar
Xing, S. et al. Retrieving and utilizing hypothetical neutral losses from tandem mass spectra for spectral similarity analysis and unknown metabolite annotation. Anal. Chem. 92, 14476–14483 (2020).
CAS PubMed Google Scholar
Treen, D. G. C. et al. SIMILE enables alignment of tandem mass spectra with statistical significance. Nat. Commun. 13, 2510 (2022).
CAS PubMed PubMed Central Google Scholar
van der Hooft, J. J. J., Wandy, J., Barrett, M. P., Burgess, K. E. V. & Rogers, S. Topic modeling for untargeted substructure exploration in metabolomics. Proc. Natl Acad. Sci. USA 113, 13738–13743 (2016).
PubMed PubMed Central Google Scholar
Djoumbou-Feunang, Y. et al. BioTransformer: a comprehensive computational tool for small molecule metabolism prediction and metabolite identification. J. Cheminform. 11, 2 (2019).
PubMed PubMed Central Google Scholar
Jeffryes, J. G. et al. MINEs: open access databases of computationally predicted enzyme promiscuity products for untargeted metabolomics. J. Cheminform. 7, 44 (2015).
PubMed PubMed Central Google Scholar
Ludwig, M. et al. Studying charge migration fragmentation of sodiated precursor ions in collision-induced dissociation at the library scale. J. Am. Soc. Mass. Spectrom. 32, 180–186 (2021).
CAS PubMed Google Scholar
Bertz, S. H. The first general index of molecular complexity. J. Am. Chem. Soc. 103, 3599–3601 (1981).
CAS Google Scholar
Ertl, P., Roggo, S. & Schuffenhauer, A. Natural product-likeness score and its application for prioritization of compound libraries. J. Chem. Inf. Model. 48, 68–74 (2008).
CAS PubMed Google Scholar
Kessner, D., Chambers, M., Burke, R., Agus, D. & Mallick, P. ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics 24, 2534–2536 (2008).
CAS PubMed PubMed Central Google Scholar
Stein, S. E. & Scott, D. R. Optimization and testing of mass spectral library search algorithms for compound identification. J. Am. Soc. Mass. Spectrom. 5, 859–866 (1994).
CAS PubMed Google Scholar

Download references

Acknowledgements

This study was funded by the University of British Columbia Start-up Grant (grant no. F18-03001), Canada Foundation for Innovation (grant no. CFI 38159) and Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grants (grant nos. RGPIN-2020-04895 and RGPIN-2022-05316). We thank A. Hui for proofreading this paper.

Author information

Authors and Affiliations

Department of Chemistry, Faculty of Science, University of British Columbia, Vancouver, British Columbia, Canada
Shipei Xing, Sam Shen, Banghua Xu & Tao Huan
Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, British Columbia, Canada
Xiaoxiao Li

Authors

Shipei Xing
View author publications
You can also search for this author in PubMed Google Scholar
Sam Shen
View author publications
You can also search for this author in PubMed Google Scholar
Banghua Xu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoxiao Li
View author publications
You can also search for this author in PubMed Google Scholar
Tao Huan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.X. and T.H. conceived the research project. S.X. developed the computational algorithms for BUDDY. X.L. assisted with the MLR feature design and Platt calibration. S.X., S.S. and B.X. constructed the standalone software platform in C#. S.X. performed method evaluations and applications. S.X. and T.H. wrote the paper and all authors approved the final version.

Corresponding author

Correspondence to Tao Huan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Corey Broeckling, Oscar Yanes and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Arunima Singh, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 BUDDY graphical user interface.

BUDDY offers an intuitive graphical user interface directly downloadable from GitHub. BUDDY can simultaneously process data files from different sources. For identified metabolites, MS/MS matching information and detailed chemical descriptions are provided to help check the identification quality and further investigate biological insights. If experiment-specific global peak annotation is performed, feature connections are listed with MS/MS similarity, formula difference, and connection description information.

Extended Data Fig. 2 Generation of MS/MS-explainable candidate space.

Both even-electron and odd-electron species are considered in bottom-up MS/MS interrogation.

Extended Data Fig. 3 Structural complexity of tested compounds in reference MS/MS libraries.

The tested compounds cover a broad range of molecular complexity and NP-likeness scores compared to the entire chemical space of PubChem.

Extended Data Fig. 4 Potential factors impacting the formula annotation performance on tested metabolomics datasets.

a, The relative mass deviations of experimental precursor masses from theoretical masses for all identified metabolites. b, The number of reserved fragments (maximum: 50) after low-abundance noise ions and MS/MS isotopic peaks are removed in MS/MS preprocessing. c, The number of valid fragments that are subformula-explainable using the ground-truth molecular formulae.

Extended Data Fig. 5 Validation of FDR estimation on reference MS/MS libraries.

We used four reference MS/MS libraries to evaluate FDR estimation in BUDDY. Q-Q plots of estimated FDR and exact FDR are shown. In both positive and negative ion modes, estimated FDR shows high Pearson’s correlation coefficients (r > 0.98) with exact FDR.

Extended Data Fig. 6 Method evaluation on novel compounds discovered in references.

Bottom-up MS/MS interrogation was evaluated using the MS/MS spectra of five novel compounds, three of which were confirmed in NMR experiments. The correct formulae were ranked first in all cases, with the estimated FDR ranging from 0.6 to 16.3%.

Extended Data Fig. 7 Molecular formula discovery in NIST human fecal material standards.

a, Venn diagram of annotated molecular formulae with estimated FDR < 5%. b, Scatter and density plots of m/z, retention time, DBE value, and hydrogen carbon ratio (H / C) of high-confidence annotated molecular formulae. c, Manual inspection of N-valeryl histamine. The experimental MS/MS of N-valeryl histamine shows a reverse dot product score of 0.99 against the reference MS/MS of histamine. Two neutral losses representing the acyl chain were found. Valeric acid is commonly produced in the gut microbiota (for example, by Clostridia species), and histamine is a well-known diet-derived metabolite.

Supplementary information

Supplementary Information

Supplementary Figs. 1–11 and Notes 1–18.

Reporting Summary

Supplementary Tables

Supplementary Tables 1–22. Captions are provided within each tab.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xing, S., Shen, S., Xu, B. et al. BUDDY: molecular formula discovery via bottom-up MS/MS interrogation. Nat Methods 20, 881–890 (2023). https://doi.org/10.1038/s41592-023-01850-x

Download citation

Received: 03 August 2022
Accepted: 15 March 2023
Published: 13 April 2023
Issue Date: June 2023
DOI: https://doi.org/10.1038/s41592-023-01850-x

This article is cited by

The changing metabolic landscape of bile acids – keys to metabolism and immune regulation
- Ipsita Mohanty
- Celeste Allaband
- Pieter C. Dorrestein
Nature Reviews Gastroenterology & Hepatology (2024)
Determining the parent and associated fragment formulae in mass spectrometry via the parent subformula graph
- Sean Li
- Björn Bohman
- Dylan Jayatilaka
Journal of Cheminformatics (2023)
Open access repository-scale propagated nearest neighbor suspect spectral library for untargeted metabolomics
- Wout Bittremieux
- Nicole E. Avalon
- Pieter C. Dorrestein
Nature Communications (2023)