Abstract
A substantial fraction of metabolic features remains undetermined in mass spectrometry (MS)-based metabolomics, and molecular formula annotation is the starting point for unraveling their chemical identities. Here we present bottom-up tandem MS (MS/MS) interrogation, a method for de novo formula annotation. Our approach prioritizes MS/MS-explainable formula candidates, implements machine-learned ranking and offers false discovery rate estimation. Compared with the mathematically exhaustive formula enumeration, our approach shrinks the formula candidate space by 42.8% on average. Method benchmarking on annotation accuracy was systematically carried out on reference MS/MS libraries and real metabolomics datasets. Applied on 155,321 recurrent unidentified spectra, our approach confidently annotated >5,000 novel molecular formulae absent from chemical databases. Beyond the level of individual metabolic features, we combined bottom-up MS/MS interrogation with global optimization to refine formula annotations while revealing peak interrelationships. This approach allowed the systematic annotation of 37 fatty acid amide molecules in human fecal data. All bioinformatics pipelines are available in a standalone software, BUDDY (https://github.com/HuanLab/BUDDY).
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout






Similar content being viewed by others
Data availability
The four reference MS/MS libraries used for evaluation can be freely downloaded at https://mona.fiehnlab.ucdavis.edu/downloads. The MS/MS spectra of five newly discovered compounds can be accessed on GNPS (CCMSLIB00005467952 and CCMSLIB00005716808) and https://www.nature.com/articles/s41592-021-01303-3#MOESM10. ARUS MS/MS libraries can be downloaded at https://chemdata.nist.gov/dokuwiki/doku.php?id=chemdata:arus, and corresponding annotation results can be accessed on Zenodo (https://doi.org/10.5281/zenodo.7495955). All LC–MS/MS datasets are available from the MassIVE repository (American Gut Project, MSV000081981; tomato, MSV000081463; Chagas disease, MSV000086988; NIST human fecal material standards, MSV000086988 and MSV000086989). Evaluation results are provided in Supplementary Tables.
Code availability
BUDDY is written in the C# language on the Universal Windows Platform (UWP). It currently works in the Windows OS (Windows 10 or higher). The standalone software can be freely downloaded from GitHub (https://github.com/HuanLab/BUDDY) and Zenodo (https://doi.org/10.5281/zenodo.7735295). Source codes are also available on GitHub (https://github.com/HuanLab/BUDDY) under the MIT License.
References
Wang, M. et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nat. Biotechnol. 34, 828–837 (2016).
NIST Standard Reference Database 1A (NIST, 2014); https://www.nist.gov/srd/nist-standard-reference-database-1a
Xue, J., Guijas, C., Benton, H. P., Warth, B. & Siuzdak, G. METLIN MS2 molecular standards database: a broad chemical and biological resource. Nat. Methods 17, 953–954 (2020).
Horai, H. et al. MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass Spectrom. 45, 703–714 (2010).
da Silva, R. R., Dorrestein, P. C. & Quinn, R. A. Illuminating the dark matter in metabolomics. Proc. Natl Acad. Sci. USA 112, 12549–12550 (2015).
Stein, S. Mass spectral reference libraries: an ever-expanding resource for chemical identification. Anal. Chem. 84, 7274–7282 (2012).
Dührkop, K. et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods 16, 299–302 (2019).
Bittremieux, W., May, D. H., Bilmes, J. & Noble, W. S. A learned embedding for efficient joint analysis of millions of mass spectra. Nat. Methods 19, 675–678 (2022).
Dührkop, K., Shen, H., Meusel, M., Rousu, J. & Böcker, S. Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proc. Natl Acad. Sci. USA 112, 12580–12585 (2015).
Hoffmann, M. A. et al. High-confidence structural annotation of metabolites absent from spectral libraries. Nat. Biotechnol. 40, 411–421 (2022).
Chen, L. et al. Metabolite discovery through global annotation of untargeted metabolomics data. Nat. Methods 18, 1377–1385 (2021).
Shen, X. et al. Metabolic reaction network-based recursive metabolite annotation for untargeted metabolomics. Nat. Commun. 10, 1516 (2019).
Ludwig, M. et al. Database-independent molecular formula annotation using Gibbs sampling through ZODIAC. Nat. Mach. Intell. 2, 629–641 (2020).
Ernst, M. et al. MolNetEnhancer: enhanced molecular networks by integrating metabolome mining and annotation tools. Metabolites 9, 144 (2019).
Wishart, D. S. et al. HMDB 4.0: the human metabolome database for 2018. Nucleic Acids Res. 46, D608–D617 (2018).
Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457–D462 (2016).
Hastings, J. et al. ChEBI in 2016: improved services and an expanding collection of metabolites. Nucleic Acids Res. 44, D1214–D1219 (2016).
Kim, S. et al. PubChem 2019 update: improved access to chemical data. Nucleic Acids Res. 47, D1102–D1109 (2019).
Pence, H. E. & Williams, A. ChemSpider: an online chemical information resource. J. Chem. Educ. 87, 1123–1124 (2010).
Kind, T. & Fiehn, O. Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry. BMC Bioinformatics 8, 105 (2007).
Pluskal, T., Uehara, T. & Yanagida, M. Highly accurate chemical formula prediction tool utilizing high-resolution mass spectra, MS/MS fragmentation, heuristic rules, and isotope pattern matching. Anal. Chem. 84, 4396–4403 (2012).
Bocker, S. & Liptak, Z. A fast and simple algorithm for the money changing problem. Algorithmica 48, 413–432 (2007).
Böcker, S., Letzel, M. C., Lipták, Z. & Pervukhin, A. SIRIUS: decomposing isotope patterns for metabolite identification†. Bioinformatics 25, 218–224 (2009).
Rasche, F., Svatoš, A., Maddula, R. K., Böttcher, C. & Böcker, S. Computing fragmentation trees from tandem mass spectrometry data. Anal. Chem. 83, 1243–1251 (2011).
Staden, R. A strategy of DNA sequencing employing computer programs. Nucleic Acids Res. 6, 2601–2610 (1979).
Anderson, S. Shotgun DNA sequencing using cloned DNase I-generated fragments. Nucleic Acids Res. 9, 3015–3027 (1981).
Aebersold, R. & Mann, M. Mass spectrometry-based proteomics. Nature 422, 198–207 (2003).
Chait, B. T. Mass spectrometry: bottom-up or top-down? Science 314, 65–66 (2006).
Scheubert, K. et al. Significance estimation for large scale metabolomics annotations by spectral matching. Nat. Commun. 8, 1494 (2017).
Xing, S. & Huan, T. Radical fragment ions in collision-induced dissociation-based tandem mass spectrometry. Anal. Chim. Acta 1200, 339613 (2022).
Senior, J. K. Partitions and their representative graphs. Am. J. Math. 73, 663–689 (1951).
Platt, J. C. in Advances in Large Margin Classifiers (eds Smola, A. J. et al.) (MIT Press, 2000).
Nikolskiy, I., Mahieu, N. G., Chen, Y.-J., Tautenhahn, R. & Patti, G. J. An untargeted metabolomic workflow to improve structural characterization of metabolites. Anal. Chem. 85, 7713–7719 (2013).
Xing, S. et al. Recognizing contamination fragment ions in liquid chromatography–tandem mass spectrometry data. J. Am. Soc. Mass. Spectrom. 32, 2296–2305 (2021).
Blaženović, I. et al. Structure annotation of all mass spectra in untargeted metabolomics. Anal. Chem. 91, 2155–2162 (2019).
Djoumbou Feunang, Y. et al. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J. Cheminform. 8, 61 (2016).
Li, Y. et al. Spectral entropy outperforms MS/MS dot product similarity for small-molecule compound identification. Nat. Methods 18, 1524–1531 (2021).
Schymanski, E. L. et al. Identifying small molecules via high resolution mass spectrometry: communicating confidence. Environ. Sci. Technol. 48, 2097–2098 (2014).
McDonald, D. et al. American Gut: an open platform for citizen science microbiome research. mSystems 3, e00031-18 (2018).
Lai, Z. et al. Identifying metabolites by integrating metabolome databases with mass spectrometry cheminformatics. Nat. Methods 15, 53–56 (2018).
Simón-Manso, Y. et al. Mass spectrometry fingerprints of small-molecule metabolites in biofluids: building a spectral library of recurrent spectra for urine analysis. Anal. Chem. 91, 12021–12029 (2019).
Wang, M. et al. Mass spectrometry searches using MASST. Nat. Biotechnol. 38, 23–26 (2020).
Watrous, J. et al. Mass spectral molecular networking of living microbial colonies. Proc. Natl Acad. Sci. USA 109, E1743–E1752 (2012).
Cohen, L. J. et al. Commensal bacteria make GPCR ligands that mimic human signalling molecules. Nature 549, 48–53 (2017).
Chang, F.-Y. et al. Gut-inhabiting Clostridia build human GPCR ligands by conjugating neurotransmitters with diet- and human-derived fatty acids. Nat. Microbiol. 6, 792–805 (2021).
Giné, R. et al. HERMES: a molecular-formula-oriented method to target the metabolome. Nat. Methods 18, 1370–1376 (2021).
Yin, Y., Wang, R., Cai, Y., Wang, Z. & Zhu, Z.-J. DecoMetDIA: deconvolution of multiplexed MS/MS spectra for metabolite identification in SWATH-MS-based untargeted metabolomics. Anal. Chem. 91, 11897–11904 (2019).
Tsugawa, H. et al. MS-DIAL: data-independent MS/MS deconvolution for comprehensive metabolome analysis. Nat. Methods 12, 523–526 (2015).
Tada, I. et al. Correlation-based deconvolution (CorrDec) to generate high-quality MS2 spectra from data-independent acquisition in multisample studies. Anal. Chem. 92, 11310–11317 (2020).
Li, D. et al. XY-Meta: a high-efficiency search engine for large-scale metabolome annotation with accurate FDR estimation. Anal. Chem. 92, 5701–5707 (2020).
Bonini, P., Kind, T., Tsugawa, H., Barupal, D. K. & Fiehn, O. Retip: retention time prediction for compound annotation in untargeted metabolomics. Anal. Chem. 92, 7515–7522 (2020).
Bach, E., Szedmak, S., Brouard, C., Böcker, S. & Rousu, J. Liquid-chromatography retention order prediction for metabolite identification. Bioinformatics 34, i875–i883 (2018).
Domingo-Almenara, X. et al. The METLIN small molecule dataset for machine learning-based retention time prediction. Nat. Commun. 10, 5811 (2019).
Zhou, Z. et al. Ion mobility collision cross-section atlas for known and unknown metabolite annotation in untargeted metabolomics. Nat. Commun. 11, 4334 (2020).
Huber, F. et al. Spec2Vec: improved mass spectral similarity scoring through learning of structural relationships. PLoS Comput. Biol. 17, e1008724 (2021).
Xing, S. et al. Retrieving and utilizing hypothetical neutral losses from tandem mass spectra for spectral similarity analysis and unknown metabolite annotation. Anal. Chem. 92, 14476–14483 (2020).
Treen, D. G. C. et al. SIMILE enables alignment of tandem mass spectra with statistical significance. Nat. Commun. 13, 2510 (2022).
van der Hooft, J. J. J., Wandy, J., Barrett, M. P., Burgess, K. E. V. & Rogers, S. Topic modeling for untargeted substructure exploration in metabolomics. Proc. Natl Acad. Sci. USA 113, 13738–13743 (2016).
Djoumbou-Feunang, Y. et al. BioTransformer: a comprehensive computational tool for small molecule metabolism prediction and metabolite identification. J. Cheminform. 11, 2 (2019).
Jeffryes, J. G. et al. MINEs: open access databases of computationally predicted enzyme promiscuity products for untargeted metabolomics. J. Cheminform. 7, 44 (2015).
Ludwig, M. et al. Studying charge migration fragmentation of sodiated precursor ions in collision-induced dissociation at the library scale. J. Am. Soc. Mass. Spectrom. 32, 180–186 (2021).
Bertz, S. H. The first general index of molecular complexity. J. Am. Chem. Soc. 103, 3599–3601 (1981).
Ertl, P., Roggo, S. & Schuffenhauer, A. Natural product-likeness score and its application for prioritization of compound libraries. J. Chem. Inf. Model. 48, 68–74 (2008).
Kessner, D., Chambers, M., Burke, R., Agus, D. & Mallick, P. ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics 24, 2534–2536 (2008).
Stein, S. E. & Scott, D. R. Optimization and testing of mass spectral library search algorithms for compound identification. J. Am. Soc. Mass. Spectrom. 5, 859–866 (1994).
Acknowledgements
This study was funded by the University of British Columbia Start-up Grant (grant no. F18-03001), Canada Foundation for Innovation (grant no. CFI 38159) and Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grants (grant nos. RGPIN-2020-04895 and RGPIN-2022-05316). We thank A. Hui for proofreading this paper.
Author information
Authors and Affiliations
Contributions
S.X. and T.H. conceived the research project. S.X. developed the computational algorithms for BUDDY. X.L. assisted with the MLR feature design and Platt calibration. S.X., S.S. and B.X. constructed the standalone software platform in C#. S.X. performed method evaluations and applications. S.X. and T.H. wrote the paper and all authors approved the final version.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Methods thanks Corey Broeckling, Oscar Yanes and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Arunima Singh, in collaboration with the Nature Methods team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 BUDDY graphical user interface.
BUDDY offers an intuitive graphical user interface directly downloadable from GitHub. BUDDY can simultaneously process data files from different sources. For identified metabolites, MS/MS matching information and detailed chemical descriptions are provided to help check the identification quality and further investigate biological insights. If experiment-specific global peak annotation is performed, feature connections are listed with MS/MS similarity, formula difference, and connection description information.
Extended Data Fig. 2 Generation of MS/MS-explainable candidate space.
Both even-electron and odd-electron species are considered in bottom-up MS/MS interrogation.
Extended Data Fig. 3 Structural complexity of tested compounds in reference MS/MS libraries.
The tested compounds cover a broad range of molecular complexity and NP-likeness scores compared to the entire chemical space of PubChem.
Extended Data Fig. 4 Potential factors impacting the formula annotation performance on tested metabolomics datasets.
a, The relative mass deviations of experimental precursor masses from theoretical masses for all identified metabolites. b, The number of reserved fragments (maximum: 50) after low-abundance noise ions and MS/MS isotopic peaks are removed in MS/MS preprocessing. c, The number of valid fragments that are subformula-explainable using the ground-truth molecular formulae.
Extended Data Fig. 5 Validation of FDR estimation on reference MS/MS libraries.
We used four reference MS/MS libraries to evaluate FDR estimation in BUDDY. Q-Q plots of estimated FDR and exact FDR are shown. In both positive and negative ion modes, estimated FDR shows high Pearson’s correlation coefficients (r > 0.98) with exact FDR.
Extended Data Fig. 6 Method evaluation on novel compounds discovered in references.
Bottom-up MS/MS interrogation was evaluated using the MS/MS spectra of five novel compounds, three of which were confirmed in NMR experiments. The correct formulae were ranked first in all cases, with the estimated FDR ranging from 0.6 to 16.3%.
Extended Data Fig. 7 Molecular formula discovery in NIST human fecal material standards.
a, Venn diagram of annotated molecular formulae with estimated FDR < 5%. b, Scatter and density plots of m/z, retention time, DBE value, and hydrogen carbon ratio (H / C) of high-confidence annotated molecular formulae. c, Manual inspection of N-valeryl histamine. The experimental MS/MS of N-valeryl histamine shows a reverse dot product score of 0.99 against the reference MS/MS of histamine. Two neutral losses representing the acyl chain were found. Valeric acid is commonly produced in the gut microbiota (for example, by Clostridia species), and histamine is a well-known diet-derived metabolite.
Supplementary information
Supplementary Information
Supplementary Figs. 1–11 and Notes 1–18.
Supplementary Tables
Supplementary Tables 1–22. Captions are provided within each tab.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xing, S., Shen, S., Xu, B. et al. BUDDY: molecular formula discovery via bottom-up MS/MS interrogation. Nat Methods 20, 881–890 (2023). https://doi.org/10.1038/s41592-023-01850-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-023-01850-x
This article is cited by
-
Determining the parent and associated fragment formulae in mass spectrometry via the parent subformula graph
Journal of Cheminformatics (2023)