Main

Microorganisms drive the global carbon cycle1 and can establish symbiotic relationships with host organisms, influencing their health, aging and behaviour2,3,4,5,6. Microbial populations interact with different ecosystems through the alteration of available metabolite pools and the production of specialized small molecules7,8. The vast genetic potential of these communities is exemplified by human-associated microorganisms, which encode ~100 times more genes than the human genome9,10. However, this metabolic potential remains unreflected in modern untargeted metabolomics experiments, where typically <1% of the annotated molecules can be classified as microbial. This problem particularly affects mass spectrometry (MS)-based untargeted metabolomics, a common technique to investigate molecules produced or modified by microorganisms11, which famously struggles with spectral annotation of complex biological samples. This is because most spectral reference libraries are biased towards commercially available or otherwise accessible standards of primary metabolites, drugs or industrial chemicals. Even when metabolites are annotated, extensive literature searches are required to understand whether these molecules have microbial origins and to identify the respective microbial producers. Public databases, such as KEGG12, MiMeDB13, NPAtlas14 and LOTUS15, can assist in this interpretation, but they are mostly limited to well-established, largely genome-inferred metabolic models or to fully characterized and published molecular structures. In addition, while targeted metabolomics efforts aimed at interrogating the gut microbiome mechanistically have been developed16, these focus only on relatively few commercially available microbial molecules. Hence, the majority of the microbial chemical space remains unknown despite the continuous expansion of MS reference libraries. To fill this gap, we have developed microbeMASST (https://masst.gnps2.org/microbemasst/), a search tool that leverages public MS repository data to identify the microbial origin of known and unknown metabolites and map them to their microbial producers.

microbeMASST is a community-sourced tool that works within the GNPS ecosystem17. Users can search tandem MS (MS/MS) spectra obtained from their experiments against the GNPS/MassIVE repository and retrieve matching samples exclusively acquired from extracts of bacterial, fungal or archaeal monocultures. No other available resource or tool allows linking uncharacterized MS/MS spectra to characterized microorganisms. The microbeMASST reference database of microbial monocultures has been generated through years of community contributions and metadata curation, and it contains microorganisms isolated from plants, soils, oceans, lakes, fish, terrestrial animals and humans (Fig. 1a). All available microorganisms have been categorized according to the NCBI taxonomy18 at different taxonomic resolutions (that is, species, genus, family and so on) or mapped to the closest taxonomically accurate level, if no NCBI ID was available at the time of database creation. As of September 2023, microbeMASST includes 60,781 liquid chromatography (LC)–MS/MS files comprising >100 million MS/MS spectra mapped to 541 strains, 1,336 species, 539 genera, 264 families, 109 orders, 41 classes and 16 phyla from the three domains of life: Bacteria, Archaea and Eukaryota (Fig. 1b). Different from MASST19, which uses a precomputed network of ~110 million MS/MS spectra to enable spectral searching, microbeMASST is based on the recently introduced Fast Search Tool (https://fasst.gnps2.org/fastsearch/)20. This tool, originally designed for proteomics, drastically improves search speed by several orders of magnitude by indexing all the MS/MS spectra present in GNPS/MassIVE and restricting the search space to the user input parameters. Because of this, search results are returned within seconds as opposed to 20 min per search or 24–48 h for modification tolerant searches in the original implementation of MASST. In addition, microbeMASST leverages pre-curated file-associated metadata to aggregate results into easy-to-interpret taxonomic trees. This represents a major enhancement over MASST, where users have to manually inspect results tables and contextualize them, making interpretations tedious. Finally, users can leverage microbeMASST Python code to perform batch searches of thousands of MS/MS spectra by providing either a formatted MS/MS file (.mgf) or a list of Universal Spectrum Identifiers (USIs)21, which represent paths to spectra in public datasets22. This is particularly useful for creating integrated data analysis pipelines using the standard outputs (.mgf) of already established data processing tools, such as MZmine23.

Fig. 1: The microbeMASST search tool and reference database.
figure 1

a, Community contributions of data and knowledge to GNPS17, ReDU57 and MassIVE from 2014 to 2022 were used to generate the microbeMASST reference database. In addition, a public invitation to deposit data in June 2022 resulted in the further deposition of LC–MS/MS files from 25 different laboratories from 15 different countries across the globe, leading to the curation of a total of 60,781 LC–MS/MS files of microbial monoculture extracts. b, microbeMASST comprises 1,858 unique lineages across three different domains of life mapped to 541 unique strains, 1,336 species, 539 genera, 264 families, 109 orders, 41 classes and 16 phyla. c, Examples of medically relevant small molecules known to be produced by bacteria or fungi. Lovastatin, a cholesterol-lowering drug originally isolated from Aspergillus genus25; salinosporamide A, a Phase III candidate to treat glioblastoma produced by Salinispora tropica27; and commendamide, a human G-protein-coupled receptor agonist28. d, microbeMASST search outputs of the three different molecules of interest confirm that they were exclusively found in monocultures of the only known producers. Pie charts display the proportion of MS/MS matches found in the deposited reference database. Blue indicates a match with a monoculture, while yellow represents a non-match. Searches were performed using MS/MS spectra deposited in the GNPS reference library: lovastatin (CCMSLIB00005435737), salinosporamide A (CCMSLIB00010013003) and commendamide (CCMSLIB00004679239). GNPS, ReDU and microbeMASST logos reproduced under a Creative Commons license CC BY 4.0.

In the microbeMASST web app (https://masst.gnps2.org/microbemasst/), users can search single MS/MS spectra and obtain matching results from the reference database of microbeMASST, providing either a USI or a precursor ion mass and its spectral fragmentation pattern (Supplementary Fig. 1). Analogue search can also be enabled to discover molecules related to the MS/MS spectrum of interest across the taxonomic tree17,19,24. The microbeMASST web app displays query results in interactive taxonomic trees, which can be downloaded as HTML files. Nodes in the trees represent specific taxa and display rich information, such as taxon scientific name, NCBI taxonomic ID, number of deposited sample data files, number of sample data files containing a match to the queried spectrum, within the user search criteria, and a proportion of the number of sample data files matching the queried spectrum to the number of total available sample data files for that specific taxon in the reference database of microbeMASST. This proportion is also visualized through pie charts. Information for an MS/MS match in a particular taxon is propagated upstream through its lineage. The reactive interface of microbeMASST enables filtering of the tree to specific taxonomic levels or to a minimum number of matches observed per taxon. In addition, three data tables are generated, linking the search job to other resources in the GNPS/MassIVE ecosystem. For example, each MS/MS query is also searched against the public MS/MS reference library of GNPS (587,213 MS/MS spectra, September 2023) to provide spectra annotations when available. The annotations to reference compounds are listed under the ‘Library matches’ tab (Supplementary Fig. 2a). The ‘Datasets matches’ tab contains information on the matching scans, displaying scientific name, NCBI taxonomic ID and taxonomic rank, number of matching fragment ions and modified cosine score together with a link to a mirror plot visualization (Supplementary Fig. 2b). Finally, the ‘Taxa matches’ tab informs on how many matches were found per taxon and the number of samples available for that taxon (Supplementary Fig. 2c). Quality controls (QCs) and blank samples (n = 2,902) present in the reference datasets of microbeMASST have been retained to provide information on possible contaminants and media components. In addition, data from human cell line cultures (n = 1,199) have been included to enable assessment of whether molecules can be produced by both human hosts and microorganisms. It is important to point out that microbeMASST allows linking of both partly annotated, through MS/MS match to reference library spectra, and fully uncharacterized spectra to possible microbial producers but that technical limitations inherent to mass spectrometry or the experiment itself are present. For example, the absence of a matching spectrum in a specific taxon does not necessarily indicate that it is not capable of producing the searched molecule but rather that the methodology used to acquire the data did not allow its detection. These and other limitations are described in Methods. Despite these limitations, microbeMASST can uniquely enable the discovery of links between uncharacterized MS/MS spectra and defined microorganisms, providing valuable information for future mechanistic studies.

Search results for lovastatin, salinosporamide A and commendamide MS/MS spectra highlight how microbeMASST can correctly connect microbial molecules to their known producers (Fig. 1c). In the case of lovastatin, a clinically used cholesterol-lowering drug originally isolated from Aspergillus terreus25, spectral matches were unique to the genus Aspergillus (Fig. 1d). The MS/MS spectrum for salinosporamide A, a Phase III candidate to treat glioblastoma26, only matched two strains of Salinispora tropica (Fig. 1d), the only known producer27. Commendamide, first observed in cultures of Bacteroides vulgatus (recently reclassified as Phocaeicola vulgatus), is a G-protein-coupled receptor agonist28. Surprisingly it had many matches to several bacterial cultures, including in Flavobacteriaceae (Algibacter, Lutibacter, Maribacter, Polaribacter, Postechiella and Winogradskyella) and Bacteroides cultures (Fig. 1d). Additional examples include searches of mevastatin, arylomycin A4, yersiniabactin, promicroferrioxamine, and the microbial bile acid conjugates29,30,31 glutamate-cholic acid (Glu-CA) and glutamate-deoxycholic acid (Glu-DCA) (Supplementary Fig. 3). Mevastatin, another cholesterol-lowering drug originally isolated from Penicillium citrinum32, was only found in samples classified as fungi. The antibiotic arylomycin A4 was observed in different Streptomyces species, and it was originally isolated from Streptomyces sp. Tue 6075 in 200233. Yersiniabactin, a siderophore originally isolated from Yersinia pestis34 whose monoculture is not yet present in the reference database of microbeMASST, was observed in Escherichia coli and Klebsiella species, consistent with previous observations35,36. Promicroferrioxamine, another siderophore, was observed to match Micromonospora chokoriensis and Streptomyces species. This molecule was originally isolated from an uncharacterized Promicromonosporaceae isolate37. The MS/MS spectrum of the gut microbiota-derived Glu-CA, an amidated tri-hydroxylated bile acid, was most frequently observed in cultures of Bifidobacterium species, while Glu-DCA was found only in one Bifidobacterium strain but also in two Enterococcus and Clostridium species. None of the molecules were found in cultured human cell lines, highlighting the ability of microbeMASST to distinguish MS/MS spectra of molecules that can be exclusively produced by either bacteria or fungi. It is important to acknowledge that MS/MS data generally do not differentiate stereoisomers, but it can nevertheless provide crucial information on molecular families.

microbeMASST can be also used to extract microbial information from mass spectrometry-based metabolomics studies without any a priori knowledge. To illustrate this, we reprocessed an untargeted metabolomics study with data acquired from 29 different organs and biofluids comprising tissues including brain, heart, liver, blood and stool of germ-free (GF) mice and mice harbouring microbial communities, also known as specific pathogen-free (SPF) mice30 (Fig. 2a). We extracted 10,047 consensus MS/MS spectra uniquely present in SPF mice and queried them with microbeMASST. A total of 3,262 MS/MS spectra were found to have a microbial match to the microbeMASST reference database. Of these, 837 were also found in human cell lines and for this reason were removed from further analysis. Among the remaining 2,425 MS/MS spectra, 1,673 were exclusively found in bacteria, 95 in fungi and 657 in both (Supplementary Fig. 4). These MS/MS spectra were then processed with SIRIUS38 and CANOPUS39 to tentatively annotate the metabolites and identify their chemical classes. A file containing all these spectra of interest can be explored and downloaded in .mgf format from GNPS (see Methods). To further validate the microbial origin of these MS/MS spectra, we assessed their overlap with data acquired from a different study comparing SPF mice treated with a cocktail of antibiotics to untreated controls40. Interestingly, 621 MS/MS spectra were also found in this second dataset and 512 were only present in animals not treated with antibiotics (Fig. 2b). The distribution of these spectra and their putative classes across bacterial phyla was visualized using an UpSet plot41 (Fig. 2c). Notably, most of the spectra classified as terpenoids were commonly observed across phyla, while amino acids and peptides appeared to be more phylum specific. Of these 512 spectra, 23% had a level 2 putative annotation according to the 2007 Metabolomics Standards Initiative42, matching the GNPS reference library (Supplementary Table 1). A level 2 annotation within the user-specified search criteria might result in MS/MS matches between molecules belonging to related families as opposed to unique molecules. Annotations included the recently described amidated microbial bile acids19,29,30,31,43,44,45,46,47,48, free bile acids originating from the hydrolysis of host-derived taurine bile acid conjugates49, keto bile acids formed via microbial oxidation of alcohols30, N-acyl-lipids belonging to a similar class of metabolites as commendamide28 (a microbial N-acyl lipid), di- and tri- peptides seen in microbial digestion of proteins50, and soyasapogenol, a by-product of the microbial digestion of complex saccharides from dietary soyasaponins30. Part of the remaining unannotated spectra can be identified as chemical modifications of the above annotated microbial metabolites through spectral similarity obtained from molecular networking (Supplementary Fig. 5). This list of annotated MS/MS spectra included metabolites that are not yet widely considered to be of microbial origins, such as the di- and tri-hydroxylated bile acids and the glycine-conjugated bile acids43. One interpretation of these findings is that microorganisms are capable of producing metabolites previously described to be only of mammalian origins. Notable examples of metabolites that have been established to be produced by both the mammalian host and bacteria include serotonin51, γ-aminobutyric acid (GABA)52 and most recently, glycocholic acid43,53,54,55. In addition, an alternative hypothesis is that microorganisms can also selectively stimulate the production of host metabolites. Other limitations regarding annotations are discussed in Methods. To assess whether the observations from the mouse models translate to humans, we searched and found that 455 out of the 512 MS/MS spectra of interest matched public human data (Fig. 2d). Interestingly, these spectra were found in both healthy individuals and individuals affected by different diseases, including type II diabetes, inflammatory bowel disease, Alzheimer’s disease and other conditions. These spectra were most commonly found in stool samples (n = 110,973 MS/MS matches), followed by blood, breast milk and the oral cavity, as well as other organs including the brain, skin, vagina and biofluids (for example, cerebrospinal fluid and urine) (Fig. 2e). These findings support the concept that a substantial number of microbial metabolites reach and influence distant organs in the human body56.

Fig. 2: microbeMASST can identify microbial MS/MS spectra within mouse and human datasets.
figure 2

a, Workflow to extract microbial MS/MS spectra from biochemical profiles of 29 different tissues and biofluids of SPF mice that are not observed in GF mice30. The MS/MS spectra unique to SPF mice (10,047) were searched with microbeMASST. A total of 3,262 MS/MS spectra had a match; those MS/MS also matching human cell lines were removed, leaving a total of 2,425 putative microbial MS/MS spectra (see Methods to download .mgf file). b, The presence of the 2,425 MS/MS spectra was evaluated in an additional animal study looking at antibiotic usage40. A total of 512 MS/MS spectra, out of the 621 overlapping, were exclusively found in animals not receiving antibiotics. c, UpSet plot of the distribution of the detected MS/MS spectra (512) across bacterial phyla. Terpenoids were more commonly observed across phyla, while amino acids and peptides appeared to be more phylum specific. d, The 512 MS/MS spectra were searched in human datasets and 455 were found to have a match. These MS/MS spectra were present in both healthy individuals and individuals affected by different diseases. e, Most of the MS/MS spectra (n = 411) matched faecal samples (n = 110,973 matches), followed by blood, oral cavity, breast milk, urine and several other organs. CSF, cerebral spinal fluid; COVID-19, coronavirus disease 2019; HIV, human immunodeficiency virus; PBI, primary bacterial infectious disease; SD, sleep disorder; AD, Alzheimer’s disease; IS, ischaemic stroke; KD, Kawasaki disease; IBD, inflammatory bowel disease; T2D, type II diabetes. GNPS and and microbeMASST logos reproduced under a Creative Commons license CC BY 4.0; SIRIUS logo reproduced under a Creative Commons license CC BY 4.0-ND.

We anticipate that microbeMASST will be a key resource to enhance understanding of the role of microbial metabolites across a wide range of ecosystems, including oceans, plants, soils, insects, animals and humans. This expanding resource will enable the scientific community to gain valuable taxonomic and functional insights into diverse microbial populations. The mass spectrometry community will play a key role in the evolution of this tool in the future through the continued deposition of data associated with microbial monocultures and the expansion of spectral reference libraries. Moreover, microbeMASST holds potential for various applications ranging from aquaculture and agriculture to biotechnology and the study of microbe-mediated human health conditions. By harnessing the power of public data, we can unlock opportunities for advancements in multiple fields and deepen our understanding of the intricate relationships between microorganisms and their ecosystems.

Methods

Data collection and harmonization

Data deposited in GNPS/MassIVE were investigated manually and systematically using ReDU57 (https://redu.ucsd.edu/) to extract all publicly available MS/MS files (.mzML or .mzXML formats) acquired from monocultures of bacteria, fungi, archaea and human cell lines. Only monocultures were included in the reference database of this search tool to unequivocally associate the production of the detected metabolites to each specific taxon. A total of 60,781 files from 537 different GNPS/MassIVE datasets were selected to be used as the reference database of microbeMASST (Supplementary Table 2). These include files deposited in response to our call to the scientific community. Between May and July 2022, 25 different research groups deposited 65 distinct datasets in GNPS/MassIVE, comprising a total of 3,142 unique LC–MS/MS files. This represented a 5.45% increase in publicly available MS/MS data acquired from monocultures in just 2 months. To qualify as a contributor and be credited as one of the authors, researchers had to deposit high-resolution LC–MS/MS data acquired either in positive or negative ionization modes from monocultures of either bacteria, fungi or achaea. Harmonization of the acquired data and metadata represented a challenge. The NCBI taxonomic database is constantly expanding and evolving, and the ReDU latest update (December 2021) does not accommodate the latest deposited taxa. For this reason, an additional metadata file (microbeMASST_metadata_massiveID) was generated specifically for the microbeMASST project and uploaded to the respective GNPS/MassIVE datasets deposited by the collaborators if the ReDU workflow failed. All the collected information was finally aggregated in a single .csv file (microbe_masst_table.csv) that can be found on GitHub, which contains: (1) full MassIVE path of each sample, (2) file name of each sample reported as its MassIVE ID/file name to avoid the presence of duplicated names, (3) MassIVE ID, (4) taxonomic name of the isolate as reported by the author submitting the associated metadata, (5) alternative taxonomic name if the provided taxonomic name was incorrect or not present in NCBI, (6) associated NCBI ID to the taxonomic name or the alternative taxonomic name, when present, (7) definition if the taxonomic ID was automatically assigned or manually curated, and information if (8) ReDU metadata are available for that specific file and if the file correspond to a (9) blank or (10) QC rather than a unique biological sample.

Unique taxonomic names and NCBI IDs were extracted from the metadata associated with the selected samples. When metadata were not available and multiple species of bacteria or fungi were present in the same dataset, samples were generically classified as bacteria or fungi. Concordance between taxonomic names and NCBI IDs was checked by blasting taxonomic names against NCBI (https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi) to obtain respective NCBI IDs and updated taxonomic names. Results were manually investigated and missing IDs were recovered using the NCBI browser (https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi). If the taxonomic name was not found in NCBI, most probably because it was not deposited yet, the NCBI of the closest taxon was retrieved and used. For example, the strain Staphylococcus aureus CM05 was unavailable in NCBI and was curated to the species Staphylococcus aureus instead.

Taxonomic tree generation

The microbeMASST taxonomic tree was generated using both R 4.2.2 and Python 3.10. In R, the microbeMASST table was filtered and only unique NCBI IDs were retained (n = 1,834). The classification function of the ‘taxize’ package (v.0.9.100) was used to retrieve the full lineage of each NCBI ID58. Main taxonomic ranks (kingdom to strain) plus subgenus, subspecies and varieties were kept to obtain taxonomic trees with a similar number of nodes per lineage. The list of NCBI IDs of all lineages was then imported to Python, where the ETE3 toolkit was used to generate a taxonomic tree on the basis of the provided NCBI IDs59. The generated Newick tree was then converted into JSON format and information such as taxonomic rank and number of available samples per taxon was added. In addition, children nodes for blanks and QCs were created to be visualized in the same tree.

MASST query

The microbeMASST web application was built using Dash and Flask open-source libraries for Python (https://github.com/mwang87/GNPS_MASST/blob/master/dash_microbemasst.py). The web app can receive as inputs either a USI or an MS/MS spectrum (fragment ions and their intensities). In addition, batch searches can be performed using a customizable Python script that can read either a .tsv file containing a list of USIs or a single .mgf file (https://github.com/robinschmid/microbe_masst). Through the manuscript, we showcase how we were able to search for more than 10,000 MS/MS spectra contained in a single .mgf file (~2 h run time). After receiving input information, microbeMASST leverages the Fast Search Tool (https://fasst.gnps2.org/fastsearch/) API and the sample-specific associated metadata to generate taxonomic trees. Fast searches are based on indexing all the MS/MS spectra present in GNPS/MassIVE according to the mass and intensity of their precursor ions and then restricting the search to only a set of relevant spectra that have a greater chance of achieving a high spectral similarity (modified cosine score) to the MS/MS of interest. Searches are customizable and default settings are the following: precursor and fragment ion mass tolerances, 0.05; minimum cosine score threshold, 0.7; minimum number of matching fragment ions, 3; and analogue search, False. Users can modify these parameters on the basis of their data and research questions. Once matches are obtained, it is good practice to inspect the associated mirror plots for confirmation. To create the final taxonomic tree, the JSON file of the complete microbeMASST taxonomic tree is filtered according to the results and converted into a D3 JavaScript object that can be visualized as an HTML file.

Applications

We envision microbeMASST to have several applications. First, we showcase how researchers can investigate single MS/MS spectra using the web interface and obtain matching results if their known or unknown MS/MS spectrum was previously observed in any of the microbial monocultures present in the microbeMASST database. Nine small molecules of interest were investigated using MS/MS spectra already deposited in the GNPS reference library (see ‘Data availability’ and ‘Code availability’). Second, we show how microbeMASST can be leveraged to mine for known or unknown microbial metabolites in MS studies. To test this hypothesis, we reprocessed an LC–MS/MS dataset acquired from 29 different organs and biofluids of GF and SPF mice30. A comprehensive molecular network was generated (https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=893fd89b52dc4c07a292485404f97723). From the obtained job, the qiime2 artefact (qiime2_table.qza), the .mgf file (METABOLOMICS-SNETS-V2-893fd89b-download_clustered_spectra-main.mgf) containing all the captured MS/MS spectra, and the annotation table (METABOLOMICS-SNETS-V2-893fd89b-view_all_annotations_DB-main.tsv) were extracted. The .qza file was first converted into a .biom file and then a .tsv file using QIIME2 (ref. 60) to extract the feature table. This was then imported to R where only spectra present in tissues and biofluids of SPF animals were retained (n = 10,047). To add an extra layer of filtering, all MS/MS spectra that had an edge (cosine similarity >0.7) and a delta parent ion mass of ±0.02 Da with MS/MS spectra present in GF animals were removed. Spectral pairs information was contained in a networkedges_selfloop file. All the MS/MS spectra were then run in batch using a custom Python script of microbeMASST (processing time: ~2 h, Apple M2 Max, 64 GB RAM) to obtain microbial matches. Matching and filtered MS/MS spectra (n = 2,425) were aggregated into a single .mgf file that can be downloaded from GNPS (https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=aecd30b9febd43bd8f57b88598a05553). The compound class of each MS/MS spectrum with parent ion mass <850 Da was predicted using SIRIUS38 and CANOPUS39. The 2,425 MS/MS spectra were then searched against the MSV000080918 dataset containing mice treated or not with antibiotics40. Matching and filtered MS/MS spectra (n = 512) were aggregated into a single .mgf file that can be downloaded from GNPS (https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=c33855fc32c948049331e9730189d5c1). A list of the spectra with putative chemical class classification is available in Supplementary Table 1. Venn diagrams and UpSet plots were generated in R using VennDiagram 1.7.3, UpSetR 1.4.0 and ComplexUpset 1.3.3. Finally, the 512 MS/MS spectra were searched in batch against the GNPS public repository to observe whether they were detected in human datasets (Supplementary Table 3). ReDU metadata information associated with the human datasets was then used to observe the distribution of the MS/MS spectra across different diseases and body parts.

Technical limitations

Analysis of the results should be considered with the following limitations in mind. Molecule detection in microbeMASST is dependent on the availability of specific substrates in the reference monocultures. If the culture lacks the necessary substrates (or any other culture condition) to produce a certain molecule, this molecule will not be detected. Nevertheless, if related substrates are present, then a different but related molecule may be produced instead, which can be detected using the analogue search function. To address this problem, it is crucial for the community to continue to gather data from as many diverse experimental conditions as possible to capture the full range of metabolites produced by different microorganisms. This will help in building the most comprehensive reference database that encompasses diverse microbial metabolic profiles. In addition, isomers and stereoisomers, which are molecules with the same molecular formula but different structural arrangements, often exhibit similar MS/MS spectra. This means that their fragmentation patterns may not provide enough information to distinguish them. Finally, differences in extraction conditions and instrument settings can lead to variations in the obtained MS/MS spectra. For example, the intensity of precursor ions used for fragmentation can impact the resulting spectra. If the precursor ion intensity is low, the fragmented spectrum may lack ions that are present in spectra obtained from high-intensity precursor ions. This may result in ‘data leakage’ as the MS/MS spectrum may be missing ions, leading to the two molecules not being recognized as the same molecule. To partially overcome this, more permissive settings can be created. The majority of the data deposited in public repositories, GNPS included, and used in microbeMASST were acquired using positive electrospray ionization mode, which limits the observation of molecules that cannot be ionized in positive mode. This means that certain metabolites may be underrepresented or not detected at all. The continuous curation of the microbeMASST reference database involves adding more diverse data in terms of ionization modes to improve the coverage of metabolites. The taxonomic tree was generated using associated NCBI IDs provided by the community. Specimen assignment before metabolomic analysis cannot be checked by microbeMASST. The majority of the deposited data do not have associated genetic information and even if available, it was not used to build the taxonomic tree. Thus, specimen misidentification cannot be excluded. By addressing these challenges and continuously curating the reference database with more comprehensive and diverse data, microbeMASST coverage can be expanded to provide valuable insights into the role of microbiota and to facilitate our understanding of microbial metabolism in diverse ecosystems.

Statistics and reproducibility

No statistical method was used to predetermine sample size. No data were excluded from the analyses. The experiments were not randomized. The Investigators were not blinded to allocation during experiments and outcome assessment.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.