microbeMASST: a taxonomically informed mass spectrometry search tool for microbial metabolomics data

microbeMASST, a taxonomically informed mass spectrometry (MS) search tool, tackles limited microbial metabolite annotation in untargeted metabolomics experiments. Leveraging a curated database of >60,000 microbial monocultures, users can search known and unknown MS/MS spectra and link them to their respective microbial producers via MS/MS fragmentation patterns. Identification of microbe-derived metabolites and relative producers without a priori knowledge will vastly enhance the understanding of microorganisms’ role in ecology and human health.

public MS repository data to identify the microbial origin of known and unknown metabolites and map them to their microbial producers.
microbeMASST is a community-sourced tool that works within the GNPS ecosystem 17 .Users can search tandem MS (MS/MS) spectra obtained from their experiments against the GNPS/MassIVE repository and retrieve matching samples exclusively acquired from extracts of bacterial, fungal or archaeal monocultures.No other available resource or tool allows linking uncharacterized MS/MS spectra to characterized microorganisms.The microbeMASST reference database of microbial monocultures has been generated through years of community contributions and metadata curation, and it contains microorganisms isolated from plants, soils, oceans, lakes, fish, terrestrial animals and humans (Fig. 1a).All available microorganisms have been categorized according to the NCBI taxonomy 18 at different taxonomic resolutions (that is, species, genus, family and so on) or mapped to the closest taxonomically accurate level, if no NCBI ID was available at the time of database creation.As of September 2023, microbeMASST includes 60,781 liquid chromatography (LC)-MS/MS files comprising >100 million MS/MS spectra mapped to 541 strains, 1,336 species, 539 genera, 264 families, 109 orders, 41 classes and 16 phyla from the three domains of life: Bacteria, Archaea and Eukaryota (Fig. 1b).Different from MASST 19 , which uses a precomputed network of ~110 million MS/MS spectra to enable spectral searching, microbeMASST is based on the recently introduced Fast Search Tool (https://fasst.gnps2.org/fastsearch/) 20.This tool, originally designed for proteomics, drastically improves search speed by several orders of magnitude by indexing all the MS/MS spectra present in GNPS/ MassIVE and restricting the search space to the user input parameters.Because of this, search results are returned within seconds Brief Communication https://doi.org/10.1038/s41564-023-01575-9pipelines using the standard outputs (.mgf) of already established data processing tools, such as MZmine 23 .
In the microbeMASST web app (https://masst.gnps2.org/microbemasst/),users can search single MS/MS spectra and obtain matching results from the reference database of microbeMASST, providing either a USI or a precursor ion mass and its spectral fragmentation pattern (Supplementary Fig. 1).Analogue search can also be enabled to discover molecules related to the MS/MS spectrum of interest across the taxonomic tree 17,19,24 .The microbeMASST web app displays query results in interactive taxonomic trees, which can be downloaded as HTML files.Nodes in the trees represent specific taxa and display rich information, as opposed to 20 min per search or 24-48 h for modification tolerant searches in the original implementation of MASST.In addition, microbeMASST leverages pre-curated file-associated metadata to aggregate results into easy-to-interpret taxonomic trees.This represents a major enhancement over MASST, where users have to manually inspect results tables and contextualize them, making interpretations tedious.Finally, users can leverage microbeMASST Python code to perform batch searches of thousands of MS/MS spectra by providing either a formatted MS/MS file (.mgf) or a list of Universal Spectrum Identifiers (USIs) 21 , which represent paths to spectra in public datasets 22 .This is particularly useful for creating integrated data analysis a Reference databse    such as taxon scientific name, NCBI taxonomic ID, number of deposited sample data files, number of sample data files containing a match to the queried spectrum, within the user search criteria, and a proportion of the number of sample data files matching the queried spectrum to the number of total available sample data files for that specific taxon in the reference database of microbeMASST.This proportion is also visualized through pie charts.Information for an MS/MS match in a particular taxon is propagated upstream through its lineage.The reactive interface of microbeMASST enables filtering of the tree to specific taxonomic levels or to a minimum number of matches observed per taxon.In addition, three data tables are generated, linking the search job to other resources in the GNPS/MassIVE ecosystem.For example, each MS/MS query is also searched against the public MS/MS reference library of GNPS (587,213 MS/MS spectra, September 2023) to provide spectra annotations when available.The annotations to reference compounds are listed under the 'Library matches' tab (Supplementary Fig. 2a).The 'Datasets matches' tab contains information on the matching scans, displaying scientific name, NCBI taxonomic ID and taxonomic rank, number of matching fragment ions and modified cosine score together with a link to a mirror plot visualization (Supplementary Fig. 2b).Finally, the 'Taxa matches' tab informs on how many matches were found per taxon and the number of samples available for that taxon (Supplementary Fig. 2c).Quality controls (QCs) and blank samples (n = 2,902) present in the reference datasets of microbeMASST have been retained to provide information on possible contaminants and media components.In addition, data from human cell line cultures (n = 1,199) have been included to enable assessment of whether molecules can be produced by both human hosts and microorganisms.It is important to point out that microbeMASST allows linking of both partly annotated, through MS/MS match to reference library spectra, and fully uncharacterized spectra to possible microbial producers but that technical limitations inherent to mass spectrometry or the experiment itself are present.For example, the absence of a matching spectrum in a specific taxon does not necessarily indicate that it is not capable of producing the searched molecule but rather that the methodology used to acquire the data did not allow its detection.These and other limitations are described in Methods.Despite these limitations, microbeMASST can uniquely enable the discovery of links between uncharacterized MS/MS spectra and defined microorganisms, providing valuable information for future mechanistic studies.Search results for lovastatin, salinosporamide A and commendamide MS/MS spectra highlight how microbeMASST can correctly connect microbial molecules to their known producers (Fig. 1c).In the case of lovastatin, a clinically used cholesterol-lowering drug originally isolated from Aspergillus terreus 25 , spectral matches were unique to the genus Aspergillus (Fig. 1d).The MS/MS spectrum for salinosporamide A, a Phase III candidate to treat glioblastoma 26 , only matched two strains of Salinispora tropica (Fig. 1d), the only known producer 27 .Commendamide, first observed in cultures of Bacteroides vulgatus (recently reclassified as Phocaeicola vulgatus), is a G-protein-coupled receptor agonist 28 .Surprisingly it had many matches to several bacterial cultures, including in Flavobacteriaceae (Algibacter, Lutibacter, Maribacter, Polaribacter, Postechiella and Winogradskyella) and Bacteroides cultures (Fig. 1d).Additional examples include searches of mevastatin, arylomycin A4, yersiniabactin, promicroferrioxamine, and the microbial bile acid conjugates [29][30][31] glutamate-cholic acid (Glu-CA) and glutamate-deoxycholic acid (Glu-DCA) (Supplementary Fig. 3).Mevastatin, another cholesterol-lowering drug originally isolated from Penicillium citrinum 32 , was only found in samples classified as fungi.The antibiotic arylomycin A4 was observed in different Streptomyces species, and it was originally isolated from Streptomyces sp.Tue 6075 in 2002 33 .Yersiniabactin, a siderophore originally isolated from Yersinia pestis 34 whose monoculture is not yet present in the reference database of microbeMASST, was observed in Escherichia coli and Klebsiella species, consistent with previous observations 35,36 .
Promicroferrioxamine, another siderophore, was observed to match Micromonospora chokoriensis and Streptomyces species.This molecule was originally isolated from an uncharacterized Promicromonosporaceae isolate 37 .The MS/MS spectrum of the gut microbiota-derived Glu-CA, an amidated tri-hydroxylated bile acid, was most frequently observed in cultures of Bifidobacterium species, while Glu-DCA was found only in one Bifidobacterium strain but also in two Enterococcus and Clostridium species.None of the molecules were found in cultured human cell lines, highlighting the ability of microbeMASST to distinguish MS/MS spectra of molecules that can be exclusively produced by either bacteria or fungi.It is important to acknowledge that MS/MS data generally do not differentiate stereoisomers, but it can nevertheless provide crucial information on molecular families.
microbeMASST can be also used to extract microbial information from mass spectrometry-based metabolomics studies without any a priori knowledge.To illustrate this, we reprocessed an untargeted metabolomics study with data acquired from 29 different organs and biofluids comprising tissues including brain, heart, liver, blood and stool of germ-free (GF) mice and mice harbouring microbial communities, also known as specific pathogen-free (SPF) mice 30 (Fig. 2a).We extracted 10,047 consensus MS/MS spectra uniquely present in SPF mice and queried them with microbeMASST.A total of 3,262 MS/ MS spectra were found to have a microbial match to the microbe-MASST reference database.Of these, 837 were also found in human cell lines and for this reason were removed from further analysis.Among the remaining 2,425 MS/MS spectra, 1,673 were exclusively found in bacteria, 95 in fungi and 657 in both (Supplementary Fig. 4).These MS/MS spectra were then processed with SIRIUS 38 and CANOPUS to tentatively annotate the metabolites and identify their chemical classes.A file containing all these spectra of interest can be explored and downloaded in .mgfformat from GNPS (see Methods).To further validate the microbial origin of these MS/MS spectra, we assessed their overlap with data acquired from a different study comparing SPF mice treated with a cocktail of antibiotics to untreated controls 40 .Interestingly, 621 MS/MS spectra were also found in this second dataset and were only present in animals not treated with antibiotics (Fig. 2b).The distribution of these spectra and their putative classes across bacterial phyla was visualized using an UpSet plot 41 (Fig. 2c).Notably, most of the spectra classified as terpenoids were commonly observed across phyla, while amino acids and peptides appeared to be more phylum specific.Of these 512 spectra, 23% had a level 2 putative annotation according to the 2007 Metabolomics Standards Initiative 42 , matching the GNPS reference library (Supplementary Table 1).A level 2 annotation within the user-specified search criteria might result in MS/MS matches between molecules belonging to related families as opposed to unique molecules.Annotations included the recently described amidated microbial bile acids 19,[29][30][31][43][44][45][46][47][48] , free bile acids originating from the hydrolysis of host-derived taurine bile acid conjugates 49 , keto bile acids formed via microbial oxidation of alcohols 30 , N-acyl-lipids belonging to a similar class of metabolites as commendamide (a microbial N-acyl lipid), di-and tri-peptides seen in microbial digestion of proteins 50 , and soyasapogenol, a by-product of the microbial digestion of complex saccharides from dietary soyasaponins 30 . Part f the remaining unannotated spectra can be identified as chemical modifications of the above annotated microbial metabolites through spectral similarity obtained from molecular networking (Supplementary Fig. 5).This list of annotated MS/MS spectra included metabolites that are not yet widely considered to be of microbial origins, such as the di-and tri-hydroxylated bile acids and the glycine-conjugated bile acids 43 .One interpretation of these findings is that microorganisms are capable of producing metabolites previously described to be only of mammalian origins. Noable examples of metabolites that have been established to be produced by both the mammalian host and bacteria include serotonin 51 , γ-aminobutyric acid (GABA) 52 and most recently, glycocholic acid 43,[53][54][55] . Inaddition, an alternative hypothesis is that Brief Communication https://doi.org/10.1038/s41564-023-01575-9microorganisms can also selectively stimulate the production of host metabolites.Other limitations regarding annotations are discussed in Methods.To assess whether the observations from the mouse models translate to humans, we searched and found that 455 out of the 512 MS/MS spectra of interest matched public human data (Fig. 2d).Interestingly, these spectra were found in both healthy individuals and individuals affected by different diseases, including type II diabetes, inflammatory bowel disease, Alzheimer's disease and other conditions.These spectra were most commonly found in stool samples (n = 110,973 MS/MS matches), followed by blood, breast milk and the oral cavity, as well as other organs including the brain, skin, vagina and biofluids (for example, cerebrospinal fluid and urine) (Fig. 2e).These findings support the concept that a substantial number of microbial metabolites reach and influence distant organs in the human body 56 .

Brief Communication
https://doi.org/10.1038/s41564-023-01575-9 We anticipate that microbeMASST will be a key resource to enhance understanding of the role of microbial metabolites across a wide range of ecosystems, including oceans, plants, soils, insects, animals and humans.This expanding resource will enable the scientific community to gain valuable taxonomic and functional insights into diverse microbial populations.The mass spectrometry community will play a key role in the evolution of this tool in the future through the continued deposition of data associated with microbial monocultures and the expansion of spectral reference libraries.Moreover, microbeMASST holds potential for various applications ranging from aquaculture and agriculture to biotechnology and the study of microbe-mediated human health conditions.By harnessing the power of public data, we can unlock opportunities for advancements in multiple fields and deepen our understanding of the intricate relationships between microorganisms and their ecosystems.

Data collection and harmonization
Data deposited in GNPS/MassIVE were investigated manually and systematically using ReDU 57 (https://redu.ucsd.edu/) to extract all publicly available MS/MS files (.mzML or .mzXMLformats) acquired from monocultures of bacteria, fungi, archaea and human cell lines.Only monocultures were included in the reference database of this search tool to unequivocally associate the production of the detected metabolites to each specific taxon.A total of 60,781 files from 537 different GNPS/MassIVE datasets were selected to be used as the reference database of microbeMASST (Supplementary Table 2).These include files deposited in response to our call to the scientific community.Between May and July 2022, 25 different research groups deposited 65 distinct datasets in GNPS/MassIVE, comprising a total of 3,142 unique LC-MS/ MS files.This represented a 5.45% increase in publicly available MS/ MS data acquired from monocultures in just 2 months.To qualify as a contributor and be credited as one of the authors, researchers had to deposit high-resolution LC-MS/MS data acquired either in positive or negative ionization modes from monocultures of either bacteria, fungi or achaea.Harmonization of the acquired data and metadata represented a challenge.The NCBI taxonomic database is constantly expanding and evolving, and the ReDU latest update (December 2021) does not accommodate the latest deposited taxa.For this reason, an additional metadata file (microbeMASST_metadata_massiveID) was generated specifically for the microbeMASST project and uploaded to the respective GNPS/Mas-sIVE datasets deposited by the collaborators if the ReDU workflow failed.All the collected information was finally aggregated in a single .csvfile (microbe_masst_table.csv) that can be found on GitHub, which contains: (1) full MassIVE path of each sample, (2) file name of each sample reported as its MassIVE ID/file name to avoid the presence of duplicated names, (3) MassIVE ID, (4) taxonomic name of the isolate as reported by the author submitting the associated metadata, (5) alternative taxonomic name if the provided taxonomic name was incorrect or not present in NCBI, (6)  associated NCBI ID to the taxonomic name or the alternative taxonomic name, when present, (7) definition if the taxonomic ID was automatically assigned or manually curated, and information if (8) ReDU metadata are available for that specific file and if the file correspond to a (9) blank or (10) QC rather than a unique biological sample.
Unique taxonomic names and NCBI IDs were extracted from the metadata associated with the selected samples.When metadata were not available and multiple species of bacteria or fungi were present in the same dataset, samples were generically classified as bacteria or fungi.Concordance between taxonomic names and NCBI IDs was checked by blasting taxonomic names against NCBI (https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi) to obtain respective NCBI IDs and updated taxonomic names.Results were manually investigated and missing IDs were recovered using the NCBI browser (https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi).If the taxonomic name was not found in NCBI, most probably because it was not deposited yet, the NCBI of the closest taxon was retrieved and used.For example, the strain Staphylococcus aureus CM05 was unavailable in NCBI and was curated to the species Staphylococcus aureus instead.

Taxonomic tree generation
The microbeMASST taxonomic tree was generated using both R 4.2.2 and Python 3.10.In R, the microbeMASST table was filtered and only unique NCBI IDs were retained (n = 1,834).The classification function of the 'taxize' package (v.0.9.100) was used to retrieve the full lineage of each NCBI ID 58 .Main taxonomic ranks (kingdom to strain) plus subgenus, subspecies and varieties were kept to obtain taxonomic trees with a similar number of nodes per lineage.The list of NCBI IDs of all lineages was then imported to Python, where the ETE3 toolkit was used to generate a taxonomic tree on the basis of the provided NCBI IDs 59 .The generated Newick tree was then converted into JSON format and information such as taxonomic rank and number of available samples per taxon was added.In addition, children nodes for blanks and QCs were created to be visualized in the same tree.

MASST query
The microbeMASST web application was built using Dash and Flask open-source libraries for Python (https://github.com/mwang87/GNPS_MASST/blob/master/dash_microbemasst.py).The web app can receive as inputs either a USI or an MS/MS spectrum (fragment ions and their intensities).In addition, batch searches can be performed using a customizable Python script that can read either a .tsvfile containing a list of USIs or a single .mgffile (https://github.com/robinschmid/microbe_masst).Through the manuscript, we showcase how we were able to search for more than 10,000 MS/MS spectra contained in a single .mgffile (~2 h run time).After receiving input information, microbeMASST leverages the Fast Search Tool (https:// fasst.gnps2.org/fastsearch/)API and the sample-specific associated metadata to generate taxonomic trees.Fast searches are based on indexing all the MS/MS spectra present in GNPS/MassIVE according to the mass and intensity of their precursor ions and then restricting the search to only a set of relevant spectra that have a greater chance of achieving a high spectral similarity (modified cosine score) to the MS/MS of interest.Searches are customizable and default settings are the following: precursor and fragment ion mass tolerances, 0.05; minimum cosine score threshold, 0.7; minimum number of matching fragment ions, 3; and analogue search, False.Users can modify these parameters on the basis of their data and research questions.Once matches are obtained, it is good practice to inspect the associated mirror plots for confirmation.To create the final taxonomic tree, the JSON file of the complete microbeMASST taxonomic tree is filtered according to the results and converted into a D3 JavaScript object that can be visualized as an HTML file.

Applications
We envision microbeMASST to have several applications.First, we showcase how researchers can investigate single MS/MS spectra using the web interface and obtain matching results if their known or unknown MS/MS spectrum was previously observed in any of the microbial monocultures present in the microbeMASST database.Nine small molecules of interest were investigated using MS/MS spectra already deposited in the GNPS reference library (see 'Data availability' and 'Code availability').Second, we show how microbeMASST can be leveraged to mine for known or unknown microbial metabolites in MS studies.To test this hypothesis, we reprocessed an LC-MS/MS dataset acquired from 29 different organs and biofluids of GF and SPF mice 30 .A comprehensive molecular network was generated (https:// gnps.ucsd.edu/ProteoSAFe/status.jsp?task=893fd89b52dc4c07a29 2485404f97723).From the obtained job, the qiime2 artefact (qiime2_ table.qza), the .mgffile (METABOLOMICS-SNETS-V2-893fd89b-

Brief Communication
https://doi.org/10.1038/s41564-023-01575-9download_clustered_spectra-main.mgf) containing all the captured MS/MS spectra, and the annotation table (METABOLOMICS-SNETS-V 2-893fd89b-view_all_annotations_DB-main.tsv) were extracted.The .qza file was first converted into a .biomfile and then a .tsvfile using QIIME2 (ref.60) to extract the feature table.This was then imported to R where only spectra present in tissues and biofluids of SPF animals were retained (n = 10,047).To add an extra layer of filtering, all MS/MS spectra that had an edge (cosine similarity >0.7) and a delta parent ion mass of ±0.02 Da with MS/MS spectra present in GF animals were removed.Spectral pairs information was contained in a net-workedges_selfloop file.All the MS/MS spectra were then run in batch using a custom Python script of microbeMASST (processing time: ~2 h, Apple M2 Max, 64 GB RAM) to obtain microbial matches.Matching and filtered MS/MS spectra (n = 2,425) were aggregated into a single .mgffile that can be downloaded from GNPS (https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=aecd30b9febd43bd8f57b88598a05553).The compound class of each MS/MS spectrum with parent ion mass <850 Da was predicted using SIRIUS 38 and CANOPUS 39 .The 2,425 MS/ MS spectra were then searched against the MSV000080918 dataset containing mice treated or not with antibiotics 40 .Matching and filtered MS/MS spectra (n = 512) were aggregated into a single .mgffile that can be downloaded from GNPS (https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=c33855fc32c948049331e9730189d5c1).A list of the spectra with putative chemical class classification is available in Supplementary Table 1.Venn diagrams and UpSet plots were generated in R using VennDiagram 1.7.3,UpSetR 1.4.0 and ComplexUpset 1.3.3.Finally, the 512 MS/MS spectra were searched in batch against the GNPS public repository to observe whether they were detected in human datasets (Supplementary Table 3).ReDU metadata information associated with the human datasets was then used to observe the distribution of the MS/MS spectra across different diseases and body parts.

Technical limitations
Analysis of the results should be considered with the following limitations in mind.Molecule detection in microbeMASST is dependent on the availability of specific substrates in the reference monocultures.If the culture lacks the necessary substrates (or any other culture condition) to produce a certain molecule, this molecule will not be detected.Nevertheless, if related substrates are present, then a different but related molecule may be produced instead, which can be detected using the analogue search function.To address this problem, it is crucial for the community to continue to gather data from as many diverse experimental conditions as possible to capture the full range of metabolites produced by different microorganisms.This will help in building the most comprehensive reference database that encompasses diverse microbial metabolic profiles.In addition, isomers and stereoisomers, which are molecules with the same molecular formula but different structural arrangements, often exhibit similar MS/MS spectra.This means that their fragmentation patterns may not provide enough information to distinguish them.Finally, differences in extraction conditions and instrument settings can lead to variations in the obtained MS/MS spectra.For example, the intensity of precursor ions used for fragmentation can impact the resulting spectra.If the precursor ion intensity is low, the fragmented spectrum may lack ions that are present in spectra obtained from high-intensity precursor ions.This may result in 'data leakage' as the MS/MS spectrum may be missing ions, leading to the two molecules not being recognized as the same molecule.To partially overcome this, more permissive settings can be created.The majority of the data deposited in public repositories, GNPS included, and used in microbeMASST were acquired using positive electrospray ionization mode, which limits the observation of molecules that cannot be ionized in positive mode.This means that certain metabolites may be underrepresented or not detected at all.The continuous curation of the microbeMASST reference database involves adding more diverse data in terms of ionization modes to improve the coverage of metabolites.The taxonomic tree was generated using associated NCBI IDs provided by the community.Specimen assignment before metabolomic analysis cannot be checked by microbeMASST.The majority of the deposited data do not have associated genetic information and even if available, it was not used to build the taxonomic tree.Thus, specimen misidentification cannot be excluded.By addressing these challenges and continuously curating the reference database with more comprehensive and diverse data, microbeMASST coverage can be expanded to provide valuable insights into the role of microbiota and to facilitate our understanding of microbial metabolism in diverse ecosystems.

Statistics and reproducibility
No statistical method was used to predetermine sample size.No data were excluded from the analyses.The experiments were not randomized.The Investigators were not blinded to allocation during experiments and outcome assessment.

Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability
Data used to generate the reference database of microbeMASST are publicly available at GNPS/MassIVE (https://massive.ucsd.edu/).A list of all the accession numbers (MassIVE IDs) of the studies used to generate the reference database of this tool is available in Supplementary Table 2. Interactive examples of the MS/MS queries illustrated in Fig. 1d and Supplementary Fig. 3 can be visualized at the microbeMASST website (https://masst.gnps2.org/microbemasst/).A video tutorial on how to use microbeMASST is available on YouTube.Known molecules already present in the GNPS library (https://library.gnps2.org/)were used to facilitate interpretation and confirm that specific bacterial and fungal molecules of interest were exclusively observed in the respective monocultures.Lovastatin -CCMSLIB00005435737 Salinosporamide A -CCMSLIB00010013003 Commendamide -CCMSLIB00004679239 Mevastatin -CCMSLIB00005435644 Arylomycin A4 -CCMSLIB00000075066 Yersiniabactin -CCMSLIB00005435750 Promicroferrioxamine -CCMSLIB00005716848 Glutamate-cholic acid (Glu-CA) -CCMSLIB00006582258 Glutamate-deoxycholic acid (Glu-DCA) -CCMSLIB00006582092 Data used to extract MS/MS spectra exclusively present in colonized (SPF) mice are publicly available in GNPS/MassIVE under the accession number MSV000079949.Data used to validate and assess antibiotics effect on microbial MS/MS spectra of interest are available under the accession number MSV000080918.A list of datasets with data acquired from human biosamples that presented matches to the putative microbial MS/MS spectra of interest is available in Supplementary Table 3.

Fig. 1 |
Fig. 1 | The microbeMASST search tool and reference database.a, Community contributions of data and knowledge to GNPS 17 , ReDU 57 and MassIVE from 2014 to 2022 were used to generate the microbeMASST reference database.In addition, a public invitation to deposit data in June 2022 resulted in the further deposition of LC-MS/MS files from 25 different laboratories from 15 different countries across the globe, leading to the curation of a total of 60,781 LC-MS/MS files of microbial monoculture extracts.b, microbeMASST comprises 1,858 unique lineages across three different domains of life mapped to 541 unique strains, 1,336 species, 539 genera, 264 families, 109 orders, 41 classes and 16 phyla.c, Examples of medically relevant small molecules known to be produced by bacteria or fungi.Lovastatin, a cholesterol-lowering drug originally isolated from Aspergillus

Fig. 2 |
Fig. 2 | microbeMASST can identify microbial MS/MS spectra within mouse and human datasets.a, Workflow to extract microbial MS/MS spectra from biochemical profiles of 29 different tissues and biofluids of SPF mice that are not observed in GF mice 30 .The MS/MS spectra unique to SPF mice (10,047) were searched with microbeMASST.A total of 3,262 MS/MS spectra had a match; those MS/MS also matching human cell lines were removed, leaving a total of 2,425 putative microbial MS/MS spectra (see Methods to download .mgffile).b, The presence of the 2,425 MS/MS spectra was evaluated in an additional animal study looking at antibiotic usage 40 .A total of 512 MS/MS spectra, out of the 621 overlapping, were exclusively found in animals not receiving antibiotics.c, UpSet plot of the distribution of the detected MS/MS spectra (512) across bacterial phyla.Terpenoids were more commonly observed across phyla, while amino acids and peptides appeared to be more phylum specific.d, The 512 MS/MS spectra were searched in human datasets and 455 were found to have a match.These MS/MS spectra were present in both healthy individuals and individuals affected by different diseases.e, Most of the MS/MS spectra (n = 411) matched faecal samples (n = 110,973 matches), followed by blood, oral cavity, breast milk, urine and several other organs.CSF, cerebral spinal fluid; COVID-19, coronavirus disease 2019; HIV, human immunodeficiency virus; PBI, primary bacterial infectious disease; SD, sleep disorder; AD, Alzheimer's disease; IS, ischaemic stroke; KD, Kawasaki disease; IBD, inflammatory bowel disease; T2D, type II diabetes.GNPS and and microbeMASST logos reproduced under a Creative Commons license CC BY 4.0; SIRIUS logo reproduced under a Creative Commons license CC BY 4.0-ND. 17