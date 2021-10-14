General considerations

Establishing the stereochemistry from fragmentation spectra is highly challenging and beyond the power of automated search engines; hence, only the two-dimensional structure is considered when evaluating a hit structure. We consider the identity and connectivity (with bond multiplicities) of the atoms but ignore the stereo-configuration for asymmetric centers and double bonds.

The term ‘novel compound’ has previously been used to describe conflicting and imprecisely defined concepts, such as when an unexpected compound is detected in a sample or organism or whether compounds have previously been described in the literature. Throughout this paper, a structure is considered ‘novel’ if no MS/MS data from a compound with the same structure are present in the training data; hence, the compound cannot be annotated through spectral library search. We noted above that stereoisomers (compounds with identical structure, such as L-threose, D-threose, L-erythrose and D-erythrose) show highly similar fragmentation. Hence, for L-threose to be novel, the training data must not contain MS/MS data for L-threose, D-threose or (L- or D-)erythrose. In our evaluations, we ensure that all compounds are novel using structure–disjoint cross-validation.

Similarly, a ‘truly novel’ compound refers to a compound structure absent from large public databases such as PubChem30 or ChemSpider32; quotation marks are in place, as the (non-public) database GDB-17 (ref. 47) contains 166 billion hypothetical structures of small molecules, and ‘truly novel’ compounds might already be in there. For CSI:FingerID and other in silico methods that do not rely on metascores, there is no difference to search in a database of ‘truly novel’ hypothetical structures or to search in PubChem or the biomolecule structure database. It is understood that correct annotation rates will deteriorate if the database we search in becomes too large.

COSMIC targets biomolecules—that is, products of nature as well as synthetic products with potential bioactivity, including drugs, toxins, food, cosmetics and other xenobiotics. This restriction of focus is due to the available MS/MS training data.

Regarding COSMIC and its annotations, it must be understood that COSMIC only proposes structure annotations; the user has to decide which of these putative annotations will be analyzed further and potentially verified using orthogonal data, such as retention time, comparison with synthetic standards, spike-in experiments, isolation or nuclear magnetic resonance (NMR) experiments.

FDRs

Given a list of hits, the FDR of this list is the number of incorrect hits in the list divided by the size of the list. Hence, to compute FDR, we must know the exact number of correct and incorrect hits. Throughout this paper, evaluations were carried out using reference data so that the true structure underlying any query spectrum was unknown to the method but known to us. To this end, all reported FDR rates are exact, unless indicated otherwise. At this point, there is no need to employ methods for FDR estimation12,34,48,49; such methods try to accurately estimate the exact FDR in application, where we do not have knowledge of correct and incorrect hits. Accurate FDR estimation remains a highly non-trivial problem in general statistics as well as many fields of application (see also below).

In silico methods and related work

So-called ‘in silico methods’ allow us to search in a molecular structure database using MS/MS data as our query. Most methods follow one of three paradigms. (1) Combinatorial fragmenters13,50,51,52 try to explain the query spectrum using the candidate structure, combinatorially breaking bonds in the molecular structure graph. (2) Other methods try to predict the fragmentation spectrum of a given compound structure14,53,54; this allows us to search in the structure database by spectral matching. (3) Alternatively, we can transform the query spectrum into information about the query structure and then use this structure information to search in the structure database15,55,56. Later publications basically present minor modifications of these ideas; an exception is the Input Output Kernel Regression (IOKR) variants of CSI:FingerID17,57, which use molecular fingerprints but circumvent the prediction of individual molecular properties, instead predicting similarity of a candidate to the query by regression.

Some methods use so-called ‘metascores’ that integrate information about citation frequencies or production volume51,58,59. We stress that ‘metascores’ have nothing in common with ‘metadata’, except for the prefix; metadata are information about the experimental setup and the biological sample, whereas these metascores use side information unrelated to the actual experiment. These metascores usually perform well in evaluations but come with several severe restrictions; in the context discussed here, the most important restriction is that the above side information is not available, and metascores, therefore, are not applicable for any ‘truly novel’ structure, such as novel bile acid conjugates. Furthermore, metascores tend to prefer highly cited ‘blockbuster metabolite’ candidates; hence, evaluation results, which are carried out using mainly such ‘blockbuster metabolites’, are often exaggerated. Similar limitations are associated with metascores based on taxonomy60 as, again, this information is not available for ‘truly novel’ structures. Thus, we ignored metascore methods in our evaluations.

Finally, some tools use networks for structure annotations; networks may be based on spectral similarity in the LC–MS/MS run or structural similarity in the metabolite database60,61,62,63.

Structure databases

Different from previous studies15,33 where structures were derived from International Chemical Identifier (InChI) strings, molecular structures were standardized using the PubChem standardization procedure30. In particular, a canonical tautomeric form was chosen, as solvent, temperature and pH in the sample influence the dominating tautomeric species. Standardization of compounds not in PubChem was carried out using the web service at https://pubchem.ncbi.nlm.nih.gov/rest/pug/. PubChem standardization has changed multiple times over the last years without further noticing of users; to this end, it is possible that some non-PubChem compounds were standardized slightly differently than structures from the MS/MS training data.

We searched in the following structure databases with COSMIC:

For the CASMI 2016 evaluation 18 , we downloaded structures from the CASMI 2016 results web page (http://casmi-contest.org/2016/). Candidate structures were provided as part of the blinded contest and originally retrieved from ChemSpider 32 .

The biomolecule structure database is a union of several public structure databases, including HMDB 43 , ChEBI 64 , KEGG 65,66 and UNPD 67 . The resulting database contains 391,855 unique structures of biomolecules and compounds that can be expected to be present in biological samples.

The HMDB structure database 43 was downloaded on 8 August 2018 and contains 113,983 compounds and 95,980 unique structures with mass up to 2,000 Da.

The PubChem structure database 30 was downloaded on 16 January 2019 and contains 97,168,905 compounds and 77,153,182 unique covalently bonded structures with mass up to 2,000 Da. We added all missing structures from the biomolecule structure database, which resulted in a total of 77,190,484 unique structures.

A combinatorial database of 28,630 bile acid conjugate structures was generated with SmiLib v2.0 (refs. 19,20), downloaded from http://melolab.org/smilib/. SmiLib generates chemical structures by combining scaffolds and building blocks provided as SMILES (Simplified Molecular Input Line Entry Specification). We curated a list of initial bile acid ‘scaffolds’ that represent common steroid cores (that is, cholic acid, deoxycholic acid, hyocholic acid and chenodeoxycholic acid). Initial scaffolds were modified manually with common phase 2 metabolism reactions (that is, glucuronidation, acetylation, sulphation and methylation) and resulted in 322 scaffolds. To generate bile acid conjugates, scaffolds were combined with 91 building blocks, including proteinogenic and non-proteinogenic amino acids, along with their N-hydroxylated and N-methylated version, and acyls moieties. Stereochemical information was removed before the database generation with SmiLib. Notably, the bile acid conjugate structure database also contains unconjugated bile acids; for the sake of brevity, we will, nevertheless, refer to ‘bile acid conjugates’ without explicitly mentioning this fact.

MS/MS reference datasets and noise addition

For evaluations, we limited ourselves to MS/MS spectra recorded in positive ion mode, as there are generally more such spectra available. This is not a restriction of COSMIC, and the publicly available version can also process negative ion mode data. Evaluations were carried out using reference measurements, as we do not know the correct answers for biological datasets.

For the CASMI 2016 evaluation, MS/MS spectra were downloaded from the CASMI web page (http://casmi-contest.org/2016/). MS/MS spectra were measured on a Q Exactive Plus Orbitrap (Thermo Fisher Scientific) with 20/35/50 HCD nominal collision energies. Twenty-two mixes of synthetic standards were measured in one LC–MS run each, using data-dependent acquisition mode and inclusion lists. Each mix contained 10–94 compounds. A reversed-phase C18 column was used (see ref. 18 for details). In full, MS/MS data of 127 compounds measured in positive ion mode were provided as part of the contest. Fragmentation spectra from different collision energies were merged.

For the Sciex dataset, authentic standards from different compound libraries and single reference standards were used. Specifically, the Agilent LC/MS Pesticide Comprehensive mix, Sigma-Aldrich Bile Acid/Carnitine/Sterol Metabolite Library of Standards, Sigma-Aldrich Fatty Acid Metabolite Library of Standards and Sigma-Aldrich Acid Metabolite Library of Standards were used. Standards were dissolved in suitable solvents and mixed in 43 mixtures in such a way to avoid overlap of isomeric and isobaric substances. Standard mixtures were analyzed using a Sciex Exion AD liquid chromatography system coupled to a Sciex X500R QTOF MS system. Separation was achieved on a Phenomenex Kinetex F5 column (150 mm × 2.1 mm ID, 2.6 μm particle size) with a gradient from eluent A (100% H 2 O + 0.1% formic acid) to eluent B (100% acetonitrile + 0.1% formic acid) using the following gradient: 100/0 at 0 min, 100/0 at 2.1 min, 5/95 at 14 min, 5/95 at 16 min, 100/0 at 16.1 min and 100/0 at 20 min. Column temperature was set to 30 ∘C and flow rate to 200 μl min−1. Data were acquired by data-dependent acquisition of MS/MS spectra using a collision energy ramp from 20 eV to 50 eV. The MS was automatically recalibrated every five injections in MS1 and MS/MS mode. MS/MS spectra for the standards were extracted using the Sciex OS 2.0 software and stored as a .txt file. SIRIUS .ms files and MassBank records were generated using a custom R script.

To train CSI:FingerID, we used a combined dataset from MassBank68, GNPS37 and the NIST 2017 database. Reference MS/MS data were measured on different high-resolution instruments from multiple vendors. The CSI training dataset contains 16,703 structures with 23,965 independent MS/MS measurements. As an independent dataset, we used the commercial MassHunter Forensics/Toxicology PCDL library (Agilent Technologies) with 3,243 structures and 3,462 independent MS/MS measurements, all measured on an Agilent QTOF instrument. Unlike the commercially available library, these mass spectra were not curated. When discussing reference dataset evaluations, independent MS/MS measurements will be referred to as ‘compounds’ for the sake of brevity.

Previous evaluations of CSI:FingerID15,33 were carried out using fragmentation spectra that merged all available collision energies. Here, we also want to evaluate COSMIC’s power if query spectra are recorded at a single collision energy, because LC–MS/MS datasets are often recorded in this way. To this end, we compiled fragmentation spectra sets for both training and independent data using single collision energies—namely, 10 eV, 20 eV and 40 eV. To ensure that COSMIC results are comparable among different collision energies, we used only those compounds for which all three collision energies are available. In the independent data, this is the case for all compounds; but, in the training data, only NIST entries pass this criterion. Hence, the COSMIC training dataset exclusively contains spectra from NIST, all of which were measured on an Orbitrap instrument; and, consequently, all cross-validation results on this dataset exclusively use MS/MS data from Orbitrap instruments. In case the NIST library did not contain fragmentation spectra for the exact collision energies 10 eV, 20 eV and 40 eV, we allowed for a deviation of up to 4 eV; in case fragmentation spectra for more than one collision energy were present in this interval, we used the one with collision energy closest to the desired one. Finally, merged spectra were generated by combining these three spectra (pseudo-ramp spectra).

Fragmentation spectra in reference libraries often have much better quality (more signal peaks, fewer noise peaks and better signal-to-noise) than fragmentation spectra from a biological LC–MS/MS run. To simulate this effect in our reference datasets, we ‘added noise’ to each fragmentation spectrum. Distorting spectra followed similar principles as the generation of decoy spectra12: we distorted spectra similar to what we expect for experimental spectra. For example, adding noise peaks with (uniform) random mass will result in spectra that are notably different from experimental ones12. We simulated two noise models: medium noise and high noise.

We simulated a global mass shift (bias) by drawing a random number δ * from \({{{\mathcal{N}}}}(0,{\sigma }_{\,{{\mbox{mb}}}\,}^{2})\) and then shifting every peak mass m by δ * m . The standard deviation σ mb was chosen as σ mb = (10/3) × 10 −6 (medium noise) or σ mb = (15/3) × 10 −6 (high noise), so that the 3 σ mb interval represents a 10-ppm shift for medium noise and a 15-ppm shift for high noise.

We simulated individual mass deviations by drawing, for each peak with mass m individually, a random number δ from \({{{\mathcal{N}}}}(0,{\sigma }_{\,{{\mbox{md}}}\,}^{2})\) and shifting the peak by δ m . The standard deviation σ md was chosen so that the 3 σ md interval represents a 10-ppm shift for medium noise and a 20-ppm shift for high noise.

We simulated intensity variations in the spectrum: each peak intensity was multiplied by an individual random number ϵ drawn from \({{{\mathcal{N}}}}(1,{\sigma }_{\,{{\mbox{id}}}\,}^{2})\). Variance was chosen as \({\sigma }_{\,{{\mbox{id}}}\,}^{2}=1\) for medium noise and \({\sigma }_{\,{{\mbox{id}}}\,}^{2}=2\) for high noise. Furthermore, 0.03 times the maximum peak intensity of the spectrum was subtracted from each peak intensity. If a peak intensity fell below the threshold of one thousands of the maximum intensity in the spectrum, the peak was discarded.

Finally, we added ‘noise peaks’ to the spectrum. As uniformly choosing the mass of a noise peak would result in peaks that are too easy to spot and sort out by our subsequent analysis12, we, instead, used peaks that appeared in other measured spectra. In pre-processing, a pool of ‘noise peaks’ was gathered from the fragmentation spectra, using all peaks that did not have a molecular subformula decomposition of the known molecular formula of the precursor. For each spectrum, α n of these ‘noise peaks’ were added to the spectrum, where n is the number of peaks in the spectrum and α = 0.2 for medium noise and α = 0.4 for high noise. Intensities of ‘noise peaks’ were adjusted for maximum peak intensities in the contributing and receiving spectrum.

Parameters for medium noise and high noise were chosen in a way that the similarity between the original spectrum and the distorted spectrum reached a particular level, measured by the cosine score (dot product); for the cosine score, we allowed a mass deviation of 7 ppm when matching peaks. Precursor ion peaks were not considered for cosine score calculation, as their high intensities overshadow the lower-intensity peaks. For medium noise, the cosine score between the original and the distorted spectrum had a median value of 0.880. For high noise, the median cosine score was 0.714. Datasets with different noise levels were used for evaluations only but not to train CSI:FingerID or individual confidence score SVMs.

Adding noise to the fragmentation spectra might result in an empty or almost empty spectrum, which would be regarded as insufficient for structure annotation in applications. To this end, we removed fragmentation spectra with, at most, one peak. To ensure that evaluation results are comparable between collision energies and noise levels, we discarded the compound from all libraries if a fragmentation spectrum with, at most, one peak resulted for at least one collision energy and noise level. Doing so, 3,314 compounds were removed from the COSMIC training dataset, and 171 compounds were removed from the independent dataset. Substantially more compounds were removed from the COSMIC training dataset because many training dataset spectra have only few peaks, increasing chances that noisy spectra contain, at most, one peak. Here, 10-eV noisy spectra contain, at most, one peak for 75% of the 3,314 removed compounds; 20-eV noisy spectra for 27%; and 40-eV noisy spectra for 11% (a compound can exhibit sparse spectra for more than one collision energy).

This resulted in eight libraries: four libraries with 4,046 compounds each for the COSMIC training dataset and four libraries with 3,291 compounds each for the independent dataset. Notably, the COSMIC training dataset is a proper subset of the CSI training dataset; if we simply refer to ‘training data’ throughout this manuscript, this refers to the full CSI training dataset and includes the COSMIC training dataset. Recall that the COSMIC training dataset contains Orbitrap MS/MS data only, whereas the independent dataset contains QTOF MS/MS data only.

Biological datasets and data processing

For the mice fecal dataset, we analyzed LC–MS/MS data of 278 samples from a public metabolomics dataset (MassIVE data repository, MSV000082973). This dataset comes from a previously published study 69 . LC–MS/MS experiments were conducted on a Q Exactive Orbitrap instrument (Thermo Fisher Scientific). In brief, the fecal mice metabolome was analyzed by untargeted metabolomics from fecal pellet aqueous–methanol (1:1) extracts from specimens of an atherosclerosis mouse model ( Mus musculus atherosclerosis-ApoE − /− ). Specimens were either exposed or not exposed to intermittent hypoxia or hypercapnia (IHH). In addition, two groups were fed with an HFD or a regular diet; each group consists of 28 specimens.

For the human dataset, we analyzed ten MassIVE datasets from the MassIVE data repository (MSV000083559, MSV000079651, MSV000080167, MSV000080469, MSV000080533, MSV000080627, MSV000081351, MSV000082261, MSV000082629 and MSV000082630). The dataset contains fecal, plasma, urine, lips, tongue and teeth samples from humans, all acquired on Q Exactive Orbitrap instruments (Thermo Fisher Scientific) in positive ion mode. Runs were acquired using C18 reversed-phase ultra-high-performance liquid chromatography. Only files with extensions ‘.mzML’ or ‘.mzXML’ were considered, and LC–MS runs containing spectra in profiled mode were discarded. This resulted in 2,666 LC–MS/MS runs being processed.

For the Orbitrap dataset, we followed the idea of ‘flipping the workflow’ and reanalyzing public data at a repository scale. We restricted ourselves to MassIVE datasets measured on a Q Exactive Orbitrap instrument (Thermo Fisher Scientific), as this instrument had the largest number of MassIVE datasets. We applied no other constraints with regard to analyzed organism and LC setup, resulting in 264 public MassIVE datasets (downloaded on 20 February 2020). MassIVE datasets containing only spectra in profiled or negative ion mode were discarded, leaving us with 123 MassIVE datasets. Sample types range from environmental to natural products and include biological samples from at least 30 different species, covering diverse genera and phyla. Only files with extensions ‘.mzML’ or ‘.mzXML’ were considered, and LC–MS/MS runs containing spectra in profiled or negative ion mode were discarded, leading to 17,414 LC–MS/MS runs being processed. See Supplementary Table 4 for a list of all MassIVE datasets.

SIRIUS 4 was used to process LC–MS/MS runs and MassIVE datasets provided in mzML or mzXML format. Feature detection in SIRIUS 4 is similar in spirit to a targeted analysis. Instead of searching for all features in a run, SIRIUS first collects all fragmentation spectra and their precursor information and then searches for features that are associated with those fragmentation spectra (precursor ions, adduct ions and isotope peaks). Adducts and isotopes were detected using predefined lists of mass differences. Fragmentation spectra assigned to the same feature (precursor ion) are merged using an agglomerative clustering algorithm based on cosine distance. Compounds with mass beyond 700 Da were discarded to avoid high running time. MassIVE datasets that exceeded 600 LC–MS/MS runs were split to reduce memory consumption.

We use both isotope patterns and fragmentation patterns to determine the molecular formula de novo using SIRIUS 4 with default parameters and mass accuracy of 10 ppm. CSI:FingerID with default parameters was used to rank structure candidates. We use SIRIUS default soft thresholding of molecular formulas when querying CSI:FingerID structure candidates. For confidence score computation, we restrict the candidate list to those candidates with the same molecular formula as the highest-scoring candidate (hit). We used the highest-scoring structure candidate and the corresponding fragmentation tree, isotope pattern and structure candidate list features for COSMIC.

For the mice fecal dataset, SIRIUS results were imported into GNPS, and data were further annotated and explored by performing feature-based molecular networking and spectral library search on GNPS. The statistical and fold change analysis was performed using MetaboAnalyst 4.0 (ref. 70) for samples from control mice (not exposed to IHH) that were fed either an HFD or a regular diet.

ROC characteristics and hop plots

We are given a list of hits, one for each query, ordered by score. Each hit can either be positive (correct annotation) or negative (incorrect annotation). Varying a score threshold, we can modify the number of hits reported to the user; our goal is to report all positives and to reject all negatives. True positives (TPs) and false negatives (FNs) are positives (correct hits) that pass or do not pass the threshold; similarly, false positives (FPs) and true negatives (TNs) are incorrect hits that pass or do not pass the threshold. For any score threshold, we plot the true positive rate TP/(TP + FN) (ratio of reported correct hits among all correct hits) against the false positive rate FP/(FP + TN) (ratio of reported incorrect hits among all incorrect hits), resulting in a ROC plot. The AUC of the ROC curve is the integral of the ROC curve; the random score, corresponding to a random ordering of hits, reaches AUC 0.5. A method may reach AUC below 0.5, meaning that the hit score performs worse than random. Different from binary classification, we must not invert ‘predictions’ to reach a better AUC. Logic dictates that the directionality of the hit score (such as, ‘high scores are good’) is fixed by the candidate identification task. The AUC measure makes no difference between the (highly relevant) lower-left and the (mostly irrelevant) upper-right of the ROC curve.

In contrast to binary classification, two methods can differ in the number of positives (correct hits, correct annotations) that they reach for the complete list of queries. This is a peculiarity of the identification task and has no equivalent in binary classifier evaluation, where the number of positives and negatives is determined by the dataset. ROC curves do not asses the number of positives; in particular, two methods can have identical ROC curves, although one method reaches twice as many correct hits. We introduce hop plots (inspired by the hop plant Humulus lupulus ranking to a supporting wire) to integrate this information. We again vary the score threshold but normalize reported correct hits and incorrect hits by the total number of hits (queries) N = TP + FN + TN + FP, plotting TP/N versus FP/N (Extended Data Fig. 1). The resulting curve starts in the origin (0,0) and ends in some point (x, y) ∈ [0,1]2 with x + y = 1, where y is the ratio of correct hits for the complete list of queries. The hop curve lies in the lower-left triangle; random ordering of hits corresponds to a straight line from the origin to some point (x, y) with x + y = 1. For perfect results, the hop curve is a straight line between the origin and (0,1); in the worst case, it is a straight line from the origin to (1,0). Hop plots allow us to answer questions such as, ‘If I fix some FDR, how many true discoveries will a method return?’ We stress that, to draw a ROC curve or a hop plot, we must have complete information about true and false positives and negatives, so we can calculate the exact FDR as FP/(FP + TP). A zoom-in allows us to compare methods in the particularly interesting region close to the origin. Both ROC curves and hop plots allow us to visually compare the performance of a method for different datasets in one plot; here, the total number of hits N is different for each curve.

Besides ROC curves, precision recall curves are frequently used to asses the performance of a binary classifier. Similarly to ROC curves, precision recall curves are not appropriate for the identification task, because ‘recall’ is normalized to the number of correct identifications, which is usually different for two methods. As ‘precision’ equals one minus FDR, ‘precision’ can directly be read from a hop plot, too.

We can calculate the AUC of a hop plot by mirroring the curve at the line x + y = 1 before taking the integral. A method with identification rate y ∈ [0,1] for the complete list of queries will have AUC between y2 and y2 + 2(1 − y)y = 1 − (1−y)2, with random ordering reaching area y2 + (1 − y)y = y. But, much like the AUC of a ROC curve, this number does not tell us whether a method performs well at the (highly relevant) lower-left or the (mostly irrelevant) upper-right of the curve; hence, we refrain from reporting hop plot AUC.

Training CSI:FingerID and structure–disjoint evaluation

We trained an array of SVMs for fingerprint prediction from MS/MS data as described in refs. 15,33,56. Training of CSI:FingerID was carried out using merged spectra with all available collision energies from the CSI training dataset. In contrast, single collision energy and merged spectra libraries, as well as noisified spectra, were not used when training CSI:FingerID but only in validation of COSMIC. We used PubChem-standardized structures71 when computing the molecular fingerprint of a compound. In evaluations, we used the CSI:FingerID ‘covariance score’ from ref. 72 to rank candidates, comparing the probabilistic query fingerprint and each structure candidate fingerprint. A hit was regarded as correct if the PubChem-standardized structures of query and top rank were identical.

As noted above, all evaluations were carried out structure–disjoint. For the ten-fold cross-validation, we partitioned the training data into ten disjoint batches of almost identical size, ensuring that all fragmentation spectra from compounds with identical structure (such as L-threose and D-erythrose) end up in the same batch. Otherwise, L-threose could be part of the training data when evaluating on D-erythrose and vice versa. For each batch, we trained the fingerprint SVM array using the remaining nine batches; we evaluated on the tenth batch. In this way, we ensured that all compounds are novel for CSI:FingerID. For each query, MS/MS training data for the corresponding structure, including independent MS/MS measurements, were not available for CSI:FingerID.

CSI:FingerID evaluations on the independent dataset were again executed structure–disjoint. We additionally trained an SVM array using the complete CSI training dataset. Given an MS/MS query from the independent data, we checked if the structure of the query is also part of the training data. If so, we used the appropriate SVM array from cross-validation for fingerprint prediction; otherwise, we used the SVM array trained on the complete training data. Again, this ensured that all structures were novel in evaluation.

Score calibration and E-value estimation

The P value of a score is the probability that a score this high or higher would be expected by chance; the E value is the expected number of random hits with this score or higher. Kim et al.73 suggested to use E values for peptide database searching; MS-GF E-value computation uses dynamic programming, based on the linear nature of peptides. Keich et al.29 calibrated peptide database search scores using decoys. Both approaches are conceptually hard to adopt for metabolite annotation. Metabolites have highly non-linear structure, and no methods have been suggested to generate reasonable decoy molecular structures for small molecules12.

We suggest using the distribution of scores of PubChem30 candidates as a proxy for the score distribution of incorrect hits. We empirically established that scores of an individual MS/MS query roughly followed a log-normal distribution; for other queries, the score distribution was multimodal (Supplementary Fig. 3). In particular, a small fraction of candidates had a much higher score than expected from the single log-normal distribution; ignoring this would result in inflated calibrated scores.

The log-normal distribution is a reasonable proxy if there are only few samples available. To model multimodal distributions as well as distributions that deviate from the log-normal distribution, we suggest using a kernel density estimate of the probability density function. Clearly, we do not have to ‘compute’ the kernel density; instead, we want to know the E value under the resulting distribution. For the ease of presentation, we do not use log-normal kernel functions but, instead, model the log-transform of the scores by normal kernel functions, which is mathematically equivalent. Let \({y}_{i}:={{\mathrm{ln}}}\,{x}_{i}\) for i = 1, …, n be the log-scores of the PubChem ‘proxy decoys’ excluding the hit score, and let \(y:={{\mathrm{ln}}}\,x\) be the log-score of the hit. We first determine the bandwidth of the kernel function; we use Silverman’s rule of thumb, first determining the standard deviation \(\hat{\sigma }\) of the sample y 1 , …, y n and then setting

$$h:=1.059223841\times \hat{\sigma }{n}^{-1/5}.$$

We also tested other bandwidth estimation procedures but did not find a substantial difference (data not shown). For the Gaussian kernel \(K(u):=\frac{1}{\sqrt{2\pi }}\exp (-\frac{1}{2}{u}^{2})\), we reach

$$K\left(\frac{y-{y}_{i}}{h}\right)=\frac{1}{\sqrt{2\pi }}\exp \left(-\frac{{(y-{y}_{i})}^{2}}{2{h}^{2}}\right)$$

so this is just the usual probability density function of the normal distribution times h, which cancels out in the kernel estimator. We calculate

$$\,{E}\,=\frac{m}{n}\times \mathop{\sum}\limits_{i=1,\ldots ,n}\left[\frac{1}{2}-\frac{1}{2}{{\mathrm{erf}}}\,\left(\frac{y-{y}_{i}}{\sqrt{2}h}\right)\right]$$ (1)

where m is the number of candidates in the biomolecule structure database.

Confidence score computation

Our method of confidence estimation is inspired by the Percolator method for peptide identification in shotgun proteomics74,75. Different from there but similar to refs. 76,77, we do not train a classifier for an individual LC–MS run to ‘boost’ annotations; instead, we train classifiers only once using the reference measurements, which are then applied to the biological data. As noted by Käll et al.74, this approach is highly prone to overfitting. Characteristics of correct and incorrect hits might vary among experiments, instrument types, compounds present in the sample and others. Here, we have taken extensive measures to counter overfitting, such as ‘noisifying’ spectra and the restriction to linear SVMs.

We repeated the following for each collision energy (10 eV, 20 eV, 40 eV and merged spectra) and trained individual SVMs using spectra without added noise from that energy as training data. Features of the linear SVMs are shown in Supplementary Table 1. All features were individually standardized. Parameter C ∈ {10−5, 10−4, . . . , 105} of each SVM was chosen by a nested cross-validation. We used quadratic hinge loss and l 2 regularization. SVMs were trained using LIBLINEAR78.

For each collision energy, we trained three classifiers. (1) When searching PubChem, we used all appropriate features (all but Features 20–22) from Supplementary Table 1. Searching the biomolecule structure database, not all queries result in two or more candidates; but some features from Supplementary Table 1 require a candidate list of at least size two, such as the difference between score of highest-scoring versus runner-up candidate. To this end, we trained two classifiers for the biomolecule structure database. (2) The regular SVM assumes that there are at least two candidates; it uses all features from Supplementary Table 1 but is trained only on the appropriate subset of the training data. (3) The single-candidate SVM uses only the appropriate sub-features (all but Features 1–4, 10 and 13) but can be trained using all training data. For instances with two or more candidates, we uniformly selected one candidate.

The resulting linear classifiers showed clear signs of overfitting. For example, some features received weights that were counterintuitive, such as negative weight for the quality of the SIRIUS fragmentation tree or the CSI:FingerID score. Recall that the actual hit was chosen by CSI:FingerID as the candidate with the highest score; hence, logic dictates that the CSI:FingerID score of the hit must not receive a negative weight when deciding whether a hit is correct or incorrect. The same is true for selecting the best fragmentation tree by SIRIUS. To this end, we enforced directionality of the features. For each feature, we decided manually whether a high value of the feature would increase or decrease our confidence in an annotation. For example, a high CSI:FingerID score should clearly increase our confidence and so should a small E value. See Supplementary Table 1 for enforced directions. Notably, enforcing directionality can be achieved by a regular SVM optimization without additional constraints, allowing us to use established SVM solvers. For each feature with enforced directionality, we augmented one training sample where the corresponding feature was set to a large (positive or negative) value ± β, whereas all other features were kept at zero; the sample received a positive label (correct hit). If the absolute feature value β > 0 is large enough, then an optimal solution must use the feature in the desired direction; the actual value β is of minor importance due to the hinge loss of SVM optimization. To avoid potential numerical instabilities when finding the solution, β should not be chosen too large. Here, we used β = 107; using absolute feature values 108 and 109 resulted in basically identical models, and differences are of no practical consequence (data not shown). Notably, some features received non-zero weights for the classifier with enforced directionality, despite the fact that these features received ‘counter-intuitive’ weights in the unrestricted optimization. For example, feature ‘FP Length Hit’ was repeatedly given negative weight in cross-validation but had high positive weight if we enforced directionality (unrestricted weight − 0.00165, restricted weight 0.0568 in the same cross-validation fold).

When training the COSMIC SVMs, all CSI:FingerID fingerprint predictions of training spectra were carried out structure–disjoint using CSI:FingerID cross-validation models. The COSMIC training dataset was then partitioned for ten-fold cross-validation in the same fashion as for CSI:FingerID training. Hence, cross-validation evaluation of COSMIC is again structure–disjoint, and all compounds are novel. Similarly to above, we also ensured structure–disjoint evaluations on the independent dataset by choosing the appropriate SVM from cross-validation for computing the confidence score. When applying the model to independent data, we capped feature values. For each feature from Supplementary Table 1, we record the minimum and maximum feature value in our training data. When applying the model, feature values exceeding these thresholds are set to the respective threshold value. We do so to prevent exaggerated decision values caused by unexpectedly high values of one or more features.

We map decision values to posterior probability estimates using Platt probabilities31. Platt31 proposed to use a sigmoid function as an approximation of posterior probabilities: \({\mathbb{P}}(y=\,{{\mbox{correct}}}\,| x)\approx {P}_{A,B}(f)\equiv \frac{1}{1+\exp (Af+B)}\), where \(f=f(x)\in {\mathbb{R}}\) is the decision value for hit x and y ∈ {correct, incorrect} is its label. We estimated parameters \(A,B\in {\mathbb{R}}\) using maximum likelihood31,79 as implemented in LIBSVM80.

Using a linear classifier enables explainable machine learning; see Supplementary Table 1 for feature weights after normalization of the three classifiers for merged spectra. We observe that certain features have weight close to zero; this might indicate that the feature is indeed uninformative, that the feature does not measure what we intended to measure or that our training data are insufficient to learn a reasonable weight.

Recall that confidence SVMs were trained exclusively on spectra without added noise. We also trained SVMs from a combined dataset with all noise levels but found that results were of identical quality when applied to the same evaluation dataset (data not shown).

Unlike Percolator74,75, we do not learn a confidence score for individual LC–MS datasets. We do so because it is non-trivial to generate reasonable decoys for small molecules and, more importantly, because incorrect hits in the target database are often not random (Fig. 4)35. This potentially explains why the calibrated E-value score presented here does not allow for a satisfactory separation. Also unlike Percolator, we do not use our scores to re-rank candidates74,75. All of our candidates share the same molecular formula, fragmentation tree and predicted fingerprint; these features are meaningless for re-ranking. To this end, curves of CSI:FingerID and COSMIC in hop plots (Figs. 2 and 5) always end in the same point (x,y) with x + y = 1.

In application, a model with the exact collision energy of the experimental measurement might not be available; in this case, the model with the smallest difference in collision energy (such as the 40-eV model for 35 eV collision energy) is chosen by COSMIC.

FDR estimation

Recall that the FDR equals FP/(FP + TP) where TP is the number of true positives (correct hits above some score threshold) and FP is the number of false positives (incorrect hits above the same score threshold). Also recall that, to compute this exact FDR, we must know the exact numbers FP and TP. However, in applications, we do not have this information; in this case, we need some method to estimate FDR values. Returning random numbers would be an admissible method for FDR estimation, albeit a useless one; to this end, a method for FDR estimation has to be validated against exact FDR values, to assess its accuracy. In application, a user selects an acceptable FDR level, and we want to return as many hits as possible so that the list of hits meets the pre-selected FDR. The q value of a hit is the smallest FDR at which this hit is part of the output list.

We now show how to transform COSMIC confidence scores to FDR estimates. The confidence score is an estimated posterior probability of the hit being correct; to this end, it is one minus the posterior error probability for this hit. Hence, we can use the confidence score to estimate the FDR of the top k hits12,36. Let p j be the posterior error probability for hit j for j = 1, …, n and assume that the hits are ordered by confidence score, so p j ≤ p j+1 . Viewing the annotations as (not necessarily independent) Bernoulli trials, the expected number of incorrect annotations for the top k hits is \(\mathop{\sum }

olimits_{j = 1}^{k}{p}_{j}\), and the expected FDR is

$${\widehat{{FDR}}}_{k}=\frac{1}{k}\times \mathop{\sum }\limits_{j=1}^{k}{p}_{j}.$$ (2)

Because hits have been ordered by posterior error probability, FDR estimates \({\widehat{{FDR}}}_{k}\) are monotonically increasing, so \({\widehat{{FDR}}}_{k}\) is also the q value estimate for hit k.

We evaluate the accuracy of our FDR estimates by plotting exact q values against estimated q values in a Q–Q plot (Extended Data Fig. 5); this has to be carried out using reference data where exact FDR values can be calculated.

Comparing molecular structures

The Tanimoto coefficient measures the similarity of two molecular structures. Any Tanimoto coefficient is based on a particular set of molecular properties, constituting the fingerprint type. For consistency, we use the same fingerprint type (molecular properties) throughout this manuscript that we have trained SVMs for as part of CSI:FingerID. The Tanimoto coefficient is the Jaccard index of the two sets of molecular properties—that is, the cardinality of the intersection of the two sets divided by the cardinality of the union of the two sets. The advantage of the Tanimoto coefficient is that it can be quickly calculated, in particular if we have pre-computed the fingerprints of all molecular structures of interest.

For highly similar molecular structures, such as the pairs in Fig. 4, it is not advisable to employ the Tanimoto coefficient, as it is not apt to accurately measure such high similarity. Instead, we represent the two molecular structures as graphs and ask for a minimum number of edges that have to be removed from the graphs, such that the resulting graphs are isomorphic; naturally, hydrogen atoms are ignored in this computation. This is the maximum common edge subgraph (MCES) problem. Using the number of removed edges to estimate dissimilarity is an appropriate measure for highly similar molecules, as we explicitly do not demand that the resulting subgraph is connected. The MCES problem is NP-complete, as it generalizes subgraph isomorphism. See, for example, ref. 81 for a discussion of available methods for solving MCES exactly and heuristically.

For the molecular structures in Fig. 4, it is straightforward to manually find optimal solutions. The ‘top hit’ structure can be transformed into the ‘correct hit’ structure via two edge deletions for examples a–c and f–i, whereas examples d and e require four edge deletions. Because both graphs have the same number of edges, we require at least two edge deletions for non-isomorphic graphs.

CASMI 2016 re-evaluation

Scores of MetFrag, MAGMa+, CFM-ID, CSI:FingerID (original) and CSI:FingerID IOKR were downloaded from the CASMI 2016 results web page (http://casmi-contest.org/2016/, category 2, automated methods). We only consider tools that scored all candidates. CSI:FingerID (original) and CSI:FingerID IOKR were not executed structure–disjoint, as CASMI is a blinded competition. We computed scores for the structure–disjoint evaluation of CSI:FingerID using CSI:FingerID 1.2.0.

We used hit scores (score of the top-ranked candidate for each query) to order hits. For consistency, we restricted the set of candidate structures to those with the correct molecular formula for all tools. We performed evaluation either using all ChemSpider candidates or restricting the search to those ChemSpider candidates that are simultaneously found in our biomolecule structure database. In four cases, this resulted in an empty list of candidates, and these queries were excluded from evaluation. In 13 cases, the set of candidates did no longer contain the correct structure; these queries were not excluded from evaluation. As expected82, MetFrag, MAGMa+ and CFM-ID profit more from restricting the set of candidates than CSI:FingerID15; hence, annotation rates varied less than those reported in the CASMI evaluation18. In fact, even randomly choosing one of the remaining candidates resulted in a decent annotation rate when searching the biomolecule structure database. In 38 cases, only a single candidate remained; and, in 33 cases, the candidate list contained two or three structures. Even if there is only a single candidate, the score that some in silico tool assigns to this candidate is important information, as we use it to order hits.

The fact that scores of in silico tools, including CSI:FingerID, cannot be used to decently separate correct and incorrect hits might be unexpected for users, but tools and scores were not developed with this application in mind. To this end, our findings must not be misunderstood as a critique of these tools or their developers.

COSMIC confidence scores were computed as described above, using the confidence score model for ‘merged spectra’. We ensured structure–disjoint evaluation (all compounds novel) for both CSI:FingerID and COSMIC, as detailed above. For both ChemSpider and the biomolecule structure database, we used the confidence score variant for searching the biomolecule structure database; this is reasonable as the number of ChemSpider candidates is often substantially smaller than the number of PubChem candidates.

For completeness, we also evaluated separation of the original submissions of CSI:FingerID and CSI:FingerID IOKR (Supplementary Fig. 2). As noted, these evaluations were not carried out structure–disjoint; hence, results mix de-replication (structures for which MS/MS data are available in the training data) and novel structure search. We cannot compute confidence scores for the original CSI:FingerID submission, as features required for its computation (Supplementary Table 1) were not recorded when submitting the CASMI entry.

In-depth method evaluation

For a query fragmentation spectrum, we again assume to know its molecular formula, and we obtained candidates from the structure databases using this molecular formula. In practice, molecular formulas can be established using SIRIUS 4 (ref. 33) or ZODIAC83. For 325 compounds in the COSMIC training dataset and 278 compounds in the independent data, this resulted in an empty candidate list when querying the biomolecule structure database; these compounds were excluded from evaluation, leaving us with 3,721 queries in cross-validation and 3,013 queries for independent data. For 845 compounds in the COSMIC training dataset and 521 compounds in the independent data, the correct structure is not present in the biomolecule structure database; these compounds were not excluded. We ensured structure–disjoint evaluation (all compounds novel) for both CSI:FingerID and COSMIC.

To evaluate against spectral library search, we generated two spectral libraries based on the CSI training dataset: one library with merged spectra and one library with spectra at individual collision energies as well as merged spectra. We searched merged query spectra in the first library and query spectra containing a single collision energy in the second library. Merged spectra are identical to those used for training CSI:FingerID (see above); this library contains 23,965 spectra. The second library contains all available fragmentation spectra at all available collision energies, plus the merged spectra, and contains 189,979 spectra. Notably, the spectral library contains MS/MS data from QTOF and Orbitrap instruments, whereas all query MS/MS spectra are QTOF data. We argue that this resembles how searching in a public or commercial spectral library is executed in practice. The situation is clearly different for an in-house spectral library, but such libraries are usually one to two orders of magnitude smaller. For 821 query compounds, the correct structure is not present in the spectral library; as for COSMIC, these compounds were not excluded from evaluation. To ensure a fair comparison with COSMIC, spectral library search candidates were restricted to those with the correct molecular formula for each query; in practice, this information is usually not available, and spectral library search might perform worse than reported here. In case the spectral library did not contain at least one candidate with the correct molecular formula of the query, a misannotation with score zero was assumed. We evaluated both the cosine score described above and a cosine score using the square root of intensities.

We also evaluated spectral library search when restricting library spectra to the ‘correct’ collision energy (closest energy from 15 eV to 25 eV for 20-eV queries) but found that both annotation rates and separation were substantially worse than for the combined library (data not shown).

The term ‘spectral library search’ refers to searching for a query fragmentation spectrum in a database of reference fragmentation spectra measured from (usually commercial) standards and then reporting the highest-scoring candidate (hit) under some scoring. Spectral library search must not be mistaken with the task of comparing mass spectra, manually or automated, or with computing a measure of similarity between spectra, such as the cosine score. Comparison of mass spectra is in use for many research questions beyond spectral library search. This includes the manual confirmation of annotations, MASST28, as well as CSI:FingerID (and, hence, COSMIC), which uses the cosine score as part of its machine learning framework.

Sciex dataset evaluation

We queried the biological structure database using the positive ionization mode data. As in the other evaluations, we assumed that the correct molecular formula of each query was known. For 13 queries, this resulted in an empty candidate list; these instances were excluded from our evaluation. For the remaining 301 queries, the correct answer was not present in the biological structure database in four cases; these queries were not excluded. Because fragmentation spectra were recorded as ramp spectra, we used the ‘merged spectra’ confidence score model. We ensured structure–disjoint evaluation both for CSI:FingerID and COSMIC. Nineteen structures from the Sciex dataset were not present in the training data.

Annotation, manual confirmation and validation of novel bile acid conjugates

For the mice fecal dataset, MS/MS measurements were taken with a collision energy of 30 eV; we used the COSMIC version trained on 40-eV spectra. The bile acid conjugates structure database was used for the annotation. No additional parameters have to be chosen in the COSMIC workflow.

The output of this workflow is an ordered list of 1,456 COSMIC structure annotations (‘MS features’; Supplementary Data 1). In case multiple compounds were annotated with the same structure (for example, compounds being present in multiple runs and different adducts of the same compound), entries in the COSMIC output were merged and represented by the hit with the highest confidence. This reduces the output to 626 unique structure annotations (Supplementary Table 2). Of these, 113 were present in PubChem. Here, we concentrated on the 513 ‘truly novel’ bile acid conjugates. The q value estimates reported in Fig. 6a were computed via eq. (2) using only the ‘truly novel’ bile acid conjugates.

The top 12 most confident bile acid conjugate annotations were manually inspected, and the fragmentation was interpreted to check consistency with the structure proposed by COSMIC (Supplementary Figs. 8–19). The fragmentation of bile acid conjugates is characterized by fragment ions and neutral losses from the conjugated amino acid moiety as well as the hydroxylation pattern of the bile acid core. Annotations of two ‘truly novel’ bile acid conjugates—phenylalanine (Phe) and tryptophan (Trp) conjugates of chenodeoxycholic acid (CDCA)—were verified by comparing their fragmentation spectra and retention times with those of synthetic standards. Phe-CDCA (7) and Trp-CDCA (12) were synthesized using a procedure adapted from a previous method by Ezawa et al.84. Chenodeoxycholic acid (98.1 mg, 0.25 mmol, 1 eq.) was dissolved in THF (4.9 ml, 0.05 M) and cooled to 0 °C with stirring. Ethyl chloroformate (28 μl, 1.2 eq.) was added, followed by triethylamine (41 μl, 1.2 eq); then, the reaction was stirred for 2 h in an ice bath. After complete conversion of the starting material by TLC, a cold, aqueous solution (4.9 ml) of amino acid (0.37 mmol, 1.5 eq.) and NaOH (14.8 mg, 0.37 mmol, 1.5 eq.) was added in one portion. The reaction was then stirred for 2 h, gradually warming to room temperature. THF was removed in vacuo, and 2 M HCl was added to acidify to pH < 2, at which point a white precipitate appears. The mixture was extracted with ethyl acetate (3 × 20 ml), and the combined organic layers were washed with brine (1 × 50 ml), dried over sodium sulfate and concentrated. The crude material was purified over silica gel by column chromatography eluting with 3–10% methanol in dichloromethane (plus 1% acetic acid, vol/vol) to yield the desired products as confirmed by NMR spectroscopy. NMR spectra were recorded on a Bruker Avance (600 MHz, CryoProbe) spectrometer in CD 3 OD. Signals are reported in ppm with the internal CD 3 OD signal at 3.31 ppm (1H) and 49.0 ppm (13C) as standard reference peak.

Phenylalanine-conjugated chenodeoxycholic acid (Phe-CDCA): Product was obtained as a white solid in 91% yield. 1 H NMR (599 MHz, MeOD): 7.29–7.18 (m, 5H), 4.67–4.61 (m, 1H), 3.81–3.78 (m, 1H), 3.42–3.33 (m, 1H), 3.22 (dd, J = 14.4, 4.8 Hz, 1H), 2.93 (dd, J = 13.8, 9.0 Hz, 1H), 2.27 (q, J = 12.0 Hz, 1H), 2.22–2.17 (m, 1H), 2.10–2.03 (m, 1H), 2.01–1.94 (m, 2H), 1.90–1.81 (m, 3H), 1.77–1.57 (m, 4H), 1.54–1.45 (m, 4H), 1.41–1.26 (m, 5H), 1.24–1.04 (m, 5H), 1.03–0.95 (m, 1H), 0.93–0.86 (m, 7H). 13 C (151 MHz, MeOD): 175.22, 137.28, 128.88, 128.00, 126.32, 71.47, 67.66, 55.94, 50.13, 42.26, 41.78, 39.65, 39.37, 39.08, 37.06, 35.44, 35.17, 34.83, 34.51, 32.65, 32.45, 31.82, 29.96, 27.84, 23.23, 22.02, 20.39, 17.48, 10.81.

Tryptophan-conjugated chenodeoxycholic acid (Trp-CDCA): Product was obtained as a white solid in 42% yield. 1H NMR (599 MHz, MeOD): 7.56 (d, J = 7.8 Hz, 1H), 7.33 (d, J = 7.8 Hz, 1H), 7.10–7.06 (m, 2H), 7.00 (t, J = 7.8 Hz, 1H), 4.73 (dd, J = 8.4, 4.8 Hz, 1H), 3.81–3.77 (m,1H), 3.41–3.32 (m, 2H), 3.18–3.13 (m, 1H), 2.31–2.16 (m, 2H), 2.10–2.03 (m, 1H), 1.98–1.93 (m, 2H), 1.88–1.78 (m,3H), 1.73–1.63 (m, 3H), 1.63–1.58 (m, 1H), 1.54–1.43 (m, 5H), 1.41–1.25 (m, 5H), 1.24–0.94 (m, 7H), 0.91(s, 3H), 0.89 (d, J = 7.2 Hz, 3H), 0.63 (s, 3H). 13C (151 MHz, MeOD): 176.65, 175.39, 138.01, 128.87, 124.24, 122.37, 119.78, 112.28, 111.12, 72.84, 69.07, 57.22, 54.61, 51.49, 43.62, 43.14, 41.00, 40.72, 40.45, 36.78, 36.53, 36.20, 35.89, 34.02, 33.78, 32.99, 31.33, 29.16, 28.45, 24.60, 23.39, 21.76, 18.86, 12.15.

Reference standards were analyzed by LC–MS/MS using identical experimental conditions as used previously85. Retention times were 298.5 s for Phe-CDCA and 294.5 s for Trp-CDCA. Samples from the previous study were re-analyzed to ensure comparability of retention times: the putative Phe-CDCA and Trp-CDCA candidates had retention times of 300.5 s and 294.5 s, respectively. Considering the similarity of their fragmentation spectra (Extended Data Fig. 8) and retention times, these are MSI level 1 identifications. However, these identifications are not unambiguous: isomeric structures, such as Phe-deoxycholic acid, would show the same fragmentation spectrum and the same or very similar retention time. For a conclusive decision, a more detailed analysis method would be required, which is out of the scope of this paper.

Molecular networks were visualized in Cytoscape (v3.7.1)86. The MetaboAnalyst web server70 was used to process the feature quantification results and perform statistical analysis in Fig. 6c. Quantile normalization and auto-scaling were used. Results of the fold change analysis were mapped onto molecular networks using Cytoscape. Primary (cholic acid, ursocholic acid and muricholic acid) and secondary (deoxycholic acid, hyodeoxycholic acid and chenodeoxycholic acid) bile acids and historically known bile acid conjugates (glycocholic acid and taurocholic acid) were annotated by spectral library search. Similarly, the three bile acid conjugates from ref. 40 (tyrosocholic acid, phenylalanocholic acid and leucocholic acid) were annotated by spectral library search. For visualizing the relative feature intensity and ratio (Supplementary Figs. 22–28), box plots were generated directly from the feature quantification results (no normalization and scaling applied). We chose taurocholic acid, deoxycholic acid and hyodeoxycholic acid to compute ratios in Supplementary Figs. 24–26 as these are highly abundant in rat bile and rat feces87. MASST28 was used to search the annotated bile acid conjugates spectra in all public mass spectrometry datasets, including MassIVE-GNPS37, MetaboLights4 and Metabolomics Workbench5. Parameters and results for these jobs are part of Supplementary Table 2.

Repository-scale annotation of novel metabolites

To estimate a reasonable COSMIC confidence score cutoff, we made use of our reference data evaluation results. In our evaluation using independent data, collision energy of 20 eV and medium noise, a confidence score threshold of 0.64 corresponded to FDR 10%. Our implicit assumption is that, for the biological data, this threshold will correspond to a similar FDR. It must be understood that we cannot guarantee a similar FDR for structure annotations below, given our inability to accurately estimate FDR. Clearly, many hits with confidence below this threshold will nevertheless be correct.

We searched the human dataset against the biomolecule structure database; this resulted in 114,012 hits. Multiple hits can annotate the same structure; for example, these hits might originate from different LC–MS/MS runs or different adducts. Hence, we report unique structures instead, where the hit with the highest confidence is used as a representative for that structure. This resulted in 24,554 unique structures being annotated, of which 3,167 (12.9%) were present in the CSI training dataset. We now filter the 24,554 structure annotations for high confidence (score threshold 0.64), resulting in 911 structure annotations. Of these high-confidence annotations, 475 (52.1%) were present in the CSI training dataset, leaving us with 436 (47.9%) high-confidence novel structure annotations. Finally, we excluded all hits with structures in the HMDB structure database, resulting in 21,128 unique structure annotations, 436 high-confidence structure annotations and 315 high-confidence structure annotations without reference MS/MS data (Fig. 10). Of the 315 novel structures, 48 were proteinogenic peptides.

We searched 14 character InChI keys of all 267 novel metabolite structures in the current version of the HMDB (February 2021) and found that at least 23 of these structures are present in the current HMDB version. The exact number might be slightly higher, as structures from the current HMDB version were not standardized using the PubChem standardization procedure. Notably, the recent inclusion of structures in the HMDB does not mean that reference MS/MS data are available for these structures.

High-confidence hits were manually evaluated by checking spectra for known neutral losses and fragments that can be explained. Furthermore, spectra were compared against reference spectra from similar structures. The following paragraphs discuss some high-confidence annotations where evaluation based on manual interpretation or newly generated reference spectra was possible. For none of the structures verified by spectral comparison ((2E)-octenoyl-carnitine, N-oleyl-leucine, phenazine-1,6-dicarboxylic acid) were reference spectra available in the training data of COSMIC or CSI:FingerID.

First, acyl-carnitine structures were evaluated by their typical fragmentation. Characteristic fragments are found at m/z 85 and m/z 144. These are derived from an ene-type loss of the neutral fatty acid yielding m/z 144, undergoing a further loss of trimethylamine to yield m/z 85. The same loss of trimethylamine can occur from the intact molecule, yielding a fragment found at a neutral loss of 59 Da. Based on this fragmentation pattern, eight high-confidence hits were ruled out and are presumably incorrect annotations. Of these eight bogus annotations, three potentially have the wrong adduct annotation. Based on our manual verification, 21 of 30 annotations of the acyl-carnitines are correct. Furthermore, the query spectrum annotated as (2E)-octenoyl-carnitine showed good agreement with a reference measurement (Supplementary Fig. 29).

Second, several N-acyl-amino acids were manually confirmed. Fragmentation of [M+H]+ adducts of N-acyl amino acids typically yields an ene-type of loss of a neutral fatty amide or the neutral loss of a fatty acyl ketene structure. Additional fragmentation yields typical amino acid fragments, allowing to potentially identify the amino acid in more detail. Within the human dataset, N-oleyl-leucine was annotated with a high confidence score. For this structure, reference spectra are now available in MassBank68. A high spectral similarity (cosine score 0.85) was found between the spectrum and the reference (Supplementary Fig. 29). Because MS cannot differentiate between isomeric species, the structure might also represent N-oleyl-isoleucine: spectra of N-oleyl-leucine and N-oleyl-isoleucine are both present in MassBank but are indistinguishable. Another example is N-palmitoyl-tryptophan. No reference spectrum is available for this substance, but the observed fragmentation pattern is in good agreement with expected fragmentation, showing m/z 205, which relates to tryptophan based on the loss of palmitic acid as ketene, and m/z 188, which is related to the loss of palmitic acid as neutral amide. Additional fragments are typically observed in the fragmentation of tryptophan. Using MASST, 12 additional human datasets containing a similar spectrum were identified.

Third, phenazine-1,6-dicarboxylic acid was annotated in a human urine dataset. This metabolite is produced by Streptomyces and Pseudomonas species88, hinting at a potential urinary tract infection. Again, the query spectrum showed good agreement with a reference measurement (Supplementary Fig. 29).

Compound classes in Extended Data Fig. 10 were assigned by NPClassifier89. Proteinogenic amino acids and peptides were selected manually. We installed a web interface allowing interested users to browse through structure annotations ordered by confidence, check spectra, access underlying datasets, leave comments and judge the overall quality of the annotation for the human dataset. The web interface is available at https://bio.informatik.uni-jena.de/cosmic.

To demonstrate that COSMIC can be applied at a repository scale, we searched the Orbitrap dataset with 17,414 LC–MS/MS runs against the biomolecule structure database; this resulted in 979,521 hits. Again, multiple hits can annotate the same structure; the above hits correspond to 77,932 unique annotated structures, of which 8,172 (10.5%) were present in the CSI training dataset. We now filter the 77,932 structure annotations for high confidence (score threshold 0.64), resulting in 3,530 structure annotations. Of these high-confidence structure annotations, 1,815 (51.4%) were present in the CSI training dataset, leaving 1,715 (48.6%) high-confidence novel structure annotations (Supplementary Fig. 30). Again, all hits of the Orbitrap dataset can be accessed via a web interface available at https://bio.informatik.uni-jena.de/cosmic.

The above computations were carried out on a compute node with 2 × 48 cores, AMD EPYC 7642 processors and 1 TB RAM. For running times on a laptop computer, we selected a ‘typical’ MassIVE dataset with 44 LC–MS/MS runs (MSV000080553, rosemary samples). We analyzed the data on a common laptop computer (Quad-Core Intel CPU i7-7700HQ, 16 GB RAM). SIRIUS default parameters were used; in particular, fragmentation tree computation (which is the most time-demanding step of the computational analysis) was done exactly for compounds below 350 Da and in hybrid mode above 350 Da. After feature alignment, 1,961 putative compounds with mass 125–968 Da were detected. We restricted the analysis to compounds below 700 Da, resulting in 1,854 compounds to be processed. Overall wall clock running time was 149 min, 59 s. We note that running time is dominated by the number of compounds in a dataset.

