Main

Structural annotation of small molecules in biological samples is a key bottleneck in various research fields including biomedicine, biotechnology, drug discovery and environmental sciences. Samples in untargeted metabolomics studies typically contain thousands of different molecules, the vast majority of which remain unidentified1,2,3. Liquid chromatography–tandem mass spectrometry (LC-MS2) is one of the most widely used analysis platforms4, as it allows for high-throughput screening, is highly sensitive and is applicable to a wide range of molecules. In LC-MS2, molecules are first separated by their different physicochemical interactions between the mobile and stationary phase of the column in the liquid chromatographic system, resulting in retention time (RT) differences. Subsequently, they are separated according to their mass-to-charge ratio in a mass analyser (MS1). Finally, the molecular ions are isolated and fragmented in the tandem mass spectrometer (MS2).

For each ion, the recorded fragments and their intensities constitute the MS2 spectrum, which contains information about the substructures in the molecule and serves as a basis for annotation efforts. In typical untargeted LC-MS2 workflows, thousands of MS features (MS1, MS2, RT) arise from a single sample. The goal of structural annotation is to associate each feature with a candidate molecular structure, for further downstream interpretation.

In recent years, many powerful methods5,6 to predict structural annotations for MS2 spectra have been developed7,8,9,10,11,12,13,14,15,16,17,18. In general, these methods find candidate molecular structures potentially associated with the MS feature, for example, by querying molecules with a certain mass from a structure database such as Human Metabolome Database (HMDB)19 or PubChem20 and subsequently computing a match score between each candidate and the MS2 spectrum. The highest-scoring candidate is typically considered as the structure annotation of a given MS2. Currently, even the best-of-class methods only reach an annotation accuracy of around 40% (ref. 17) in evaluations when searching large candidate sets such as those retrieved from PubChem. Therefore, in practice, a ranked list of molecular structures is provided to the user (for example, the top-20 structures). This level of performance is still a considerable hindrance in metabolomics and other fields.

Interestingly, RT information remains underutilized in automated approaches for structure annotation based on MS2, despite RTs being readily available in all LC-MS2 pipelines and generally recognized as contributing valuable information21,22. An explanation is that a molecule generally has different RTs under different LC conditions (mobile phase, column composition and so on)23,24. Typically, the RT information is used for post-processing of candidate lists, for example, by comparing measured and reference standard RTs3,24. This approach, however, is limited by the availability of experimentally determined RTs of reference standards. RT prediction models24,25, however, allow the prediction of RTs based solely on the molecular structure of the candidate, and have been successfully applied to aid structure annotation11,26,27,28,29. However, such prediction models generally have to be calibrated to the specific LC configuration3, requiring at least some amount of target LC reference standard RT data to be available21,29,30. Recently, the idea of predicting retention orders (ROs), that is, the order in which two molecules elute from the LC column, has been explored31,32,33,34. ROs are largely preserved within a family of LC systems (for example, reversed-phase or hydrophilic interaction LC systems). Therefore, RO predictors can be trained using a diverse set of RT reference data, and applied to out-of-dataset LC set-ups31. Integration of MS2- and RO-based scores using probabilistic graphical models improved the annotation performance in LC-MS2 experiments34.

Another somewhat neglected aspect in automated annotation pipelines is the treatment of stereochemistry, that is, the different three-dimensional (3D) variants of the molecules. The general assumption has been that LC-MS2 data do not contain sufficient information to separate stereoisomers in samples5,24. As a result, MS2 scorers typically disregard the stereochemical information in the candidate structures and often output the same matching for different stereoisomers (compare refs. 7,17). However, stereoisomers that vary in their double-bond orientation (for example, cistrans or EZ isomerism) may have different shapes and thus exhibit different fragmentation and/or interactions with the LC system. Thus, ignoring stereochemistry in candidate processing may disregard LC-relevant stereochemical information. Furthermore, it is known that certain stereochemical configurations occur more frequently than others in nature and hence in the reference databases. Making use of such information can potentially improve annotation performance.

In this Article, we set out to provide a new perspective on jointly using MS2 and RO combined with stereochemistry-aware molecular features for the structure annotation of LC-MS2 data. We present a novel machine learning framework called LC-MS2Struct, which learns to optimally combine the MS2 and RO information for the accurate annotation of a sequence of MS features. LC-MS2Struct relies on the structured support vector machine (SSVM)35 and max-margin Markov network36 frameworks. In contrast to the previous work of ref. 34, our framework does not require a separately learned RO prediction model. Instead, it optimizes the SSVM parameters such that the score margin between correct and any other sequence of annotations is maximized. This way, LC-MS2Struct learns to optimally use the RO information from a set of LC-MS2 experiments. We trained and evaluated LC-MS2Struct on all available reversed-phase LC data from MassBank37, including a combined total of 4,327 molecules from 18 different LC configurations, hence reaching a high level of measurement diversity in the model evaluation. Our framework is compared with three other approaches: RT filtering, logP predictions11 and RO predictions34. LC-MS2Struct can be combined with any MS2 scorer, and is demonstrated with the CFM-ID9,18, MetFrag7,11 and SIRIUS8,17 tools. The use of chirality encoding circular molecular fingerprints38 in the predictive model allows to distinguish and rank different stereoisomers based on the observed ROs.

Overview of LC-MS2Struct

Input and output

We consider a typical data setting in untargeted LC-MS2-based experiments, after pre-processing such as chromatographic peak picking and alignment (Fig. 1a). Such data comprise a sequence of MS features, here indexed by σ. Each feature consists of MS1 information (for example, mass, adduct and isotope pattern), LC retention time (RT) tσ and an MS2 spectrum xσ. We assume that a set of candidate molecules \({{{{\mathcal{C}}}}}_{\sigma }\) is associated with each MS feature σ. Such a set can be, for example, generated from a structure database (for example, PubChem20, ChemSpider39 or PubChemLite40) based on the ion’s mass, a suspect list or an in silico molecule generator (for example, SmiLib v2.041,42). We furthermore require that for MS2 spectrum xσ, a matching score θ(xσ, m) with its candidates \(m\in {{{{\mathcal{C}}}}}_{\sigma }\) is pre-computed using an in silico tool, such as CFM-ID9,18, MetFrag11 or SIRIUS8,17. LC-MS2Struct predicts a score for MS feature σ and each associated candidate \(m\in {{{{\mathcal{C}}}}}_{\sigma }\) based on a sequence of spectra \({{{\bf{x}}}}={({x}_{\sigma })}_{\sigma = 1}^{L}\), of length L, and the ROs derived from the observed RTs \({{{\bf{t}}}}={({t}_{\sigma })}_{\sigma = 1}^{L}\). These scores are used to rank the molecular candidates associated with the MS features (Fig. 1b).

Fig. 1: Overview of the LC-MS2Struct workflow.
figure 1

a, Input to LC-MS2Struct during the application phase. The LC-MS2 experiment results in a set of (MS2, RT)-tuples. The MS information is used to generate a molecular candidate set for each MS feature. b, The output of LC-MS2Struct is the ranked molecular candidates for each MS feature. c, A fully connected graph G models the pairwise dependency between the MS features. Using a set of random spanning trees Tk and SSVM, we predict the max-marginal scores for each candidate used for the ranking. d, The MS2 and RO information is used to score the nodes and edges in the graph G. e, To train the SSVM models and evaluate LC-MS2Struct, we extract MS2 spectra and RTs from MassBank. We group the MassBank records such that their experimental set-ups are matching, simulating LC-MS2 experiments. f, Main objective optimized during the SSVM training, where yi Σi is the ground-truth label sequence of example i and \({{{\bf{y}}}},{{{\bf{y}}}}^{\prime} \in {{{\varSigma }}}_{i}\) are further possible label sequences.

Candidate ranking using max-marginals

We define a fully connected graph G = (V, E) capturing the MS features and modelling their dependencies (Fig. 1c), where V represents the set of nodes and E the set of edges. Each node σV corresponds to an MS feature, and is associated with the pre-computed MS2 matching scores θ(xσ, m) between the MS2 spectrum xσ and all molecular candidates \(m\in {{{{\mathcal{C}}}}}_{\sigma }\). The graph G contains an edge (σ, τ) E for each MS feature pair. A scoring function F is defined predicting a compatibility score between a sequence of molecular structure assignments \({{{\bf{y}}}}={({y}_{\sigma })}_{\sigma = 1}^{L}\) in the label-space \({{\varSigma }}={{{{\mathcal{C}}}}}_{1}\times \ldots \times {{{{\mathcal{C}}}}}_{L}\) and the observed data:

$$F({\mathbf{y}}\,|\,{\mathbf{x}},{\mathbf{t}},{\mathbf{w}},G)=\underbrace{\frac{1}{|V|}\sum\limits_{\sigma\in V}\theta(x_\sigma,y_\sigma)}_{{\rm{Node}}\,{\rm{scores}}:\,{\rm{MS}}^2\,{\rm{information}}}+\underbrace{\frac{1}{|E|}\sum\limits_{(\sigma,\tau)\in E}f\left((t_\sigma,t_\tau), (y_\sigma,y_\tau)\,|\,{\mathbf{w}}\right)}_{{\rm{Edge}}\,{\rm{scores}}:\,{\rm{RO}}\,{\rm{information}}},$$
(1)

where the function f outputs an edge score (Fig. 1d) expressing the agreement between the observed and the predicted RO, for each candidate assignment pair (yσ, yτ) given the observed RTs (tσ, tτ). The function f is parameterized by the vector w, which is trained specifically for each MS2 scorer (see next section). Using the compatibility score function F (equation (1)), we compute the max-marginal scores43 for each candidate and MS feature, defined for a candidate \(m\in {{{{\mathcal{C}}}}}_{\sigma }\) and MS feature σ as the maximum compatibility score that a candidate assignment \(\bar{{{{\bf{y}}}}}\in {{\varSigma }}\) with \({\bar{y}}_{\sigma }=m\) can reach:

$$\mu ({y}_{\sigma }=m\,| \,{{{\bf{x}}}},{{{\bf{t}}}},{{{\bf{w}}}},G)=\mathop{\max }\limits_{\{\bar{{{{\bf{y}}}}}\in {{\varSigma }}\,:\,{\bar{y}}_{\sigma }=m\}}\,F(\bar{{{{\bf{y}}}}}\,| \,{{{\bf{x}}}},{{{\bf{t}}}},{{{\bf{w}}}},G).$$

We use μ to rank the molecular candidates34. However, for general graphs G, the max-marginal inference problem (MMAP) is intractable. Therefore, we approximate the MMAP problem by performing the inference on tree-like graphs Tk randomly sampled from G (Fig. 1c), for which exact inference is feasible43,44. Here, k indexes the individual spanning trees. Subsequently, we average the max-marginal scores μ(yσ = m | xi, ti, wk, Tk) over a set of trees T, an approach that performed well for practical applications34,45,46. Thereby i indexes the individual training MS² spectra and RT sequences. For each spanning tree Tk, we apply a separately trained SSVM model wk to increase the diversity of the predictions.

Joint annotation using SSVMs

We propose to tackle the joint assignment of candidate labels yΣ to the sequence of MS features of a LC-MS2 experiment through structured prediction, a family of machine learning methods generally used to annotate sequences or networks35,46,47. In our model, the structure is given by the observed RO of the MS feature pairs (yσ, yτ), which provides additional information on the correct candidate labels yσ and yτ. Given a set of annotated LC-MS2 experiments extracted from MassBank37 (Fig. 1e), we train an SSVM35 model w predicting the edge scores. SSVM models can be optimized using the max-margin principle35. In a nutshell, given a set of ground-truth-annotated MS feature sequences, the model parameters w are optimized such that the correct label sequence yiΣi, that is, the structure annotations for all MS features in an LC-MS2 experiment, scores higher than any other possible label sequence assignment yΣi (Fig. 1f).

Results

Extracting training data from MassBank

Ground-truth-annotated MS2 spectra and RTs were extracted from MassBank37, a public online database for MS2 data. Each individual MassBank record typically provides a rich set of meta-information (Supplementary Table 1), such as the chromatographic and MS conditions as well as molecular structure annotations. For training the SSVM model of LC-MS2Struct, the MassBank data were processed such that the experimental conditions were consistent within each MS feature set, that is, with identical LC set-up and MS configuration as in a typical LC-MS2 experiment, to ensure comparable RT, RO and MS2 data. We developed a Python package ‘massbank2db’ that can process MassBank records and group them into consistent MS feature sets, which we denote as MassBank subsets (MB-subsets). For our experiments, we sampled sequences of MS features from the MB-subsets to simulate real LC-MS2 experiments where the signals of multiple unknown compounds are measured under consistent experimental set-ups. Figure 1e illustrates the grouping and LC-MS2 sampling process. Two collections of MassBank data were considered: ALLDATA and the ONLYSTEREO subset.

Comparison of LC-MS2Struct with other approaches

In the first experiment, we compare LC-MS2Struct with previous approaches for candidate ranking either using only-MS2 or additionally using RT or RO information. Only-MS2 uses the MS2 spectrum information to rank the molecular candidates and serves as baseline; MS2 + RO (ref. 34) uses a ranking support vector machine (RankSVM)48,49 to predict the ROs of candidate pairs and a probabilistic inference model to combine the ROs with MS2 scores; MS2 + RT uses predicted RTs to remove false-positive molecule structures from the candidate set, ordered by their MS2 score, by comparing the predicted and observed RT; MS2 + logP is an approach introduced by ref. 11, which uses the observed RT to predict the XlogP3 value50 of the unknown compound and compares it with the candidates’ XlogP3 values extracted from PubChem to refine the initial ranking based on the MS2 scores. The RO-based methods (LC-MS2Struct and MS2 + RO) were trained using the RTs from all available MB-subsets, ensuring that no test molecular structure (based on InChIKey first block, that is, the structural skeleton) was used for the model training (structure disjoint). For the RT-based approaches (MS2 + RT and MS2 + logP), the respective predictors were trained in a structure disjoint fashion using only the RT data available for that MB-subset. For the experiment, all MB-subsets with more than 75 (MS2, RT)-tuples from the ALLDATA data set-up were used (Supplementary Table 2), as the RT-based approaches require LC-system-specific RT training data. The ranking performance was computed for each LC-MS2 experiment within a particular MB-subset. The candidate molecules are identified by their InChIKey first block (the structural skeleton); hence, no stereoisomers are in the candidate sets.

Each candidate ranking approach was evaluated with three MS2 scorers: CFM-ID 4.018, MetFrag11 and SIRIUS17. For LC-MS2Struct, we use stereochemistry-aware molecular fingerprints (3D) to represent the candidates.

Figure 2a shows the average ranking performance (top-k accuracy) across 350 LC-MS2 experiments, each encompassing about 50 (MS2, RT)-tuples (Methods). LC-MS2Struct is the best-performing method combined with any of the three MS2 scorers. For CFM-ID and MetFrag, LC-MS2Struct provides 4.7 and 7.3 percentage unit increases over the only-MS2 for the top-1 accuracy, corresponding to 80.8% and 106% performance gain, respectively. In our setting, that translates to 2.4 and 3.7 additional annotations at the top rank, respectively (out of approximately 50). The performance improvement increases for larger k, reaching as far as 9.3 and 11.3 percentage units for the top-20, which means 4.7 and 5.7 additional correct structures, respectively, in the top-20. For SIRIUS, the improvements are more modest, on average around 2 percentage units for top-1 to top-20. This might be explained by the higher baseline performance of SIRIUS. Nevertheless, SIRIUS can be improved for particular MB-subsets (see Fig. 2b and the discussion in the next section).

Fig. 2: Different approaches to combine MS2 and RT information.
figure 2

a, Comparison of the performance, measured by top-k accuracy, for the different ranking approaches combining MS2 and RT information. The results shown are averaged accuracies over 350 sample MS feature sequences (LC-MS2 experiments). b, Average top-k accuracies per MB-subset, rounded to full integers. The colour encodes the performance improvement in percentage units (%p) of each score integration method compared with only-MS2.

Source data

The runner-up score integration method is MS2 + RO, which also makes use of predicted ROs. For CFM-ID and MetFrag, it leads to about one-third to one-half of the performance gain of LC-MS2Struct. The approaches relying on RTs, either by candidate filtering (MS2 + RT) or through logP prediction (MS2 + logP), lead to only minor improvements for MetFrag and CFM-ID, but none for SIRIUS, for which MS2 + RT even leads to a decrease in ranking performance by about 2 percentage units. An explanation for this is that the filtering approach removes on average 4.7% of the correct candidates, which leads to false-negative predictions.

The performance gain by using either RO or RT varies between the MB-subsets with differing LC-MS2 set-ups (Supplementary Table 3) and compound class compositions (Extended Data Fig. 1). We illustrate these differences in Fig. 2b. Applying LC-MS2Struct improves the ranking performance in almost all MB-subsets, including the SIRIUS MS2-scorer (some very slight decreases were observed in some SIRIUS scored sets). This is in stark contrast to the RT-based approaches (MS2 + RT and MS2 + logP), which often lead to less accurate rankings, especially for SIRIUS. Furthermore, as seen already in the average results (Fig. 2a), the benefit of LC-MS2Struct depends on the MS2 base scorer. For example, the top-1 accuracy of the subsets ‘AC_003’ and ‘NA_003’ can be greatly improved for MetFrag but show little improvement for CFM-ID. Both datasets are natural-product toxins, which are perhaps poorly explained by the bond-disconnection approach of MetFrag. In contrast, for ‘RP_001’ and ‘UF_003’, the largest improvements (top-1) can be reached for CFM-ID. The RT-filtering approach (MS2 + RT) performs particularly well for ‘LQB_000’ and ‘UT_000’. These subsets mostly contain lipids and lipid-like molecules (Extended Data Fig. 1).

Since the RT prediction models are trained using only data from the respective MB-subsets, more accurate models may be reached for less heterogeneous subsets of molecules. Hence, the RT filtering could work well in such cases26.

Performance for different compound classifications

Next we investigate how LC-MS2Struct can improve the annotation across different categories in two molecule classification systems, ClassyFire51 and PubChemLite40. Figure 3 shows the average top-1 and top-20 accuracy improvement of LC-MS2Struct over the only-MS2 baseline for each ClassyFire super-class and PubChemLite annotation category. For ClassyFire (Fig. 3a), the ranking performance improvement for the different super-classes depends on the MS2 scorer. For example, the top-1 accuracy of ‘Alkaloids and derivatives’ can be improved by 10.8 percentage units for MetFrag, but improves much less for CFM-ID and SIRIUS (1.9 and 3.5 percentage units, respectively). For ‘Organic oxygen compounds’, in contrast, the top-1 accuracy improves by about 10 percentage units when using both CFM-ID and MetFrag, whereas only half that improvement is observed for SIRIUS. This suggests that the CFM-ID results may be improved with the inclusion of more ‘Alkaloids and derivatives’. In addition, the ‘Alkaloids and derivatives’, ‘Organic acids and derivatives’ and ‘Organic nitrogen compounds’ appear less well explained by MetFrag (perhaps with more rearrangements, or less distinguishable spectra), such that the improvement from the RO approach is more apparent. For SIRIUS, ‘Lipids and lipid-like molecules’ as well as ‘Organic oxygen compounds’ benefit the most from LC-MS2Struct in top-1 (both improving by 5.7 percentage units) and top-20 (4.1 and 3.2 percentage units, respectively). In general, for ‘Lipid and lipid-like molecules’, LC-MS2Struct seems to achieve the best improvement (top-1 and top-20) over all MS2 scorers. However, depending on the MS2 scorer, this improvement distributes differently across the lipid sub-classes (Extended Data Fig. 2), such as ‘Fatty acyls’, ‘Prenol lipids’ or ‘Sphingolipids’.

Fig. 3: Performance gain by LC-MS2Struct across molecular classes.
figure 3

The ranking performance (top-k) improvement of LC-MS2Struct compared with only-MS2 (baseline). The data are presented as mean values (50 samples) and the error bars show the 95% confidence interval of the mean estimate (1,000 bootstrapping samples). The top-k accuracies (%) under the bars show the only-MS2 performance. For each molecular class, the number of unique molecular structures in the class is denoted in the x-axis label (n). a, Molecular classification using the ClassyFire51 framework (class level). b, PubChemLite40 annotation classification system. Molecules not present in PubChemLite are summarized under the ‘noClassification’ category. Note that in PubChemLite, a molecule can belong to multiple categories.

Source data

For the PubChemLite classification (Fig. 3b), we also see that the MS2 scorers benefit differently from LC-MS2Struct. The improvement is generally close to the average improvement of the respective MS2 scorers and seems more equally distributed across the annotation categories.

For example for CFM-ID, the biggest top-1 improvements are in the ‘foodRelated’ and ‘noClassification’ categories. On the other hand, for SIRIUS the ‘pharmacoInfo’ and ‘bioPathway’ categories improve the most. MetFrag shows the most consistent performance improvement across the categories. ‘agroChemInfo’ benefits the least from LC-MS2Struct (top-1 and top-20). A possible explanation could be that the molecules categorized as agrochemicals are mainly ‘Benzenoids’ (48.5%), ‘Organoheterocyclic compounds’ (25.9%) and ‘Organic acids and derivatives’ (11.6%). As shown in Fig. 3a, these three ClassyFire classes show low (CFM-ID and MetFrag) or practically no (SIRIUS) improvement when using ROs.

Annotation of stereoisomers

Finally, we study whether LC-MS2Struct can annotate stereoisomers more accurately than MS2 alone, considering differences between stereoisomers that vary in their double-bond orientation (for example, cistrans or EZ isomerism), which may potentially lead to differences in their LC behaviour (Fig. 4a). We consider candidate sets containing stereoisomers and evaluate LC-MS2Struct only using MassBank records where the ground-truth structure has stereochemistry information provided, that is, where the InChIKey second block is not ‘UHFFFAOYSA’ (ONLYSTEREO data set-up; Methods). The molecular candidates are represented using two different molecular fingerprints: one that includes stereochemistry information (3D); and one that omits it (2D) (Methods). This allows us to assess the importance of stereochemistry-aware features for the structure annotation.

Figure 5a shows the ranking performance of LC-MS2Struct using 2D and 3D fingerprints.

Fig. 4: Application of LC-MS2Struct to annotate stereoisomers.
figure 4

Post-hoc analysis of the stereoisomer annotation using LC-MS2Struct for three (MS2, RT)-tuples from our MassBank data associated with the same 2D skeleton (InChIKey first block). In our evaluation, all three MS features were analysed multiple times in different contexts (BS02391126 in four, BS64681001 in eight and PR75447353 in two LC-MS2 experiments). a, MS features with their ground-truth annotations. Two of the spectra (starting with BS) were measured under the same LC condition (MB-subset `BS_000'), demonstrating the separation of E/Z isomers on LC columns. b, The candidate sets of the three features are identical (defined by the molecular formula C36H32O19) and contain only three structures. For 12 out of the 14 LC-MS2 experiments, LC-MS2Struct predicts the correct E/Z isomer.

Source data

When looking into the top-1 performance of LC-MS2Struct (3D) for the individual MS2 scorers, we observe an improvement by 2.6, 3.8 and 3.2 percentage units for CFM-ID, MetFrag and SIRIUS, respectively. This translates to performance gains of 87.3%, 95.9% and 44.3%, respectively.

In general, LC-MS2Struct improves the ranking for all three MS2 scorers. The improvement, however, is notably larger when using stereochemistry-aware (3D) candidate features. Interestingly, a similar behaviour could be observed in the ALLDATA setting (Extended Data Fig. 3), even though the absolute performance improvements were smaller. This experiment demonstrates that LC-MS2Struct can use RO information to improve the annotation of stereoisomers.

Discussion

LC-MS2Struct is a novel approach for the integration of MS2 and LC data for the structural annotation of small molecules. The method learns from the pairwise dependencies in the RO of MS features within similar LC configurations and can generalize across different, heterogeneous LC configurations. Furthermore, the use of stereochemistry-aware molecular fingerprints enables LC-MS2Struct to annotate stereoisomers in LC-MS2 experiments based on the observed ROs. Also, our novel processing pipeline to group all (MS2, RT) data from MassBank into subsets of homogeneous LC-MS2 conditions, which is implemented and made available in the ‘massbank2db’52 Python package will, we believe, make MassBank more accessible to other researchers and hence lower the bar of entry to computational metabolomics research.

Our experiments demonstrate that LC-MS2Struct annotates small molecules with an accuracy far superior to more traditional RT filtering and logP-based approaches, and also markedly better than previous methods that rely on ROs. In particular, compared with ref. 34, which used a graphical model as a post-hoc integration tool of MS2 scores and RO predictions, the benefits of learning the parameters of the graphical model are clear. All three studied MS2 scorers could be improved by LC-MS2Struct, including the best-of-class SIRIUS, for which improvements have generally been hard to come by due to its already high baseline accuracy. Our results show the superiority of stereochemistry-aware molecular features for the structure annotation of LC-MS2 data. Remarkably, this was the case not only for the annotation of stereoisomers but also for candidates distinguished by only their 2D structure. This result could be relevant for improving structural annotations in ion mobility separation–mass spectrometry with collision-cross-section measurements.

Our examples indicated that LC-MS2Struct separates candidates with varying double-bond stereochemistry, that is, E/Z and cis/trans isomers (see, for example, Fig. 4). However, there were very few examples of double-bond and/or chiral isomers measured on the same LC system in our dataset, which makes it difficult to quantify this effect, or interrogate these further until more such data are publicly available. Furthermore, as non-chiral LC cannot distinguish stereoisomers that differ only in their chiral centres, the development of more selective stereochemistry-aware molecular features, ignoring the chiral annotations, might be beneficial. We also note that the direct modelling of a node score (MS2 information) predictor in the SSVM would be possible. However, as the MS2 scorers used here are already relatively mature and well known in the community, we have left this research line open for future efforts.

Fig. 5: Using LC-MS2Struct to identify stereoisomers.
figure 5

a, Comparison of the performance, measured by top-k accuracy, of LC-MS2Struct using either 2D (no stereochemistry) or 3D (with stereochemistry) molecular fingerprints in the ONLYSTEREO setting. The results shown are averaged accuracies over 94 sample MS feature sequences (LC-MS2 experiments). b, Average top-k accuracies per MB-subset rounded to full integers. The colour encodes the performance improvement in percentage units (%p) of each score integration method compared with only-MS2.

Source data

Methods

Notation

We use the following notation to describe LC-MS2Struct:

$$\begin{array}{lll}{{{\rm{Sequence}}}}\,{{{\rm{of}}}}\,{{{\rm{spectra}}}}&{{{\bf{x}}}}=({x}_{1},\ldots ,{x}_{L})&{{{\rm{with}}}}\,{x}_{\sigma }\in {{{\mathcal{X}}}}\\ {{{\rm{Sequence}}}}\,{{{\rm{of}}}}\,{{{\rm{retention}}}}\,{{{\rm{times}}}}&{{{\bf{t}}}}=({t}_{1},\ldots ,{t}_{L})&{{{\rm{with}}}}\,{t}_{\sigma }\in {{\mathbb{R}}}_{\ge 0}\\ {{{\rm{Sequence}}}}\,{{{\rm{of}}}}\,{{{\rm{candidate}}}}\,{{{\rm{sets}}}}&{{{\boldsymbol{{{{\mathcal{C}}}}}}}}=({{{{\mathcal{C}}}}}_{1},\ldots ,{{{{\mathcal{C}}}}}_{L})&{{{\rm{with}}}}\,{{{{\mathcal{C}}}}}_{\sigma }\subseteq {{{\mathcal{Y}}}}\\ {{{\rm{Sequence}}}}\,{{{\rm{of}}}}\,{{{\rm{labels}}}}&{{{\bf{y}}}}=({y}_{1},\ldots ,{y}_{L})\in {{\Sigma }}&{{{\rm{with}}}}\,{y}_{\sigma }\in {{{\mathcal{Y}}}}\\ {{{\rm{Candidate}}}}\,{{{\rm{assignment}}}}\,{{{\rm{space}}}}&{{\varSigma }}={{{{\mathcal{C}}}}}_{1}\times \ldots \times {{{{\mathcal{C}}}}}_{L},\end{array}$$

where \({{{\mathcal{X}}}}\) and \({{{\mathcal{Y}}}}\) denote the MS2 spectra and the molecular structure space, respectively, \({{{\mathcal{C}}}}\) denotes a candidate set that is a subset of all possible molecular structures, and A × B denotes the cross product of two sets A and B. For the purpose of model training and evaluation, we assume a dataset with ground-truth-labelled MS feature sequences: \({{{\mathcal{D}}}}={\{(({{{{\bf{x}}}}}_{i},{{{{\bf{t}}}}}_{i}),{{{{\boldsymbol{{{{\mathcal{C}}}}}}}}}_{i},{{{{\bf{y}}}}}_{i})\}}_{i = 1}^{N}\), where N denotes the total number of sequences. We use \(i,j\in {{\mathbb{N}}}_{\ge 0}\) to index MS feature sequences and \(\sigma ,\tau \in {{\mathbb{N}}}_{\ge 0}\) as indices for individual MS features within a sequence, for example, xiσ denotes the MS2 spectrum at index σ in the sequence i. The length of a sequence of MS features is denoted with L. We denote the ground-truth labels (candidate assignment) of sequence i with yi and any labelling with y. Both, yi and y are in Σi. We use y to denote the candidate label variable, whereas m denotes a particular molecular structure. For example, yσ = m means that we assign the molecular structure m as label to the MS feature σ.

Graphical model for joint annotation of MS features

We consider the molecular annotation problem for the output of LC-MS2, which means assigning a molecular structure to each MS feature, as a structured prediction problem35,46,47, relying on a graphical model representation of the sets of MS features arising from an LC-MS2 experiment. For each MS feature σ, we want to predict a label yσ from a fixed and finite candidate (label) set \({{{{\mathcal{C}}}}}_{\sigma }\). We model the observed ROs between each MS feature pair (σ, τ) within an LC-MS2 experiment, as pairwise dependencies of the features. We define an undirected graph G = (V, E) with the vertex set V containing a node σ for each MS feature and the edge set E containing an edge for each MS feature pair E = {(σ, τ) σ, τV, σ ≠ τ} (compare Fig. 1a,c). The resulting graph is complete with an edge between all pairs of nodes. This allows us to make use of arbitrary pairwise dependencies, instead of limiting to, say, adjacent RTs. This modelling choice was previously shown to be beneficial by ref. 34. Here we extend that approach by learning from the pairwise dependencies to optimize joint annotation accuracy, which leads to markedly improved annotation accuracy.

For learning, we define a scoring function F that, given the input MS feature sequences (x, t) and its corresponding sequence of candidate sets \({{{\boldsymbol{{{{\mathcal{C}}}}}}}}\), computes a compatibility score between the measured data and any possible sequence of labels y Σ:

$$F({{{\bf{y}}}}\,| \,{{{\bf{x}}}},{{{\bf{t}}}},{{{\bf{w}}}},G)=\frac{1}{| V| }\mathop{\sum}\limits_{\sigma \in V}\theta ({x}_{\sigma },{y}_{\sigma })+\frac{1}{| E| }\mathop{\sum}\limits_{(\sigma ,\tau )\in E}\langle {{{\bf{w}}}},{{\bf{{\Gamma }}}}({{{{\bf{t}}}}}^{\sigma \tau },{{{{\bf{y}}}}}^{\sigma \tau })\rangle ,$$
(2)

where \(\theta :{{{\mathcal{X}}}}\times {{{\mathcal{Y}}}}\to (0,1]\) is a function returning an MS2 matching score between the spectrum xσ and a candidate \({y}_{\sigma }\in {{{{\mathcal{C}}}}}_{\sigma }\), 〈, 〉 denotes the inner product, and w is a model weight vector to predict the RO matching score, based on the joint-feature vector \({{\bf{{\Gamma }}}}:{{\mathbb{R}}}_{\ge 0}\times {{\mathbb{R}}}_{\ge 0}\times {{{\mathcal{Y}}}}\times {{{\mathcal{Y}}}}\to {{{\mathcal{F}}}}\) between the observed RO derived from tστ = (tσ, tτ) and a pair of molecular candidates yστ = (yσ, yτ).

Equation (2) consists of two parts: (1) a score computed over the nodes in G capturing the MS2 information; and (2) a score expressing the agreement of observed and predicted RO computed over the edge set. We assume that the node scores are pre-computed by a MS2 scorer such as CFM-ID18, MetFrag11 or SIRIUS17. The node scores are normalized to (0, 1] within each candidate set \({{{{\mathcal{C}}}}}_{\sigma }\). The edge scores are predicted for each edge (σ, τ) using the model w and the joint-feature vector Γ:

$$\begin{array}{lll}f({{{{\bf{t}}}}}^{\sigma \tau },{{{{\bf{y}}}}}^{\sigma \tau }\,| \,{{{\bf{w}}}})&=&\langle {{{\bf{w}}}},{{\bf{{\Gamma }}}}({{{{\bf{t}}}}}^{\sigma \tau },{{{{\bf{y}}}}}^{\sigma \tau })\rangle \\ &=&\langle {{{\bf{w}}}},{{{\rm{sign}}}}({t}_{\sigma }-{t}_{\tau })\left(\phi ({y}_{\sigma })-\phi ({y}_{\tau })\right)\rangle \\ &=&{{{\rm{sign}}}}({t}_{\sigma }-{t}_{\tau })\langle {{{\bf{w}}}},\phi ({y}_{\sigma })-\phi ({y}_{\tau })\rangle ,\end{array}$$
(3)

with \(\phi :{{{\mathcal{Y}}}}\to {{{{\mathcal{F}}}}}_{{{{\mathcal{Y}}}}}\) being a function embedding a molecular structure into a feature space. The edge prediction function (3) will produce a height edge score, if the observed RO (that is, sign(tσ − tτ)) agrees with the predicted one.

Using the compatibility score function (2), the predicted joint annotation for (x, t) corresponds to the the highest-scoring label sequence \(\hat{{{{\bf{y}}}}}\in {{\varSigma }}:\hat{{{{\bf{y}}}}}=\arg \mathop{\max }\limits_{\bar{{{{\bf{y}}}}}\in {{\varSigma }}}\,F(\bar{{{{\bf{y}}}}}\,| \,{{{\bf{x}}}},{{{\bf{t}}}},{{{\bf{w}}}},G)\). In practice, however, instead of predicting only the best label sequence, it can be useful to rank the molecular candidates \(m\in {{{{\mathcal{C}}}}}_{\sigma }\) for each MS feature σ. This is because for state-of-the-art MS2 scorers, the annotation accuracy in the top-20 candidate list is typically much higher than for the highest-ranked candidate (top-1).

Our framework provides candidate rankings by solving the following problem for each MS feature σ and \(m\in {{{{\mathcal{C}}}}}_{\sigma }\):

$$\mu ({y}_{\sigma }=m\,| \,{{{\bf{x}}}},{{{\bf{t}}}},{{{\bf{w}}}},G)=\mathop{\max }\limits_{\{\bar{{{{\bf{y}}}}}\in {{\varSigma }}\,:\,{\bar{y}}_{\sigma }=m\}}\,F(\bar{{{{\bf{y}}}}}\,| \,{{{\bf{x}}}},{{{\bf{t}}}},{{{\bf{w}}}},G).$$
(4)

Problem (4) returns a max-marginal μ score for each candidate m. That is, the maximum compatibility score any label sequence \(\bar{{{{\bf{y}}}}}\in {{\varSigma }}\) with \({\bar{y}}_{\sigma }=m\) can achieve. One can interpret equation (2) as the log-space representation of a unnormalized Markov random field probability distribution over y associated with an undirected graphical model G (ref. 44).

Feasible inference using random spanning trees

For general graphs G, the maximum a posterior inference problem (that is, finding the highest-scoring label sequence y given an MS feature sequence) is an \({{{\mathcal{N}}}}{{{\mathcal{P}}}}\)-hard problem53,54. The max-marginals inference (MMAP), needed for the candidate ranking, is an even harder problem which is \({{{\mathcal{N}}}}{{{\mathcal{P}}}}\)PP complete54. However, efficient inference approaches have been developed. In particular, if G is tree-like, we can efficiently compute the max-marginals using dynamic programming and the max-product algorithm43,44. Such tree-based approximations have shown to be successful in various practical applications34,45,46.

Here, we follow the work by ref. 34 and sample a set of random spanning trees (RST) \({{{\bf{T}}}}={\{{T}_{k}\}}_{k = 1}^{K}\) from G, whereby K denotes the size of the RST sample. Each tree Tk has the same node set V as G, but an edge set E(T) E, with |E(T)| = L − 1, ensuring that T is a single connected component and cycle free. We follow the sampling procedure used by ref. 34. Given the RST set T, we compute the averaged max-marginals to rank the molecular candidates34:

$$\bar{\mu }({y}_{\sigma }=m\,| \,{{{\bf{x}}}},{{{\bf{t}}}},{{{\bf{w}}}},{{{\bf{T}}}})=\frac{1}{K}\mathop{\sum }\limits_{k=1}^{K}\left(\mu ({y}_{\sigma }=m\,| \,{{{\bf{x}}}},{{{\bf{t}}}},{{{\bf{w}}}},{T}_{k})-\mathop{\max }\limits_{\bar{{{{\bf{y}}}}}\in {{\varSigma }}}\,F(\bar{{{{\bf{y}}}}}\,| \,{{{\bf{x}}}},{{{\bf{t}}}},{{{\bf{w}}}},{T}_{k})\right),$$
(5)

where we subtract the maximum compatibility score from the marginal values corresponding to the individual trees to normalize the marginals before averaging34. This normalization value can be efficiently computed given the max-marginals μ. In our experiments, we train K individual models (wk) and associate them with the trees Tk to increase the diversity. The influence of the number of SSVM models on the prediction performance is shown in Extended Data Fig. 4.

The SSVM model

To train the model parameters w (equation (2)), we implemented a variant of the SSVM35,36. Its primal optimization problem is given as55:

$$\begin{array}{ll}\mathop{\min }\limits_{{{{\bf{w}}}},{{{\mathbf{\xi }}}}}&\frac{1}{2}\parallel {{{\bf{w}}}}{\parallel }^{2}+\frac{C}{N}\mathop{\sum }\limits_{i=1}^{N}{\xi }_{i}\\ {{{\rm{st.}}}}&F({{{{\bf{y}}}}}_{i}\,| \,{{{{\bf{x}}}}}_{i},{{{{\bf{t}}}}}_{i},{{{\bf{w}}}},{G}_{i})-F({{{\bf{y}}}}\,| \,{{{{\bf{x}}}}}_{i},{{{{\bf{t}}}}}_{i},{{{\bf{w}}}},{G}_{i})\ge \ell ({{{{\bf{y}}}}}_{i},{{{\bf{y}}}})-{\xi }_{i}\\ &\forall i\in \{1,\ldots ,N\},\,\forall {{{\bf{y}}}}\in {{{\varSigma }}}_{i},\end{array}$$
(6)

where C > 0 is the regularization parameter, ξi ≥ 0 is the slack variable for example i and \(\ell :{{{\varSigma }}}_{i}\times {{{\varSigma }}}_{i}\to {{\mathbb{R}}}_{\ge 0}\) is a function capturing the loss between two label sequences. The constraint set definition (st.) of problem (6) leads to a parameter vector w that is trained according to the max-margin principle35,36,47, that is, the score F(yi) of the correct label should be greater than the score F(y) of any other label sequence by at least the specified margin (yi, y). Note that in the SSVM problem (6), a different graph Gi = (Vi, Ei) can be associated with each training example i, allowing, for example, to process sequences of different length.

We solve (6) in its dual formulation and use the Frank–Wolfe algorithm56 following the recent work by ref. 55. In the Supplementary Information, we derive the dual problem and demonstrate how to solve it efficiently using the Frank–Wolfe algorithm and RST approximations for Gi. Optimizing the dual problem enables us to use non-linear kernel functions \(\lambda :{{{\mathcal{Y}}}}\times {{{\mathcal{Y}}}}\to {{\mathbb{R}}}_{\ge 0}\) measuring the similarity between the molecular structures associated with the label sequences.

The label loss function is defined as follows:

$$\ell ({{{{\bf{y}}}}}_{i},{{{\bf{y}}}})=\frac{1}{| {V}_{i}| }\mathop{\sum }\limits_{\sigma =1}^{L}\left(1-\lambda ({y}_{i\sigma },{y}_{\sigma })\right)$$

and satisfies (y, y) = 0 (a required property55), if λ is a normalized kernel, which holds true in our experiments (we used the MinMax kernel57).

Pre-processing pipeline for raw MassBank records

Extended Data Fig. 5 illustrates our MassBank pre-processing pipeline implemented in the Python package ‘massbank2db’52. First, the MassBank records text files were parsed and the MS2 spectrum, ground-truth annotation, RT and meta-information extracted. Records with missing MS2, RT or annotation were discarded. We use the MB 2020.11 release for our experiments.

Subsequently, we grouped the MassBank records into subsets (denoted as MB-subsets) where the (MS2, RT)-tuples were measured under the same LC and MS conditions.

Supplementary Table 1 summarizes the grouping criteria. In the next step, we used the InChIKey58 identifier in MassBank to retrieve the SMILES59 representation from PubChem20 (1 February 2021), rather than using the contributor-supplied SMILES. This ensures a consistent SMILES source for the molecular candidates and ground-truth annotations.

Three more filtering steps were performed before creating the final database, to remove records: (1) if the ground-truth exact mass deviated too far (>20 ppm) from the calculated exact mass based on the precursor mass-per-charge and adduct type; (2) if the subset contained <50 unique molecular structures; (3) if they were potential isobars (see pull-request #152 in the MassBank GitHub repository, https://github.com/MassBank/MassBank-data/pull/152).

Supplementary Table 3 summarizes the LC-MS2 meta-information for all generated MB-subsets.

Generating the molecular candidate sets

We used SIRIUS8,17 to generate the molecular candidate sets. For each MassBank record, the ground-truth molecular formula was used by SIRIUS to collect the candidate structures from PubChem20. The candidate sets generated by SIRIUS contain a single stereoisomer per candidate, identified by their InChIKey first block (structural skeleton). To study the ability of LC-MS2Struct to annotate the stereochemical variant of the molecules, we enriched the SIRIUS candidates sets with stereoisomers, using the InChIKey first block of each candidate to search PubChem (1 February 2021) for stereoisomers. The additional molecules were then added to the candidate sets.

Pre-computing the MS2 matching scores

For each MB-subset, MS2 spectra with identical adduct type (for example, [M + H]+) and ground-truth molecular structure were aggregated. Depending on the MS2 scorer, we either merged the MS2 into a single spectrum (CFM-ID and MetFrag) following the strategy by ref. 11 or we provided the MS2 spectra separately (SIRIUS). For the spectra merging, we used the ‘mzClust_hclust’ function of the xcms package60, which first combines all MS2 spectra’s peaks into a single peaklist and subsequently merges peaks based on a mass error threshold.

To compute the CFM-ID (v4.0.7) MS2 matching score, we first predicted the in silico MS2 spectra for all molecular candidate structures based on their isomeric SMILES representation using the pre-trained CFM-ID models (Metlin 2019 MSML) by ref. 18. We merged the three in silico spectra predicted by CFM-ID for different collision energies and compared them with the merged MassBank spectrum using the modified cosine similarity61 implemented in the matchms62 (v0.9.2) Python library. For MetFrag (v2.4.5), the MS2 matching scores were calculated using the FragmenterScore feature based on the isomeric SMILES representation of the candidates. For SIRIUS, the required fragmentation trees are computed using the ground-truth molecular formula of each MassBank spectrum. SIRIUS uses canonical SMILES and hence does not encode stereochemical information (which is absent in the canonical SMILES). Therefore, we used the same SIRIUS MS2 matching score for all stereoisomers sharing the same InChIKey first block.

For all three MS2 scorers, we normalized the MS2 matching scores to the range [0, 1] separately for each candidate set. For the machine-learning-based scorers (CFM-ID and SIRIUS), the matching scores of the candidates associated with a particular MassBank record using in evaluation were predicted using models that did include its ground-truth structure (determined by InChIKey first block).

If a MS2 scorer failed on a MassBank record, we assigned a constant MS2 score to each candidate.

Molecular feature representations

Extended connectivity fingerprints with function classes (FCFP)38 were used to represent molecular structures in our experiments. We employed RDKit (v2021.03.1) to generate counting FCFP fingerprints. The fingerprints were computed based on the isomeric SMILES, using the parameter ‘useChirality’ to generate fingerprints that either encoded stereochemistry (3D) or not (2D). To define the set of substructures in the fingerprint vector, we first generated all possible substructures, using a FCFP radius of two, based on a set of 50,000 randomly sampled molecular candidates associated with our training data, and all the ground-truth training structures, resulting in 6,925 (3D) and 6,236 (2D) substructures. We used 3D FCFP fingerprints in our experiments, except for the experiments focusing on the annotation of stereoisomers, where we used both 2D and 3D fingerprints for comparison. We used the MinMax kernel57 to compute the similarity between the molecules.

Computing molecular categories

For the analysis of the ranking performance for different molecular categories, we used two classification systems, ClassyFire51, which classifies molecules according to their structure, and PubChemLite40, which classifies molecules according to information available for ten exposomics-relevant categories. For ClassyFire, we used the ‘classyfireR’ R package to retrieve the classification for each ground-truth molecular structure in our dataset. For PubChemLite, the classification categories were retrieved via InChIKey first block matching of each molecular structure; if it was not found in PubChemLite, the category ‘noClassification’ was assigned.

Training and evaluation data set-ups

We considered only MassBank data that have been analysed using an LC reversed-phase (RP) column. We removed molecules from the data if their measured RT was less than three times the estimated column dead-time63, as we considered such molecules to be non-retaining.

We considered two separate data set-ups. The first one, denoted by ALLDATA, used all available MassBank data to train and evaluate LC-MS2Struct. This set-up was used to compare the different candidate ranking approaches as well as to investigate the performance across various molecular classes. The second set-up, denoted by ONLYSTEREO, used MassBank records where the ground-truth molecular structure contains stereochemical information, that is, where the InChIKey second block is not ‘UHFFFAOYSA’. This set-up was used in the experiments regarding the ability of LC-MS2Struct to distinguish stereochemistry. In the training, we additionally used MassBank records that appear only without stereochemical information in our candidate sets, identified by the InChIKey second block equal to ‘UHFFFAOYSA’ in PubChem. The number of available training and evaluation (MS2, RT)-tuples per MB-subset are summarized in Supplementary Table 2.

For each MB-subset, we sampled a set of LC-MS2 experiments, that is (MS2, RT)-tuple sequences, from the available evaluation data. The number of LC-MS2 experiments (n below) depended on the number of available (MS2, RT)-tuples (Supplementary Table 2) as follows:

$$n=\left\{\begin{array}{ll}0\quad &{{{\rm{if}}}}\,| {{{\mathcal{D}}}}| < 30\\ 1\quad &{{{\rm{else}}}}\,{{{\rm{if}}}}\,30\le | {{{\mathcal{D}}}}| \le 75\\ 15\quad &{{{\rm{else}}}}\,{{{\rm{if}}}}\,76\le | {{{\mathcal{D}}}}| \le 250\\ \left\lfloor \frac{| {{{\mathcal{D}}}}| }{50}\right\rfloor \quad &{{{\rm{else}}}}.\end{array}\right.$$

where \({{{\mathcal{D}}}}\) is a set of (MS2, RT)-tuples with ground-truth annotation and molecular candidate sets associated with an MB-subset. If there are fewer than 30 (MS2, RT)-tuples available, we do not generate an evaluation LC-MS2 experiment from the corresponding MB-subset. On the basis of this sampling scheme, we obtained 354 and 94 LC-MS2 experiments for ALLDATA and ONLYSTEREO, respectively, for our evaluation (Supplementary Table 2).

We trained eight (K = 8) separate SSVM models wk for each evaluation LC-MS2 experiment. For each SSVM, model we first generated a set containing the (MS2, RT)-tuples from all MB-subsets. Then, we removed all tuples whose ground-truth molecular structure, determined by the InChIKey first block, was in the respective evaluation LC-MS2 experiment. Lastly, we randomly sampled LC-MS2 experiments from the training tuples, within their respective MB-subset, with a length randomly chosen from 4 to (maximum) 32 (see also Fig. 1e) and an RST Tik assigned for each MS feature sequence i. In total, 768 LC-MS2 training experiments were generated for each SSVM model. To speed up the model training, we restricted the candidate set size \(| {{{{\mathcal{C}}}}}_{i\sigma }|\) of each training MS feature σ to maximum 75 candidate structures by random subsampling. We ensure that the correct candidate is included in the subsample. Each SSVM model wk was applied to the evaluation LC-MS2 experiment, associated with different RSTs Tk, and the averaged max-marginal scores where used for the final candidate ranking (see equation (5) and Fig. 1c).

SSVM hyperparameter optimization

The SSVM regularization parameter C was optimized for each training set separately using grid search and evaluation on a random validation set sampled from the training data’s (MS2, RT)-tuples (33%). A set of LC-MS2 experiments was generated from the validation set and used to determine the normalized discounted cumulative gain (NDCG)64 for each C value. The regularization parameter with the highest NDCG value was chosen to train the final model. We used the scikit-learn65 (v0.24.1) Python package to compute the NDCG value, taking into account ranks up until 10 (NDCG@10) and defined the relevance for each candidate to be 1 if it is the correct one and 0 otherwise. To reduce the training time, we searched the optimal C* only for SSVM model k = 0 and used C* for the other models with k > 0.

Ranking performance evaluation

We computed the ranking performance (top-k accuracy) for a given LC-MS2 experiment using the tie-breaking strategy described in ref. 8: if a ranking method assigns an identical score to a set of n molecular candidates, then all accuracies at the ordinal ranks k at which one of these candidates is found are increased by 1/n. We computed a candidate score (that is, only-MS2, LC-MS2Struct and so on) for each molecular structure in the candidate set (identified by PubChem CID). Depending on the data set-up (Supplementary Table 4), we first collapse the candidates by InChIKey first block (ALLDATA, method comparison and molecule category analysis) or full InChIKey (ONLYSTEREO stereochemistry prediction), assigning the maximum candidate score for each InChIKey first block or InChIKey group, respectively. Subsequently, we compute the top-k accuracy based on the collapsed candidate sets.

For the performance analysis of individual molecule categories, either ClassyFire51 or PubChemLite40 classes, we first computed the rank of the correct molecular structure for each (MS2, RT)-tuple of each LC-MS2 evaluation experiment based on only-MS2 and LC-MS2Struct scores. Subsequently, we computed the top-k accuracy for each molecule category, associated with at least 50 unique ground-truth molecular structures (based on InChIKey first block). As a ground-truth structure can appear multiple times in our dataset, we generate 50 random samples, each containing only one example per unique structure, and computed the averaged top-k accuracy.

Comparison of LC-MS2Struct with other approaches

We compared LC-MS2Struct with three different approaches to integrate MS2 and RT information, namely RT filtering, logP prediction and RO prediction.

For RT filtering (MS2 + RT), we followed ref. 26, which used the relative error \(\epsilon =\frac{| \hat{t}-{t}_{\sigma }| }{{t}_{\sigma }}\), between the predicted (\(\hat{t}\)) and observed (tσ) RT. We set the filtering threshold to the 95% quantile of the relative RT prediction errors estimated from the RT model’s training data, following refs. 27,29. We used scikit-learn’s65 (v0.24.1) implementation of the support vector regression66 with radial basis function kernel for the RT prediction. For support vector regression, we use the same 196 features, computed using RDKit (v2021.03.1), as in ref. 25.

For logP prediction (MS2 + logP), we followed ref. 11, which assigned a weighted sum of an MS2 and logP score \(s=\beta{s}_{{{{\rm{MS{}}}^{{2}}}}}(m)+(1-\beta ){s}_{{{{\rm{log}}P}}}(m)\) to each candidate \(m\in {{{{\mathcal{C}}}}}_{\sigma }\), and used it rank the set of molecular candidates. The logP score is given by \({s}_{{{{\rm{log}}P}}}(m)=\frac{1}{\delta \sqrt{2\uppi }}\exp \left(-\frac{{({{{{\rm{log}}P}}}_{m}-{{{{\rm{log}}P}}}_{\sigma })}^{2}}{2{\delta }^{2}}\right)\), where logPm is the predicted XlogP350 extracted from PubChem20 for candidate m, and logPσ = a  tσ + b is the XlogP3 value of the unknown compound, associated with MS feature σ, predicted based on its measured RT tσ. The parameters a and b of the linear regression model were determined using a set of RT and XlogP3 tuples associated with the LC system. As in ref. 11, we set δ = 1.5 and set β such that it optimizes the top-1 candidate ranking accuracy, calculated from a set of 25 randomly generated training LC-MS2 experiments.

For RO prediction (MS2 + RO), we used the approach by ref. 34, which relies on a RankSVM implementation in the Python library ROSVM31,67 (v0.5.0). We used counting ‘substructure’ fingerprints calculated using CDK (v2.5)68 and the MinMax kernel57. The MS2 matching scores and predicted ROs were used to compute max-marginal ranking scores using the framework by ref. 34. We used the author’s implementation in version 0.2.369. The hyper-parameters β and k of the model were optimized for each evaluation LC-MS2 experiment separately using the respective training data. To estimate β, we generated 25 LC-MS2 experiments from the training data and selected the β that maximized the Top20AUC34 ranking performance. The sigmoid parameter k was estimated using Platt’s method70 calibrated using RankSVM’s training data. We used 128 random spanning trees per evaluation LC-MS2 experiment to compute the averaged max-marginals.

For the experiments comparing the different methods, we used all LC-MS2 experiments generated, except the ones from the MB-subsets ‘CE_001’, ‘ET_002’, ‘KW_000’ and ‘RP_000’ (Supplementary Table 2). For those subsets, the evaluation LC-MS2 experiment contains all available (MS2, RT)-tuples, leaving no LC-system-specific data to train the RT (MS2 + RT) or logP (MS2 + logP) prediction models. The RT and logP prediction models are trained in a structure disjoint fashion using the RT data of the particular MB-subset associated with the evaluation LC-MS2. The RO prediction model used by MS2 + RO is trained structure disjoint as well, but using the RTs of all MB-subsets.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.