Abstract
Structural annotation of small molecules in biological samples remains a key bottleneck in untargeted metabolomics, despite rapid progress in predictive methods and tools during the past decade. Liquid chromatography–tandem mass spectrometry, one of the most widely used analysis platforms, can detect thousands of molecules in a sample, the vast majority of which remain unidentified even with best-of-class methods. Here we present LC-MS2Struct, a machine learning framework for structural annotation of small-molecule data arising from liquid chromatography–tandem mass spectrometry (LC-MS2) measurements. LC-MS2Struct jointly predicts the annotations for a set of mass spectrometry features in a sample, using a novel structured prediction model trained to optimally combine the output of state-of-the-art MS2 scorers and observed retention orders. We evaluate our method on a dataset covering all publicly available reversed-phase LC-MS2 data in the MassBank reference database, including 4,327 molecules measured using 18 different LC conditions from 16 contributors, greatly expanding the chemical analytical space covered in previous multi-MS2 scorer evaluations. LC-MS2Struct obtains significantly higher annotation accuracy than earlier methods and improves the annotation accuracy of state-of-the-art MS2 scorers by up to 106%. The use of stereochemistry-aware molecular fingerprints improves prediction performance, which highlights limitations in existing approaches and has strong implications for future computational LC-MS2 developments.
Similar content being viewed by others
Main
Structural annotation of small molecules in biological samples is a key bottleneck in various research fields including biomedicine, biotechnology, drug discovery and environmental sciences. Samples in untargeted metabolomics studies typically contain thousands of different molecules, the vast majority of which remain unidentified1,2,3. Liquid chromatography–tandem mass spectrometry (LC-MS2) is one of the most widely used analysis platforms4, as it allows for high-throughput screening, is highly sensitive and is applicable to a wide range of molecules. In LC-MS2, molecules are first separated by their different physicochemical interactions between the mobile and stationary phase of the column in the liquid chromatographic system, resulting in retention time (RT) differences. Subsequently, they are separated according to their mass-to-charge ratio in a mass analyser (MS1). Finally, the molecular ions are isolated and fragmented in the tandem mass spectrometer (MS2).
For each ion, the recorded fragments and their intensities constitute the MS2 spectrum, which contains information about the substructures in the molecule and serves as a basis for annotation efforts. In typical untargeted LC-MS2 workflows, thousands of MS features (MS1, MS2, RT) arise from a single sample. The goal of structural annotation is to associate each feature with a candidate molecular structure, for further downstream interpretation.
In recent years, many powerful methods5,6 to predict structural annotations for MS2 spectra have been developed7,8,9,10,11,12,13,14,15,16,17,18. In general, these methods find candidate molecular structures potentially associated with the MS feature, for example, by querying molecules with a certain mass from a structure database such as Human Metabolome Database (HMDB)19 or PubChem20 and subsequently computing a match score between each candidate and the MS2 spectrum. The highest-scoring candidate is typically considered as the structure annotation of a given MS2. Currently, even the best-of-class methods only reach an annotation accuracy of around 40% (ref. 17) in evaluations when searching large candidate sets such as those retrieved from PubChem. Therefore, in practice, a ranked list of molecular structures is provided to the user (for example, the top-20 structures). This level of performance is still a considerable hindrance in metabolomics and other fields.
Interestingly, RT information remains underutilized in automated approaches for structure annotation based on MS2, despite RTs being readily available in all LC-MS2 pipelines and generally recognized as contributing valuable information21,22. An explanation is that a molecule generally has different RTs under different LC conditions (mobile phase, column composition and so on)23,24. Typically, the RT information is used for post-processing of candidate lists, for example, by comparing measured and reference standard RTs3,24. This approach, however, is limited by the availability of experimentally determined RTs of reference standards. RT prediction models24,25, however, allow the prediction of RTs based solely on the molecular structure of the candidate, and have been successfully applied to aid structure annotation11,26,27,28,29. However, such prediction models generally have to be calibrated to the specific LC configuration3, requiring at least some amount of target LC reference standard RT data to be available21,29,30. Recently, the idea of predicting retention orders (ROs), that is, the order in which two molecules elute from the LC column, has been explored31,32,33,34. ROs are largely preserved within a family of LC systems (for example, reversed-phase or hydrophilic interaction LC systems). Therefore, RO predictors can be trained using a diverse set of RT reference data, and applied to out-of-dataset LC set-ups31. Integration of MS2- and RO-based scores using probabilistic graphical models improved the annotation performance in LC-MS2 experiments34.
Another somewhat neglected aspect in automated annotation pipelines is the treatment of stereochemistry, that is, the different three-dimensional (3D) variants of the molecules. The general assumption has been that LC-MS2 data do not contain sufficient information to separate stereoisomers in samples5,24. As a result, MS2 scorers typically disregard the stereochemical information in the candidate structures and often output the same matching for different stereoisomers (compare refs. 7,17). However, stereoisomers that vary in their double-bond orientation (for example, cis–trans or E–Z isomerism) may have different shapes and thus exhibit different fragmentation and/or interactions with the LC system. Thus, ignoring stereochemistry in candidate processing may disregard LC-relevant stereochemical information. Furthermore, it is known that certain stereochemical configurations occur more frequently than others in nature and hence in the reference databases. Making use of such information can potentially improve annotation performance.
In this Article, we set out to provide a new perspective on jointly using MS2 and RO combined with stereochemistry-aware molecular features for the structure annotation of LC-MS2 data. We present a novel machine learning framework called LC-MS2Struct, which learns to optimally combine the MS2 and RO information for the accurate annotation of a sequence of MS features. LC-MS2Struct relies on the structured support vector machine (SSVM)35 and max-margin Markov network36 frameworks. In contrast to the previous work of ref. 34, our framework does not require a separately learned RO prediction model. Instead, it optimizes the SSVM parameters such that the score margin between correct and any other sequence of annotations is maximized. This way, LC-MS2Struct learns to optimally use the RO information from a set of LC-MS2 experiments. We trained and evaluated LC-MS2Struct on all available reversed-phase LC data from MassBank37, including a combined total of 4,327 molecules from 18 different LC configurations, hence reaching a high level of measurement diversity in the model evaluation. Our framework is compared with three other approaches: RT filtering, logP predictions11 and RO predictions34. LC-MS2Struct can be combined with any MS2 scorer, and is demonstrated with the CFM-ID9,18, MetFrag7,11 and SIRIUS8,17 tools. The use of chirality encoding circular molecular fingerprints38 in the predictive model allows to distinguish and rank different stereoisomers based on the observed ROs.
Overview of LC-MS2Struct
Input and output
We consider a typical data setting in untargeted LC-MS2-based experiments, after pre-processing such as chromatographic peak picking and alignment (Fig. 1a). Such data comprise a sequence of MS features, here indexed by σ. Each feature consists of MS1 information (for example, mass, adduct and isotope pattern), LC retention time (RT) tσ and an MS2 spectrum xσ. We assume that a set of candidate molecules \({{{{\mathcal{C}}}}}_{\sigma }\) is associated with each MS feature σ. Such a set can be, for example, generated from a structure database (for example, PubChem20, ChemSpider39 or PubChemLite40) based on the ion’s mass, a suspect list or an in silico molecule generator (for example, SmiLib v2.041,42). We furthermore require that for MS2 spectrum xσ, a matching score θ(xσ, m) with its candidates \(m\in {{{{\mathcal{C}}}}}_{\sigma }\) is pre-computed using an in silico tool, such as CFM-ID9,18, MetFrag11 or SIRIUS8,17. LC-MS2Struct predicts a score for MS feature σ and each associated candidate \(m\in {{{{\mathcal{C}}}}}_{\sigma }\) based on a sequence of spectra \({{{\bf{x}}}}={({x}_{\sigma })}_{\sigma = 1}^{L}\), of length L, and the ROs derived from the observed RTs \({{{\bf{t}}}}={({t}_{\sigma })}_{\sigma = 1}^{L}\). These scores are used to rank the molecular candidates associated with the MS features (Fig. 1b).
Candidate ranking using max-marginals
We define a fully connected graph G = (V, E) capturing the MS features and modelling their dependencies (Fig. 1c), where V represents the set of nodes and E the set of edges. Each node σ ∈ V corresponds to an MS feature, and is associated with the pre-computed MS2 matching scores θ(xσ, m) between the MS2 spectrum xσ and all molecular candidates \(m\in {{{{\mathcal{C}}}}}_{\sigma }\). The graph G contains an edge (σ, τ) ∈ E for each MS feature pair. A scoring function F is defined predicting a compatibility score between a sequence of molecular structure assignments \({{{\bf{y}}}}={({y}_{\sigma })}_{\sigma = 1}^{L}\) in the label-space \({{\varSigma }}={{{{\mathcal{C}}}}}_{1}\times \ldots \times {{{{\mathcal{C}}}}}_{L}\) and the observed data:
where the function f outputs an edge score (Fig. 1d) expressing the agreement between the observed and the predicted RO, for each candidate assignment pair (yσ, yτ) given the observed RTs (tσ, tτ). The function f is parameterized by the vector w, which is trained specifically for each MS2 scorer (see next section). Using the compatibility score function F (equation (1)), we compute the max-marginal scores43 for each candidate and MS feature, defined for a candidate \(m\in {{{{\mathcal{C}}}}}_{\sigma }\) and MS feature σ as the maximum compatibility score that a candidate assignment \(\bar{{{{\bf{y}}}}}\in {{\varSigma }}\) with \({\bar{y}}_{\sigma }=m\) can reach:
We use μ to rank the molecular candidates34. However, for general graphs G, the max-marginal inference problem (MMAP) is intractable. Therefore, we approximate the MMAP problem by performing the inference on tree-like graphs Tk randomly sampled from G (Fig. 1c), for which exact inference is feasible43,44. Here, k indexes the individual spanning trees. Subsequently, we average the max-marginal scores μ(yσ = m | xi, ti, wk, Tk) over a set of trees T, an approach that performed well for practical applications34,45,46. Thereby i indexes the individual training MS² spectra and RT sequences. For each spanning tree Tk, we apply a separately trained SSVM model wk to increase the diversity of the predictions.
Joint annotation using SSVMs
We propose to tackle the joint assignment of candidate labels y ∈ Σ to the sequence of MS features of a LC-MS2 experiment through structured prediction, a family of machine learning methods generally used to annotate sequences or networks35,46,47. In our model, the structure is given by the observed RO of the MS feature pairs (yσ, yτ), which provides additional information on the correct candidate labels yσ and yτ. Given a set of annotated LC-MS2 experiments extracted from MassBank37 (Fig. 1e), we train an SSVM35 model w predicting the edge scores. SSVM models can be optimized using the max-margin principle35. In a nutshell, given a set of ground-truth-annotated MS feature sequences, the model parameters w are optimized such that the correct label sequence yi ∈ Σi, that is, the structure annotations for all MS features in an LC-MS2 experiment, scores higher than any other possible label sequence assignment y ∈ Σi (Fig. 1f).
Results
Extracting training data from MassBank
Ground-truth-annotated MS2 spectra and RTs were extracted from MassBank37, a public online database for MS2 data. Each individual MassBank record typically provides a rich set of meta-information (Supplementary Table 1), such as the chromatographic and MS conditions as well as molecular structure annotations. For training the SSVM model of LC-MS2Struct, the MassBank data were processed such that the experimental conditions were consistent within each MS feature set, that is, with identical LC set-up and MS configuration as in a typical LC-MS2 experiment, to ensure comparable RT, RO and MS2 data. We developed a Python package ‘massbank2db’ that can process MassBank records and group them into consistent MS feature sets, which we denote as MassBank subsets (MB-subsets). For our experiments, we sampled sequences of MS features from the MB-subsets to simulate real LC-MS2 experiments where the signals of multiple unknown compounds are measured under consistent experimental set-ups. Figure 1e illustrates the grouping and LC-MS2 sampling process. Two collections of MassBank data were considered: ALLDATA and the ONLYSTEREO subset.
Comparison of LC-MS2Struct with other approaches
In the first experiment, we compare LC-MS2Struct with previous approaches for candidate ranking either using only-MS2 or additionally using RT or RO information. Only-MS2 uses the MS2 spectrum information to rank the molecular candidates and serves as baseline; MS2 + RO (ref. 34) uses a ranking support vector machine (RankSVM)48,49 to predict the ROs of candidate pairs and a probabilistic inference model to combine the ROs with MS2 scores; MS2 + RT uses predicted RTs to remove false-positive molecule structures from the candidate set, ordered by their MS2 score, by comparing the predicted and observed RT; MS2 + logP is an approach introduced by ref. 11, which uses the observed RT to predict the XlogP3 value50 of the unknown compound and compares it with the candidates’ XlogP3 values extracted from PubChem to refine the initial ranking based on the MS2 scores. The RO-based methods (LC-MS2Struct and MS2 + RO) were trained using the RTs from all available MB-subsets, ensuring that no test molecular structure (based on InChIKey first block, that is, the structural skeleton) was used for the model training (structure disjoint). For the RT-based approaches (MS2 + RT and MS2 + logP), the respective predictors were trained in a structure disjoint fashion using only the RT data available for that MB-subset. For the experiment, all MB-subsets with more than 75 (MS2, RT)-tuples from the ALLDATA data set-up were used (Supplementary Table 2), as the RT-based approaches require LC-system-specific RT training data. The ranking performance was computed for each LC-MS2 experiment within a particular MB-subset. The candidate molecules are identified by their InChIKey first block (the structural skeleton); hence, no stereoisomers are in the candidate sets.
Each candidate ranking approach was evaluated with three MS2 scorers: CFM-ID 4.018, MetFrag11 and SIRIUS17. For LC-MS2Struct, we use stereochemistry-aware molecular fingerprints (3D) to represent the candidates.
Figure 2a shows the average ranking performance (top-k accuracy) across 350 LC-MS2 experiments, each encompassing about 50 (MS2, RT)-tuples (Methods). LC-MS2Struct is the best-performing method combined with any of the three MS2 scorers. For CFM-ID and MetFrag, LC-MS2Struct provides 4.7 and 7.3 percentage unit increases over the only-MS2 for the top-1 accuracy, corresponding to 80.8% and 106% performance gain, respectively. In our setting, that translates to 2.4 and 3.7 additional annotations at the top rank, respectively (out of approximately 50). The performance improvement increases for larger k, reaching as far as 9.3 and 11.3 percentage units for the top-20, which means 4.7 and 5.7 additional correct structures, respectively, in the top-20. For SIRIUS, the improvements are more modest, on average around 2 percentage units for top-1 to top-20. This might be explained by the higher baseline performance of SIRIUS. Nevertheless, SIRIUS can be improved for particular MB-subsets (see Fig. 2b and the discussion in the next section).
The runner-up score integration method is MS2 + RO, which also makes use of predicted ROs. For CFM-ID and MetFrag, it leads to about one-third to one-half of the performance gain of LC-MS2Struct. The approaches relying on RTs, either by candidate filtering (MS2 + RT) or through logP prediction (MS2 + logP), lead to only minor improvements for MetFrag and CFM-ID, but none for SIRIUS, for which MS2 + RT even leads to a decrease in ranking performance by about 2 percentage units. An explanation for this is that the filtering approach removes on average 4.7% of the correct candidates, which leads to false-negative predictions.
The performance gain by using either RO or RT varies between the MB-subsets with differing LC-MS2 set-ups (Supplementary Table 3) and compound class compositions (Extended Data Fig. 1). We illustrate these differences in Fig. 2b. Applying LC-MS2Struct improves the ranking performance in almost all MB-subsets, including the SIRIUS MS2-scorer (some very slight decreases were observed in some SIRIUS scored sets). This is in stark contrast to the RT-based approaches (MS2 + RT and MS2 + logP), which often lead to less accurate rankings, especially for SIRIUS. Furthermore, as seen already in the average results (Fig. 2a), the benefit of LC-MS2Struct depends on the MS2 base scorer. For example, the top-1 accuracy of the subsets ‘AC_003’ and ‘NA_003’ can be greatly improved for MetFrag but show little improvement for CFM-ID. Both datasets are natural-product toxins, which are perhaps poorly explained by the bond-disconnection approach of MetFrag. In contrast, for ‘RP_001’ and ‘UF_003’, the largest improvements (top-1) can be reached for CFM-ID. The RT-filtering approach (MS2 + RT) performs particularly well for ‘LQB_000’ and ‘UT_000’. These subsets mostly contain lipids and lipid-like molecules (Extended Data Fig. 1).
Since the RT prediction models are trained using only data from the respective MB-subsets, more accurate models may be reached for less heterogeneous subsets of molecules. Hence, the RT filtering could work well in such cases26.
Performance for different compound classifications
Next we investigate how LC-MS2Struct can improve the annotation across different categories in two molecule classification systems, ClassyFire51 and PubChemLite40. Figure 3 shows the average top-1 and top-20 accuracy improvement of LC-MS2Struct over the only-MS2 baseline for each ClassyFire super-class and PubChemLite annotation category. For ClassyFire (Fig. 3a), the ranking performance improvement for the different super-classes depends on the MS2 scorer. For example, the top-1 accuracy of ‘Alkaloids and derivatives’ can be improved by 10.8 percentage units for MetFrag, but improves much less for CFM-ID and SIRIUS (1.9 and 3.5 percentage units, respectively). For ‘Organic oxygen compounds’, in contrast, the top-1 accuracy improves by about 10 percentage units when using both CFM-ID and MetFrag, whereas only half that improvement is observed for SIRIUS. This suggests that the CFM-ID results may be improved with the inclusion of more ‘Alkaloids and derivatives’. In addition, the ‘Alkaloids and derivatives’, ‘Organic acids and derivatives’ and ‘Organic nitrogen compounds’ appear less well explained by MetFrag (perhaps with more rearrangements, or less distinguishable spectra), such that the improvement from the RO approach is more apparent. For SIRIUS, ‘Lipids and lipid-like molecules’ as well as ‘Organic oxygen compounds’ benefit the most from LC-MS2Struct in top-1 (both improving by 5.7 percentage units) and top-20 (4.1 and 3.2 percentage units, respectively). In general, for ‘Lipid and lipid-like molecules’, LC-MS2Struct seems to achieve the best improvement (top-1 and top-20) over all MS2 scorers. However, depending on the MS2 scorer, this improvement distributes differently across the lipid sub-classes (Extended Data Fig. 2), such as ‘Fatty acyls’, ‘Prenol lipids’ or ‘Sphingolipids’.
For the PubChemLite classification (Fig. 3b), we also see that the MS2 scorers benefit differently from LC-MS2Struct. The improvement is generally close to the average improvement of the respective MS2 scorers and seems more equally distributed across the annotation categories.
For example for CFM-ID, the biggest top-1 improvements are in the ‘foodRelated’ and ‘noClassification’ categories. On the other hand, for SIRIUS the ‘pharmacoInfo’ and ‘bioPathway’ categories improve the most. MetFrag shows the most consistent performance improvement across the categories. ‘agroChemInfo’ benefits the least from LC-MS2Struct (top-1 and top-20). A possible explanation could be that the molecules categorized as agrochemicals are mainly ‘Benzenoids’ (48.5%), ‘Organoheterocyclic compounds’ (25.9%) and ‘Organic acids and derivatives’ (11.6%). As shown in Fig. 3a, these three ClassyFire classes show low (CFM-ID and MetFrag) or practically no (SIRIUS) improvement when using ROs.
Annotation of stereoisomers
Finally, we study whether LC-MS2Struct can annotate stereoisomers more accurately than MS2 alone, considering differences between stereoisomers that vary in their double-bond orientation (for example, cis–trans or E–Z isomerism), which may potentially lead to differences in their LC behaviour (Fig. 4a). We consider candidate sets containing stereoisomers and evaluate LC-MS2Struct only using MassBank records where the ground-truth structure has stereochemistry information provided, that is, where the InChIKey second block is not ‘UHFFFAOYSA’ (ONLYSTEREO data set-up; Methods). The molecular candidates are represented using two different molecular fingerprints: one that includes stereochemistry information (3D); and one that omits it (2D) (Methods). This allows us to assess the importance of stereochemistry-aware features for the structure annotation.
Figure 5a shows the ranking performance of LC-MS2Struct using 2D and 3D fingerprints.
When looking into the top-1 performance of LC-MS2Struct (3D) for the individual MS2 scorers, we observe an improvement by 2.6, 3.8 and 3.2 percentage units for CFM-ID, MetFrag and SIRIUS, respectively. This translates to performance gains of 87.3%, 95.9% and 44.3%, respectively.
In general, LC-MS2Struct improves the ranking for all three MS2 scorers. The improvement, however, is notably larger when using stereochemistry-aware (3D) candidate features. Interestingly, a similar behaviour could be observed in the ALLDATA setting (Extended Data Fig. 3), even though the absolute performance improvements were smaller. This experiment demonstrates that LC-MS2Struct can use RO information to improve the annotation of stereoisomers.
Discussion
LC-MS2Struct is a novel approach for the integration of MS2 and LC data for the structural annotation of small molecules. The method learns from the pairwise dependencies in the RO of MS features within similar LC configurations and can generalize across different, heterogeneous LC configurations. Furthermore, the use of stereochemistry-aware molecular fingerprints enables LC-MS2Struct to annotate stereoisomers in LC-MS2 experiments based on the observed ROs. Also, our novel processing pipeline to group all (MS2, RT) data from MassBank into subsets of homogeneous LC-MS2 conditions, which is implemented and made available in the ‘massbank2db’52 Python package will, we believe, make MassBank more accessible to other researchers and hence lower the bar of entry to computational metabolomics research.
Our experiments demonstrate that LC-MS2Struct annotates small molecules with an accuracy far superior to more traditional RT filtering and logP-based approaches, and also markedly better than previous methods that rely on ROs. In particular, compared with ref. 34, which used a graphical model as a post-hoc integration tool of MS2 scores and RO predictions, the benefits of learning the parameters of the graphical model are clear. All three studied MS2 scorers could be improved by LC-MS2Struct, including the best-of-class SIRIUS, for which improvements have generally been hard to come by due to its already high baseline accuracy. Our results show the superiority of stereochemistry-aware molecular features for the structure annotation of LC-MS2 data. Remarkably, this was the case not only for the annotation of stereoisomers but also for candidates distinguished by only their 2D structure. This result could be relevant for improving structural annotations in ion mobility separation–mass spectrometry with collision-cross-section measurements.
Our examples indicated that LC-MS2Struct separates candidates with varying double-bond stereochemistry, that is, E/Z and cis/trans isomers (see, for example, Fig. 4). However, there were very few examples of double-bond and/or chiral isomers measured on the same LC system in our dataset, which makes it difficult to quantify this effect, or interrogate these further until more such data are publicly available. Furthermore, as non-chiral LC cannot distinguish stereoisomers that differ only in their chiral centres, the development of more selective stereochemistry-aware molecular features, ignoring the chiral annotations, might be beneficial. We also note that the direct modelling of a node score (MS2 information) predictor in the SSVM would be possible. However, as the MS2 scorers used here are already relatively mature and well known in the community, we have left this research line open for future efforts.
Methods
Notation
We use the following notation to describe LC-MS2Struct:
where \({{{\mathcal{X}}}}\) and \({{{\mathcal{Y}}}}\) denote the MS2 spectra and the molecular structure space, respectively, \({{{\mathcal{C}}}}\) denotes a candidate set that is a subset of all possible molecular structures, and A × B denotes the cross product of two sets A and B. For the purpose of model training and evaluation, we assume a dataset with ground-truth-labelled MS feature sequences: \({{{\mathcal{D}}}}={\{(({{{{\bf{x}}}}}_{i},{{{{\bf{t}}}}}_{i}),{{{{\boldsymbol{{{{\mathcal{C}}}}}}}}}_{i},{{{{\bf{y}}}}}_{i})\}}_{i = 1}^{N}\), where N denotes the total number of sequences. We use \(i,j\in {{\mathbb{N}}}_{\ge 0}\) to index MS feature sequences and \(\sigma ,\tau \in {{\mathbb{N}}}_{\ge 0}\) as indices for individual MS features within a sequence, for example, xiσ denotes the MS2 spectrum at index σ in the sequence i. The length of a sequence of MS features is denoted with L. We denote the ground-truth labels (candidate assignment) of sequence i with yi and any labelling with y. Both, yi and y are in Σi. We use y to denote the candidate label variable, whereas m denotes a particular molecular structure. For example, yσ = m means that we assign the molecular structure m as label to the MS feature σ.
Graphical model for joint annotation of MS features
We consider the molecular annotation problem for the output of LC-MS2, which means assigning a molecular structure to each MS feature, as a structured prediction problem35,46,47, relying on a graphical model representation of the sets of MS features arising from an LC-MS2 experiment. For each MS feature σ, we want to predict a label yσ from a fixed and finite candidate (label) set \({{{{\mathcal{C}}}}}_{\sigma }\). We model the observed ROs between each MS feature pair (σ, τ) within an LC-MS2 experiment, as pairwise dependencies of the features. We define an undirected graph G = (V, E) with the vertex set V containing a node σ for each MS feature and the edge set E containing an edge for each MS feature pair E = {(σ, τ) ∣ σ, τ ∈ V, σ ≠ τ} (compare Fig. 1a,c). The resulting graph is complete with an edge between all pairs of nodes. This allows us to make use of arbitrary pairwise dependencies, instead of limiting to, say, adjacent RTs. This modelling choice was previously shown to be beneficial by ref. 34. Here we extend that approach by learning from the pairwise dependencies to optimize joint annotation accuracy, which leads to markedly improved annotation accuracy.
For learning, we define a scoring function F that, given the input MS feature sequences (x, t) and its corresponding sequence of candidate sets \({{{\boldsymbol{{{{\mathcal{C}}}}}}}}\), computes a compatibility score between the measured data and any possible sequence of labels y ∈ Σ:
where \(\theta :{{{\mathcal{X}}}}\times {{{\mathcal{Y}}}}\to (0,1]\) is a function returning an MS2 matching score between the spectrum xσ and a candidate \({y}_{\sigma }\in {{{{\mathcal{C}}}}}_{\sigma }\), 〈⋅, ⋅〉 denotes the inner product, and w is a model weight vector to predict the RO matching score, based on the joint-feature vector \({{\bf{{\Gamma }}}}:{{\mathbb{R}}}_{\ge 0}\times {{\mathbb{R}}}_{\ge 0}\times {{{\mathcal{Y}}}}\times {{{\mathcal{Y}}}}\to {{{\mathcal{F}}}}\) between the observed RO derived from tστ = (tσ, tτ) and a pair of molecular candidates yστ = (yσ, yτ).
Equation (2) consists of two parts: (1) a score computed over the nodes in G capturing the MS2 information; and (2) a score expressing the agreement of observed and predicted RO computed over the edge set. We assume that the node scores are pre-computed by a MS2 scorer such as CFM-ID18, MetFrag11 or SIRIUS17. The node scores are normalized to (0, 1] within each candidate set \({{{{\mathcal{C}}}}}_{\sigma }\). The edge scores are predicted for each edge (σ, τ) using the model w and the joint-feature vector Γ:
with \(\phi :{{{\mathcal{Y}}}}\to {{{{\mathcal{F}}}}}_{{{{\mathcal{Y}}}}}\) being a function embedding a molecular structure into a feature space. The edge prediction function (3) will produce a height edge score, if the observed RO (that is, sign(tσ − tτ)) agrees with the predicted one.
Using the compatibility score function (2), the predicted joint annotation for (x, t) corresponds to the the highest-scoring label sequence \(\hat{{{{\bf{y}}}}}\in {{\varSigma }}:\hat{{{{\bf{y}}}}}=\arg \mathop{\max }\limits_{\bar{{{{\bf{y}}}}}\in {{\varSigma }}}\,F(\bar{{{{\bf{y}}}}}\,| \,{{{\bf{x}}}},{{{\bf{t}}}},{{{\bf{w}}}},G)\). In practice, however, instead of predicting only the best label sequence, it can be useful to rank the molecular candidates \(m\in {{{{\mathcal{C}}}}}_{\sigma }\) for each MS feature σ. This is because for state-of-the-art MS2 scorers, the annotation accuracy in the top-20 candidate list is typically much higher than for the highest-ranked candidate (top-1).
Our framework provides candidate rankings by solving the following problem for each MS feature σ and \(m\in {{{{\mathcal{C}}}}}_{\sigma }\):
Problem (4) returns a max-marginal μ score for each candidate m. That is, the maximum compatibility score any label sequence \(\bar{{{{\bf{y}}}}}\in {{\varSigma }}\) with \({\bar{y}}_{\sigma }=m\) can achieve. One can interpret equation (2) as the log-space representation of a unnormalized Markov random field probability distribution over y associated with an undirected graphical model G (ref. 44).
Feasible inference using random spanning trees
For general graphs G, the maximum a posterior inference problem (that is, finding the highest-scoring label sequence y given an MS feature sequence) is an \({{{\mathcal{N}}}}{{{\mathcal{P}}}}\)-hard problem53,54. The max-marginals inference (MMAP), needed for the candidate ranking, is an even harder problem which is \({{{\mathcal{N}}}}{{{\mathcal{P}}}}\)PP complete54. However, efficient inference approaches have been developed. In particular, if G is tree-like, we can efficiently compute the max-marginals using dynamic programming and the max-product algorithm43,44. Such tree-based approximations have shown to be successful in various practical applications34,45,46.
Here, we follow the work by ref. 34 and sample a set of random spanning trees (RST) \({{{\bf{T}}}}={\{{T}_{k}\}}_{k = 1}^{K}\) from G, whereby K denotes the size of the RST sample. Each tree Tk has the same node set V as G, but an edge set E(T) ⊆ E, with |E(T)| = L − 1, ensuring that T is a single connected component and cycle free. We follow the sampling procedure used by ref. 34. Given the RST set T, we compute the averaged max-marginals to rank the molecular candidates34:
where we subtract the maximum compatibility score from the marginal values corresponding to the individual trees to normalize the marginals before averaging34. This normalization value can be efficiently computed given the max-marginals μ. In our experiments, we train K individual models (wk) and associate them with the trees Tk to increase the diversity. The influence of the number of SSVM models on the prediction performance is shown in Extended Data Fig. 4.
The SSVM model
To train the model parameters w (equation (2)), we implemented a variant of the SSVM35,36. Its primal optimization problem is given as55:
where C > 0 is the regularization parameter, ξi ≥ 0 is the slack variable for example i and \(\ell :{{{\varSigma }}}_{i}\times {{{\varSigma }}}_{i}\to {{\mathbb{R}}}_{\ge 0}\) is a function capturing the loss between two label sequences. The constraint set definition (st.) of problem (6) leads to a parameter vector w that is trained according to the max-margin principle35,36,47, that is, the score F(yi) of the correct label should be greater than the score F(y) of any other label sequence by at least the specified margin ℓ(yi, y). Note that in the SSVM problem (6), a different graph Gi = (Vi, Ei) can be associated with each training example i, allowing, for example, to process sequences of different length.
We solve (6) in its dual formulation and use the Frank–Wolfe algorithm56 following the recent work by ref. 55. In the Supplementary Information, we derive the dual problem and demonstrate how to solve it efficiently using the Frank–Wolfe algorithm and RST approximations for Gi. Optimizing the dual problem enables us to use non-linear kernel functions \(\lambda :{{{\mathcal{Y}}}}\times {{{\mathcal{Y}}}}\to {{\mathbb{R}}}_{\ge 0}\) measuring the similarity between the molecular structures associated with the label sequences.
The label loss function ℓ is defined as follows:
and satisfies ℓ(y, y) = 0 (a required property55), if λ is a normalized kernel, which holds true in our experiments (we used the MinMax kernel57).
Pre-processing pipeline for raw MassBank records
Extended Data Fig. 5 illustrates our MassBank pre-processing pipeline implemented in the Python package ‘massbank2db’52. First, the MassBank records text files were parsed and the MS2 spectrum, ground-truth annotation, RT and meta-information extracted. Records with missing MS2, RT or annotation were discarded. We use the MB 2020.11 release for our experiments.
Subsequently, we grouped the MassBank records into subsets (denoted as MB-subsets) where the (MS2, RT)-tuples were measured under the same LC and MS conditions.
Supplementary Table 1 summarizes the grouping criteria. In the next step, we used the InChIKey58 identifier in MassBank to retrieve the SMILES59 representation from PubChem20 (1 February 2021), rather than using the contributor-supplied SMILES. This ensures a consistent SMILES source for the molecular candidates and ground-truth annotations.
Three more filtering steps were performed before creating the final database, to remove records: (1) if the ground-truth exact mass deviated too far (>20 ppm) from the calculated exact mass based on the precursor mass-per-charge and adduct type; (2) if the subset contained <50 unique molecular structures; (3) if they were potential isobars (see pull-request #152 in the MassBank GitHub repository, https://github.com/MassBank/MassBank-data/pull/152).
Supplementary Table 3 summarizes the LC-MS2 meta-information for all generated MB-subsets.
Generating the molecular candidate sets
We used SIRIUS8,17 to generate the molecular candidate sets. For each MassBank record, the ground-truth molecular formula was used by SIRIUS to collect the candidate structures from PubChem20. The candidate sets generated by SIRIUS contain a single stereoisomer per candidate, identified by their InChIKey first block (structural skeleton). To study the ability of LC-MS2Struct to annotate the stereochemical variant of the molecules, we enriched the SIRIUS candidates sets with stereoisomers, using the InChIKey first block of each candidate to search PubChem (1 February 2021) for stereoisomers. The additional molecules were then added to the candidate sets.
Pre-computing the MS2 matching scores
For each MB-subset, MS2 spectra with identical adduct type (for example, [M + H]+) and ground-truth molecular structure were aggregated. Depending on the MS2 scorer, we either merged the MS2 into a single spectrum (CFM-ID and MetFrag) following the strategy by ref. 11 or we provided the MS2 spectra separately (SIRIUS). For the spectra merging, we used the ‘mzClust_hclust’ function of the xcms package60, which first combines all MS2 spectra’s peaks into a single peaklist and subsequently merges peaks based on a mass error threshold.
To compute the CFM-ID (v4.0.7) MS2 matching score, we first predicted the in silico MS2 spectra for all molecular candidate structures based on their isomeric SMILES representation using the pre-trained CFM-ID models (Metlin 2019 MSML) by ref. 18. We merged the three in silico spectra predicted by CFM-ID for different collision energies and compared them with the merged MassBank spectrum using the modified cosine similarity61 implemented in the matchms62 (v0.9.2) Python library. For MetFrag (v2.4.5), the MS2 matching scores were calculated using the FragmenterScore feature based on the isomeric SMILES representation of the candidates. For SIRIUS, the required fragmentation trees are computed using the ground-truth molecular formula of each MassBank spectrum. SIRIUS uses canonical SMILES and hence does not encode stereochemical information (which is absent in the canonical SMILES). Therefore, we used the same SIRIUS MS2 matching score for all stereoisomers sharing the same InChIKey first block.
For all three MS2 scorers, we normalized the MS2 matching scores to the range [0, 1] separately for each candidate set. For the machine-learning-based scorers (CFM-ID and SIRIUS), the matching scores of the candidates associated with a particular MassBank record using in evaluation were predicted using models that did include its ground-truth structure (determined by InChIKey first block).
If a MS2 scorer failed on a MassBank record, we assigned a constant MS2 score to each candidate.
Molecular feature representations
Extended connectivity fingerprints with function classes (FCFP)38 were used to represent molecular structures in our experiments. We employed RDKit (v2021.03.1) to generate counting FCFP fingerprints. The fingerprints were computed based on the isomeric SMILES, using the parameter ‘useChirality’ to generate fingerprints that either encoded stereochemistry (3D) or not (2D). To define the set of substructures in the fingerprint vector, we first generated all possible substructures, using a FCFP radius of two, based on a set of 50,000 randomly sampled molecular candidates associated with our training data, and all the ground-truth training structures, resulting in 6,925 (3D) and 6,236 (2D) substructures. We used 3D FCFP fingerprints in our experiments, except for the experiments focusing on the annotation of stereoisomers, where we used both 2D and 3D fingerprints for comparison. We used the MinMax kernel57 to compute the similarity between the molecules.
Computing molecular categories
For the analysis of the ranking performance for different molecular categories, we used two classification systems, ClassyFire51, which classifies molecules according to their structure, and PubChemLite40, which classifies molecules according to information available for ten exposomics-relevant categories. For ClassyFire, we used the ‘classyfireR’ R package to retrieve the classification for each ground-truth molecular structure in our dataset. For PubChemLite, the classification categories were retrieved via InChIKey first block matching of each molecular structure; if it was not found in PubChemLite, the category ‘noClassification’ was assigned.
Training and evaluation data set-ups
We considered only MassBank data that have been analysed using an LC reversed-phase (RP) column. We removed molecules from the data if their measured RT was less than three times the estimated column dead-time63, as we considered such molecules to be non-retaining.
We considered two separate data set-ups. The first one, denoted by ALLDATA, used all available MassBank data to train and evaluate LC-MS2Struct. This set-up was used to compare the different candidate ranking approaches as well as to investigate the performance across various molecular classes. The second set-up, denoted by ONLYSTEREO, used MassBank records where the ground-truth molecular structure contains stereochemical information, that is, where the InChIKey second block is not ‘UHFFFAOYSA’. This set-up was used in the experiments regarding the ability of LC-MS2Struct to distinguish stereochemistry. In the training, we additionally used MassBank records that appear only without stereochemical information in our candidate sets, identified by the InChIKey second block equal to ‘UHFFFAOYSA’ in PubChem. The number of available training and evaluation (MS2, RT)-tuples per MB-subset are summarized in Supplementary Table 2.
For each MB-subset, we sampled a set of LC-MS2 experiments, that is (MS2, RT)-tuple sequences, from the available evaluation data. The number of LC-MS2 experiments (n below) depended on the number of available (MS2, RT)-tuples (Supplementary Table 2) as follows:
where \({{{\mathcal{D}}}}\) is a set of (MS2, RT)-tuples with ground-truth annotation and molecular candidate sets associated with an MB-subset. If there are fewer than 30 (MS2, RT)-tuples available, we do not generate an evaluation LC-MS2 experiment from the corresponding MB-subset. On the basis of this sampling scheme, we obtained 354 and 94 LC-MS2 experiments for ALLDATA and ONLYSTEREO, respectively, for our evaluation (Supplementary Table 2).
We trained eight (K = 8) separate SSVM models wk for each evaluation LC-MS2 experiment. For each SSVM, model we first generated a set containing the (MS2, RT)-tuples from all MB-subsets. Then, we removed all tuples whose ground-truth molecular structure, determined by the InChIKey first block, was in the respective evaluation LC-MS2 experiment. Lastly, we randomly sampled LC-MS2 experiments from the training tuples, within their respective MB-subset, with a length randomly chosen from 4 to (maximum) 32 (see also Fig. 1e) and an RST Tik assigned for each MS feature sequence i. In total, 768 LC-MS2 training experiments were generated for each SSVM model. To speed up the model training, we restricted the candidate set size \(| {{{{\mathcal{C}}}}}_{i\sigma }|\) of each training MS feature σ to maximum 75 candidate structures by random subsampling. We ensure that the correct candidate is included in the subsample. Each SSVM model wk was applied to the evaluation LC-MS2 experiment, associated with different RSTs Tk, and the averaged max-marginal scores where used for the final candidate ranking (see equation (5) and Fig. 1c).
SSVM hyperparameter optimization
The SSVM regularization parameter C was optimized for each training set separately using grid search and evaluation on a random validation set sampled from the training data’s (MS2, RT)-tuples (33%). A set of LC-MS2 experiments was generated from the validation set and used to determine the normalized discounted cumulative gain (NDCG)64 for each C value. The regularization parameter with the highest NDCG value was chosen to train the final model. We used the scikit-learn65 (v0.24.1) Python package to compute the NDCG value, taking into account ranks up until 10 (NDCG@10) and defined the relevance for each candidate to be 1 if it is the correct one and 0 otherwise. To reduce the training time, we searched the optimal C* only for SSVM model k = 0 and used C* for the other models with k > 0.
Ranking performance evaluation
We computed the ranking performance (top-k accuracy) for a given LC-MS2 experiment using the tie-breaking strategy described in ref. 8: if a ranking method assigns an identical score to a set of n molecular candidates, then all accuracies at the ordinal ranks k at which one of these candidates is found are increased by 1/n. We computed a candidate score (that is, only-MS2, LC-MS2Struct and so on) for each molecular structure in the candidate set (identified by PubChem CID). Depending on the data set-up (Supplementary Table 4), we first collapse the candidates by InChIKey first block (ALLDATA, method comparison and molecule category analysis) or full InChIKey (ONLYSTEREO stereochemistry prediction), assigning the maximum candidate score for each InChIKey first block or InChIKey group, respectively. Subsequently, we compute the top-k accuracy based on the collapsed candidate sets.
For the performance analysis of individual molecule categories, either ClassyFire51 or PubChemLite40 classes, we first computed the rank of the correct molecular structure for each (MS2, RT)-tuple of each LC-MS2 evaluation experiment based on only-MS2 and LC-MS2Struct scores. Subsequently, we computed the top-k accuracy for each molecule category, associated with at least 50 unique ground-truth molecular structures (based on InChIKey first block). As a ground-truth structure can appear multiple times in our dataset, we generate 50 random samples, each containing only one example per unique structure, and computed the averaged top-k accuracy.
Comparison of LC-MS2Struct with other approaches
We compared LC-MS2Struct with three different approaches to integrate MS2 and RT information, namely RT filtering, logP prediction and RO prediction.
For RT filtering (MS2 + RT), we followed ref. 26, which used the relative error \(\epsilon =\frac{| \hat{t}-{t}_{\sigma }| }{{t}_{\sigma }}\), between the predicted (\(\hat{t}\)) and observed (tσ) RT. We set the filtering threshold to the 95% quantile of the relative RT prediction errors estimated from the RT model’s training data, following refs. 27,29. We used scikit-learn’s65 (v0.24.1) implementation of the support vector regression66 with radial basis function kernel for the RT prediction. For support vector regression, we use the same 196 features, computed using RDKit (v2021.03.1), as in ref. 25.
For logP prediction (MS2 + logP), we followed ref. 11, which assigned a weighted sum of an MS2 and logP score \(s=\beta{s}_{{{{\rm{MS{}}}^{{2}}}}}(m)+(1-\beta ){s}_{{{{\rm{log}}P}}}(m)\) to each candidate \(m\in {{{{\mathcal{C}}}}}_{\sigma }\), and used it rank the set of molecular candidates. The logP score is given by \({s}_{{{{\rm{log}}P}}}(m)=\frac{1}{\delta \sqrt{2\uppi }}\exp \left(-\frac{{({{{{\rm{log}}P}}}_{m}-{{{{\rm{log}}P}}}_{\sigma })}^{2}}{2{\delta }^{2}}\right)\), where logPm is the predicted XlogP350 extracted from PubChem20 for candidate m, and logPσ = a tσ + b is the XlogP3 value of the unknown compound, associated with MS feature σ, predicted based on its measured RT tσ. The parameters a and b of the linear regression model were determined using a set of RT and XlogP3 tuples associated with the LC system. As in ref. 11, we set δ = 1.5 and set β such that it optimizes the top-1 candidate ranking accuracy, calculated from a set of 25 randomly generated training LC-MS2 experiments.
For RO prediction (MS2 + RO), we used the approach by ref. 34, which relies on a RankSVM implementation in the Python library ROSVM31,67 (v0.5.0). We used counting ‘substructure’ fingerprints calculated using CDK (v2.5)68 and the MinMax kernel57. The MS2 matching scores and predicted ROs were used to compute max-marginal ranking scores using the framework by ref. 34. We used the author’s implementation in version 0.2.369. The hyper-parameters β and k of the model were optimized for each evaluation LC-MS2 experiment separately using the respective training data. To estimate β, we generated 25 LC-MS2 experiments from the training data and selected the β that maximized the Top20AUC34 ranking performance. The sigmoid parameter k was estimated using Platt’s method70 calibrated using RankSVM’s training data. We used 128 random spanning trees per evaluation LC-MS2 experiment to compute the averaged max-marginals.
For the experiments comparing the different methods, we used all LC-MS2 experiments generated, except the ones from the MB-subsets ‘CE_001’, ‘ET_002’, ‘KW_000’ and ‘RP_000’ (Supplementary Table 2). For those subsets, the evaluation LC-MS2 experiment contains all available (MS2, RT)-tuples, leaving no LC-system-specific data to train the RT (MS2 + RT) or logP (MS2 + logP) prediction models. The RT and logP prediction models are trained in a structure disjoint fashion using the RT data of the particular MB-subset associated with the evaluation LC-MS2. The RO prediction model used by MS2 + RO is trained structure disjoint as well, but using the RTs of all MB-subsets.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
All data used in our experiments are available online71 (https://zenodo.org/record/5854661). The candidate rankings of all LC-MS2 experiments are available online: ALLDATA72 (https://zenodo.org/record/6451016) and ONLYSTEREO73 (https://zenodo.org/record/6037629). Source data are provided with this paper.
Code availability
The source code developed for this study is available on GitHub, including the implementation of LC-MS2Struct74 (v2.13.0; https://github.com/aalto-ics-kepaco/msms_rt_ssvm); scripts to run the experiments75 (https://github.com/aalto-ics-kepaco/lcms2struct_exp); and the library implementing the MassBank pre-processing52 (v0.9.0; https://github.com/bachi55/massbank2db). The candidate fingerprints were computed by the ROSVM Python library67 (v0.5.0; https://github.com/bachi55/rosvm) using RDKit (2021.03.1). The SSVM library uses the max-marginal inference solver implemented by ref. 34 (v0.2.3; https://github.com/aalto-ics-kepaco/msms_rt_score_integration).
References
da Silva, R. R., Dorrestein, P. C. & Quinn, R. A. Illuminating the dark matter in metabolomics. Proc. Natl Acad. Sci. USA 112, 12549–12550 (2015).
Aksenov, A. A., da Silva, R., Knight, R., Lopes, N. P. & Dorrestein, P. C. Global chemical analysis of biology by mass spectrometry. Nat. Rev. Chem. 1, 0054 (2017).
Blaženović, I. et al. Structure annotation of all mass spectra in untargeted metabolomics. Anal. Chem. 91, 2155–2162 (2019).
Blaženović, I., Kind, T., Ji, J. & Fiehn, O. Software tools and approaches for compound identification of LC-MS/MS data in metabolomics. Metabolites 8, 31 (2018).
Schymanski, E. L. et al. Critical assessment of small molecule identification 2016: automated methods. J. Cheminform. 9, 22 (2017).
Nguyen, D. H., Nguyen, C. H. & Mamitsuka, H. Recent advances and prospects of computational methods for metabolite identification: a review with emphasis on machine learning approaches. Brief. Bioinform. 20, 2028–2043 (2019).
Wolf, S., Schmidt, S., Müller-Hannemann, M. & Neumann, S. In silico fragmentation for computer assisted identification of metabolite mass spectra. BMC Bioinform. 11, 1–12 (2010).
Dührkop, K., Shen, H., Meusel, M., Rousu, J. & Böcker, S. Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proc. Natl Acad. Sci. USA 112, 12580–12585 (2015).
Allen, F., Greiner, R. & Wishart, D. Competitive fragmentation modeling of ESI-MS/MS spectra for putative metabolite identification. Metabolomics 11, 98–110 (2015).
Brouard, C. et al. Fast metabolite identification with input output kernel regression. Bioinformatics 32, i28–i36 (2016).
Ruttkies, C., Schymanski, E. L., Wolf, S., Hollender, J. & Neumann, S. MetFrag relaunched: incorporating strategies beyond in silico fragmentation. J. Cheminform. 8, 3 (2016).
Brouard, C., Bach, E., Böcker, S. & Rousu, J. Magnitude-preserving ranking for structured outputs. In Proc. Ninth Asian Conference on Machine Learning, Proc. Machine Learning Research Vol. 77 (eds Zhang, M.-L. & Noh, Y.-K.) 407–422 (PMLR, 2017); http://proceedings.mlr.press/v77/brouard17a.html
Nguyen, D. H., Nguyen, C. H. & Mamitsuka, H. Simple: sparse interaction model over peaks of molecules for fast, interpretable metabolite identification from tandem mass spectra. Bioinformatics 34, i323–i332 (2018).
Li, Y., Kuhn, M., Gavin, A.-C. & Bork, P. Identification of metabolites from tandem mass spectra with a machine learning approach utilizing structural features. Bioinformatics 36, 1213–1218 (2019).
Ruttkies, C., Neumann, S. & Posch, S. Improving MetFrag with statistical learning of fragment annotations. BMC Bioinform. 20, 376 (2019).
Nguyen, D. H., Nguyen, C. H. & Mamitsuka, H. ADAPTIVE: learning data-dependent, concIse molecular vectors for fast, accurate metabolite identification from tandem mass spectra. Bioinformatics 35, i164–i172 (2019).
Dührkop, K. et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods https://doi.org/10.1038/s41592-019-0344-8 (2019).
Wang, F. et al. CFM-ID 4.0: nore accurate ESI-MS/MS spectral prediction and compound identification. Anal. Chem. https://doi.org/10.1021/acs.analchem.1c01465 (2021).
Wishart, D. S. et al. HMDB 4.0: the Human Metabolome Database for 2018. Nucleic Acids Res. 46, D608–D617 (2017).
Kim, S. et al. PubChem in 2021: new data content and improved web interfaces. Nucleic Acids Res. 49, D1388–D1395 (2020).
Stanstrup, J., Neumann, S. & Vrhovšek, U. PredRet: prediction of retention time by direct mapping between multiple chromatographic systems. Anal. Chem. 87, 9421–9428 (2015).
Low, D. Y. et al. Data sharing in predret for accurate prediction of retention time: application to plant food bioactive compounds. Food Chem. 357, 129757 (2021).
Fanali, S., Haddad, P., Poole, C. & Lloyd, D. Liquid Chromatography: Fundamentals and Instrumentation (Handbooks in Separation Science, Elsevier Science, 2013).
Witting, M. & Böcker, S. Current status of retention time prediction in metabolite identification. J. Sep. Sci. 43, 1746–1754 (2020).
Bouwmeester, R., Martens, L. & Degroeve, S. Comprehensive and empirical evaluation of machine learning algorithms for small molecule LC retention time prediction. Anal. Chem. 91, 3694–3703 (2019).
Aicheler, F. et al. Retention time prediction improves identification in nontargeted lipidomics approaches. Anal. Chem. 87, 7698–7704 (2015).
Samaraweera, M. A., Hall, L. M., Hill, D. W. & Grant, D. F. Evaluation of an artificial neural network retention index model for chemical structure identification in nontargeted metabolomics. Anal. Chem. 90, 12752–12760 (2018).
Bonini, P., Kind, T., Tsugawa, H., Barupal, D. K. & Fiehn, O. Retip: retention time prediction for compound annotation in untargeted metabolomics. Anal. Chem. https://doi.org/10.1021/acs.analchem.9b05765 (2020).
Yang, Q., Ji, H., Lu, H. & Zhang, Z. Prediction of liquid chromatographic retention time with graph neural networks to assist in small molecule identification. Anal. Chem. https://doi.org/10.1021/acs.analchem.0c04071 (2021).
Bouwmeester, R., Martens, L. & Degroeve, S. Generalized calibration across liquid chromatography setups for generic prediction of small-molecule retention times. Anal. Chem. 92, 6571–6578 (2020).
Bach, E., Szedmak, S., Brouard, C., Böcker, S. & Rousu, J. Liquid-chromatography retention order prediction for metabolite identification. Bioinformatics 34, i875–i883 (2018).
Liu, J. J., Alipuly, A., Baczek, T., Wong, M. W. & Žuvela, P. Quantitative structure–retention relationships with non-linear programming for prediction of chromatographic elution order. Int. J. Mol. Sci. 20, 3443 (2019).
Žuvela, P., Liu, J. J., Wong, M. W. & Baczek, T. Prediction of chromatographic elution order of analytical mixtures based on quantitative structure–retention relationships and multi-objective optimization. Molecules 25, 3085 (2020).
Bach, E., Rogers, S., Williamson, J. & Rousu, J. Probabilistic framework for integration of mass spectrum and retention time information in small molecule identification. Bioinformatics 37, 1724–1731 (2021).
Tsochantaridis, I., Joachims, T., Hofmann, T. & Altun, Y. Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res. 6, 1453–1484 (2005).
Taskar, B., Guestrin, C. & Koller, D. Max-margin Markov networks. Adv. Neural Inf. Process. Syst. 16, 25–32 (MIT, 2004).
Horai, H. et al. MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass Spectrom. 45, 703–714 (2010).
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Pence, H. & Williams, A. ChemSpider: an online chemical information resource. J. Chem. Educ. 87, 1123–1124 (2010).
Schymanski, E. L. et al. Empowering large chemical knowledge bases for exposomics: PubChemLite meets MetFrag. J. Cheminform. https://doi.org/10.21203/rs.3.rs-107432/v1 (2021).
Schüller, A., Schneider, G. & Byvatov, E. SmiLib: rapid assembly of combinatorial libraries in smiles notation. QSAR Comb. Sci. 22, 719–721 (2003).
Schüller, A., Hähnke, V. & Schneider, G. SmiLib v2.0: a Java-based tool for rapid combinatorial library enumeration. QSAR Comb. Sci. 26, 407–410 (2007).
Wainwright, M., Jaakkola, T. & Willsky, A. Tree consistency and bounds on the performance of the max-product algorithm and its generalizations. Stat. Comput. 14, 143–166 (2004).
MacKay, D. J. Information Theory, Inference and Learning Algorithms (Cambridge Univ. Press, 2005).
Pletscher, P., Ong, C. S. & Buhmann, J. Spanning tree approximations for conditional random fields. In Proc. Twelth International Conference on Artificial Intelligence and Statistics, Proc. Machine Learning Research Vol. 5 (eds van Dyk, D. & Welling, M.) 408–415 (PMLR, 2009); http://proceedings.mlr.press/v5/pletscher09a.html
Su, H. & Rousu, J. Multilabel classification through random graph ensembles. Mach. Learn. 99, 231–256 (2015).
Rousu, J., Saunders, C., Szedmak, S. & Shawe-Taylor, J. Kernel-based learning of hierarchical multilabel classification models. J. Mach. Learn. Res. 7, 1601–1626 (2006).
Elisseeff, A. & Weston, J. A kernel method for multi-labelled classification. Adv. Neural Inf. Process. Syst. 14, 681–687 (2002).
Joachims, T. Optimizing search engines using clickthrough data. In Proc. Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 133–142 (ACM, 2002); https://doi.org/10.1145/775047.775067
Cheng, T. et al. Computation of octanol-water partition coefficients by guiding an additive model with knowledge. J. Chem. Inf. Model. 47, 2140–2148 (2007).
Feunang, Y. D. et al. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J. Cheminform. 8, 61 (2016).
Bach, E. massbank2db: build a machine learning ready SQLite database from MassBank. GitHub https://github.com/bachi55/massbank2db (2022).
Gärtner, T. & Vembu, S. On structured output training: hard cases and an efficient alternative. Mach. Learn. 76, 227–242 (2009).
Xue, Y., Li, Z., Ermon, S., Gomes, C. P. & Selman, B. Solving marginal map problems with NP oracles and parity constraints. Adv. Neural Inf. Process. Syst. 29, 1135–1143 (2016).
Lacoste-Julien, S., Jaggi, M., Schmidt, M. & Pletscher, P. Block-coordinate Frank–Wolfe optimization for structural svms. In International Conference on Machine Learning 53–61 (PMLR, 2013).
Frank, M. & Wolfe, P. An algorithm for quadratic programming. Nav. Res. Logist. Q. 3, 95–110 (1956).
Ralaivola, L., Swamidass, S. J., Saigo, H. & Baldi, P. Graph kernels for chemical informatics. Neural Netw. 18, 1093–1110 (2005).
Heller, S. R., McNaught, A., Pletnev, I., Stein, S. & Tchekhovskoi, D. InChI, the IUPAC international chemical identifier. J. Cheminform. 7, 23 (2015).
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36 (1988).
Benton, H. P., Wong, D. M., Trauger, S. A. & Siuzdak, G. XCMS2: processing tandem mass spectrometry data for metabolite identification and structural characterization. Anal. Chem. 80, 6382–6389 (2008).
Watrous, J. et al. Mass spectral molecular networking of living microbial colonies. Proc. Natl Acad. Sci. USA 109, E1743–E1752 (2012).
Huber, F. et al. matchms—processing and similarity evaluation of mass spectrometry data. J. Open Source Softw. 5, 2411 (2020).
Dolan, J. W. Column Dead Time as a Diagnostic Tool. LCGC North America 32, 24–29 (2014).
Järvelin, K. & Kekäläinen, J. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst. 20, 422–446 (2002).
Pedregosa, F. et al. scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Drucker, H., Burges, C. J., Kaufman, L., Smola, A. J. & Vapnik, V. Support vector regression machines. Adv. Neural Inf. Process. Syst. 9, 155–161 (1997).
Bach, E. Retention order support vector machine (ROSVM) GitHub https://github.com/bachi55/rosvm (2022).
Willighagen, E. L. et al. The chemistry development kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching. J. Cheminform. 9, 33 (2017).
Bach, E. msmsrt_scorer: probabilistic framework for integration of mass spectrum and retention order information. GitHub https://github.com/aalto-ics-kepaco/msms_rt_score_integration (2021).
Platt, J. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv. Large Margin Classifiers 10, 61–74 (2000).
Bach, E. Dataset: ‘Joint structural annotation of small molecules using liquid chromatography retention order and tandem mass spectrometry data’. Zenodo https://doi.org/10.5281/zenodo.5854661 (2022).
Bach, E. Result files (ALLDATA): ‘Joint structural annotation of small molecules using liquid chromatography retention order and tandem mass spectrometry data with LC-MS2Struct’. Zenodo https://doi.org/10.5281/zenodo.6451016 (2022).
Bach, E. Result files (ONLYSTEREO): ‘Joint structural annotation of small molecules using liquid chromatography retention order and tandem mass spectrometry data’. Zenodo https://doi.org/10.5281/zenodo.6037629 (2022).
Bach, E. msms_rt_ssvm: implementation of the LC-MS2Struct algorithm. GitHub https://github.com/aalto-ics-kepaco/msms_rt_ssvm (2022).
Bach, E. Experiments and figure generation for the LC-MS2Struct evaluation. GitHub https://github.com/aalto-ics-kepaco/lcms2struct_exp (2022).
Acknowledgements
E.L.S. acknowledges discussions with G. Landrum (ETHZ) and E. Bolton (NCBI/NLM/NIH). We acknowledge CSC-IT Center for Science, Finland, and Aalto Science-IT infrastructure, Finland, for generous computational resources. E.B. thanks K. Dührkop for generating the SIRIUS candidate sets and predicting the SIRIUS MS2 scores.
Funding
Open Access funding provided by Aalto University. The work by E.B. and J.R. was partially supported by Academy of Finland grants 310107 (MACOME) and 334790 (MAGITICS). E.L.S. acknowledges funding support from the Luxembourg National Research Fund (FNR) for project A18/BM/12341006.
Author information
Authors and Affiliations
Contributions
E.B. and J.R. designed the research. E.B. implemented the MassBank pre-processing. E.B. developed, implemented and evaluated the computational method. E.B., E.L.S. and J.R. interpreted the results. E.B., E.L.S. and J.R. wrote the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Nicola Zamboni and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Distribution of molecule classes in the MassBank (MB) subsets.
Distribution of molecule classes in the MassBank (MB) subsets. ClassyFire SuperClass distribution 51 for each MB-subset studied in our experiments. Within each MB-subset, the label ‘Other’ is assigned to each SuperClass which makes up less than 2.5% of all molecules. The centre label represents the number of examples for the respective MB-subset.
Extended Data Fig. 2 Performance gain by LC-MS2Struct across ClassyFire Classes annotations.
Ranking performance (top-k) improvement of LC-MS2Struct compared to Only-MS2 (baseline) across ClassyFire Class-level annotations. The Classes (shown in the bars) are colour coded by SuperClasses (see legend). The data is presented as mean values (50 samples) and the error bars show the 95%-confidence interval of the mean estimate (1000 bootstrapping samples). The top-k accuracies (%) under the bars show the Only-MS2 performance. For each molecule class, the number of unique molecular structures in the class is denoted in the x-axis label (n).
Extended Data Fig. 3 Performance comparisons using 3D and 2D fingerprints in the ALLDATA setting.
Using LC-MS2Struct with different molecule feature representations to identify the correct structure at the level of first InChIKey block (InChIKey-1). a: Comparison of the performance, measured by top-k accuracy, of LC-MS2Struct using either 2D (no stereochemistry) or 3D (with stereochemistry) molecular fingerprints in the ALLDATA setting. The results shown are averaged accuracies over 354 sample MS feature sequences (LC-MS2 experiments). b: Average top-k accuracies per MassBank (MB) subset rounded to full integers. The colour encodes the performance improvement in percentage units (%p) of each score integration method compared to Only-MS2.
Extended Data Fig. 4 Model performance for different number of SSVM models.
Performance comparison of LC-MS2Struct against using only-MS2 information (Only-MS2) for different number of SSVM models. The performance curves for the three MS2-scorers are shown separately. The top-k accuracies shown are averaged accuracies over 354 sample MS feature sequences (LC-MS2 experiments) from the ALLDATA setting.
Extended Data Fig. 5 Processing pipeline of the MassBank data.
Processing pipeline of the MassBank data. Illustration of the processing pipeline to extract the training data from MassBank. The depicted workflow is implemented in the ‘massbank2db’ Python package 52.
Supplementary information
Supplementary Information
Supplementary Tables 1–5 and detailed derivations of the SSVM.
Source data
Source Data Fig. 2
Raw data to reproduce Fig. 2a,b.
Source Data Fig. 3
Raw data to reproduce Fig. 3a,b.
Source Data Fig. 4
Raw data to reproduce Fig. 4a,b.
Source Data Fig. 5
Raw data for the spectra shown in Fig. 5.
Source Data Extended Data Fig. 1
Raw data to reproduce Extended Data Fig. 1.
Source Data Extended Data Fig. 2
Raw data to reproduce Extended Data Fig. 2.
Source Data Extended Data Fig. 3
Raw data to reproduce Extended Data Fig. 3a,b.
Source Data Extended Data Fig. 4
Raw data to reproduce Extended Data Fig. 4.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Bach, E., Schymanski, E.L. & Rousu, J. Joint structural annotation of small molecules using liquid chromatography retention order and tandem mass spectrometry data. Nat Mach Intell 4, 1224–1237 (2022). https://doi.org/10.1038/s42256-022-00577-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-022-00577-2
This article is cited by
-
Generic and accurate prediction of retention times in liquid chromatography by post–projection calibration
Communications Chemistry (2024)
-
Complementary methods for structural assignment of isomeric candidate structures in non-target liquid chromatography ion mobility high-resolution mass spectrometric analysis
Analytical and Bioanalytical Chemistry (2023)
-
NORMAN guidance on suspect and non-target screening in environmental monitoring
Environmental Sciences Europe (2023)