Joint structural annotation of small molecules using liquid chromatography retention order and tandem mass spectrometry data

Bach, Eric; Schymanski, Emma L.; Rousu, Juho

doi:10.1038/s42256-022-00577-2

Download PDF

Article
Open access
Published: 19 December 2022

Joint structural annotation of small molecules using liquid chromatography retention order and tandem mass spectrometry data

Nature Machine Intelligence volume 4, pages 1224–1237 (2022)Cite this article

6992 Accesses
11 Citations
117 Altmetric
Metrics details

Subjects

A preprint version of the article is available at bioRxiv.

Abstract

Structural annotation of small molecules in biological samples remains a key bottleneck in untargeted metabolomics, despite rapid progress in predictive methods and tools during the past decade. Liquid chromatography–tandem mass spectrometry, one of the most widely used analysis platforms, can detect thousands of molecules in a sample, the vast majority of which remain unidentified even with best-of-class methods. Here we present LC-MS²Struct, a machine learning framework for structural annotation of small-molecule data arising from liquid chromatography–tandem mass spectrometry (LC-MS²) measurements. LC-MS²Struct jointly predicts the annotations for a set of mass spectrometry features in a sample, using a novel structured prediction model trained to optimally combine the output of state-of-the-art MS² scorers and observed retention orders. We evaluate our method on a dataset covering all publicly available reversed-phase LC-MS² data in the MassBank reference database, including 4,327 molecules measured using 18 different LC conditions from 16 contributors, greatly expanding the chemical analytical space covered in previous multi-MS² scorer evaluations. LC-MS²Struct obtains significantly higher annotation accuracy than earlier methods and improves the annotation accuracy of state-of-the-art MS² scorers by up to 106%. The use of stereochemistry-aware molecular fingerprints improves prediction performance, which highlights limitations in existing approaches and has strong implications for future computational LC-MS² developments.

Ion mobility collision cross-section atlas for known and unknown metabolite annotation in untargeted metabolomics

Article Open access 28 August 2020

Zhiwei Zhou, Mingdu Luo, … Zheng-Jiang Zhu

High-confidence structural annotation of metabolites absent from spectral libraries

Article Open access 14 October 2021

Martin A. Hoffmann, Louis-Félix Nothias, … Sebastian Böcker

Trackable and scalable LC-MS metabolomics data processing using asari

Article Open access 11 July 2023

Shuzhao Li, Amnah Siddiqa, … Shujian Zheng

Main

Structural annotation of small molecules in biological samples is a key bottleneck in various research fields including biomedicine, biotechnology, drug discovery and environmental sciences. Samples in untargeted metabolomics studies typically contain thousands of different molecules, the vast majority of which remain unidentified^1,2,3. Liquid chromatography–tandem mass spectrometry (LC-MS²) is one of the most widely used analysis platforms⁴, as it allows for high-throughput screening, is highly sensitive and is applicable to a wide range of molecules. In LC-MS², molecules are first separated by their different physicochemical interactions between the mobile and stationary phase of the column in the liquid chromatographic system, resulting in retention time (RT) differences. Subsequently, they are separated according to their mass-to-charge ratio in a mass analyser (MS¹). Finally, the molecular ions are isolated and fragmented in the tandem mass spectrometer (MS²).

For each ion, the recorded fragments and their intensities constitute the MS² spectrum, which contains information about the substructures in the molecule and serves as a basis for annotation efforts. In typical untargeted LC-MS² workflows, thousands of MS features (MS¹, MS², RT) arise from a single sample. The goal of structural annotation is to associate each feature with a candidate molecular structure, for further downstream interpretation.

In recent years, many powerful methods^5,6 to predict structural annotations for MS² spectra have been developed^{7,8,9,10,11,12,13,14,15,16,17,18}. In general, these methods find candidate molecular structures potentially associated with the MS feature, for example, by querying molecules with a certain mass from a structure database such as Human Metabolome Database (HMDB)¹⁹ or PubChem²⁰ and subsequently computing a match score between each candidate and the MS² spectrum. The highest-scoring candidate is typically considered as the structure annotation of a given MS². Currently, even the best-of-class methods only reach an annotation accuracy of around 40% (ref. ¹⁷) in evaluations when searching large candidate sets such as those retrieved from PubChem. Therefore, in practice, a ranked list of molecular structures is provided to the user (for example, the top-20 structures). This level of performance is still a considerable hindrance in metabolomics and other fields.

Interestingly, RT information remains underutilized in automated approaches for structure annotation based on MS², despite RTs being readily available in all LC-MS² pipelines and generally recognized as contributing valuable information^21,22. An explanation is that a molecule generally has different RTs under different LC conditions (mobile phase, column composition and so on)^23,24. Typically, the RT information is used for post-processing of candidate lists, for example, by comparing measured and reference standard RTs^3,24. This approach, however, is limited by the availability of experimentally determined RTs of reference standards. RT prediction models^24,25, however, allow the prediction of RTs based solely on the molecular structure of the candidate, and have been successfully applied to aid structure annotation^{11,26,27,28,29}. However, such prediction models generally have to be calibrated to the specific LC configuration³, requiring at least some amount of target LC reference standard RT data to be available^21,29,30. Recently, the idea of predicting retention orders (ROs), that is, the order in which two molecules elute from the LC column, has been explored^31,32,33,34. ROs are largely preserved within a family of LC systems (for example, reversed-phase or hydrophilic interaction LC systems). Therefore, RO predictors can be trained using a diverse set of RT reference data, and applied to out-of-dataset LC set-ups³¹. Integration of MS²- and RO-based scores using probabilistic graphical models improved the annotation performance in LC-MS² experiments³⁴.

Another somewhat neglected aspect in automated annotation pipelines is the treatment of stereochemistry, that is, the different three-dimensional (3D) variants of the molecules. The general assumption has been that LC-MS² data do not contain sufficient information to separate stereoisomers in samples^5,24. As a result, MS² scorers typically disregard the stereochemical information in the candidate structures and often output the same matching for different stereoisomers (compare refs. ^7,17). However, stereoisomers that vary in their double-bond orientation (for example, cis–trans or E–Z isomerism) may have different shapes and thus exhibit different fragmentation and/or interactions with the LC system. Thus, ignoring stereochemistry in candidate processing may disregard LC-relevant stereochemical information. Furthermore, it is known that certain stereochemical configurations occur more frequently than others in nature and hence in the reference databases. Making use of such information can potentially improve annotation performance.

In this Article, we set out to provide a new perspective on jointly using MS² and RO combined with stereochemistry-aware molecular features for the structure annotation of LC-MS² data. We present a novel machine learning framework called LC-MS²Struct, which learns to optimally combine the MS² and RO information for the accurate annotation of a sequence of MS features. LC-MS²Struct relies on the structured support vector machine (SSVM)³⁵ and max-margin Markov network³⁶ frameworks. In contrast to the previous work of ref. ³⁴, our framework does not require a separately learned RO prediction model. Instead, it optimizes the SSVM parameters such that the score margin between correct and any other sequence of annotations is maximized. This way, LC-MS²Struct learns to optimally use the RO information from a set of LC-MS² experiments. We trained and evaluated LC-MS²Struct on all available reversed-phase LC data from MassBank³⁷, including a combined total of 4,327 molecules from 18 different LC configurations, hence reaching a high level of measurement diversity in the model evaluation. Our framework is compared with three other approaches: RT filtering, logP predictions¹¹ and RO predictions³⁴. LC-MS²Struct can be combined with any MS² scorer, and is demonstrated with the CFM-ID^9,18, MetFrag^7,11 and SIRIUS^8,17 tools. The use of chirality encoding circular molecular fingerprints³⁸ in the predictive model allows to distinguish and rank different stereoisomers based on the observed ROs.

Overview of LC-MS²Struct

Input and output

We consider a typical data setting in untargeted LC-MS²-based experiments, after pre-processing such as chromatographic peak picking and alignment (Fig. 1a). Such data comprise a sequence of MS features, here indexed by σ. Each feature consists of MS¹ information (for example, mass, adduct and isotope pattern), LC retention time (RT) t_σ and an MS² spectrum x_σ. We assume that a set of candidate molecules ${{{{\mathcal{C}}}}}_{\sigma }$ is associated with each MS feature σ. Such a set can be, for example, generated from a structure database (for example, PubChem²⁰, ChemSpider³⁹ or PubChemLite⁴⁰) based on the ion’s mass, a suspect list or an in silico molecule generator (for example, SmiLib v2.0^41,42). We furthermore require that for MS² spectrum x_σ, a matching score θ(x_σ, m) with its candidates $m\in {{{{\mathcal{C}}}}}_{\sigma }$ is pre-computed using an in silico tool, such as CFM-ID^9,18, MetFrag¹¹ or SIRIUS^8,17. LC-MS²Struct predicts a score for MS feature σ and each associated candidate $m\in {{{{\mathcal{C}}}}}_{\sigma }$ based on a sequence of spectra ${{{\bf{x}}}}={({x}_{\sigma })}_{\sigma = 1}^{L}$, of length L, and the ROs derived from the observed RTs ${{{\bf{t}}}}={({t}_{\sigma })}_{\sigma = 1}^{L}$. These scores are used to rank the molecular candidates associated with the MS features (Fig. 1b).

**Fig. 1: Overview of the LC-MS²Struct workflow.**

Candidate ranking using max-marginals

We define a fully connected graph G = (V, E) capturing the MS features and modelling their dependencies (Fig. 1c), where V represents the set of nodes and E the set of edges. Each node σ ∈ V corresponds to an MS feature, and is associated with the pre-computed MS² matching scores θ(x_σ, m) between the MS² spectrum x_σ and all molecular candidates $m\in {{{{\mathcal{C}}}}}_{\sigma }$. The graph G contains an edge (σ, τ) ∈ E for each MS feature pair. A scoring function F is defined predicting a compatibility score between a sequence of molecular structure assignments ${{{\bf{y}}}}={({y}_{\sigma })}_{\sigma = 1}^{L}$ in the label-space ${{\varSigma }}={{{{\mathcal{C}}}}}_{1}\times \ldots \times {{{{\mathcal{C}}}}}_{L}$ and the observed data:

$$F({\mathbf{y}}\,|\,{\mathbf{x}},{\mathbf{t}},{\mathbf{w}},G)=\underbrace{\frac{1}{|V|}\sum\limits_{\sigma\in V}\theta(x_\sigma,y_\sigma)}_{{\rm{Node}}\,{\rm{scores}}:\,{\rm{MS}}^2\,{\rm{information}}}+\underbrace{\frac{1}{|E|}\sum\limits_{(\sigma,\tau)\in E}f\left((t_\sigma,t_\tau), (y_\sigma,y_\tau)\,|\,{\mathbf{w}}\right)}_{{\rm{Edge}}\,{\rm{scores}}:\,{\rm{RO}}\,{\rm{information}}},$$

(1)

where the function f outputs an edge score (Fig. 1d) expressing the agreement between the observed and the predicted RO, for each candidate assignment pair (y_σ, y_τ) given the observed RTs (t_σ, t_τ). The function f is parameterized by the vector w, which is trained specifically for each MS² scorer (see next section). Using the compatibility score function F (equation (1)), we compute the max-marginal scores⁴³ for each candidate and MS feature, defined for a candidate $m\in {{{{\mathcal{C}}}}}_{\sigma }$ and MS feature σ as the maximum compatibility score that a candidate assignment $\bar{{{{\bf{y}}}}}\in {{\varSigma }}$ with ${\bar{y}}_{\sigma }=m$ can reach:

$$\mu ({y}_{\sigma }=m\,| \,{{{\bf{x}}}},{{{\bf{t}}}},{{{\bf{w}}}},G)=\mathop{\max }\limits_{\{\bar{{{{\bf{y}}}}}\in {{\varSigma }}\,:\,{\bar{y}}_{\sigma }=m\}}\,F(\bar{{{{\bf{y}}}}}\,| \,{{{\bf{x}}}},{{{\bf{t}}}},{{{\bf{w}}}},G).$$

We use μ to rank the molecular candidates³⁴. However, for general graphs G, the max-marginal inference problem (MMAP) is intractable. Therefore, we approximate the MMAP problem by performing the inference on tree-like graphs T_k randomly sampled from G (Fig. 1c), for which exact inference is feasible^43,44. Here, k indexes the individual spanning trees. Subsequently, we average the max-marginal scores μ(y_σ = m | x_i, t_i, w_k, T_k) over a set of trees T, an approach that performed well for practical applications^34,45,46. Thereby i indexes the individual training MS² spectra and RT sequences. For each spanning tree T_k, we apply a separately trained SSVM model w_k to increase the diversity of the predictions.

Joint annotation using SSVMs

We propose to tackle the joint assignment of candidate labels y ∈ Σ to the sequence of MS features of a LC-MS² experiment through structured prediction, a family of machine learning methods generally used to annotate sequences or networks^35,46,47. In our model, the structure is given by the observed RO of the MS feature pairs (y_σ, y_τ), which provides additional information on the correct candidate labels y_σ and y_τ. Given a set of annotated LC-MS² experiments extracted from MassBank³⁷ (Fig. 1e), we train an SSVM³⁵ model w predicting the edge scores. SSVM models can be optimized using the max-margin principle³⁵. In a nutshell, given a set of ground-truth-annotated MS feature sequences, the model parameters w are optimized such that the correct label sequence y_i ∈ Σ_i, that is, the structure annotations for all MS features in an LC-MS² experiment, scores higher than any other possible label sequence assignment y ∈ Σ_i (Fig. 1f).

Results

Extracting training data from MassBank

Ground-truth-annotated MS² spectra and RTs were extracted from MassBank³⁷, a public online database for MS² data. Each individual MassBank record typically provides a rich set of meta-information (Supplementary Table 1), such as the chromatographic and MS conditions as well as molecular structure annotations. For training the SSVM model of LC-MS²Struct, the MassBank data were processed such that the experimental conditions were consistent within each MS feature set, that is, with identical LC set-up and MS configuration as in a typical LC-MS² experiment, to ensure comparable RT, RO and MS² data. We developed a Python package ‘massbank2db’ that can process MassBank records and group them into consistent MS feature sets, which we denote as MassBank subsets (MB-subsets). For our experiments, we sampled sequences of MS features from the MB-subsets to simulate real LC-MS² experiments where the signals of multiple unknown compounds are measured under consistent experimental set-ups. Figure 1e illustrates the grouping and LC-MS² sampling process. Two collections of MassBank data were considered: ALLDATA and the ONLYSTEREO subset.

Comparison of LC-MS²Struct with other approaches

In the first experiment, we compare LC-MS²Struct with previous approaches for candidate ranking either using only-MS² or additionally using RT or RO information. Only-MS² uses the MS² spectrum information to rank the molecular candidates and serves as baseline; MS² + RO (ref. ³⁴) uses a ranking support vector machine (RankSVM)^48,49 to predict the ROs of candidate pairs and a probabilistic inference model to combine the ROs with MS² scores; MS² + RT uses predicted RTs to remove false-positive molecule structures from the candidate set, ordered by their MS² score, by comparing the predicted and observed RT; MS² + logP is an approach introduced by ref. ¹¹, which uses the observed RT to predict the XlogP3 value⁵⁰ of the unknown compound and compares it with the candidates’ XlogP3 values extracted from PubChem to refine the initial ranking based on the MS² scores. The RO-based methods (LC-MS²Struct and MS² + RO) were trained using the RTs from all available MB-subsets, ensuring that no test molecular structure (based on InChIKey first block, that is, the structural skeleton) was used for the model training (structure disjoint). For the RT-based approaches (MS² + RT and MS² + logP), the respective predictors were trained in a structure disjoint fashion using only the RT data available for that MB-subset. For the experiment, all MB-subsets with more than 75 (MS², RT)-tuples from the ALLDATA data set-up were used (Supplementary Table 2), as the RT-based approaches require LC-system-specific RT training data. The ranking performance was computed for each LC-MS² experiment within a particular MB-subset. The candidate molecules are identified by their InChIKey first block (the structural skeleton); hence, no stereoisomers are in the candidate sets.

Each candidate ranking approach was evaluated with three MS² scorers: CFM-ID 4.0¹⁸, MetFrag¹¹ and SIRIUS¹⁷. For LC-MS²Struct, we use stereochemistry-aware molecular fingerprints (3D) to represent the candidates.

Figure 2a shows the average ranking performance (top-k accuracy) across 350 LC-MS² experiments, each encompassing about 50 (MS², RT)-tuples (Methods). LC-MS²Struct is the best-performing method combined with any of the three MS² scorers. For CFM-ID and MetFrag, LC-MS²Struct provides 4.7 and 7.3 percentage unit increases over the only-MS² for the top-1 accuracy, corresponding to 80.8% and 106% performance gain, respectively. In our setting, that translates to 2.4 and 3.7 additional annotations at the top rank, respectively (out of approximately 50). The performance improvement increases for larger k, reaching as far as 9.3 and 11.3 percentage units for the top-20, which means 4.7 and 5.7 additional correct structures, respectively, in the top-20. For SIRIUS, the improvements are more modest, on average around 2 percentage units for top-1 to top-20. This might be explained by the higher baseline performance of SIRIUS. Nevertheless, SIRIUS can be improved for particular MB-subsets (see Fig. 2b and the discussion in the next section).

**Fig. 2: Different approaches to combine MS² and RT information.**

The runner-up score integration method is MS² + RO, which also makes use of predicted ROs. For CFM-ID and MetFrag, it leads to about one-third to one-half of the performance gain of LC-MS²Struct. The approaches relying on RTs, either by candidate filtering (MS² + RT) or through logP prediction (MS² + logP), lead to only minor improvements for MetFrag and CFM-ID, but none for SIRIUS, for which MS² + RT even leads to a decrease in ranking performance by about 2 percentage units. An explanation for this is that the filtering approach removes on average 4.7% of the correct candidates, which leads to false-negative predictions.

The performance gain by using either RO or RT varies between the MB-subsets with differing LC-MS² set-ups (Supplementary Table 3) and compound class compositions (Extended Data Fig. 1). We illustrate these differences in Fig. 2b. Applying LC-MS²Struct improves the ranking performance in almost all MB-subsets, including the SIRIUS MS²-scorer (some very slight decreases were observed in some SIRIUS scored sets). This is in stark contrast to the RT-based approaches (MS² + RT and MS² + logP), which often lead to less accurate rankings, especially for SIRIUS. Furthermore, as seen already in the average results (Fig. 2a), the benefit of LC-MS²Struct depends on the MS² base scorer. For example, the top-1 accuracy of the subsets ‘AC_003’ and ‘NA_003’ can be greatly improved for MetFrag but show little improvement for CFM-ID. Both datasets are natural-product toxins, which are perhaps poorly explained by the bond-disconnection approach of MetFrag. In contrast, for ‘RP_001’ and ‘UF_003’, the largest improvements (top-1) can be reached for CFM-ID. The RT-filtering approach (MS² + RT) performs particularly well for ‘LQB_000’ and ‘UT_000’. These subsets mostly contain lipids and lipid-like molecules (Extended Data Fig. 1).

Since the RT prediction models are trained using only data from the respective MB-subsets, more accurate models may be reached for less heterogeneous subsets of molecules. Hence, the RT filtering could work well in such cases²⁶.

Performance for different compound classifications

Next we investigate how LC-MS²Struct can improve the annotation across different categories in two molecule classification systems, ClassyFire⁵¹ and PubChemLite⁴⁰. Figure 3 shows the average top-1 and top-20 accuracy improvement of LC-MS²Struct over the only-MS² baseline for each ClassyFire super-class and PubChemLite annotation category. For ClassyFire (Fig. 3a), the ranking performance improvement for the different super-classes depends on the MS² scorer. For example, the top-1 accuracy of ‘Alkaloids and derivatives’ can be improved by 10.8 percentage units for MetFrag, but improves much less for CFM-ID and SIRIUS (1.9 and 3.5 percentage units, respectively). For ‘Organic oxygen compounds’, in contrast, the top-1 accuracy improves by about 10 percentage units when using both CFM-ID and MetFrag, whereas only half that improvement is observed for SIRIUS. This suggests that the CFM-ID results may be improved with the inclusion of more ‘Alkaloids and derivatives’. In addition, the ‘Alkaloids and derivatives’, ‘Organic acids and derivatives’ and ‘Organic nitrogen compounds’ appear less well explained by MetFrag (perhaps with more rearrangements, or less distinguishable spectra), such that the improvement from the RO approach is more apparent. For SIRIUS, ‘Lipids and lipid-like molecules’ as well as ‘Organic oxygen compounds’ benefit the most from LC-MS²Struct in top-1 (both improving by 5.7 percentage units) and top-20 (4.1 and 3.2 percentage units, respectively). In general, for ‘Lipid and lipid-like molecules’, LC-MS²Struct seems to achieve the best improvement (top-1 and top-20) over all MS² scorers. However, depending on the MS² scorer, this improvement distributes differently across the lipid sub-classes (Extended Data Fig. 2), such as ‘Fatty acyls’, ‘Prenol lipids’ or ‘Sphingolipids’.

**Fig. 3: Performance gain by LC-MS²Struct across molecular classes.**

For the PubChemLite classification (Fig. 3b), we also see that the MS² scorers benefit differently from LC-MS²Struct. The improvement is generally close to the average improvement of the respective MS² scorers and seems more equally distributed across the annotation categories.

For example for CFM-ID, the biggest top-1 improvements are in the ‘foodRelated’ and ‘noClassification’ categories. On the other hand, for SIRIUS the ‘pharmacoInfo’ and ‘bioPathway’ categories improve the most. MetFrag shows the most consistent performance improvement across the categories. ‘agroChemInfo’ benefits the least from LC-MS²Struct (top-1 and top-20). A possible explanation could be that the molecules categorized as agrochemicals are mainly ‘Benzenoids’ (48.5%), ‘Organoheterocyclic compounds’ (25.9%) and ‘Organic acids and derivatives’ (11.6%). As shown in Fig. 3a, these three ClassyFire classes show low (CFM-ID and MetFrag) or practically no (SIRIUS) improvement when using ROs.

Annotation of stereoisomers

Finally, we study whether LC-MS²Struct can annotate stereoisomers more accurately than MS² alone, considering differences between stereoisomers that vary in their double-bond orientation (for example, cis–trans or E–Z isomerism), which may potentially lead to differences in their LC behaviour (Fig. 4a). We consider candidate sets containing stereoisomers and evaluate LC-MS²Struct only using MassBank records where the ground-truth structure has stereochemistry information provided, that is, where the InChIKey second block is not ‘UHFFFAOYSA’ (ONLYSTEREO data set-up; Methods). The molecular candidates are represented using two different molecular fingerprints: one that includes stereochemistry information (3D); and one that omits it (2D) (Methods). This allows us to assess the importance of stereochemistry-aware features for the structure annotation.

Figure 5a shows the ranking performance of LC-MS²Struct using 2D and 3D fingerprints.

**Fig. 4: Application of LC-MS2Struct to annotate stereoisomers.**

When looking into the top-1 performance of LC-MS²Struct (3D) for the individual MS² scorers, we observe an improvement by 2.6, 3.8 and 3.2 percentage units for CFM-ID, MetFrag and SIRIUS, respectively. This translates to performance gains of 87.3%, 95.9% and 44.3%, respectively.

In general, LC-MS²Struct improves the ranking for all three MS² scorers. The improvement, however, is notably larger when using stereochemistry-aware (3D) candidate features. Interestingly, a similar behaviour could be observed in the ALLDATA setting (Extended Data Fig. 3), even though the absolute performance improvements were smaller. This experiment demonstrates that LC-MS²Struct can use RO information to improve the annotation of stereoisomers.

Discussion

LC-MS²Struct is a novel approach for the integration of MS² and LC data for the structural annotation of small molecules. The method learns from the pairwise dependencies in the RO of MS features within similar LC configurations and can generalize across different, heterogeneous LC configurations. Furthermore, the use of stereochemistry-aware molecular fingerprints enables LC-MS²Struct to annotate stereoisomers in LC-MS² experiments based on the observed ROs. Also, our novel processing pipeline to group all (MS², RT) data from MassBank into subsets of homogeneous LC-MS² conditions, which is implemented and made available in the ‘massbank2db’⁵² Python package will, we believe, make MassBank more accessible to other researchers and hence lower the bar of entry to computational metabolomics research.

Our experiments demonstrate that LC-MS²Struct annotates small molecules with an accuracy far superior to more traditional RT filtering and logP-based approaches, and also markedly better than previous methods that rely on ROs. In particular, compared with ref. ³⁴, which used a graphical model as a post-hoc integration tool of MS² scores and RO predictions, the benefits of learning the parameters of the graphical model are clear. All three studied MS² scorers could be improved by LC-MS²Struct, including the best-of-class SIRIUS, for which improvements have generally been hard to come by due to its already high baseline accuracy. Our results show the superiority of stereochemistry-aware molecular features for the structure annotation of LC-MS² data. Remarkably, this was the case not only for the annotation of stereoisomers but also for candidates distinguished by only their 2D structure. This result could be relevant for improving structural annotations in ion mobility separation–mass spectrometry with collision-cross-section measurements.

Our examples indicated that LC-MS²Struct separates candidates with varying double-bond stereochemistry, that is, E/Z and cis/trans isomers (see, for example, Fig. 4). However, there were very few examples of double-bond and/or chiral isomers measured on the same LC system in our dataset, which makes it difficult to quantify this effect, or interrogate these further until more such data are publicly available. Furthermore, as non-chiral LC cannot distinguish stereoisomers that differ only in their chiral centres, the development of more selective stereochemistry-aware molecular features, ignoring the chiral annotations, might be beneficial. We also note that the direct modelling of a node score (MS² information) predictor in the SSVM would be possible. However, as the MS² scorers used here are already relatively mature and well known in the community, we have left this research line open for future efforts.

**Fig. 5: Using LC-MS²Struct to identify stereoisomers.**

Methods

Notation

We use the following notation to describe LC-MS²Struct:

$$\begin{array}{lll}{{{\rm{Sequence}}}}\,{{{\rm{of}}}}\,{{{\rm{spectra}}}}&{{{\bf{x}}}}=({x}_{1},\ldots ,{x}_{L})&{{{\rm{with}}}}\,{x}_{\sigma }\in {{{\mathcal{X}}}}\\ {{{\rm{Sequence}}}}\,{{{\rm{of}}}}\,{{{\rm{retention}}}}\,{{{\rm{times}}}}&{{{\bf{t}}}}=({t}_{1},\ldots ,{t}_{L})&{{{\rm{with}}}}\,{t}_{\sigma }\in {{\mathbb{R}}}_{\ge 0}\\ {{{\rm{Sequence}}}}\,{{{\rm{of}}}}\,{{{\rm{candidate}}}}\,{{{\rm{sets}}}}&{{{\boldsymbol{{{{\mathcal{C}}}}}}}}=({{{{\mathcal{C}}}}}_{1},\ldots ,{{{{\mathcal{C}}}}}_{L})&{{{\rm{with}}}}\,{{{{\mathcal{C}}}}}_{\sigma }\subseteq {{{\mathcal{Y}}}}\\ {{{\rm{Sequence}}}}\,{{{\rm{of}}}}\,{{{\rm{labels}}}}&{{{\bf{y}}}}=({y}_{1},\ldots ,{y}_{L})\in {{\Sigma }}&{{{\rm{with}}}}\,{y}_{\sigma }\in {{{\mathcal{Y}}}}\\ {{{\rm{Candidate}}}}\,{{{\rm{assignment}}}}\,{{{\rm{space}}}}&{{\varSigma }}={{{{\mathcal{C}}}}}_{1}\times \ldots \times {{{{\mathcal{C}}}}}_{L},\end{array}$$

where ${{{\mathcal{X}}}}$ and ${{{\mathcal{Y}}}}$ denote the MS² spectra and the molecular structure space, respectively, ${{{\mathcal{C}}}}$ denotes a candidate set that is a subset of all possible molecular structures, and A × B denotes the cross product of two sets A and B. For the purpose of model training and evaluation, we assume a dataset with ground-truth-labelled MS feature sequences: ${{{\mathcal{D}}}}={\{(({{{{\bf{x}}}}}_{i},{{{{\bf{t}}}}}_{i}),{{{{\boldsymbol{{{{\mathcal{C}}}}}}}}}_{i},{{{{\bf{y}}}}}_{i})\}}_{i = 1}^{N}$, where N denotes the total number of sequences. We use $i,j\in {{\mathbb{N}}}_{\ge 0}$ to index MS feature sequences and $\sigma ,\tau \in {{\mathbb{N}}}_{\ge 0}$ as indices for individual MS features within a sequence, for example, x_iσ denotes the MS² spectrum at index σ in the sequence i. The length of a sequence of MS features is denoted with L. We denote the ground-truth labels (candidate assignment) of sequence i with y_i and any labelling with y. Both, y_i and y are in Σ_i. We use y to denote the candidate label variable, whereas m denotes a particular molecular structure. For example, y_σ = m means that we assign the molecular structure m as label to the MS feature σ.

Graphical model for joint annotation of MS features

We consider the molecular annotation problem for the output of LC-MS², which means assigning a molecular structure to each MS feature, as a structured prediction problem^35,46,47, relying on a graphical model representation of the sets of MS features arising from an LC-MS² experiment. For each MS feature σ, we want to predict a label y_σ from a fixed and finite candidate (label) set ${{{{\mathcal{C}}}}}_{\sigma }$. We model the observed ROs between each MS feature pair (σ, τ) within an LC-MS² experiment, as pairwise dependencies of the features. We define an undirected graph G = (V, E) with the vertex set V containing a node σ for each MS feature and the edge set E containing an edge for each MS feature pair E = {(σ, τ) ∣ σ, τ ∈ V, σ ≠ τ} (compare Fig. 1a,c). The resulting graph is complete with an edge between all pairs of nodes. This allows us to make use of arbitrary pairwise dependencies, instead of limiting to, say, adjacent RTs. This modelling choice was previously shown to be beneficial by ref. ³⁴. Here we extend that approach by learning from the pairwise dependencies to optimize joint annotation accuracy, which leads to markedly improved annotation accuracy.

For learning, we define a scoring function F that, given the input MS feature sequences (x, t) and its corresponding sequence of candidate sets ${{{\boldsymbol{{{{\mathcal{C}}}}}}}}$, computes a compatibility score between the measured data and any possible sequence of labels y ∈ Σ:

$$F({{{\bf{y}}}}\,| \,{{{\bf{x}}}},{{{\bf{t}}}},{{{\bf{w}}}},G)=\frac{1}{| V| }\mathop{\sum}\limits_{\sigma \in V}\theta ({x}_{\sigma },{y}_{\sigma })+\frac{1}{| E| }\mathop{\sum}\limits_{(\sigma ,\tau )\in E}\langle {{{\bf{w}}}},{{\bf{{\Gamma }}}}({{{{\bf{t}}}}}^{\sigma \tau },{{{{\bf{y}}}}}^{\sigma \tau })\rangle ,$$

(2)

where $\theta :{{{\mathcal{X}}}}\times {{{\mathcal{Y}}}}\to (0,1]$ is a function returning an MS² matching score between the spectrum x_σ and a candidate ${y}_{\sigma }\in {{{{\mathcal{C}}}}}_{\sigma }$, 〈⋅, ⋅〉 denotes the inner product, and w is a model weight vector to predict the RO matching score, based on the joint-feature vector ${{\bf{{\Gamma }}}}:{{\mathbb{R}}}_{\ge 0}\times {{\mathbb{R}}}_{\ge 0}\times {{{\mathcal{Y}}}}\times {{{\mathcal{Y}}}}\to {{{\mathcal{F}}}}$ between the observed RO derived from t^στ = (t_σ, t_τ) and a pair of molecular candidates y^στ = (y_σ, y_τ).

Equation (2) consists of two parts: (1) a score computed over the nodes in G capturing the MS² information; and (2) a score expressing the agreement of observed and predicted RO computed over the edge set. We assume that the node scores are pre-computed by a MS² scorer such as CFM-ID¹⁸, MetFrag¹¹ or SIRIUS¹⁷. The node scores are normalized to (0, 1] within each candidate set ${{{{\mathcal{C}}}}}_{\sigma }$. The edge scores are predicted for each edge (σ, τ) using the model w and the joint-feature vector Γ:

$$\begin{array}{lll}f({{{{\bf{t}}}}}^{\sigma \tau },{{{{\bf{y}}}}}^{\sigma \tau }\,| \,{{{\bf{w}}}})&=&\langle {{{\bf{w}}}},{{\bf{{\Gamma }}}}({{{{\bf{t}}}}}^{\sigma \tau },{{{{\bf{y}}}}}^{\sigma \tau })\rangle \\ &=&\langle {{{\bf{w}}}},{{{\rm{sign}}}}({t}_{\sigma }-{t}_{\tau })\left(\phi ({y}_{\sigma })-\phi ({y}_{\tau })\right)\rangle \\ &=&{{{\rm{sign}}}}({t}_{\sigma }-{t}_{\tau })\langle {{{\bf{w}}}},\phi ({y}_{\sigma })-\phi ({y}_{\tau })\rangle ,\end{array}$$

(3)

with $\phi :{{{\mathcal{Y}}}}\to {{{{\mathcal{F}}}}}_{{{{\mathcal{Y}}}}}$ being a function embedding a molecular structure into a feature space. The edge prediction function (3) will produce a height edge score, if the observed RO (that is, sign(t_σ − t_τ)) agrees with the predicted one.

Using the compatibility score function (2), the predicted joint annotation for (x, t) corresponds to the the highest-scoring label sequence $\hat{{{{\bf{y}}}}}\in {{\varSigma }}:\hat{{{{\bf{y}}}}}=\arg \mathop{\max }\limits_{\bar{{{{\bf{y}}}}}\in {{\varSigma }}}\,F(\bar{{{{\bf{y}}}}}\,| \,{{{\bf{x}}}},{{{\bf{t}}}},{{{\bf{w}}}},G)$. In practice, however, instead of predicting only the best label sequence, it can be useful to rank the molecular candidates $m\in {{{{\mathcal{C}}}}}_{\sigma }$ for each MS feature σ. This is because for state-of-the-art MS² scorers, the annotation accuracy in the top-20 candidate list is typically much higher than for the highest-ranked candidate (top-1).

Our framework provides candidate rankings by solving the following problem for each MS feature σ and $m\in {{{{\mathcal{C}}}}}_{\sigma }$:

$$\mu ({y}_{\sigma }=m\,| \,{{{\bf{x}}}},{{{\bf{t}}}},{{{\bf{w}}}},G)=\mathop{\max }\limits_{\{\bar{{{{\bf{y}}}}}\in {{\varSigma }}\,:\,{\bar{y}}_{\sigma }=m\}}\,F(\bar{{{{\bf{y}}}}}\,| \,{{{\bf{x}}}},{{{\bf{t}}}},{{{\bf{w}}}},G).$$

(4)

Problem (4) returns a max-marginal μ score for each candidate m. That is, the maximum compatibility score any label sequence $\bar{{{{\bf{y}}}}}\in {{\varSigma }}$ with ${\bar{y}}_{\sigma }=m$ can achieve. One can interpret equation (2) as the log-space representation of a unnormalized Markov random field probability distribution over y associated with an undirected graphical model G (ref. ⁴⁴).

Feasible inference using random spanning trees

For general graphs G, the maximum a posterior inference problem (that is, finding the highest-scoring label sequence y given an MS feature sequence) is an ${{{\mathcal{N}}}}{{{\mathcal{P}}}}$-hard problem^53,54. The max-marginals inference (MMAP), needed for the candidate ranking, is an even harder problem which is ${{{\mathcal{N}}}}{{{\mathcal{P}}}}$^PP complete⁵⁴. However, efficient inference approaches have been developed. In particular, if G is tree-like, we can efficiently compute the max-marginals using dynamic programming and the max-product algorithm^43,44. Such tree-based approximations have shown to be successful in various practical applications^34,45,46.

Here, we follow the work by ref. ³⁴ and sample a set of random spanning trees (RST) ${{{\bf{T}}}}={\{{T}_{k}\}}_{k = 1}^{K}$ from G, whereby K denotes the size of the RST sample. Each tree T_k has the same node set V as G, but an edge set E(T) ⊆ E, with |E(T)| = L − 1, ensuring that T is a single connected component and cycle free. We follow the sampling procedure used by ref. ³⁴. Given the RST set T, we compute the averaged max-marginals to rank the molecular candidates³⁴:

$$\bar{\mu }({y}_{\sigma }=m\,| \,{{{\bf{x}}}},{{{\bf{t}}}},{{{\bf{w}}}},{{{\bf{T}}}})=\frac{1}{K}\mathop{\sum }\limits_{k=1}^{K}\left(\mu ({y}_{\sigma }=m\,| \,{{{\bf{x}}}},{{{\bf{t}}}},{{{\bf{w}}}},{T}_{k})-\mathop{\max }\limits_{\bar{{{{\bf{y}}}}}\in {{\varSigma }}}\,F(\bar{{{{\bf{y}}}}}\,| \,{{{\bf{x}}}},{{{\bf{t}}}},{{{\bf{w}}}},{T}_{k})\right),$$

(5)

where we subtract the maximum compatibility score from the marginal values corresponding to the individual trees to normalize the marginals before averaging³⁴. This normalization value can be efficiently computed given the max-marginals μ. In our experiments, we train K individual models (w_k) and associate them with the trees T_k to increase the diversity. The influence of the number of SSVM models on the prediction performance is shown in Extended Data Fig. 4.

The SSVM model

To train the model parameters w (equation (2)), we implemented a variant of the SSVM^35,36. Its primal optimization problem is given as⁵⁵:

$$\begin{array}{ll}\mathop{\min }\limits_{{{{\bf{w}}}},{{{\mathbf{\xi }}}}}&\frac{1}{2}\parallel {{{\bf{w}}}}{\parallel }^{2}+\frac{C}{N}\mathop{\sum }\limits_{i=1}^{N}{\xi }_{i}\\ {{{\rm{st.}}}}&F({{{{\bf{y}}}}}_{i}\,| \,{{{{\bf{x}}}}}_{i},{{{{\bf{t}}}}}_{i},{{{\bf{w}}}},{G}_{i})-F({{{\bf{y}}}}\,| \,{{{{\bf{x}}}}}_{i},{{{{\bf{t}}}}}_{i},{{{\bf{w}}}},{G}_{i})\ge \ell ({{{{\bf{y}}}}}_{i},{{{\bf{y}}}})-{\xi }_{i}\\ &\forall i\in \{1,\ldots ,N\},\,\forall {{{\bf{y}}}}\in {{{\varSigma }}}_{i},\end{array}$$

(6)

where C > 0 is the regularization parameter, ξ_i ≥ 0 is the slack variable for example i and $\ell :{{{\varSigma }}}_{i}\times {{{\varSigma }}}_{i}\to {{\mathbb{R}}}_{\ge 0}$ is a function capturing the loss between two label sequences. The constraint set definition (st.) of problem (6) leads to a parameter vector w that is trained according to the max-margin principle^35,36,47, that is, the score F(y_i) of the correct label should be greater than the score F(y) of any other label sequence by at least the specified margin ℓ(y_i, y). Note that in the SSVM problem (6), a different graph G_i = (V_i, E_i) can be associated with each training example i, allowing, for example, to process sequences of different length.

We solve (6) in its dual formulation and use the Frank–Wolfe algorithm⁵⁶ following the recent work by ref. ⁵⁵. In the Supplementary Information, we derive the dual problem and demonstrate how to solve it efficiently using the Frank–Wolfe algorithm and RST approximations for G_i. Optimizing the dual problem enables us to use non-linear kernel functions $\lambda :{{{\mathcal{Y}}}}\times {{{\mathcal{Y}}}}\to {{\mathbb{R}}}_{\ge 0}$ measuring the similarity between the molecular structures associated with the label sequences.

The label loss function ℓ is defined as follows:

$$\ell ({{{{\bf{y}}}}}_{i},{{{\bf{y}}}})=\frac{1}{| {V}_{i}| }\mathop{\sum }\limits_{\sigma =1}^{L}\left(1-\lambda ({y}_{i\sigma },{y}_{\sigma })\right)$$

and satisfies ℓ(y, y) = 0 (a required property⁵⁵), if λ is a normalized kernel, which holds true in our experiments (we used the MinMax kernel⁵⁷).

Pre-processing pipeline for raw MassBank records

Extended Data Fig. 5 illustrates our MassBank pre-processing pipeline implemented in the Python package ‘massbank2db’⁵². First, the MassBank records text files were parsed and the MS² spectrum, ground-truth annotation, RT and meta-information extracted. Records with missing MS², RT or annotation were discarded. We use the MB 2020.11 release for our experiments.

Subsequently, we grouped the MassBank records into subsets (denoted as MB-subsets) where the (MS², RT)-tuples were measured under the same LC and MS conditions.

Supplementary Table 1 summarizes the grouping criteria. In the next step, we used the InChIKey⁵⁸ identifier in MassBank to retrieve the SMILES⁵⁹ representation from PubChem²⁰ (1 February 2021), rather than using the contributor-supplied SMILES. This ensures a consistent SMILES source for the molecular candidates and ground-truth annotations.

Three more filtering steps were performed before creating the final database, to remove records: (1) if the ground-truth exact mass deviated too far (>20 ppm) from the calculated exact mass based on the precursor mass-per-charge and adduct type; (2) if the subset contained <50 unique molecular structures; (3) if they were potential isobars (see pull-request #152 in the MassBank GitHub repository, https://github.com/MassBank/MassBank-data/pull/152).

Supplementary Table 3 summarizes the LC-MS² meta-information for all generated MB-subsets.

Generating the molecular candidate sets

We used SIRIUS^8,17 to generate the molecular candidate sets. For each MassBank record, the ground-truth molecular formula was used by SIRIUS to collect the candidate structures from PubChem²⁰. The candidate sets generated by SIRIUS contain a single stereoisomer per candidate, identified by their InChIKey first block (structural skeleton). To study the ability of LC-MS²Struct to annotate the stereochemical variant of the molecules, we enriched the SIRIUS candidates sets with stereoisomers, using the InChIKey first block of each candidate to search PubChem (1 February 2021) for stereoisomers. The additional molecules were then added to the candidate sets.

Pre-computing the MS² matching scores

For each MB-subset, MS² spectra with identical adduct type (for example, [M + H]⁺) and ground-truth molecular structure were aggregated. Depending on the MS² scorer, we either merged the MS² into a single spectrum (CFM-ID and MetFrag) following the strategy by ref. ¹¹ or we provided the MS² spectra separately (SIRIUS). For the spectra merging, we used the ‘mzClust_hclust’ function of the xcms package⁶⁰, which first combines all MS² spectra’s peaks into a single peaklist and subsequently merges peaks based on a mass error threshold.

To compute the CFM-ID (v4.0.7) MS² matching score, we first predicted the in silico MS² spectra for all molecular candidate structures based on their isomeric SMILES representation using the pre-trained CFM-ID models (Metlin 2019 MSML) by ref. ¹⁸. We merged the three in silico spectra predicted by CFM-ID for different collision energies and compared them with the merged MassBank spectrum using the modified cosine similarity⁶¹ implemented in the matchms⁶² (v0.9.2) Python library. For MetFrag (v2.4.5), the MS² matching scores were calculated using the FragmenterScore feature based on the isomeric SMILES representation of the candidates. For SIRIUS, the required fragmentation trees are computed using the ground-truth molecular formula of each MassBank spectrum. SIRIUS uses canonical SMILES and hence does not encode stereochemical information (which is absent in the canonical SMILES). Therefore, we used the same SIRIUS MS² matching score for all stereoisomers sharing the same InChIKey first block.

For all three MS² scorers, we normalized the MS² matching scores to the range [0, 1] separately for each candidate set. For the machine-learning-based scorers (CFM-ID and SIRIUS), the matching scores of the candidates associated with a particular MassBank record using in evaluation were predicted using models that did include its ground-truth structure (determined by InChIKey first block).

If a MS² scorer failed on a MassBank record, we assigned a constant MS² score to each candidate.

Molecular feature representations

Extended connectivity fingerprints with function classes (FCFP)³⁸ were used to represent molecular structures in our experiments. We employed RDKit (v2021.03.1) to generate counting FCFP fingerprints. The fingerprints were computed based on the isomeric SMILES, using the parameter ‘useChirality’ to generate fingerprints that either encoded stereochemistry (3D) or not (2D). To define the set of substructures in the fingerprint vector, we first generated all possible substructures, using a FCFP radius of two, based on a set of 50,000 randomly sampled molecular candidates associated with our training data, and all the ground-truth training structures, resulting in 6,925 (3D) and 6,236 (2D) substructures. We used 3D FCFP fingerprints in our experiments, except for the experiments focusing on the annotation of stereoisomers, where we used both 2D and 3D fingerprints for comparison. We used the MinMax kernel⁵⁷ to compute the similarity between the molecules.

Computing molecular categories

For the analysis of the ranking performance for different molecular categories, we used two classification systems, ClassyFire⁵¹, which classifies molecules according to their structure, and PubChemLite⁴⁰, which classifies molecules according to information available for ten exposomics-relevant categories. For ClassyFire, we used the ‘classyfireR’ R package to retrieve the classification for each ground-truth molecular structure in our dataset. For PubChemLite, the classification categories were retrieved via InChIKey first block matching of each molecular structure; if it was not found in PubChemLite, the category ‘noClassification’ was assigned.

Training and evaluation data set-ups

We considered only MassBank data that have been analysed using an LC reversed-phase (RP) column. We removed molecules from the data if their measured RT was less than three times the estimated column dead-time⁶³, as we considered such molecules to be non-retaining.

We considered two separate data set-ups. The first one, denoted by ALLDATA, used all available MassBank data to train and evaluate LC-MS²Struct. This set-up was used to compare the different candidate ranking approaches as well as to investigate the performance across various molecular classes. The second set-up, denoted by ONLYSTEREO, used MassBank records where the ground-truth molecular structure contains stereochemical information, that is, where the InChIKey second block is not ‘UHFFFAOYSA’. This set-up was used in the experiments regarding the ability of LC-MS²Struct to distinguish stereochemistry. In the training, we additionally used MassBank records that appear only without stereochemical information in our candidate sets, identified by the InChIKey second block equal to ‘UHFFFAOYSA’ in PubChem. The number of available training and evaluation (MS², RT)-tuples per MB-subset are summarized in Supplementary Table 2.

For each MB-subset, we sampled a set of LC-MS² experiments, that is (MS², RT)-tuple sequences, from the available evaluation data. The number of LC-MS² experiments (n below) depended on the number of available (MS², RT)-tuples (Supplementary Table 2) as follows:

$$n=\left\{\begin{array}{ll}0\quad &{{{\rm{if}}}}\,| {{{\mathcal{D}}}}| < 30\\ 1\quad &{{{\rm{else}}}}\,{{{\rm{if}}}}\,30\le | {{{\mathcal{D}}}}| \le 75\\ 15\quad &{{{\rm{else}}}}\,{{{\rm{if}}}}\,76\le | {{{\mathcal{D}}}}| \le 250\\ \left\lfloor \frac{| {{{\mathcal{D}}}}| }{50}\right\rfloor \quad &{{{\rm{else}}}}.\end{array}\right.$$

where ${{{\mathcal{D}}}}$ is a set of (MS², RT)-tuples with ground-truth annotation and molecular candidate sets associated with an MB-subset. If there are fewer than 30 (MS², RT)-tuples available, we do not generate an evaluation LC-MS² experiment from the corresponding MB-subset. On the basis of this sampling scheme, we obtained 354 and 94 LC-MS² experiments for ALLDATA and ONLYSTEREO, respectively, for our evaluation (Supplementary Table 2).

We trained eight (K = 8) separate SSVM models w_k for each evaluation LC-MS² experiment. For each SSVM, model we first generated a set containing the (MS², RT)-tuples from all MB-subsets. Then, we removed all tuples whose ground-truth molecular structure, determined by the InChIKey first block, was in the respective evaluation LC-MS² experiment. Lastly, we randomly sampled LC-MS² experiments from the training tuples, within their respective MB-subset, with a length randomly chosen from 4 to (maximum) 32 (see also Fig. 1e) and an RST T_ik assigned for each MS feature sequence i. In total, 768 LC-MS² training experiments were generated for each SSVM model. To speed up the model training, we restricted the candidate set size $| {{{{\mathcal{C}}}}}_{i\sigma }|$ of each training MS feature σ to maximum 75 candidate structures by random subsampling. We ensure that the correct candidate is included in the subsample. Each SSVM model w_k was applied to the evaluation LC-MS² experiment, associated with different RSTs T_k, and the averaged max-marginal scores where used for the final candidate ranking (see equation (5) and Fig. 1c).

SSVM hyperparameter optimization

The SSVM regularization parameter C was optimized for each training set separately using grid search and evaluation on a random validation set sampled from the training data’s (MS², RT)-tuples (33%). A set of LC-MS² experiments was generated from the validation set and used to determine the normalized discounted cumulative gain (NDCG)⁶⁴ for each C value. The regularization parameter with the highest NDCG value was chosen to train the final model. We used the scikit-learn⁶⁵ (v0.24.1) Python package to compute the NDCG value, taking into account ranks up until 10 (NDCG@10) and defined the relevance for each candidate to be 1 if it is the correct one and 0 otherwise. To reduce the training time, we searched the optimal C* only for SSVM model k = 0 and used C* for the other models with k > 0.

Ranking performance evaluation

We computed the ranking performance (top-k accuracy) for a given LC-MS² experiment using the tie-breaking strategy described in ref. ⁸: if a ranking method assigns an identical score to a set of n molecular candidates, then all accuracies at the ordinal ranks k at which one of these candidates is found are increased by 1/n. We computed a candidate score (that is, only-MS², LC-MS²Struct and so on) for each molecular structure in the candidate set (identified by PubChem CID). Depending on the data set-up (Supplementary Table 4), we first collapse the candidates by InChIKey first block (ALLDATA, method comparison and molecule category analysis) or full InChIKey (ONLYSTEREO stereochemistry prediction), assigning the maximum candidate score for each InChIKey first block or InChIKey group, respectively. Subsequently, we compute the top-k accuracy based on the collapsed candidate sets.

For the performance analysis of individual molecule categories, either ClassyFire⁵¹ or PubChemLite⁴⁰ classes, we first computed the rank of the correct molecular structure for each (MS², RT)-tuple of each LC-MS² evaluation experiment based on only-MS² and LC-MS²Struct scores. Subsequently, we computed the top-k accuracy for each molecule category, associated with at least 50 unique ground-truth molecular structures (based on InChIKey first block). As a ground-truth structure can appear multiple times in our dataset, we generate 50 random samples, each containing only one example per unique structure, and computed the averaged top-k accuracy.

Comparison of LC-MS²Struct with other approaches

We compared LC-MS²Struct with three different approaches to integrate MS² and RT information, namely RT filtering, logP prediction and RO prediction.

For RT filtering (MS² + RT), we followed ref. ²⁶, which used the relative error $\epsilon =\frac{| \hat{t}-{t}_{\sigma }| }{{t}_{\sigma }}$, between the predicted ($\hat{t}$) and observed (t_σ) RT. We set the filtering threshold to the 95% quantile of the relative RT prediction errors estimated from the RT model’s training data, following refs. ^27,29. We used scikit-learn’s⁶⁵ (v0.24.1) implementation of the support vector regression⁶⁶ with radial basis function kernel for the RT prediction. For support vector regression, we use the same 196 features, computed using RDKit (v2021.03.1), as in ref. ²⁵.

For logP prediction (MS² + logP), we followed ref. ¹¹, which assigned a weighted sum of an MS² and logP score $s=\beta{s}_{{{{\rm{MS{}}}^{{2}}}}}(m)+(1-\beta ){s}_{{{{\rm{log}}P}}}(m)$ to each candidate $m\in {{{{\mathcal{C}}}}}_{\sigma }$, and used it rank the set of molecular candidates. The logP score is given by ${s}_{{{{\rm{log}}P}}}(m)=\frac{1}{\delta \sqrt{2\uppi }}\exp \left(-\frac{{({{{{\rm{log}}P}}}_{m}-{{{{\rm{log}}P}}}_{\sigma })}^{2}}{2{\delta }^{2}}\right)$, where logP_m is the predicted XlogP3⁵⁰ extracted from PubChem²⁰ for candidate m, and logP_σ = a t_σ + b is the XlogP3 value of the unknown compound, associated with MS feature σ, predicted based on its measured RT t_σ. The parameters a and b of the linear regression model were determined using a set of RT and XlogP3 tuples associated with the LC system. As in ref. ¹¹, we set δ = 1.5 and set β such that it optimizes the top-1 candidate ranking accuracy, calculated from a set of 25 randomly generated training LC-MS² experiments.

For RO prediction (MS² + RO), we used the approach by ref. ³⁴, which relies on a RankSVM implementation in the Python library ROSVM^31,67 (v0.5.0). We used counting ‘substructure’ fingerprints calculated using CDK (v2.5)⁶⁸ and the MinMax kernel⁵⁷. The MS² matching scores and predicted ROs were used to compute max-marginal ranking scores using the framework by ref. ³⁴. We used the author’s implementation in version 0.2.3⁶⁹. The hyper-parameters β and k of the model were optimized for each evaluation LC-MS² experiment separately using the respective training data. To estimate β, we generated 25 LC-MS² experiments from the training data and selected the β that maximized the Top20AUC³⁴ ranking performance. The sigmoid parameter k was estimated using Platt’s method⁷⁰ calibrated using RankSVM’s training data. We used 128 random spanning trees per evaluation LC-MS² experiment to compute the averaged max-marginals.

For the experiments comparing the different methods, we used all LC-MS² experiments generated, except the ones from the MB-subsets ‘CE_001’, ‘ET_002’, ‘KW_000’ and ‘RP_000’ (Supplementary Table 2). For those subsets, the evaluation LC-MS² experiment contains all available (MS², RT)-tuples, leaving no LC-system-specific data to train the RT (MS² + RT) or logP (MS² + logP) prediction models. The RT and logP prediction models are trained in a structure disjoint fashion using the RT data of the particular MB-subset associated with the evaluation LC-MS². The RO prediction model used by MS² + RO is trained structure disjoint as well, but using the RTs of all MB-subsets.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All data used in our experiments are available online⁷¹ (https://zenodo.org/record/5854661). The candidate rankings of all LC-MS² experiments are available online: ALLDATA⁷² (https://zenodo.org/record/6451016) and ONLYSTEREO⁷³ (https://zenodo.org/record/6037629). Source data are provided with this paper.

Code availability

The source code developed for this study is available on GitHub, including the implementation of LC-MS²Struct⁷⁴ (v2.13.0; https://github.com/aalto-ics-kepaco/msms_rt_ssvm); scripts to run the experiments⁷⁵ (https://github.com/aalto-ics-kepaco/lcms2struct_exp); and the library implementing the MassBank pre-processing⁵² (v0.9.0; https://github.com/bachi55/massbank2db). The candidate fingerprints were computed by the ROSVM Python library⁶⁷ (v0.5.0; https://github.com/bachi55/rosvm) using RDKit (2021.03.1). The SSVM library uses the max-marginal inference solver implemented by ref. ³⁴ (v0.2.3; https://github.com/aalto-ics-kepaco/msms_rt_score_integration).

References

da Silva, R. R., Dorrestein, P. C. & Quinn, R. A. Illuminating the dark matter in metabolomics. Proc. Natl Acad. Sci. USA 112, 12549–12550 (2015).
Article Google Scholar
Aksenov, A. A., da Silva, R., Knight, R., Lopes, N. P. & Dorrestein, P. C. Global chemical analysis of biology by mass spectrometry. Nat. Rev. Chem. 1, 0054 (2017).
Article Google Scholar
Blaženović, I. et al. Structure annotation of all mass spectra in untargeted metabolomics. Anal. Chem. 91, 2155–2162 (2019).
Article Google Scholar
Blaženović, I., Kind, T., Ji, J. & Fiehn, O. Software tools and approaches for compound identification of LC-MS/MS data in metabolomics. Metabolites 8, 31 (2018).
Schymanski, E. L. et al. Critical assessment of small molecule identification 2016: automated methods. J. Cheminform. 9, 22 (2017).
Article Google Scholar
Nguyen, D. H., Nguyen, C. H. & Mamitsuka, H. Recent advances and prospects of computational methods for metabolite identification: a review with emphasis on machine learning approaches. Brief. Bioinform. 20, 2028–2043 (2019).
Article Google Scholar
Wolf, S., Schmidt, S., Müller-Hannemann, M. & Neumann, S. In silico fragmentation for computer assisted identification of metabolite mass spectra. BMC Bioinform. 11, 1–12 (2010).
Article Google Scholar
Dührkop, K., Shen, H., Meusel, M., Rousu, J. & Böcker, S. Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proc. Natl Acad. Sci. USA 112, 12580–12585 (2015).
Article Google Scholar
Allen, F., Greiner, R. & Wishart, D. Competitive fragmentation modeling of ESI-MS/MS spectra for putative metabolite identification. Metabolomics 11, 98–110 (2015).
Article Google Scholar
Brouard, C. et al. Fast metabolite identification with input output kernel regression. Bioinformatics 32, i28–i36 (2016).
Article Google Scholar
Ruttkies, C., Schymanski, E. L., Wolf, S., Hollender, J. & Neumann, S. MetFrag relaunched: incorporating strategies beyond in silico fragmentation. J. Cheminform. 8, 3 (2016).
Article Google Scholar
Brouard, C., Bach, E., Böcker, S. & Rousu, J. Magnitude-preserving ranking for structured outputs. In Proc. Ninth Asian Conference on Machine Learning, Proc. Machine Learning Research Vol. 77 (eds Zhang, M.-L. & Noh, Y.-K.) 407–422 (PMLR, 2017); http://proceedings.mlr.press/v77/brouard17a.html
Nguyen, D. H., Nguyen, C. H. & Mamitsuka, H. Simple: sparse interaction model over peaks of molecules for fast, interpretable metabolite identification from tandem mass spectra. Bioinformatics 34, i323–i332 (2018).
Article Google Scholar
Li, Y., Kuhn, M., Gavin, A.-C. & Bork, P. Identification of metabolites from tandem mass spectra with a machine learning approach utilizing structural features. Bioinformatics 36, 1213–1218 (2019).
Google Scholar
Ruttkies, C., Neumann, S. & Posch, S. Improving MetFrag with statistical learning of fragment annotations. BMC Bioinform. 20, 376 (2019).
Article Google Scholar
Nguyen, D. H., Nguyen, C. H. & Mamitsuka, H. ADAPTIVE: learning data-dependent, concIse molecular vectors for fast, accurate metabolite identification from tandem mass spectra. Bioinformatics 35, i164–i172 (2019).
Article Google Scholar
Dührkop, K. et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods https://doi.org/10.1038/s41592-019-0344-8 (2019).
Wang, F. et al. CFM-ID 4.0: nore accurate ESI-MS/MS spectral prediction and compound identification. Anal. Chem. https://doi.org/10.1021/acs.analchem.1c01465 (2021).
Wishart, D. S. et al. HMDB 4.0: the Human Metabolome Database for 2018. Nucleic Acids Res. 46, D608–D617 (2017).
Article Google Scholar
Kim, S. et al. PubChem in 2021: new data content and improved web interfaces. Nucleic Acids Res. 49, D1388–D1395 (2020).
Article Google Scholar
Stanstrup, J., Neumann, S. & Vrhovšek, U. PredRet: prediction of retention time by direct mapping between multiple chromatographic systems. Anal. Chem. 87, 9421–9428 (2015).
Article Google Scholar
Low, D. Y. et al. Data sharing in predret for accurate prediction of retention time: application to plant food bioactive compounds. Food Chem. 357, 129757 (2021).
Article Google Scholar
Fanali, S., Haddad, P., Poole, C. & Lloyd, D. Liquid Chromatography: Fundamentals and Instrumentation (Handbooks in Separation Science, Elsevier Science, 2013).
Witting, M. & Böcker, S. Current status of retention time prediction in metabolite identification. J. Sep. Sci. 43, 1746–1754 (2020).
Article Google Scholar
Bouwmeester, R., Martens, L. & Degroeve, S. Comprehensive and empirical evaluation of machine learning algorithms for small molecule LC retention time prediction. Anal. Chem. 91, 3694–3703 (2019).
Article Google Scholar
Aicheler, F. et al. Retention time prediction improves identification in nontargeted lipidomics approaches. Anal. Chem. 87, 7698–7704 (2015).
Article Google Scholar
Samaraweera, M. A., Hall, L. M., Hill, D. W. & Grant, D. F. Evaluation of an artificial neural network retention index model for chemical structure identification in nontargeted metabolomics. Anal. Chem. 90, 12752–12760 (2018).
Article Google Scholar
Bonini, P., Kind, T., Tsugawa, H., Barupal, D. K. & Fiehn, O. Retip: retention time prediction for compound annotation in untargeted metabolomics. Anal. Chem. https://doi.org/10.1021/acs.analchem.9b05765 (2020).
Yang, Q., Ji, H., Lu, H. & Zhang, Z. Prediction of liquid chromatographic retention time with graph neural networks to assist in small molecule identification. Anal. Chem. https://doi.org/10.1021/acs.analchem.0c04071 (2021).
Bouwmeester, R., Martens, L. & Degroeve, S. Generalized calibration across liquid chromatography setups for generic prediction of small-molecule retention times. Anal. Chem. 92, 6571–6578 (2020).
Article Google Scholar
Bach, E., Szedmak, S., Brouard, C., Böcker, S. & Rousu, J. Liquid-chromatography retention order prediction for metabolite identification. Bioinformatics 34, i875–i883 (2018).
Article Google Scholar
Liu, J. J., Alipuly, A., Baczek, T., Wong, M. W. & Žuvela, P. Quantitative structure–retention relationships with non-linear programming for prediction of chromatographic elution order. Int. J. Mol. Sci. 20, 3443 (2019).
Article Google Scholar
Žuvela, P., Liu, J. J., Wong, M. W. & Baczek, T. Prediction of chromatographic elution order of analytical mixtures based on quantitative structure–retention relationships and multi-objective optimization. Molecules 25, 3085 (2020).
Article Google Scholar
Bach, E., Rogers, S., Williamson, J. & Rousu, J. Probabilistic framework for integration of mass spectrum and retention time information in small molecule identification. Bioinformatics 37, 1724–1731 (2021).
Article Google Scholar
Tsochantaridis, I., Joachims, T., Hofmann, T. & Altun, Y. Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res. 6, 1453–1484 (2005).
MathSciNet MATH Google Scholar
Taskar, B., Guestrin, C. & Koller, D. Max-margin Markov networks. Adv. Neural Inf. Process. Syst. 16, 25–32 (MIT, 2004).
Horai, H. et al. MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass Spectrom. 45, 703–714 (2010).
Article Google Scholar
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Article Google Scholar
Pence, H. & Williams, A. ChemSpider: an online chemical information resource. J. Chem. Educ. 87, 1123–1124 (2010).
Article Google Scholar
Schymanski, E. L. et al. Empowering large chemical knowledge bases for exposomics: PubChemLite meets MetFrag. J. Cheminform. https://doi.org/10.21203/rs.3.rs-107432/v1 (2021).
Schüller, A., Schneider, G. & Byvatov, E. SmiLib: rapid assembly of combinatorial libraries in smiles notation. QSAR Comb. Sci. 22, 719–721 (2003).
Article Google Scholar
Schüller, A., Hähnke, V. & Schneider, G. SmiLib v2.0: a Java-based tool for rapid combinatorial library enumeration. QSAR Comb. Sci. 26, 407–410 (2007).
Article Google Scholar
Wainwright, M., Jaakkola, T. & Willsky, A. Tree consistency and bounds on the performance of the max-product algorithm and its generalizations. Stat. Comput. 14, 143–166 (2004).
Article MathSciNet Google Scholar
MacKay, D. J. Information Theory, Inference and Learning Algorithms (Cambridge Univ. Press, 2005).
Pletscher, P., Ong, C. S. & Buhmann, J. Spanning tree approximations for conditional random fields. In Proc. Twelth International Conference on Artificial Intelligence and Statistics, Proc. Machine Learning Research Vol. 5 (eds van Dyk, D. & Welling, M.) 408–415 (PMLR, 2009); http://proceedings.mlr.press/v5/pletscher09a.html
Su, H. & Rousu, J. Multilabel classification through random graph ensembles. Mach. Learn. 99, 231–256 (2015).
Article MathSciNet MATH Google Scholar
Rousu, J., Saunders, C., Szedmak, S. & Shawe-Taylor, J. Kernel-based learning of hierarchical multilabel classification models. J. Mach. Learn. Res. 7, 1601–1626 (2006).
MathSciNet MATH Google Scholar
Elisseeff, A. & Weston, J. A kernel method for multi-labelled classification. Adv. Neural Inf. Process. Syst. 14, 681–687 (2002).
Google Scholar
Joachims, T. Optimizing search engines using clickthrough data. In Proc. Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 133–142 (ACM, 2002); https://doi.org/10.1145/775047.775067
Cheng, T. et al. Computation of octanol-water partition coefficients by guiding an additive model with knowledge. J. Chem. Inf. Model. 47, 2140–2148 (2007).
Article Google Scholar
Feunang, Y. D. et al. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J. Cheminform. 8, 61 (2016).
Article Google Scholar
Bach, E. massbank2db: build a machine learning ready SQLite database from MassBank. GitHub https://github.com/bachi55/massbank2db (2022).
Gärtner, T. & Vembu, S. On structured output training: hard cases and an efficient alternative. Mach. Learn. 76, 227–242 (2009).
Article MATH Google Scholar
Xue, Y., Li, Z., Ermon, S., Gomes, C. P. & Selman, B. Solving marginal map problems with NP oracles and parity constraints. Adv. Neural Inf. Process. Syst. 29, 1135–1143 (2016).
Google Scholar
Lacoste-Julien, S., Jaggi, M., Schmidt, M. & Pletscher, P. Block-coordinate Frank–Wolfe optimization for structural svms. In International Conference on Machine Learning 53–61 (PMLR, 2013).
Frank, M. & Wolfe, P. An algorithm for quadratic programming. Nav. Res. Logist. Q. 3, 95–110 (1956).
Article MathSciNet Google Scholar
Ralaivola, L., Swamidass, S. J., Saigo, H. & Baldi, P. Graph kernels for chemical informatics. Neural Netw. 18, 1093–1110 (2005).
Article Google Scholar
Heller, S. R., McNaught, A., Pletnev, I., Stein, S. & Tchekhovskoi, D. InChI, the IUPAC international chemical identifier. J. Cheminform. 7, 23 (2015).
Article Google Scholar
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36 (1988).
Article Google Scholar
Benton, H. P., Wong, D. M., Trauger, S. A. & Siuzdak, G. XCMS2: processing tandem mass spectrometry data for metabolite identification and structural characterization. Anal. Chem. 80, 6382–6389 (2008).
Article Google Scholar
Watrous, J. et al. Mass spectral molecular networking of living microbial colonies. Proc. Natl Acad. Sci. USA 109, E1743–E1752 (2012).
Article Google Scholar
Huber, F. et al. matchms—processing and similarity evaluation of mass spectrometry data. J. Open Source Softw. 5, 2411 (2020).
Article Google Scholar
Dolan, J. W. Column Dead Time as a Diagnostic Tool. LCGC North America 32, 24–29 (2014).
Google Scholar
Järvelin, K. & Kekäläinen, J. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst. 20, 422–446 (2002).
Article Google Scholar
Pedregosa, F. et al. scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet MATH Google Scholar
Drucker, H., Burges, C. J., Kaufman, L., Smola, A. J. & Vapnik, V. Support vector regression machines. Adv. Neural Inf. Process. Syst. 9, 155–161 (1997).
Google Scholar
Bach, E. Retention order support vector machine (ROSVM) GitHub https://github.com/bachi55/rosvm (2022).
Willighagen, E. L. et al. The chemistry development kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching. J. Cheminform. 9, 33 (2017).
Article Google Scholar
Bach, E. msmsrt_scorer: probabilistic framework for integration of mass spectrum and retention order information. GitHub https://github.com/aalto-ics-kepaco/msms_rt_score_integration (2021).
Platt, J. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv. Large Margin Classifiers 10, 61–74 (2000).
Google Scholar
Bach, E. Dataset: ‘Joint structural annotation of small molecules using liquid chromatography retention order and tandem mass spectrometry data’. Zenodo https://doi.org/10.5281/zenodo.5854661 (2022).
Bach, E. Result files (ALLDATA): ‘Joint structural annotation of small molecules using liquid chromatography retention order and tandem mass spectrometry data with LC-MS²Struct’. Zenodo https://doi.org/10.5281/zenodo.6451016 (2022).
Bach, E. Result files (ONLYSTEREO): ‘Joint structural annotation of small molecules using liquid chromatography retention order and tandem mass spectrometry data’. Zenodo https://doi.org/10.5281/zenodo.6037629 (2022).
Bach, E. msms_rt_ssvm: implementation of the LC-MS²Struct algorithm. GitHub https://github.com/aalto-ics-kepaco/msms_rt_ssvm (2022).
Bach, E. Experiments and figure generation for the LC-MS²Struct evaluation. GitHub https://github.com/aalto-ics-kepaco/lcms2struct_exp (2022).

Download references

Acknowledgements

E.L.S. acknowledges discussions with G. Landrum (ETHZ) and E. Bolton (NCBI/NLM/NIH). We acknowledge CSC-IT Center for Science, Finland, and Aalto Science-IT infrastructure, Finland, for generous computational resources. E.B. thanks K. Dührkop for generating the SIRIUS candidate sets and predicting the SIRIUS MS² scores.

Funding

Open Access funding provided by Aalto University. The work by E.B. and J.R. was partially supported by Academy of Finland grants 310107 (MACOME) and 334790 (MAGITICS). E.L.S. acknowledges funding support from the Luxembourg National Research Fund (FNR) for project A18/BM/12341006.

Author information

Authors and Affiliations

Department of Computer Science, Aalto University, Espoo, Finland
Eric Bach & Juho Rousu
Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Belvaux, Luxembourg
Emma L. Schymanski

Authors

Eric Bach
View author publications
You can also search for this author in PubMed Google Scholar
Emma L. Schymanski
View author publications
You can also search for this author in PubMed Google Scholar
Juho Rousu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

E.B. and J.R. designed the research. E.B. implemented the MassBank pre-processing. E.B. developed, implemented and evaluated the computational method. E.B., E.L.S. and J.R. interpreted the results. E.B., E.L.S. and J.R. wrote the manuscript.

Corresponding authors

Correspondence to Eric Bach or Juho Rousu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Nicola Zamboni and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Distribution of molecule classes in the MassBank (MB) subsets.

Distribution of molecule classes in the MassBank (MB) subsets. ClassyFire SuperClass distribution ⁵¹ for each MB-subset studied in our experiments. Within each MB-subset, the label ‘Other’ is assigned to each SuperClass which makes up less than 2.5% of all molecules. The centre label represents the number of examples for the respective MB-subset.

Source data

Extended Data Fig. 2 Performance gain by LC-MS²Struct across ClassyFire Classes annotations.

Ranking performance (top-k) improvement of LC-MS²Struct compared to Only-MS² (baseline) across ClassyFire Class-level annotations. The Classes (shown in the bars) are colour coded by SuperClasses (see legend). The data is presented as mean values (50 samples) and the error bars show the 95%-confidence interval of the mean estimate (1000 bootstrapping samples). The top-k accuracies (%) under the bars show the Only-MS² performance. For each molecule class, the number of unique molecular structures in the class is denoted in the x-axis label (n).

Source data

Extended Data Fig. 3 Performance comparisons using 3D and 2D fingerprints in the ALLDATA setting.

Using LC-MS²Struct with different molecule feature representations to identify the correct structure at the level of first InChIKey block (InChIKey-1). a: Comparison of the performance, measured by top-k accuracy, of LC-MS²Struct using either 2D (no stereochemistry) or 3D (with stereochemistry) molecular fingerprints in the ALLDATA setting. The results shown are averaged accuracies over 354 sample MS feature sequences (LC-MS² experiments). b: Average top-k accuracies per MassBank (MB) subset rounded to full integers. The colour encodes the performance improvement in percentage units (%p) of each score integration method compared to Only-MS².

Source data

Extended Data Fig. 4 Model performance for different number of SSVM models.

Performance comparison of LC-MS²Struct against using only-MS² information (Only-MS²) for different number of SSVM models. The performance curves for the three MS²-scorers are shown separately. The top-k accuracies shown are averaged accuracies over 354 sample MS feature sequences (LC-MS² experiments) from the ALLDATA setting.

Source data

Extended Data Fig. 5 Processing pipeline of the MassBank data.

Processing pipeline of the MassBank data. Illustration of the processing pipeline to extract the training data from MassBank. The depicted workflow is implemented in the ‘massbank2db’ Python package ⁵².

Supplementary information

Supplementary Information

Supplementary Tables 1–5 and detailed derivations of the SSVM.

Reporting Summary

Source data

Source Data Fig. 2

Raw data to reproduce Fig. 2a,b.

Source Data Fig. 3

Raw data to reproduce Fig. 3a,b.

Source Data Fig. 4

Raw data to reproduce Fig. 4a,b.

Source Data Fig. 5

Raw data for the spectra shown in Fig. 5.

Source Data Extended Data Fig. 1

Raw data to reproduce Extended Data Fig. 1.

Source Data Extended Data Fig. 2

Raw data to reproduce Extended Data Fig. 2.

Source Data Extended Data Fig. 3

Raw data to reproduce Extended Data Fig. 3a,b.

Source Data Extended Data Fig. 4

Raw data to reproduce Extended Data Fig. 4.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Bach, E., Schymanski, E.L. & Rousu, J. Joint structural annotation of small molecules using liquid chromatography retention order and tandem mass spectrometry data. Nat Mach Intell 4, 1224–1237 (2022). https://doi.org/10.1038/s42256-022-00577-2

Download citation

Received: 23 February 2022
Accepted: 31 October 2022
Published: 19 December 2022
Issue Date: December 2022
DOI: https://doi.org/10.1038/s42256-022-00577-2

This article is cited by

Generic and accurate prediction of retention times in liquid chromatography by post–projection calibration
- Yan Zhang
- Fei Liu
- Qing He Zhang
Communications Chemistry (2024)
Complementary methods for structural assignment of isomeric candidate structures in non-target liquid chromatography ion mobility high-resolution mass spectrometric analysis
- Masoumeh Akhlaqi
- Wei-Chieh Wang
- Anneli Kruve
Analytical and Bioanalytical Chemistry (2023)
NORMAN guidance on suspect and non-target screening in environmental monitoring
- Juliane Hollender
- Emma L. Schymanski
- Martin Krauss
Environmental Sciences Europe (2023)

Subjects

Abstract

Similar content being viewed by others

Main

Overview of LC-MS2Struct

Input and output

Candidate ranking using max-marginals

Joint annotation using SSVMs

Results

Extracting training data from MassBank

Comparison of LC-MS2Struct with other approaches

Performance for different compound classifications

Annotation of stereoisomers

Discussion

Methods

Notation

Graphical model for joint annotation of MS features

Feasible inference using random spanning trees

The SSVM model

Pre-processing pipeline for raw MassBank records

Generating the molecular candidate sets

Pre-computing the MS2 matching scores

Molecular feature representations

Computing molecular categories

Training and evaluation data set-ups

SSVM hyperparameter optimization

Ranking performance evaluation

Comparison of LC-MS2Struct with other approaches

Reporting summary

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links

Overview of LC-MS²Struct

Comparison of LC-MS²Struct with other approaches

Pre-computing the MS² matching scores

Comparison of LC-MS²Struct with other approaches