MS2Query: reliable and scalable MS2 mass spectra-based analogue search

Metabolomics-driven discoveries of biological samples remain hampered by the grand challenge of metabolite annotation and identification. Only few metabolites have an annotated spectrum in spectral libraries; hence, searching only for exact library matches generally returns a few hits. An attractive alternative is searching for so-called analogues as a starting point for structural annotations; analogues are library molecules which are not exact matches but display a high chemical similarity. However, current analogue search implementations are not yet very reliable and relatively slow. Here, we present MS2Query, a machine learning-based tool that integrates mass spectral embedding-based chemical similarity predictors (Spec2Vec and MS2Deepscore) as well as detected precursor masses to rank potential analogues and exact matches. Benchmarking MS2Query on reference mass spectra and experimental case studies demonstrate improved reliability and scalability. Thereby, MS2Query offers exciting opportunities to further increase the annotation rate of metabolomics profiles of complex metabolite mixtures and to discover new biology.


Supplementary Note 1: Model used for Case studies
The model used for the optimization of the hyperparameters and for the case studies was trained on positive ionization mode mass spectra from GNPS (https://gnpsexternal.ucsd.edu/gnpslibrary/ALL_GNPS) downloaded on the 15 th of November 2021, 20:00 CET. The version of MS2Query used for the training of this model was version 0.3.2, the main difference with this version is that the training of the models was done in notebooks, instead of having a fully automatic pipeline.
The same training workflow was used as for the k-fold cross-validation except for the sizes of the test sets and validation set split used. The "analogue test set" for this model was generated by randomly selecting 250 unique 2D molecules from the library and selecting all corresponding spectra. The "exact matches test set" for this model was generated by randomly selecting 3000 spectra from the library with at least one 2D structure match in the training spectra.
In addition, a validation test set was used. This was used for optimizing hyperparameters of MS2Query and to decide which features to include in the Random Forest model. The validation test set for MS2Query consisted of all spectra of 50 unique 2D structures and 600 spectra with at least one 2D structure match in the training spectra.
The model was trained in the same way as the k-fold validation but the number of training pairs was slightly different. For generating the training spectrum pairs for the random forest model of MS2Query a set of spectra was used containing all spectra of 200 unique 2D structures and 2400 spectra with at least one 2D structure match in the training spectra.
When aiming for a recall of 35%, the MS2Query threshold for this test set is 0.633 and results in an average Tanimoto score of 0.67 (Supplementary Figure 3 shows a detailed Tanimoto score distribution).  Table S1. This method is based on an impurity-based feature importance, also known as the Gini importance 1 . As an alternative method for assessing the impact the features have on the model performance, random forest models were trained lacking one of the features, to determine the difference in performance for these models (Table S2). Replacing the average MS2Deepscore of multiple library structures with the MS2Deepscore between 1 library spectrum and the query spectrum increased the MSE from 0.0255 to 0.0337 for the validation data set. This clearly demonstrates that using the average of multiple similar library spectra significantly improves the performance of the random forest model. Removing any of the other features also decreased the performance of the model.

Rationale behind precursor m/z difference
One of the newly introduced feature of MS2Query is using precursor m/z difference as an input feature for the random forest model. Current implementations of an analogue search often start with a preselection on precursor m/z followed by selecting spectra above a predefined threshold for the used similarity score 2 . A predefined precursor m/z threshold does not differentiate between precursor m/z differences as long as the mass difference falls within the set threshold. However, certain specific mass differences will be more likely than others. By using the precursor m/z difference as an input feature, the random forest can learn specific mass differences that are more likely to correspond to a good analogue or exact match. This approach has similarities to the analogue search implementation by Stephen E.
Stein and colleagues: in their approach, they limit the analogues to a predefined list of precursor m/z differences of commonly observed losses 3 . This is an interesting approach; however, this limits the analogues only to known losses and does not allow for analogues that have a substructure substituted by another substructure. Our random forest model is more flexible and can learn the mass differences that are an indication of a good analogue, even if these are not known mass differences. Modified cosine score and cosine score Both cosine scores and modified cosine scores were added as features for training random forest models to explore if they would notably contribute to the model predictions. However, the feature importance was always 0 for these scores, indicating that the model does not use the (modified) cosine score for the prediction at the set tree depth of 5. This suggests that the (modified) cosine score does not provide additional predictive power when MS2Deepscore and Spec2Vec are part of the model as well. Therefore, it was decided to not incorporate the (modified) cosine score in the workflow of MS2Query.

Weighting of the average of multiple library structures
The multiple library spectra were selected from the 10 library structures that are chemically most similar to the structure of interest, based on the Tanimoto scores. For all spectra belonging to these 10 selected library spectra, we compute the MS2Deepscore towards the query spectrum. To find a good, reliable method for taking an average and for weighting each of these selected spectra, we tried different approaches and selected the method with the best performance. To assess which method works best the MSE was compared for two models having different approaches for this feature, while the other 4 features stayed constant (precursor m/z difference, query precursor m/z, spec2vec_score, and MS2Deepscore).
There are two approaches that we assessed for calculating the average of the calculated MS2Deepscores.
The first method just takes the average over all selected spectra regardless of which library structure it belongs to and the second method first calculates the average MS2Deepscore for each selected library structure, followed by taking the average of the 10 average scores for each library structure. Since for one InChIKey the number of library spectra differs, the two methods result in a different way of weighting the spectra.
The model using the feature that first calculates the average MS2Deepscore for each selected library structure, followed by taking the average of the 10 average scores for each library structure performed the best. This method had an MSE of 0.0283 for the training set and 0.0257 for the validation set. The model using the feature where the average is just taken over all spectra had an MSE of 0.0301 for the training data and an MSE for the validation data of 0.0286. Therefore, the first method of selecting the average was selected.

Weighting based on Tanimoto score
In the comparison done in S2.2 each library structure is weighted equally. In practice, however, we see variations in how chemically similar the 10 closest structures are (all measured using Tanimoto). To explore if this has any notable negative impact on our model performance, we tested a weighted average. This should give more importance to library molecules that are more similar. We hence weighted each score by the Tanimoto score and further also tested different powers of this weighting. A higher power will result in more extreme weight to spectra of more similar library structures and a low power will have less effect.
As weight, the Tanimoto score to the power of 1-3 was tested. In Table S3 the performance of the different models are shown. This shows that using weighting of the different structures based on the Tanimoto score did not improve the overall performance of the model. Therefore, no weighting is used for calculating the feature using multiple library structures.

Mass spectrometer instrument type
The mass spectra in the GNPS library are measured on a wide variety of mass spectra instruments. We tested if we could use this information to improve the performance of the random forest model, since it is conceivable that some features are influenced by the instrument type or by whether or not the two spectra of interest were measured on different instruments. To this end we used metadata of spectra to classify the query and library spectra as a mass spectrometer using "ToF", "Quadrupole", "Ion Trap", or "Orbitrap" setup. Instruments that were not given or used a system that could not be classified as one of the 4 mentioned techniques were not classified. For both the query spectrum and the library spectrum a binary feature is created for each of these mass spectrum instrument types. Thus, this resulted in 4 features to indicate the type of the query spectrum and 4 features to indicate the type of the library spectrum: a total of 8 features. The value was set to 1 if it was measured on this instrument type and 0 if it was not measured on this instrument type. If the instrument type could not be classified, all features were set to 0.
The feature importance of training a random forest model including these features always resulted in a feature importance of 0. This shows that with this setup, the random forest model was not able to use this information to increase the prediction quality of MS2Query. Therefore, this feature was left out in the current MS2Query workflow.

Supplementary Note 4: Performance of an analogue search for different mass ranges increased performance for larger metabolites
Many tools for MS 2 spectrum annotation do not perform equally well for low and high query masses 4-6 .
For MS2Query the performance for different masses is tested by splitting the 'analogues test set' with spectra without an exact match in the library into three mass ranges; 0-300 Da, 300-600 Da and > 600 Da. Figure 3 displays the Tanimoto score distributions of the suggested analogues for these three mass ranges. This analysis reveals that MS2Query performs best for large metabolites (>600 Da) where it detected analogues with an average Tanimoto score of 0.85 and has a recall of 63%. A better performance for larger metabolites can also be observed when using MS2Deepscore, cosine or modified cosine score. The plots showing optimal and random results show the same results, suggesting that at least part of this improved performance for larger masses is an artifact of the spectral libraries or the Tanimoto score. However, the increase for the large masses seems to be more pronounced for the analogue search methods compared to the random search. A possible explanation why analogue searching is more accurate for larger metabolites, is that larger metabolites will often produce a higher number of characteristic fragments. In practice, the observed high analogue similarity for larger molecules is interesting, since it is complementary to currently existing methods relying on fragmentation tree-based approaches. Fragmentation tree-based methods perform well for smaller metabolites <500 Da, but perform less well for larger metabolites, both in terms of computational time and reliability 4,6,7 . This shows the potential for combining the two approaches and using the best of both for optimal performance.

Supplementary Note 5: Justification for method of selecting training data
Pairs of spectra were used for the training data of the random forest. If the spectra pairs would have been picked at random, the resulting Tanimoto scores would be very low on average 8 . To prevent overfitting to low Tanimoto scores, it is important that a more equal distribution of Tanimoto scores is used for training. To achieve this, the spectrum pairs were picked by selecting the top 100 highest scoring library spectra for MS2Deepscore for a set of training spectra. This method for selecting spectrum pairs of spectra for training is similar to the workflow for running MS2Query and results in a relatively equal distribution of Tanimoto scores between these spectra, preventing a bias towards predicting lower Tanimoto scores.  The results of the analogue searches are merged with the annotated csv files. Despite our efforts, it is challenging to judge for a case study what method performs better, due to the relatively low number of (tentatively) validated spectra and the fact that judging the quality of analogues is somewhat subjective. However, for case study 1 a relatively high number of case study spectra could be validated. In total, for 67 spectra, the precursor m/z and retention time could be linked to an in-house reference. Both the results predicted by GNPS and MS2Query were compared to the reference standards. The biggest difference was in the spectra that we were not able to validate, which makes it challenging to directly compare the performance of GNPS analogue search with GNPS for this case study. Benchmarking with Cosine score, Modified cosine score and MS2Deepscore was done with a preselection on mass difference between a query and library molecule of 100 Da. In Supplementary Figure 7, we compare the difference in performance with and without a preselection on mass difference. These results

Supplementary
show that performance is slightly better when using prefiltering on a mass difference of 100 Da, when using MS2Deepscore and the cosine score. When using the modified cosine score the results are a bit better for low recall settings, when no preselection on mass difference is done, however this comes at the cost of a large increase in runtime.
In addition a new MS2Deepscore model was trained that already includes the precursor m/z during training. The performance of this model is slightly better, but still MS2Query performs better.
Supplementary Figure 6: Comparisons of performance for variants of the reference benchmarking used in Figure 2. The cosine score and modified cosine score are used as reference benchmarking. The Cosine score and modified cosine score both calculate spectral similarity by directly comparing spectral peaks. The score is calculated by finding the best possible matches between peaks of two spectra and subsequently calculating a cosine score between the resulting vectors.
For the comparison of a spectrum S to another spectrum S' a peak p is considered as an eligible match for the cosine score if mz(p)-mz(p') < t, where t is the tolerance. The tolerance used was 0.05 Da.
For the modified cosine score a peak is also considered as an eligible match if mz(p)-mz(p') < t, but also if mz(p) + M -mz(p') < t, where M=PM(S')-PM(S). M is the modification mass and PM is the Precursor mass.
Supplementary Note 9: Methods details listed in Tables   Supplementary Table 5