Molecular search by NMR spectrum based on evaluation of matching between spectrum and molecule

Inferring molecular structures from experimentally measured nuclear magnetic resonance (NMR) spectra is an important task in many chemistry applications. Herein, we present a novel method implementing an automated molecular search by NMR spectrum. Given a query spectrum and a pool of candidate molecules, the matching score of each candidate molecule with respect to the query spectrum is evaluated by introducing a molecule-to-spectrum estimation procedure. The candidate molecule with the highest matching score is selected. This procedure does not require any prior knowledge of the corresponding molecular structure nor laborious manual efforts by chemists. We demonstrate the effectiveness of the proposed method on molecular search using 13C NMR spectra.

The main advantages of the proposed method over conventional approaches are as follows. Compared to the chemical shift similarity approach in Fig. 1a, the proposed method does not require any prior knowledge or laborious manual efforts for peak picking and assignment from the query spectrum. Compared to the spectral similarity approach in Fig. 1b, the proposed method does not require measured/simulated spectra of candidate molecules, which are expensive or otherwise difficult to obtain. These make it beneficial for implementing automated molecular search in general situations where information is limited.

Methods
Problem definition. The problem of molecular search by NMR spectrum is formulated as follows. Suppose a spectrum experimentally measured by NMR spectroscopy is given as a query in the form of S = (x, y) , where x = [x 1 , . . . , x l ] and y = [y 1 , . . . , y l ] represent the x-axis (frequency) and y-axis (intensity) of the spectrum, respectively. The corresponding molecular structure of the query spectrum S is unknown. No prior information about the molecular structure, such as the chemical formula, is available for use. We are also given a pool of candidate molecules D = {G 1 , . . . , G N } for which experimentally measured spectra are not provided. From the candidate pool D with no further information, we wish to search for the best matching molecule G * that is expected to have an NMR spectrum that is the closest match to the query spectrum S.
To evaluate the matching between the query spectrum S and a candidate molecule G t , we introduce the score function that involves a molecule-to-spectrum estimation procedure. The best matching molecule G * with the highest matching score is obtained as: Molecule-to-spectrum estimation procedure. Given the query spectrum S and candidate molecule G , the molecule-to-spectrum estimation procedure is used to evaluate whether G has a spectrum similar to S . The procedure is composed of three sequential steps, as illustrated in Fig. 2. The first step is to predict the chemical   www.nature.com/scientificreports/ shifts of G . The second step is to align the chemical shifts with S . The third step is to construct a spectrum that estimates S . For these three steps, we introduce the predict, align, and optimize functions.
From molecule to predicted chemical shifts. Given a candidate molecule G , the first step is to predict the chemical shifts of its NMR-active atom, which we denote by , using the predict function: where m is the number of NMR-active atoms in the molecule G and the elements of are sorted in ascending order.
The success of molecular search primarily relies on the accuracy of the chemical shift prediction. Any method that provides an accurate prediction and is computationally efficient can be used in this step. In this study, we use two representative methods as the predict function: • Hierarchical Organization of Spherical Environments (HOSE) 7 A HOSE code encodes the neighborhood information around an NMR-active atom in a spherical radius. If two atoms have similar neighbors, they will have similar HOSE codes. To predict the chemical shifts of the candidate molecule G , we generate HOSE codes for the NMR-active atoms in the molecule. For each NMR-active atom, we search all atoms with the same HOSE code from the NMRShiftDB2 database 14 . We then take the average shift of the atoms as the predicted chemical shift. • Message Passing Neural Network (MPNN) 1 A molecule is represented as a graph, whose nodes and edges correspond to atoms and bonds, respectively. An MPNN 15 processes the graph representation of the molecule using multiple message passing steps to predict the chemical shifts of the NMR-active atoms in the molecule. We train the MPNN using the molecules and their annotated chemical shifts that are collected from the NMRShiftDB2 database 14 . The MPNN is then used to predict the chemical shifts of the candidate molecule G.
Alignment of predicted chemical shifts. Given the predicted chemical shifts and the query spectrum S , we align the chemical shifts to S to obtain the aligned chemical shifts ′ using the align function: The pseudocode of the align function is given below.
From the query spectrum S , we choose the frequency values whose intensity is above a threshold τ , denoted as x τ = {x j ∈ x|y j > τ} . The value of τ is the minimum intensity to identify a peak. It should be chosen adequately to distinguish between peaks and noise in the spectrum, thereby ensuring that τ is greater than the smallest peak intensity and x τ includes all the actual chemical shifts of the spectrum. Then, each element in is aligned as follows. The smallest chemical shift δ 1 is aligned to the minimum value of x τ . The largest chemical shift δ m is aligned to the maximum value of x τ . The other chemical shifts δ i , i = 2, . . . , m − 1 are aligned to their closest values in x τ . Regarding the comparison between the candidate molecule G and query spectrum S , this step allows some inaccuracies in the predicted chemical shifts from HOSE or MPNN as well as in the spectrum caused by such reasons as shielding, hydrogen-bonding, and solvent effects.
From aligned chemical shifts to estimated spectrum. Given the aligned chemical shifts ′ and query spectrum S , this step optimizes the chemical shifts to construct an estimated spectrum of the candidate molecule G with respect to S . Using the optimize function, the optimized chemical shifts * and estimated spectrum ŷ are obtained as follows: The pseudocode of the optimize function is given below. where the parameters µ and σ are associated with the chemical shifts and their peak intensities, respectively. For the kernel function k, we use the Gaussian-Lorentzian sum function k(z; ) = (1 − ) exp −(4 ln 2)z 2 + /(1 + 4z 2 ) to approximate the shape of a peak in the spectrum, where lies in the range of [0, 1]. The estimated spectrum ŷ is updated along with the parameters (µ, σ , ) to have a similar shape to that of the query spectrum S . We initialize µ to the values of the aligned chemical shifts ′ , σ to a certain initial value h, and to 0.5. Then, (µ, σ , ) are optimized by maximizing the objective function J as follows: The first term corresponds to the maximization of the cosine similarity between the two spectra y and ŷ , which is calculated as cossim(y,ŷ) = y ·ŷ/(�y� · �ŷ�) , to ensure they have peaks at similar frequencies. The second term is used to minimize the squared Euclidean distance between the two normalized spectra y/ y 1 and ŷ/�ŷ� 1 to ensure they have similar overall shapes. The third and fourth terms indicate the preferences for µ i , which should be close to its initial value, and σ i , which should have a small value. The last term indicates the penalty if µ i+1 is not greater than µ i by a certain margin ǫ , thereby encouraging the peaks to split in the estimated spectrum ŷ . In this study, we employ the Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm 17 for optimization.
Matching scoring procedure. We calculate the matching score between the query spectrum S = (x, y) and candidate molecule G using the score function. This involves the calculation of the predict, align, and optimize functions for the molecule-to-estimation procedure. The pseudocode of the score function is as follows: In the molecule-to-estimation procedure, the major bottleneck is the optimize function owing to its high computational cost. We do not perform the optimization if the candidate molecule G is expected to have a spectrum that is significantly different from the query spectrum S . We introduce two filtering criteria. The first criterion is to abstain if the largest difference from to ′ is greater than the allowance of θ , formulated as The second criterion is to abstain if the aligned chemical shifts ′ fail to cover all peaks in S with a tolerance of θ , formulated as max If either condition is met, the score function returns a matching score of −C: where C is a large constant. Setting θ to a smaller value speeds up the molecular search by filtering out more candidate molecules. However, if θ is set too small, there is a risk of filtering out the actual matching molecule.
If the molecule G passes both filtering criteria, the optimized chemical shifts * and the estimated spectrum ŷ are obtained via optimization. The matching score for the query spectrum S is then calculated as follows: where the hyperparameter α controls the strength of the penalty for the magnitude of the alignment from to * . The matching score increases if the cosine similarity between the query spectrum y and the estimated spectra ŷ is higher and the difference between the original predicted chemical shifts and optimized chemical shifts * is lower. The second term prevents the score from becoming spuriously high when the molecule G has many chemical shifts. The score can be negatively valued if * is significantly different from .

Results and discussion
Dataset. We investigated the effectiveness of the proposed method on the problem of molecular search by 13 C NMR spectrum. Given a 13 C NMR spectrum as the query spectrum, we searched for the best matching molecule from a pool of candidate molecules.
For the query spectra, we used 36 spectra from our in-house database, which were experimentally measured using 13 C NMR spectroscopy. Each spectrum was transformed into a sequence of intensity-frequency pairs with a frequency interval of 0.05ppm.
The candidate pool for the search was composed of 36 molecules that corresponded to the query spectra, another 30 molecules from the in-house database that were collected for the same purpose, and 5,000 molecules that were randomly sampled from the NMRShiftDB2 database 14 . The summary statistics of the candidate molecules used are listed in Table 1. We do not report the detailed information and query spectra of the molecules from the in-house database to comply with the confidentiality policy. Implementation. For the experimental investigation, we implemented the proposed method with the following configurations. For the predict function, we predicted the chemical shifts of the candidate molecules from the in-house database using the HOSE and MPNN, resulting in two different predictions per molecule, and then, we chose the prediction with the better matching score for each molecule. We used the annotated chemical shifts provided by the database itself for the candidate molecules sampled from the NMRShiftDB2 database. For the optimize function, we implemented the L-BFGS algorithm for optimization using the SciPy library 18 in Python.
The proposed method requires the following five hyperparameters to be predetermined: τ , θ, ǫ, h , and α . We suggest the following guidelines for determining their values. The value of τ should be manually chosen between the highest noise and the lowest peak intensity, depending on the NMR instrument used to measure the spectrum. Choosing a proper value for h facilitates faster convergence of optimization. Setting the value of ǫ to be greater than 0 and smaller than the frequency interval of the query spectrum is sufficient. A smaller/larger value of θ allows more/less candidate molecules to be filtered out before optimization. The value of α should be chosen considering the overall scale of the chemical shifts. We note that the hyperparameters h, ǫ , and θ do not significantly affect the molecular search performance but are related to the efficiency of molecular search. For the molecular search with the query spectra measured using 13 C NMR spectroscopy, we used the hyperparameter settings listed in Table 2.
Molecular search performance was evaluated in terms of the top-K accuracy. For each query spectrum, we determined whether the matching molecule was retrieved from the best K candidate molecules with the highest scores. We computed the measure with varying values of K as 1, 2, 3, 5, and 10. molecular search by NMR spectrum. Table 3 shows the molecular search performance for the 36 query spectra in terms of the top-K accuracy on various numbers of candidate molecules. The numbers 66 and 5,066 indicate that only the in-house database and entire molecules were respectively used to constitute the candidate   www.nature.com/scientificreports/ pool. The proposed method achieved a considerably high top-K accuracy, indicating that it succeeded in retrieving the matching molecules from the pool for most query spectra. When the molecular search was conducted using only the in-house database as the pool, the top-1 accuracy and top-5 accuracy were 94.44% and 100%, respectively. Molecular search performance gradually decreased with the inclusion of more candidate molecules taken from NMRShiftDB2, because some of them coincidentally provided higher matching scores for some query spectra. When all 5,066 candidate molecules were considered for the search, the top-1 accuracy and top-5 accuracy decreased to 83.33% and 97.22%, respectively. Figure 3 shows an example of searching from three candidate molecules given a query spectrum. We calculated the matching score of each candidate molecule with respect to the query spectrum. For candidate molecule A, its predicted chemical shifts required little alignment to match the peaks in the query spectrum. The estimated spectrum was similar to the query spectrum, and thus, its final matching score was considerably high. On the other hand, candidate molecules B and C yielded lower matching scores with respect to the query spectrum. For molecule B, the magnitude of alignment was large. For molecule C, the estimated spectrum was dissimilar to the query spectrum. Consequently, among the three molecules, we chose molecule A as the best matching molecule for the query spectrum.
We investigated the relationship between the matching score and the success of the molecular search for each query spectrum. Figure 4 plots the rank among all 5,066 candidate molecules against the matching score for the actual matching molecules of the 36 query spectra. As demonstrated, when the actual matching molecule of a query spectrum yielded a high matching score, its rank among the candidate molecules was close to 1. The molecules for some query spectra yielded smaller matching scores and were subsequently ranked lower. We found that the molecular search failures were primarily caused by inaccuracies in the chemical shift prediction. Accordingly, other candidate molecules that provided moderate matching scores could take a higher rank, thereby degrading molecular search performance. We believe that molecular search performance can be improved further by enhancing the accuracy of the chemical shift prediction method used in the molecule-tospectrum estimation procedure.

Conclusion
In this paper, we presented a method for automated molecular search by NMR spectrum. Given a query spectrum and a pool of candidate molecules, the proposed method calculated the matching score of each candidate molecule with respect to the query spectrum by performing a molecule-to-spectrum estimation procedure. The candidate molecule with the highest matching score was retrieved by the molecular search. We demonstrated the effectiveness of the proposed method in identifying the molecules corresponding to 13 C NMR spectra.
Compared with conventional approaches, the proposed method is advantageous in that it does not require any prior knowledge of the corresponding molecular structure nor laborious manual efforts by chemists to implement the molecular search. Nevertheless, incorporating prior knowledge, such as the number of NMR-active atoms and chemical formula, would be beneficial for filtering out most non-matching candidate molecules in advance, thereby further improving molecular search performance. The proposed method is versatile for any type of spectrum by adjusting the hyperparameter settings. We expect that the proposed method will prove effective in the automatic identification of molecular structures from spectra in many chemistry applications.