High accuracy machine learning identification of fentanyl-relevant molecular compound classification via constituent functional group analysis

Fentanyl is an anesthetic with a high bioavailability and is the leading cause of drug overdose death in the U.S. Fentanyl and its derivatives have a low lethal dose and street drugs which contain such compounds may lead to death of the user and simultaneously pose hazards for first responders. Rapid identification methods of both known and emerging opioid fentanyl substances is crucial. In this effort, machine learning (ML) is applied in a systematic manner to identify fentanyl-related functional groups in such compounds based on their observed spectral properties. In our study, accurate infrared (IR) spectra of common organic molecules which contain functional groups that are constituents of fentanyl is determined by investigating the structure–property relationship. The average accuracy rate of correctly identifying the functional groups of interest is 92.5% on our testing data. All the IR spectra of 632 organic molecules are from National Institute of Standards and Technology (NIST) database as the training set and are assessed. Results from this work will provide Artificial Intelligence (AI) based tools and algorithms increased confidence, which serves as a basis to detect fentanyl and its derivatives.

www.nature.com/scientificreports/ A comprehensive review of fentanyl and its analogs has been done in by Armenian et al. 10 . Carfentanil 11 , sufentanil 12 , alfentanil 13 , and remifentanil 14 are fentanyl analogs with more potency than fentanyl for medicinal use. These and other fentanyls all exhibit some common key functional groups.
One promising method of identifying fentanyl and its analogues is through the fingerprinting of their unique infrared (IR) spectral properties (reflectivity, absorptivity, or transmissivity) as this technique can reveal characteristics of the numerous intramolecular bonds 15 . However, identifying fentanyls in practice is difficult. Clandestine production often exhibits variations in products between batches and as previously stated there are numerous formulations of fentanyl and carfentanil. As can be seen in Figs. 1 and 2, fentanyl, carfentanil, and their analogues exhibit varying functional groups and simultaneously may have conformers for the same composition which are denoted by a * at a particular bond. These structural variations can result in significantly different infrared spectral properties. Despite these differences, there are structural similarities enabling the compound to bind to receptors within the body and perform a similar function.
In analytical chemistry, IR spectroscopy tables 16 had been widely used to identify compounds based on the correlations between IR absorption and common types of molecular bonds and functional groups. In this work, we aimed to answer the following question: can we identify constituent functional groups of fentanyl and its analogues from the IR absorption data?
Here we use a database containing 632 molecules, 591 of which have at least one of the functional groups found in the parent compound of fentanyl. The rescaled IR absorption spectra of these molecules are shown in Fig. 3. From these data we construct machine learning algorithms to identify molecules which are compromised from at least one of these functional groups.
A main challenge of learning from the spectral data is that the number of variables is the number of wavenumbers where absorptive data is obtained. The analysis suffers from a curse of dimensionality. There has been a growing body of literature in chemometrics that study the identification of compounds from their spectral properties via data-driven algorithms. In most of the work, spectral properties measured at each wavenumber is treated as a predictor, and widely-used classification or clustering machine learning methods has been applied for analysis directly or after dimension reduction with methods such as principal component analysis, see Refs. [17][18][19][20][21][22][23][24] among others. In this work, we treat the underlying IR spectra as smooth functions and apply functional principal component analysis by approximating the data with a basis of a few orthogonal smoothed eigenfunction and then perform classification via functional generalized linear models. Comparing to the high-dimensional multivariate classification, the functional analysis is more interpretable. It associates the functional groups with features in the IR spectra in terms of the functional components along the whole range such as peaks and troughs. Results  www.nature.com/scientificreports/ from this work will provide Artificial Intelligence (AI) based tools and algorithms increased confidence, which serves as a basis to detect fentanyl and its derivatives.

Methodology
Let α 1 , α 2 , . . . , α p be the wavenumbers at which absorbance are recorded, ν i (α) be the underlying absorbance at any wavenumber α for the ith molecule and ν i,j = ν i (α j ) + ǫ i,j , i = 1, 2, . . . , n, j = 1, 2, . . . , p be the observed absorbance. The goal of this study is to distinguish if a given molecule contains a certain functional group such as amide. Let y i = 1(molecule i contains the functional group), i = 1, 2, . . . , n be the binary observations. We establish a classification model with {ν i,j } 1≤i≤n,1≤j≤p as the predictor and {y i } 1≤i≤n as the response, via a functional logistic regression with functional principal component basis.

Dataset.
The data set for both training and testing was obtained from the public database of National Institute of Standards and Technology (https ://www.nist.gov/), which provides spectral information of the selected 632 compound molecules considered in this work., Such compounds can be categorized into one of the following eight groups: (i) amide only, (ii) aniline only, (iii) benzene only, (iv) piperidine only, (v) amide and aniline simultaneously, (vi) amide and benzene simultaneously (vii) distinct aniline and benzene simultaneously, and (viii) none of the above constituent functional groups. The group information of the data is listed in Table 1 with the names of the compounds listed in the Supplementary Appendix A. For each molecule, the discretization of the IR spectra is recorded with unaligned maximum wavenumbers ranges of 243 cm −1 to 4,000.7 cm −1 (7.28-119.94 THz), with the majority of the wavenumbers ranging from 500 to 4,000.7 cm −1 (14.99-119.94 THz). Preprocessing of the discrete raw data consists of rescaling each absorbance function such that it has unit standard deviation and converting it to regularly spaced measurement via interpolation at the common wavenumbers 500-4,000 cm −1 (14.99-119.92 THz) with step size 4 cm −1 (0.12 THz), www.nature.com/scientificreports/ yielding a total of p = 876 discrete points. For compounds with a slope in the IR spectra, we apply a linear baseline correction to remove the linear trend between the lowest point and the right end in the IR spectra.
No subsequent smoothing of the data is performed in preprocessing. It is assumed that spectra obtained from the NIST database are close enough to standard temperature and pressure to neglect broadening and thermal effects. The preprocessed dataset is decomposed into distinct training and testing sets, consisting of 506 and 126 molecules, respectively, as shown in Table 1.

Functional principal component analysis (functional PCA).
Denote ν(α) = n −1 ν i (α) as the mean as the covariance function of the sample. Functional PCA decomposes the underlying absorbance functional ν i (α) as.
where φ k (α), k = 1, 2, . . . , K are the functional principal components (PCs), and f ik is the score of molecule i for principal component k. Cutting off at a finite integer K and estimate ν i (α) according to The functional PCs are the eigenfunctions of the covariance function, σ (s, t) , where φ k (α) corresponds to the kth largest eigenvalue k . The corresponding PC scores are The number of PCs K can be selected such that the majority, e.g. 90%, of the variance is explained by (2).
The functional PCs are orthonormal, i.e., ∫ φ j (α)φ k (α)dα = 0 if j = k and 1 of j = k , and they explains the most proportion of variance in the data in descending order. In other words, the first PC score f i1 , i = 1, 2, . . . n maximizes the sample variance among the inner products between the data and all functionals ξ 's subject to the constraint ∫ ξ(α) 2 dα = 1 ; the second PC score f i2 , i = 1, 2, . . . n maximizes the remaining variance among the inner products of the data and all functionals ξ 's subject to the constraint ∫ ξ(α) 2 dα = 1 and ∫ ξ(α)φ 1 (α)dα = 0 , and so on. The proportion of variance that is explained by f ik , 1 ≤ i ≤ n is k / ∞ k=1 k . In the estimation, we regularize the functional PCs and impose roughness penalty to the sample covariance function, measured by the norm of second-order derivative of the function. Details of functional PCA can be found in Ref. 25 . β k φ k (α) . According to (1), the above linear model can be written as where the unknown coefficients β 0 , β 1 , . . . , β K can be estimated by the maximum likelihood estimation (MLE) for the generalized linear model 26 . The above logistic regression can be extended to multiclass cases. Details are omitted and interested readers are referred to Ref. 27 .
For a new molecule with IR spectrum ν ⋆ (α) , the probability p ⋆ can be estimated by where β k is the MLE of β k , and f ⋆ k are the PC scores of the new molecule, k = 1, 2, . . . , K . If p ⋆ exceeds a given threshold such as 0.5, the molecule is classified as having the functional group.
Comparing to the high-dimensional multivariate classification with p = 876 predictors, the functional GLM represents the absorbance by a small number of orthogonal basis functions enabling statistical and computational efficiency. More importantly, the GLM associates the functional groups with features in the IR spectra in terms of the functional components along the whole range, while the multivariate classification treats absorbance at each wavenumber as a predictor and trains the model by assuming conditions such as sparsity. www.nature.com/scientificreports/ Classification performance metrics. In evaluating the performance of classification results we obtain the confusion matrix as shown in Table 2 by comparing the predicted labels against the ground truth from the test set. Let The Receiver Operating Characteristic Curve (ROC) is a curve plotting 1-FNR (a.k.a., sensitivity) against the FPR (a.k.a., 1-specificity). For each classification model we report the Area Under the Curve (AUC) and use AUC as an evaluation metric of the algorithm performance.

Results and discussion
In this study, we train functional GLM models and evaluate its capability to fingerprint the functional groups including amide, aniline, benzene and piperidine in the IR spectrum of a molecule. Parameters for the GLM are estimated from the IR spectral data and functional groups information of the training set to train the model. The constructed model is then used to predict the functional groups from only IR spectra data of a testing dataset. We evaluate the accuracy of the models by comparing the prediction with the ground truth in the testing set. The functional data analysis is implemented using the R package fda.usc 28 and the ROC curve and AUC are obtained from R package pROC 29 .
We investigate two scenarios. Scenario one considers potential interactions which may arise from the simultaneous existence of multiple functional groups reflected by distinct IR spectral patterns. We classify the testing molecules into 8 nonoverlapping groups. Whereas in scenario two, the model is trained to predict if a molecule contains the constituent functional groups, where compounds with simultaneous presence of multiple functional groups is also counted.
Feature representation: functional PCA. We first construct the regularized functional PCs basis from the training set. The mean function and the leading PCs are presented in Fig. 4.
From Fig. 4, one can see the following. First, as shown in Fig, 3, the IR spectral data in each group is very diverse and has various patterns. Accordingly, in panel (c) of Fig. 4, the proportion of variance explained by the functional PCs increases slowly as more PCs are considered, where the first functional PC explains nearly 20% of the overall variance, with the second describing 10%. Over 80% of the total variance may be considered with the first 22 PCs. In addition, in panel (d) to (f) we present the leading 4 PC scores of compounds with no more than one functional groups of interest. The first four PC scores alone does not partition the four groups; however, we can observe a general trend in which, compounds with amide tend to have small scores of PC1, PC2 and PC4, and high scores in PC 3; anilines tend to have high scores in PC2, while piperidine containing compounds are clustered within a narrow range of PC2 and PC4 with high PC1 scores.
Prediction for appearance of functional groups from IR spectra. Four functional GLMs are constructed to predict the following responses from the IR spectra data and are assigned as a specific model respectively.
Model 1 whether the molecule has an amide functional group. Model 2 whether the molecule has an aniline functional group. Model 3 whether the molecule has a benzene functional group. Model 4 whether the molecule has a piperidine functional group.
In constructing the functional GLM models, we take K = 22 , which is the minimum number of PCs such that at least 80% of variance is explained. In Model 4, because the response is imbalanced with only 10 piperidine out of 478 training molecules, we perform an oversampling adjustment before estimating the model. That is, the 10 observations in the training set are resampled and reused such that there are 100 molecules in the training set are piperidine. The ROC curves for Model 1 to Model 4 are presented in the area under the ROC curve on Fig. 5 are shown in the Table 5.
We present the confusion matrices from the testing set of Models 1-4 (AUCs presented in Table 3) in Tables 4,  5, 6, 7 respectively, with a threshold probability of 0.5. The rates of correctly identifying the corresponding functional groups are 96.03%, 90.48%, 87.3%, and 96.03% respectively, with an average accuracy rate of 92.5%. www.nature.com/scientificreports/  Table 3. AUC of models 1-4.    www.nature.com/scientificreports/ Joint classification into non-overlapping groups.. We combine the four binary classification results described in Tables 4, 5, 6, 7 and group the data as shown in Table 1. The confusion matrix obtained from prediction in the 126 testing molecules is summarized in Table 8. From Table 8 one can find that, out of the 126 predicted molecules, 100 are classified exactly correctly, with an accuracy rate of 79.37%. Among the misclassified 26 molecules, 12 are classified with partial mistakes by incorrectly including or excluding a functional group while correctly identify another, 4 are misclassified by missing the correct functional group, 3 is misclassified by identifying a wrong functional group and 7 are misclassified by both missing the truth and identify a false. The misclassified molecules are listed in Table 9. Table 9 shows a relatively high error rate in distinguishing simultaneously appeared functional groups from the cases where only one of the functional groups of interest is present. The algorithm also has a relatively high error rate in distinguishing the aniline group and benzene, due to fact that these functional groups share common structure.

Summary
Today, synthetic opioid analogues such as, fentanyl, have been the cause of many accidental deaths across the world, and thus the detection of low concentrations of these harmful substances at a distance via spectroscopic techniques is crucial for law enforcement. Unfortunately, new fentanyl related compounds are being created and distributed frequently with slight modifications to the functional groups present in them. Therefore, machine learning based intelligent detection schemes must be employed for intelligent detection of such molecules. Here, we applied a functional generalized linear model with smoothed functional principal component basis to classify functional groups of molecules from their IR absorption data. The result serves as a basis of the identification of fentanyl and its derivatives, which could be accomplished in future. This effort demonstrated the efficacy of a functional data analysis to identify molecules containing one or more specific functional groups from their infrared absorption spectra. The accuracy rate of classification into the 9 distinct classes is 79.4%. The average rate of accurately identifying the four functional groups is 92.5%. Continued efforts on this model will seek to also expand the utility from gas phase compound analysis to solid phase, with spectral properties of fentanyl and its analogues included. Such expansions will require additional consideration.    Table 9. Misclassified compounds and the classification errors.