Comparative study between deep learning and QSAR classifications for TNBC inhibitors and novel GPCR agonist discovery

Machine learning is a well-known approach for virtual screening. Recently, deep learning, a machine learning algorithm in artificial neural networks, has been applied to the advancement of precision medicine and drug discovery. In this study, we performed comparative studies between deep neural networks (DNN) and other ligand-based virtual screening (LBVS) methods to demonstrate that DNN and random forest (RF) were superior in hit prediction efficiency. By using DNN, several triple-negative breast cancer (TNBC) inhibitors were identified as potent hits from a screening of an in-house database of 165,000 compounds. In broadening the application of this method, we harnessed the predictive properties of trained model in the discovery of G protein-coupled receptor (GPCR) agonist, by which computational structure-based design of molecules could be greatly hindered by lack of structural information. Notably, a potent (~ 500 nM) mu-opioid receptor (MOR) agonist was identified as a hit from a small-size training set of 63 compounds. Our results show that DNN could be an efficient module in hit prediction and provide experimental evidence that machine learning could identify potent hits in silico from a limited training set.

Implementation of "big data" with deep learning has created a paradigm shift in many scientific disciplines [1][2][3] . From the perspective of medicinal chemistry, predicting particular functions or properties, e.g., absorption, distribution, metabolism, and excretion (ADME), of a molecular entity might greatly increase the quality of hit compounds and quicken the drug-discovery process. The use of artificial intelligence (AI) in drug design to generate a prediction model, conduct virtual screening, and predict compounds' activities has received much attention recently [4][5][6][7] . Traditionally, quantitative structure-activity relationship (QSAR) model was utilized by medicinal chemists and statisticians to associate bioactivities to particular functional group manipulations. In particular, a linear equation was generated to correlate the features and bioactivities for each compound, while different descriptors were employed to calculate the physical properties to merge with the 3D-structrual information and generate 2D or 3D-QSAR models [8][9][10] . Nowadays the development of QSAR have apply to multi-target and multi-objective QSAR approaches to assist drug design [11][12][13] . These QSAR approaches are able to integrate multiple diverse chemical and biological data, being therefore capable of jointly making predictions ranging from in vitro and in vivo activities to ADMET properties 14 . Nonetheless, these QSAR models were hard to generate from random and diverse databases. In addition, to properly separate the training set and the test set was time consuming. To provide an alternative strategy, as reported by Zhavoronkov et al., they have successfully used the deep learning method in the designs of more potent compounds 15 . The incorporation of machine learning method for the progressive analysis of the active compounds and concurrent generation of the prediction model should address such limitations.
Lavecchia et al. 16 summarized applications of machine learning algorithms, such as support vector machine (SVM) 17 for ADME evaluation and decision tree (DT) in the classification of compounds 18 . Moreover, a Naïve Bayesian classifier is frequently used in chemoinformatics for predicting biological properties, while k-Nearest neighbors (K-NN) is a simple and rough method to predict and rank the molecule 19,20 . Others like the artificial neural networks (ANNs), is the popular technique for compound classification, QSAR studies, and primary virtual screening (VS) of compounds 21 . All these machine learning algorithms were programmed to pick out and reclassify important features of the molecules as instructed, the limitations of these algorithms stemmed from the intrinsic inability to "self-taught" and prioritize the features in relation to the activities. Improper combining of the compounds' descriptors could increase the noise level in features learning that could result in swamping the classifier model and generate a misleading prediction 22 .
Herein, we employed deep learning algorithm to analyze the compound features, generate a first-hand model through 613 descriptors for training, and validated its findings through experimental confirmation. In addition, we compared its accuracy and efficiency with three other different virtual screening methods. After in silico screening of our in-house database of 165,000 compounds, by which different hit compounds were identified from 15,23-25 , 100 top-ranked newly identified TNBC inhibitors were subjected to the bioassay to cross-examine the model accuracy. Moreover, to extend the scope of this deep learning model in predicting meaningful hits, another case study for the search of novel G protein-coupled receptor (GPCR) agonist was carried out. By using a similar model, we only trained the model with a collection of 63 mu-opioid receptor (MOR) agonists to learn the importance of compound features for the given bioactivities. We then identified the nanomolar MOR agonist from the in-house compounds library. Our study suggested that deep learning method could generate potent hit compounds in different disease areas for the drug discovery process.

Results and discussion
Model generation and comparative studies in efficiency. An advancement in the virtual screening method was made to reduce the burden of the drug discovery/development processes in a cost-effective manner 26 . The virtual screening can be devised by using either structure-based virtual screening (SBVS) like docking screening methods 27 or LBVS like QSAR model screening 28 . To investigate the application and efficiency of the DNN approach in medicinal chemistry, we compared other contemporary QSAR method, such as RF approach 29 , with traditional QSAR methods, such as PLS and MLR. RF has been demonstrated to have high prediction accuracy and robustness with adjustable parameters. It has become a "gold standard" machine learning method. Meanwhile, partial least squares (PLS) and multiple linear regression (MLR) are methods used for large data manipulation and allow facile generation of the model unlike other 3D-QSAR methods. In the current study, the same data set and descriptors were systematically incorporated to generate the models. The traditional QSAR model helps to identify the relationship between activities and the descriptors' variables. In addition to the QSAR methods, RF and DNN from the machine learning approach were used to generate the prediction model. RF is an ensemble learning method to perform classification in a similar manner to that of the decision tree (DT). Yet, the major difference stems from the use of Bagging method (or Bootstrap Aggregating) to generate many individual trees 30 . Each tree could self-process samples from the training set data and provide a fixed number of random sampling data from the training set to generate a DT for voting. The final model was based on the highest score from individually developed trees in the forest. On the other hand, DNN are mathematical methods developed to mimic the neurons (nodes) of the human brain to recognize objects and analyze progressively, improving the efficiency of previously reported neural network algorithms 1,31 . Each neuron is treated as a particular feature to classify the complex factors. The system, in turn, learns from the training set and assigns different weights for each neuron as this model eventually facilitates a prediction following the different clusters. Taken together, DNN increase the hidden layer numbers by allowing each layer of the nodes to access different features based on the previous layer's output. Consequently, as more executed nodes are added in each layer, more features are recognized, enhancing the overall decision process.
To compare the different methods of virtual screening, a database of 7130 molecules with previously reported MDA-MB-231 inhibitory activities were collected from the ChEMBL web service. As the model prediction accuracy is highly depended on the quality of the database. In this study, these compounds were then randomly www.nature.com/scientificreports/ separated into 6069 compounds (the training set) and 1061 compounds (the test set) to evaluate which model can more efficiently analyze the database and generate more useful models (Fig. 1). We implemented the extended connectivity fingerprints (ECFPs), which are circular topological depictions of the molecules, as the major molecular descriptors. Specifically, ECFPs are generated in a molecule-directed manner by systematically recording the neighborhood of each non-hydrogen atom into multiple circular layers up to a given diameter of that molecule 32 . These atom-centered sub-structural features are then mapped into integer codes and the resulting identifiers shape the extended connectivity fingerprint. These identifiers capture the local information of the corresponding atom in such a way that various atom properties (e.g., atomic number, connection counts) are packed into a single integer value. The default identifier configuration of ECFP captures highly specific atomic information, enabling the representation of a large set of precisely defined structural features. In some applications, however, different kinds of abstraction may be desirable. For example, a chlorine or a bromine substituent on a ring may be functionally equivalent but would be redundantly distinguished by ECFP. Alternatively, functional-class fingerprints (FCFPs) 32 detail circular fingerprints via the pharmacophore identification of atoms, which reports topological pharmacophore fingerprints. To perform the classifications comparisons, the software devised a total of 613 descriptors from AlogP_count 33 , ECFP, and FCFP to generate the model (Fig. 1, and supplementary data Table S1).
Three distinct different numbers of training set (6069, 3035, and 303 compounds) were used to generate the models and their efficiencies were evaluated by the fixed test set (1061 compounds). R-square value (r 2 value) was used to quantify the differential efficiencies between the training set and test set prediction in machine learning methods (DNN and RF) and the QSAR methods (PLS and MLR) (Fig. 1). With training set compounds fixed at 6069, the machine learning methods (DNN or RF) exhibited higher predicted r 2 value near 90% than the traditional QSAR method (PLS or MLR) at 65%. In general, a good model was considered as having larger r 2 and R 2 pred (r 2 > 80, R 2 pred > 60 is an assessable model) [34][35][36] . With the decrease of training set numbers, the machine learning methods sustained the overall higher r 2 value. As the training set number decreases, the deviation only retained with DNN and RF at 0.84 to 0.94, while PLS and MLR dropped to 0.24 from 0.69. In particular, with significantly lower training set numbers, interestingly, the MLR method maintained a respectful r 2 value near 0.93, but when running against the test set, R 2 pred R 2 pred was calculated to be zero. This implies that MLR could be an over-fitting model with a high false-positive rate, especially when the numbers of learning compounds are very limited. These results showed that the PLS and MLS methods could not efficiently distinguish the descriptors and were problematic in generating meaningful fitting equations. On the other hand, the DNN method with lower number of training sets, the data still held a higher r 2 value of 0.94 than that of 0.84 by RF method (Fig. 1). Although the RF method could classify the features and select intrinsic feature for the analysis, DNN method was better in providing insights in weighting of important features. As a result, the DNN method held a higher r 2 value with lower numbers of training data sets. Of the machine learning methods, the R 2 pred R 2 pred significantly improved with the increase in training set numbers, which is vastly different than the QSAR models www.nature.com/scientificreports/ ( Fig. 1). With routine sampling of large amount of molecular features against a target from the public domain might be limiting, the large spread or deviation of PLS and MLR processes could greatly hinder the potential of identifying potent hits. Taken together, DNN and RF exhibited better accuracy and efficiency in the prediction of hit compounds. As shown in Fig. 1, the R 2 pred of DNN (0.26) and RF (0.24) were much lower, which implies that the database quality might not be sufficient for learning. We envision that more datasets might be needed or the quality of the datasets in terms of structural information and their activities should be more correlated for better learning by the algorithm.
Seminal work by Grisoni and coworkers 37,38 Table S2). In addition, Consonni et al. showed the calculation of Root-Mean-Square Error in prediction (RMSEP) and Root-Mean-Square Error in calculation (RMSEC) could quantify predictive abilities of QSAR model. The higher value of RMSEP led to higher chances of error. Our calculation results also showed that DNN method had the lowest value for RMSEC and RMSEP in comparison to those of other models (Supplementary data Table S2).
To further investigate the advantageous prediction ability of machine learning methods (DNN and RF) over the traditional QSAR methods (PLS and MLR), we analyzed the receiver operating characteristic (ROC) curve with the fix training set (6069 compounds) and fix test set (1061 compounds) 39,40 . ROC curve evaluates the performance of a binary classifier system and provides means in selecting optimal models. ROC curve was constructed by plotting a graph of sensitivity (Se, true positive rate) vs. 1-specificity (1-Sp, false positive rate). The measure of Se and Sp are defined as where TP is the number of correctly identified active ligands (true positives), TN is the number of correctly identified inactive ligands (true negatives), FP the number of incorrectly identified active ligands (false positives), and FN the number of incorrectly identified inactive ligands (false negatives). The area under the ROC curve (AUC) measures the performance of each virtual screening approaches. The ideal screening method results in an AUC value of 1, while a random screening method would lead to an AUC value of 0.5. As shown in Fig. 2A, the AUC calculated by the training set of the RF and DNN methods were 0.991 and 0.992, respectively. Interestingly, these values were higher than those of PLS and MLR methods with 0.907 and 0.922. To investigate the prediction ability of the test set, the respective AUC values of RF and DNN methods were 0.922 and 0.924. Also, they were expected to be superior than those of PLS and MLR methods with 0.870 and 0.865. These ROC curve analyses further potentiated the RF and DNN screening method might be more suitable than traditional QSAR methods (PLS and MLR).
Virtual screening and identification of TNBC inhibitors by DNN and RF models with experimental validation. Based on the above information, the DNN and RF models were chosen as the preferred means to perform virtual screening. The identified compounds were then assayed for their corresponding bioactivities. Herein, we demonstrated two different cases for evaluating these models' accuracy. First, we successfully identified active hits for TNBC inhibition. The DNN and RF models were used to screen the in-house database (165,000 compounds), and the selected hits were assayed against the anti-TNBC cellular assay (Fig. 3A). The top predicted 100 compounds were selected and tested at 10 μM concentration for MDA-MB-231 cell line inhibition (Supplementary data Table S3.1 and Table S3.2). Since the compound collection was acquired based on MDA-MB-231 inhibitory activities, other TNBC cell lines were also assayed to obtain selective TNBC inhibitors. Out of the multiple hits identified through both methods, six compounds from each classification (compounds 1-12) were assayed and showed low cytotoxicity to MCF10A, a nonmalignant mammary epithelial cell line (Fig. 3B,C). We then assayed these hits against other TNBC cell lines, BT-549 and MDA-MB-453 . Compounds 3, 7, 8, 10, which exhibited broader TNBC inhibitions, were then subjected to IC 50 determination (Supplementary data Figure S1). Notably, between RF and DNN, we obtained a thiazole core with selective anti-TNBC profiles over  www.nature.com/scientificreports/ the normal mammary cells (Fig. 3B,C). Synthesis of the thiazole-based inhibitors were carried out and several potent TNBC inhibitors were identified (Fig. 4). Compound 18, which showed good selectivity over nonmalignant mammary epithelial cell, had an IC 50 of 0.62 µM against MDA-MB-231. Interestingly, regioisomeric controls in compounds 22 and 23 were synthesized. Compound 22 did not show activities toward the TNBC and 23, although it possessed moderate micro molar activities and also exhibited cytotoxicity to MCF-10A. This study serves as a good example of hit generation from an unknown target with good cellular selectivity and functional manipulatable core.

Analysis between model-identified compounds and database compounds.
To address the availability of thiazole core in the original set of 7130 compounds, we devised a principle component analysis of the database with PIC50, AlogP, and polar surface versus total surface area (Fig. 5). These properties were chosen to fulfill the characteristics of a hit compound in a common drug-discovery campaign. The 7130 compounds were then mapped and showed that compounds consisting of the thiazole core are clustered in the quadrant with activities ranging from PIC50 4.8-6.5 (10 µM to 0.3 µM). Moreover, the AlogP and polar surface versus total surface area values were in the satisfactory range for a hit compound (Fig. 5). Gratifyingly, this finding correlates well to the experimental results from our SAR studies of the TNBC inhibitors (Fig. 4). Our findings suggest that both RF and DNN can be adapted to generate meaningful models and identify functional hits for the later optimization process.
Identification and experimental validation of novel GPCR agonists by the DNN and RF models. We envision that predicting new scaffold with the experimental validation should render the greatly www.nature.com/scientificreports/ expand the application of this deep learning approach. We then adapted this classification for GPCR agonist generation, where structure-based designs are limited without a known information of the core structure due to the membrane associated nature of many GPCRs (Fig. 6). To evaluate the scope of the model, the MOR agonist was also identified via virtual screening of the same in-house database with the DNN and RF models. In our previous studies on MOR agonist 41 , we synthesized 63 compounds and tested by FLIPR calcium assay (Supplementary data Table S4 and Figure S2). We used MOR as an example to demonstrate the predictive power of this approach. To train the learning system, we provided a small sample collection of 63 compounds 41 . The total 63 compounds, divided into series A-E clusters, were used as the training set to generate the DNN and RF models (Fig. 6A). We envision that incorporation of molecular diversity with large spread of bioactivities in series A-E should minimize deviation of the r 2 with DNN and RF and improve the learning process. Model generation was performed with the same 613 descriptors, and then new cores in the 165,000 in-house pool were processed. The top 40 compounds predicted by RF and another top 40 by DNN (Supplementary data Table S5.1 and Table S5.2) were subjected to the FLIPR calcium assay (Fig. 6B). The CHO-K1 cell line, stably expressing MOR and Gα15 (GenScript), was used to evaluate the selected compounds. In the FLIPR calcium assay of CHO-K1/MOR/Gα15 cells, activation of MOR elicits an intracellular calcium release, leading to an increase in the relative fluorescence units (RFU). Five compounds, 24-28, were identified as potential hits by these two different screening models. As shown in Fig. 6B, in addition to hit 26 identified from DNN method exhibited potent agonist activities (EC 50 = 560 nM), these models provided great molecular diversities over the training set of compounds. To the best of our knowledge, this is the first example correlating prediction and validation of a GPCR agonist discovery where structure-based design is limited. Notably, only a small training set of 63 compounds (Supplementary  data Table S4) was employed, and a set of five structurally distinct hits was identified. This result provided strong support in that DNN and RF methods could still sustained high predicted r 2 value in low numbers of training data set.  Fig. 6, the compounds 24-28 has no structural similarity to morphine or any other previously described opioid receptor agonist. In the receptor binding assay, membrane proteins from HEK-MOP were used for detecting the binding affinity of these compounds by comparing with the morphine (Table 1).  www.nature.com/scientificreports/ conclusion Hit identification is an important step in the early stages of drug discovery. Virtual screening is extensively used to identify suitable hits, and such methods to improve the hit rate are much sought after. In this study, we report comparative studies between traditional QSAR methods and machine learning methods applied in VS. The results showed that machine learning methods could achieve a higher predicted r 2 value with fewer compounds required in the training set. In our work, DNN and RF predicted the selective TNBC inhibitors from the our in-house database. In case of identifying novel MOR agonist, 5 hit compounds were readily found from only a 63-compound training set. The diversified chemical structures of the 5 hits identified by the DNN method showed good potency as a hit with an EC 50 = 560 nM. This is an interesting application of the deep learning classification as structure-based design of GPCR agonist are limited with limited information of the core structure due to the membrane associated nature of many GPCR. Taken 44 . The PLS regression is using the orthogonal matrices (T) to determine the fundamental relations between dependent variable Y and independent variables X. For example, Y = X × W × Q + E, T = X × W, where Y is a response matrix for the dependent variables like bioactivities result, T is a extraction matrix for the independent variables like descriptors, Q is a matrix of the regression coefficients, W are the factor score matrix and the weight matrix, and E is an error term for the model 45,46 .The PLS and MLR models were also conduct by pilot V18.1 platform with the default protocol and evaluate by fivefold cross-validated method.