Incorporating chemical sub-structures and protein evolutionary information for inferring drug-target interactions

Accumulating evidence has shown that drug-target interactions (DTIs) play a crucial role in the process of genomic drug discovery. Although biological experimental technology has made great progress, the identification of DTIs is still very time-consuming and expensive nowadays. Hence it is urgent to develop in silico model as a supplement to the biological experiments to predict the potential DTIs. In this work, a new model is designed to predict DTIs by incorporating chemical sub-structures and protein evolutionary information. Specifically, we first use Position-Specific Scoring Matrix (PSSM) to convert the protein sequence into the numerical descriptor containing biological evolutionary information, then use Discrete Cosine Transform (DCT) algorithm to extract the hidden features and integrate them with the chemical sub-structures descriptor, and finally utilize Rotation Forest (RF) classifier to accurately predict whether there is interaction between the drug and the target protein. In the 5-fold cross-validation (CV) experiment, the average accuracy of the proposed model on the benchmark datasets of Enzymes, Ion Channels, GPCRs and Nuclear Receptors reached 0.9140, 0.8919, 0.8724 and 0.8111, respectively. In order to fully evaluate the performance of the proposed model, we compare it with different feature extraction model, classifier model, and other state-of-the-art models. Furthermore, we also implemented case studies. As a result, 8 of the top 10 drug-target pairs with the highest prediction score were confirmed by related databases. These excellent results indicate that the proposed model has outstanding ability in predicting DTIs and can provide reliable candidates for biological experiments.

cluster membership for each vertex in the network 20 . Xia et al. designed semi-supervised model called NetLapRLS which combines the information of the known drug-protein interaction network with genomic sequence data and chemical structure. In this model, the final result is predicted by the combination of the classifiers, and the method has achieved good performance because of utilizing the integrate information and unlabeled data 21 . He et al. proposed an effective model CCPMVFGC to calculate the degree of contextual correlation between pairwise vertex features. This model can learn a shared latent space from multi-view features, and use it to construct the interrelationship between pairs of vertices 22 . Hu et al. designed a novel GraphSE method to learning for patterns among drug side-effects (SEs), among drug sub-structures, and between multiple drug substructures and the SEs. This method can construct an attribute graph for each SE, which can effectively predict whether a drug will lead to a certain SE 23 . Chen et al. classified the current prediction model of drug-target interaction into network-based method and machine learning-based method and so on. In particular, they analyzed the supervised and semi-supervised methods in the adoption of negative samples in machine learning-based method 24 . Cao et al. proposed a new model for predicting DTIs which combines the protein information encoded by physicochemical and biochemical properties with drug molecules structures information encoded by MACCS substructure fingerings 25 . Chen et al. proposed the NRWRH model to identify DTIs based on the assumption that the framework of Random Walk and similar drugs target are often similar for the target protein 26 .
Under the premise of the theory that the interaction among drug and target protein depends largely on the chemical sub-structures of drug compound and the structure of target protein sequence 11,[27][28][29] , we design a new in silico model to predict DTIs. Compared with the proposed methods, we introduce a protein sequence transformation method which can carry the information of biological evolution. In this method, the frequency of amino acid occurrences at different positions in multiple sequence comparisons is counted, and the conservative regions related to sequence evolution are found according to their probability distribution. Thus, similar parts between different sequences are found to infer their structural and functional similarities. The descriptors extracted by this method can not only reflect the position information of amino acids in the sequence, but also reflect the effects of mutations in amino acid sites during sequence evolution. Specifically, we firstly transform the protein sequence into numerical matrix that carries the information of biological evolution. Secondly, using Discrete Cosine Transform algorithm to extract its feature and combined with the corresponding chemical sub-structures as the feature vector. Finally, the Rotation Forest classifier is used to accurately predict the potential DTIs. We evaluate our model on Enzymes, Ion Channels, GPCRs and Nuclear Receptors datasets by the 5-fold CV method. Moreover, we compared the proposed model with the different feature extraction and classifiers models on the benchmark datasets. In the case study, the top 10 drug-target pairs with the highest predictive score were confirmed by SuperTarget database. Outstanding results show that the proposed model can effectively predict the relationship between drugs and targets, and can provide accurate candidates for biological experiments. The workflow of the proposed model is shown in Fig. 1.

Materials and methods
Benchmark datasets. In this work, the data for all DTIs were collected from DrugBank, SuperTarget, BRENDA, and KEGG BRITE by Yamanishi et al. 19 . These data is divided into four datasets including Enzymes, Ion Channels, GPCRs and Nuclear Receptors. The Enzymes dataset contains 445 drugs, 664 target proteins, and experimentally verified 2926 pairs of DTIs; The Ion Channels dataset contains 210 drugs, 204 target proteins, and experimentally verified 1476 pairs of DTIs; The GPCRs dataset contains 223 drugs, 95 target proteins, and experimentally verified 635 pairs of DTIs; The Nuclear Receptors dataset contains 54 drugs, 26 target proteins, and experimentally verified 90 pairs of DTIs. We take these known drug-target interactions as benchmark data and implement our experiments on this basis. The statistical information of drug target interaction is shown in Table 1.
If the drug molecules and target proteins are regarded as nodes and the relationship between them is regarded as edges, we can build a network representing DTIs. After connecting the nodes representing the interaction of known drug targets, it can be seen that this network is sparse. In experiments, all pairs with drug-target interactions are considered to be positive samples, otherwise they are considered as negative samples. Take the GPCRs dataset for example, there are only 635 known DTIs but have 223 × 95 = 21185 edges in the network. It can be seen that the number (e.g. 21185-635 = 20550) of negative pairs is noticeably more than that of positive ones, which is about 97% of the sample space. Therefore, we use the down sampling algorithm to extract samples from unrelated drug-target pairs to construct the negative sample set. The number of these pairs is the same as that of the positive samples. Theoretically, these negative samples may contain drug-target pairs that have not been verified by experiments. However, from a probabilistic and statistical perspective, in such a large ratio of differences, the number of actual interaction pairs used as negative samples can be ignored.
Drug molecular characterization. Studies show that molecular fingerprint of chemical sub-structures information can effectively characterize drug molecular information [30][31][32] . Therefore, molecular fingerprints are used herein to encode drug compounds in this paper. Specifically, this method encodes each molecular substructure as fingerprint and maps it into a corresponding Boolean vector. For a specific molecule, if it contains a molecular substructure, assign a value to 1 in the corresponding bit of the vector, otherwise 0. Although this method divides the molecule into individual fragments, it still retains the entire structure of the drug molecule. The ingenuity of this design is that it does not need the reasonable 3D conformation of molecules, so it will not accumulate errors from the description of molecular structure. In experiment, we adopt the chemical structure of the fingerprints set derived from the PubChem System. This drug fingerprint stores 881 molecular substructures, so the drug molecular descriptor used in this paper is an 881-dimensional vector.
Numerical characterization of protein sequences. Protein sequences are usually stored in the form of letters, in which the number of letters is 20, representing 20 amino acids. In order to facilitate the processing of www.nature.com/scientificreports www.nature.com/scientificreports/ machine learning algorithm, we use Position-Specific Scoring Matrix (PSSM) to transform it into a numerical matrix 33,34 . The advantage of this strategy is that it can extract the biological evolutionary information carried in the protein sequence, which is conducive to deep mining.
, which is a matrix of L × 20. The number of L represents the length of the protein sequence, and the number of 20 indicates the kind of amino acids. So the PSSM can be expressed as: www.nature.com/scientificreports www.nature.com/scientificreports/ here  i j , means that the probability of the ith residue being mutated into type j during the procession of evolutionary in the protein from multiple sequence alignments.
In this work, one of the most effective and frequently-used application Position-Specific Iterated BLAST (PSI-BLAST) was used to generate PSSM. To achieve broad and high homologous sequences, its parameters e-value is set to 0.001, iteration is set to 3. Since all items in the SwissProt database have been strictly audited by experts, we use it as the comparison database for generating PSSM matrix in this work. feature extraction. Feature extraction is one of the important steps in model construction. Effective feature descriptors can not only extract important information, but also can improve the performance of predictive model in predicting DTIs 35 . In this work, the Discrete Cosine Transform (DCT) is introduced to extract the features of the information representing the protein sequence from the PSSM. Due to the advantages of minimizing reconstruction errors and packing most of the information to a minimum of coefficients, the DCT only loses very little information during processing. The formula as follow: After optimization, we selected the first 400 coefficients as the final feature descriptor representing the protein sequence.
Classification prediction. In this work, we introduce Rotation Forest (RF) as a classifier for predicting DTIs. RF is a successful classifier proposed by Rodriguez et al. 36 . The basic idea of RF is to simultaneously build accurate and robust differential ensemble classifiers [37][38][39] . When the algorithm executes, RF first randomly divides the sample set, and then uses the transformation method to transform the subset to increase the difference between the subsets. Finally, the transformed subset is used to select samples to train different base classifiers.
Assume S denotes the sample set, = … X x x x ( , , , ) www.nature.com/scientificreports www.nature.com/scientificreports/ The sparse rotation matrix R i can be expressed as follows: ( ) k represents the coefficient in the matrix, R i r represents the matrix obtained after reordering. In order to improve the performance of the model, we use the grid search method to optimize the parameters K and L of RF. Under different parameters, the accuracy of RF generation is shown in Fig. 2. As can be seen from the figure, with the increase of K, the value of accuracy gradually increased; with the increase of L, the value increases rapidly, then increases slowly, and finally decreases slightly. Considering the accuracy and time consumption, we finally chose the most suitable parameters of this experiment for k = 21 and L = 42.

Results and Discussion
evaluation criteria. In this work, the evaluation criteria accuracy (Accu.), sensitivity (Sen.), precision (Prec.), and Matthews correlation coefficient (MCC) are utilized to estimate the performance of our model, and its formula is as follows: here TP, FP, TN and FN represent true positive, false positive, true negative and false negative, respectively. Furthermore, the Receiver Operating Characteristic (ROC) 40 curve and the area under the curve (AUC) were also utilized to estimate the performance of the proposed model.   www.nature.com/scientificreports www.nature.com/scientificreports/ Assessment of prediction ability. To be comparable, our classifier uses the same parameters when executed on four benchmark datasets. In experiment, the performance of our model is verified utilizing 5-fold cross-validation. This has the advantage of not only testing the model's stability, but also avoiding over-fitting. Specifically, the whole dataset is split into five independent and equal-sized subsets, one of which serves as the test set and the remaining four as the training set. In the implementation, each take a different subset of the test set, loop 5 times.
The prediction results of our model on benchmark datasets are summarized in Table 2. We obtained the average results of Accu., Prec., Sen., MCC and AUC of 0. The Pseudo-AAC algorithm can effectively extract the hydrophobic information in the protein sequence, but it does not retain the biological evolution information. Given a protein sequence S, the general form of Pseudo-AAC proposed by Chou et al. 41 is defined as:   here L is the length of the protein sequence, F i is the normalized frequency of the amino acid in the protein, w is the weighting factor, and λ j is the j-tier sequence correlation factor. Table 3  In order to facilitate the comparison, we present the results generated by the three models on the benchmark datasets in the form of histogram. From Fig. 7 we can see that the proposed model achieved the optimal results in all four datasets. In terms of accuracy, the proposed model is 0.0690 and 0.0622 higher than Pseudo-AAC model and SVM model respectively on Enzymes dataset, 0.0623 and 0.0427 on Ion Channels data set, 0.1299 and 0.0921 on GPCRs data set, 0.1111 and 0.1333 on Nuclear Receptors data set. The results show that the proposed model can predict the potential drug-target relationship more accurately than other models. In terms of AUC, the proposed    The excellent performance of the proposed model is mainly attributed to the following three points: (a) the model uses protein sequence characterization with biological evolution information and drug molecular characterization with molecular fingerprint information. This strategy can enrich the expression of drug target data information; (b) the DCT algorithm used in the model can effectively extract the hidden features in the drug-target data, and only lose a little information in the process of processing; c) the RF classifier used in the model can accurately and quickly classify drug-target data, thereby greatly improving the performance of the model. comparison with state-of-the-art models. So far, there have been many state-of-the-art models to predict drug-target interactions and achieved good results. To fully evaluate the performance of the proposed model, we compare it with these state-of-the-art models on the benchmark datasets. Table 5 lists the values of AUC achieved by different models. It can be observed that the results obtained by our model have a significant improvement on benchmark datasets except Nuclear Receptors dataset. In the Enzymes, Ion Channels and GPCRs datasets, our model achieved the highest score, improving 0.0458, 0.0935, and 0.0003, respectively, over the next highest model. In the Nuclear Receptor dataset, our model achieved the third highest score, but it was also only 0.0567 lower than the highest SIMCOMP model.
To further compare the performance of the proposed models, we evaluated the comparison results of Table 5 using statistical test. We make a hypothesis that there is no significant difference between our model and other models at 95% confidence level. If the P-value is lower than 0.05, we can believe that there are significant   www.nature.com/scientificreports www.nature.com/scientificreports/ difference between the proposed model and other comparison models. As a result, we obtained the P-value of 0.044. These results show that the proposed model is significantly more competitive than other models and can effectively predict potential drug-target protein interactions. case studies. In order to further evaluate the prediction ability of the proposed model for potential DTIs, we conducted the case studies. We train the model with all the positive samples in the benchmark datasets as the training set, and predict the score of the unknown associated drug-target pairs. For the top 10 drug-target pairs with the highest predicted scores, we put them into the SuperTarget database for verification. Table 6 summarizes the details of the top 10 drug-target pairs with the highest predicted scores. It can be seen from the table that 8 new drug-target pairs have been confirmed by the SuperTarget database. The results of case studies show that the proposed model can effectively predict the unknown association of drug-target pairs, and provide reliable candidates for biological experiments. It is worth noting that although the remaining two drug-target pairs have not been confirmed at present, the possibility of an association between them cannot be denied.

conclusion
In this work, based on the assumption that the relationship between drugs and targets is largely influenced by the drug molecular structure and protein amino acid sequence, we proposed a novel model to predict DTIs by fusing protein sequence information and molecular fingerprint information. To improve the performance of the proposed model, we introduce the biological evolution information in the process of extracting protein features, and consider the excellent classifier in the process of feature classification. In the experiment, the proposed model was validated on four benchmark datasets including Enzymes, Ion Channels, GPCRs and Nuclear Receptors. Furthermore, we also compared with the different feature extraction model, classifier model and other state-of-the-art models. In the case study, 8 of the top 10 drug-target pairs predicted by our model were confirmed by relevant databases. These excellent results show that the proposed model is very suitable for predicting DTIs and can be an effective tool for providing reliable candidates for biological experiments. In the next research, we will focus on the feature extraction algorithm to further improve the performance of the model.