HIDTI: integration of heterogeneous information to predict drug-target interactions

Identification of drug-target interactions (DTIs) plays a crucial role in drug development. Traditional laboratory-based DTI discovery is generally costly and time-consuming. Therefore, computational approaches have been developed to predict interactions between drug candidates and disease-causing proteins. We designed a novel method, termed heterogeneous information integration for DTI prediction (HIDTI), based on the concept of predicting vectors for all of unknown/unavailable heterogeneous drug- and protein-related information. We applied a residual network in HIDTI to extract features of such heterogeneous information for predicting DTIs, and tested the model using drug-based ten-fold cross-validation to examine the prediction performance for unseen drugs. As a result, HIDTI outperformed existing models using heterogeneous information, and was demonstrating that our method predicted heterogeneous information on unseen data better than other models. In conclusion, our study suggests that HIDTI has the potential to advance the field of drug development by accurately predicting the targets of new drugs.


Data description
The datasets used in our experiments are summarized in Figures S1-S4. The distributions of the SMILES string and protein sequence lengths for drugs and proteins are shown in Figures  S1 and S2. For the drugs, the minimum, maximum, median, and average lengths of the SMILES strings were 3, 416, 49, and 57.55, respectively. For proteins, the minimum, maximum, median, and average lengths were 38, 3608, 291, and 371, respectively. Figure S3 shows the drug class information. Drugs were divided into 14 classes according to the first level of the Anatomical Therapeutic Chemical (ATC) classification system by the World Health Organization (WHO). The related information was extracted from the DrugBank database 1 . Figure S4 shows the five protein categories of our data: Enzyme, Transporter, Gprotein coupled receptor, Voltage-gated ion channel, and Transcription factor. These classifications were obtained from The Human Protein Atlas 2 .
The positive drug-target interaction pairs were obtained from the DrugBank database 1 , which contains information on the pharmacological action and interaction types that are annotated as an antagonist, agonist, substrate, inhibitor, and inducer. The proteins in the positive interaction pairs can be further classified into drug targets, enzymes, carriers, and transporters.
1.2 Models for unseen drugs 1.2.1 Deep neural network (DNN) model for predicting feature vectors of heterogeneous information To obtain heterogeneous information from unseen drugs, we adopted the DNN model proposed by Wang et al. 3 . This DNN model was designed to predict adverse drug reactions using biological, biomedical, and drug chemical information. The features for training and testing were slightly different. In the training step, they used biological, biomedical, and drug chemical information with adverse drug reaction features of the drugs, whereas these adverse drug reaction features were set to zero in the test step. The dimensions of each feature were 400, 1048, 1048, 15, and 1325 for drug, biological, biomedical, drug chemical, and adverse drug reactions, respectively. The hidden layers consisted of [1024,512,128] nodes and the output layer included 1325 nodes, which matched the size of the adverse drug reaction vector. Based on the concept of the DNN model proposed by Wang et al. 3 , we designed our DNN models for each item of heterogeneous information, including drug-drug interaction (DDI), drug-side effect association (DSIE), and drug-disease association (DDIS). The drug vector with a size of 300 and each item of heterogeneous information, including DDI (707), DSIE (4192), and DDIS (5603), were concatenated, and the concatenated vectors were then used as the input vectors. The hidden layers consisted of [1024,512,128] nodes for the DDI vector prediction model, and[4096, 2048, 1024] for the DSIE and DDIS vector prediction models. The size of the output layers for each model was equal to the size of each heterogeneous vector. The objective function of each model is described as follows: where N is the output vector size, is the actual binary value, and is the predicted value. To avoid overfitting, we used unique drugs for training and added a dropout layer that was set to 0.5.

NeoDTI
To evaluate the performance of NeoDTI for unseen drugs, we modified the code of NeoDTI on Github (https://github.com/FangpingWan/NeoDTI). Because our folds consisted of drug-based folds, we set the interacting edges of the network to zero for the test drugs in the training process. The interacting edges represent heterogeneous drug-related information, including the DDI, DSIE, and DDIS, in the network. We then used the predicted values between proteins and test drugs for performance evaluation. The modified NeoDTI code is available at Github (https://github.com/DMCB-GIST/HIDTI).
The results of the grid-search algorithm are presented in Table S1. Finally, one residual block, one hidden layer for the classifier, and a learning rate of 1 10 were set for our model.

Performance of HIDTI based on of drug categories for unseen drugs
We additionally examined the performance of HIDTI based on drug categories for unseen drugs (Table S8). The prediction performance of drugs was the highest for systemic hormonal preparations; however, this was because the number of interactions to calculate the area under the receiver operating characteristic curve (AUC) values was only 10, and the highest AUC of one interaction affected the average AUC. Thus, the prediction performance of drugs targeting the nervous system had very high predictability, whereas drugs with anti-infective properties for systemic use had the lowest predictability.    Table S4. Performance evaluation of support vector machine for when heterogeneous information was predicted for unseen drugs.