Abstract
Drugtarget interaction (DTI) prediction plays a crucial role in drug repositioning and virtual drug screening. Most DTI prediction methods cast the problem as a binary classification task to predict if interactions exist or as a regression task to predict continuous values that indicate a drug's ability to bind to a specific target. The regressionbased methods provide insight beyond the binary relationship. However, most of these methods require the threedimensional (3D) structural information of targets which are still not generally available to the targets. Despite this bottleneck, only a few methods address the drugtarget binding affinity (DTBA) problem from a nonstructurebased approach to avoid the 3D structure limitations. Here we propose Affinity2Vec, as a novel regressionbased method that formulates the entire task as a graphbased problem. To develop this method, we constructed a weighted heterogeneous graph that integrates data from several sources, including drugdrug similarity, targettarget similarity, and drugtarget binding affinities. Affinity2Vec further combines several computational techniques from feature representation learning, graph mining, and machine learning to generate or extract features, build the model, and predict the binding affinity between the drug and the target with no 3D structural data. We conducted extensive experiments to evaluate and demonstrate the robustness and efficiency of the proposed method on benchmark datasets used in stateoftheart nonstructuredbased drugtarget binding affinity studies. Affinity2Vec showed superior and competitive results compared to the stateoftheart methods based on several evaluation metrics, including mean squared error, rm2, concordance index, and area under the precisionrecall curve.
Introduction
Drug repositioning has gained significant attention due to its reduced time, lower investment, and higher success rate compared to the traditional de novo drug development^{1}. This realization and large repositories of drugrelated data being made accessible to researchers spurred on the development of a number of computational DR strategies^{2}. One of the significant strategies is computational drugtarget interaction (DTI) prediction, as it narrows down the search space for candidate drugs that can treat targeted diseases^{3}. DTI prediction methods can be classified into two main categories based on the prediction task. The first is the binary classification category in which the prediction task is to determine if the drug interacts with the target protein. The second is the regression category in which the prediction task is to determine continuous values that indicate the strength of the binding between the drug and the target (i.e., the binding affinity)^{4}.
The more commonly developed binary classification methods range from docking and chemogenomicbased approaches^{5,6,7} to networkbased^{8,9} and machine learning (ML)based approaches^{10,11,12,13,14}. The networkbased approaches include knowledge graphbased^{14} and graph embeddingbased approaches^{15,16}. These prediction types cannot differentiate the actual negative DTIs and interactions with missing information (unknown values), negatively affecting the model prediction performance. Moreover, the prediction does not reflect the binding strength between the drugs and proteins that impact the drug's potential efficacy. Thus, more current research focuses on drugtarget binding affinity (DTBA) prediction that builds regression models instead of classification models. Developing a regressionbased method can rank the therapeutic drugs, further limiting the scope of potential drugs (i.e., compounds) for drug discovery studies. Moreover, these DTBA regression models overcome previous limitations by having the ability to reflect the binding strength through predicting the affinity values using different measurements such as inhibition constant (K_{i}), dissociation constant (K_{d}), or the halfmaximal inhibitory concentration (IC_{50})^{17}. However, several DTBA prediction methods developed to date require the threedimensional (3D) structure of drugs or target proteins which are mostly not available and cannot be scaled to largescale data due to the low quality of 3D structures^{18,19}. Moreover, these structurebased methods required a molecular docking step, which is a bottleneck associated with prediction efficiency.
The more recently developed ML and deep learning (DL)based methods avoid this limitation by using nonstructurebased methods (i.e., sequencebased methods) that do not require docking or 3D structural data for DTBA predictions^{4}. Also, these ML/DLbased methods can aid and support the other types of methods in affinity prediction. Despite those methods' impressive performance, the DTBA regression task remains a critical and challenging task; there is more room to develop several algorithms that improve the prediction performance. The first attempt to predict DTBA without using 3D structural data is KroneckerRegularized Least Squares (KronRLS)^{20}. KronRLS is an MLbased method that uses targettarget similarity and drugdrug similarity matrices as features to predict DTBA values by finding a minimizer of the objective function (i.e., error or loss function). The second ML method, SimBoost^{21}, uses similarity matrices and constructed features obtained from targettarget similarity, drugdrug similarity, and drugtarget networks to predict DTBA using gradient boosting regression trees. SimBoost improved the prediction performance compared to the KronRLS method. However, more recently, with the similaritybased DL method, SimCNNDTA^{22} achieved even better prediction performance. SimCNNDTA predicts DTBA values by applying a twodimensional convolutional neural network (CNN) on the outer product of both the target similarity and drug similarity matrices. DLbased DTBA methods that are not similaritybased have also been developed^{23,24,25,26,27}. DeepDTA^{23} is one of the first DTBA prediction methods developed by constructing two CNNs applied on drug SMILES and target aminoacid sequences to learn feature representations for drugs and targets, respectively. These feature representations were subsequently combined and fed to fully connected layers for DTBA prediction. Similar to DeepDTA, several DLbased methods have been established using similar concepts but different in: using other SMILES and sequences embeddings^{26}, integrating different input representations such as geneontology for proteins^{28} or molecular graphs for drugs^{29}, using different neural networks (NN) architectures for feature representation learning^{30}, or utilizing different DL techniques such as longshortterm memory (LSTM) or attention mechanisms^{27}.
Several Networkbased methods^{31,32,33} outperform other DTI prediction approaches. For example, AOPEDF^{34} created a heterogeneous biological network by integrating drugs, proteins, and diseases. Then AOPEDF learns a lowdimensional vector representation of features that preserve arbitraryorder proximity from this rich constructed network covering chemical, genomic, phenotypic, and network profiles. Finally, a cascade deep forest classifier was built to identify novel DTIs. Another networkbased approach, MultiDTI^{35}, uses a joint representation framework (i.e., multimodal representation) based on heterogeneous networks to combine similaritybased methods with networkbased methods. MultiDTI takes advantage of the comprehensive information and different perspectives for DTI prediction. First, MultiDTI combined the interaction/association information of the heterogeneous network and the drug/target sequence information and then mapped the drugs, targets, side effects, and disease nodes in the heterogeneous network into a shared space. Finally, it predicted new DTIs based on the distance between the drug and the target in this new shared space. The last but very recent networkbased DTI prediction method, DTi2Vec^{36}, integrates the drugdrug similaritygraph, the targettarget similarity graph, and DTIs into one weighted heterogeneous graph. After that, it applied graph embeddings on this network to generate node embeddings and then learned feature representation for each edge (drugtarget pair) used to predict DTIs. These three methods are formulated as supervised link prediction problems (i.e., binary classification) and proved their efficiency in performance evaluation in terms of AUPR and AUC by outperforming several stateoftheart methods and predicting novel DTIs. Similar to these methods, in our work, we here leverage the power of network characteristics combined with other techniques to predict binding affinities for drugtarget pairs. Thus, we constructed a heterogeneous network by integrating several networks, but we did not include any diseaserelated network data. However, our method objective is to predict binding affinities (bioactivity continuous values) for drugtarget pairs instead of on/off binary relationships and use the appropriate evaluation metrics.
This study is the first regressionbased attempt to deal with the drugtarget prediction task as a whole weighted heterogeneous network to the best of our knowledge. We constructed a weighted network using three graphs (i.e., binding affinity graph, targettarget similarity graph, and drugdrug similarity graph) and calculated the similarities using several techniques. Here we present Affinity2Vec, as a novel networkbased method that accurately predicts the continuous binding affinity values between drugs (i.e., compounds or ligands) and target proteins (more specifically, kinases). Affinity2Vec integrates several techniques, including aminoacid sequence embeddings, SMILES embeddings, graph mining for feature extraction, and ML for prediction. Our method outperforms the baseline methods using the benchmark datasets based on multiple evaluation metrics and notably reduced prediction errors.
Materials
Benchmark datasets
Two biomedical datasets specialized for DTBA prediction, Davis^{37} and KIBA^{38} datasets, have become benchmark datasets to train and evaluate nonstructurebased methods. Table 1 provides a summary of those two datasets. Davis dataset^{37} has selectivity assay data for the Kinase proteins family. It consists of 68 unique drugs, 442 unique target proteins, and 30,056 binding affinity values for all the drugtarget pairs. The binding affinity between the drugtarget pair is measured by kinase dissociation constant (K_{d}). The higher the value of K_{d}, the lower binding between drug and target. Similar to the DLbaseline methods^{23}, the affinity values were transformed into logarithm space (pK_{d}) by applying the following equation:
The affinity values in the Davis dataset are ranging from 5.0 to 10.8. KIBA dataset^{38} provides Kinase Inhibitor BioActivity (KIBA) data, and it introduces KIBA scores that combine the information from different sources, including K_{d}, K_{i} (inhibitor constant), and IC_{50} (the concentration required to produce halfmaximum inhibition) into a single bioactivity score. Following prior work^{39}, we filtered the KIBA dataset by keeping the drugs and proteins with 10 known interactions or more, resulting in 2,111 drugs and 229 targets with 118,254 interactions. KIBA affinity scores are ranged from 0.0 to 17.2. The lower the KIBA score the stronger the binding affinity between the drug and the target. The density indicates the percentage of known binding affinities.
In addition to these two datasets, we included a dataset recently used for protein–ligand complexes binding affinity prediction, named the PDBBind dataset^{40}, as a third benchmark dataset to test and assess our method. PDBBind dataset is a comprehensive resource of experimentally measured binding affinity data expressed with pKd values for protein–ligand complexes provided with the 3D structural information of molecular complexes in the form of SDF, PDB, and MOL2 files. PDBBind datasets are derived from the Protein Data Bank (PDB)^{40} and consist of three subsets: the general set, the refined set, and the core set. The general set contains lower quality information of the complexes, and the refined set is a subset of the general set with better quality information (i.e., high structural resolution, accurate binding measurements, and the nature of the complexes). The core set is the highest quality benchmark subset. However, its size is very small compared to the other subsets. Therefore, we used the widely used PDBBind Refined v.2015 dataset to take advantage of the highquality data with a reasonable size. We utilized PDBbind Refined dataset v.2015 with 3437 protein–ligand complexes with their affinity values similar to previously published works^{41,42,43}. Briefly, we filtered this number to 3045 pairs for which we managed to obtain aminoacid sequences for the proteins and SMILES for the ligands.
Input representation
Drug SMILES
Our study represents the drugs using the Simplified MolecularInput LineEntry System (SMILES)^{44}, a line notation that describes chemical compounds' structure as text strings. We collected the drugs’ SMILES for both Davis and KIBA datasets from the previous work^{23}. For drugs in the Davis dataset, they were extracted from the PubChem compound database based on their PubChem CIDs^{45}. For drugs in the KIBA dataset, they used the PubChem CIDs to extract the SMILES. For the Davis datasets’ drugs, the maximum and average lengths of SMILES are 103 and 64 tokens, respectively. While for the KIBA datasets’ drugs, the maximum and average lengths of SMILES are 590 and 58 tokens, respectively. For the PDBbind Refined dataset, we first obtained SDF files for all ligands from the PDBBind webserver for v.2015. After that, we converted these SDF files to SMILES using RDKit^{46} with python application programming interface (API), an opensource Cheminformatics software. For this dataset, the maximum and the average length of SMILES are 126 and 46 tokens, respectively.
Protein sequences
We also acquired the proteins' aminoacid sequences from previous work by^{23}. They extracted protein sequences from the UniProt protein database using gene names/RefSeq accession numbers for the Davis dataset and UniProt IDs for the KIBA dataset^{47}. The maximum and average length of protein sequences are 2549 and 788 characters for the Davis dataset, respectively. While the maximum and average length of protein sequences are 4128 and 728 characters for the KIBA dataset, respectively. For the PDBbind Refined dataset, we retrieved the Fasta file for all the proteins from the prior work^{42}, initially extracted from the PDBBind database server. The maximum and average length of protein sequences are 4638 and 494 characters, respectively.
Drug and target similarity
We retrieved drug similarity scores for all drug pairs and target similarity scores for all target pairs for both Davis and KIBA datasets from previous works in an adjacency matrix format^{23,39}. They calculated the drug similarity scores using the SIMCOMP tool^{48} that represented the drugs by their twodimensional (2D) chemical structure (as a graph). Then, calculated the drugdrug similarity score based on the common substructures size of the two graphs using the Tanimoto coefficient. For target similarity scores, they represented the targets using aminoacid sequences. They then calculated the target similarity scores using the normalized SmithWaterman (SW) scores^{49} based on the protein sequences' alignment. We refer to the drug similarity matrix and the target similarity matrix as DDsim1 and TTsim1, respectively, for later use.
Methods
Problem formulation
This work presents a regressionbased approach for predicting binding affinity scores. The input data consists of: D = {d_{1}, d_{2}, …, d_{n}} where D represents the drug space, and the number of drugs is n, T = {t_{1}, t_{2}, …, t_{m}} where T represent the target protein space, and the number of target proteins is m, and finally Y = {y_{ij}^{N}} which is the label (i.e., continuous values of binding affinity) where y_{ij} is the binding scores between drug_{i}target_{j} pair and N is the number of observed affinities. For each drugtarget pair, we extracted different features using several techniques, explained later in the feature extraction section. The feature vector (FV) is represented by X ⊆ {x_{1}, x_{2}, …, x_{n*m}} and their labels (i.e., continuous values) Y ⊆ {y_{1}, y_{2}, …, y_{n*m}} where n*m is the number of all possible (drug, target) pairs. It is necessary to mention that, in the Davis dataset, all drugtarget pairs have affinity values (i.e., labels), while in the KIBA and PDBBind Refined datasets, many pairs have ‘NaN’ labels which we excluded from X and Y. Most of the previous works generated each drugtarget pairs’ features using different techniques applied on SMILES and sequences of drugs and proteins, respectively. However, up to now, there is no published work that deals with binding affinity prediction problems as a networkbased problem which we adjust in our study.
Overview of the Affinity2Vec model
Figure 1 illustrates the pipeline for Affinity2Vec, which consists of four main parts:

1.
Input representations, data preprocessing, and constructing heterogeneous network for the DTBA.

2.
Applying embedding techniques on the drugs’ SMILES and targets’ sequences to generate feature representations for individual drugs and target proteins. Also, we integrated the binding affinity, drug similarity, and target similarity graphs to form the complete DTBA network.

3.
Applying three feature extraction techniques to generate three different feature types: the embedding features, the metapath score features, and the hybrid model that combine the previous two types of features.

4.
Applying ML regressors to establish the regression models.
We explain each of these steps in more details in the corresponding sections.
Preprocessing and constructing DTBA network
In our study, we constructed a weighted heterogeneous graph G(V, E) using DTBA network integrated with targettarget similarity and drugdrug similarity graphs, where V is the set of vertices (i.e., drug and target nodes) and E is the set of edges that represent one of three types of edge: (1) DT edges reflect the binding strength values between drugs and targets, (2) DD edges or TT edges represent the connection between two drugs or two target proteins, respectively, with a similarity score above a certain threshold. It is essential to mention that we used two similarity matrices for drugs and targets, (DDsim1, TTsim1) and (DDsim2, TTsim2), from different sources to construct heterogeneous graphs G1 and G2, respectively, which are used later in graphbased feature extraction. We explain the formulation of the graph edges in more details in subsections that follow.
Filtering similarity edges
First, we normalized each similarity graph separately to have the similarity scores ranging within [0,1]. Before integrating the targettarget similarity and drugdrug similarity graphs, we applied a preprocessing step by filtering each graph separately. If we include all similarity scores as edges connecting similar target proteins or similar drugs, the graph will be a very complex network, specifically when the number of targets or drugs is large. Therefore, we removed all the weak similarity scores for the target similarity graph by filtering the edges within a specific threshold. We applied the same process to filter the drug similarity graph. We did an empirical analysis for the targettarget similarity and drugdrug similarity graphs separately, as explained and illustrated in the Supplementary Material, Fig. S1, to specify different threshold values for each graph. For the Davis dataset as an example, the threshold for the drugs 2D chemical structure similarity is set to 0.3 while the threshold of the targets normalized SmithWaterman alignment similarity is set to 0.04. Thus, each similarity edge below the threshold is removed, resulting in getting targettarget similarity and drugdrug similarity subgraphs. The insight of applying this process is to reduce the noisy information introduced when using all similarity scores, including those shallow similarity scores, which do not provide any informative meaning that affects the performance. That is, decreasing the edges in each similarity graph reduces the model’s running time.
Formulating binding affinity edges
In Davis, KIBA, and PDBBind datasets, the lower affinity values indicate the stronger binding between the drug and the target. Thus, the binding affinity values have a similar meaning of distance (e.g., the smaller distance between the two entities, the stronger relations between them), which has the opposite meaning of the similarity edge used in the targettarget similarity and drugdrug similarity graphs. As a result, we need to have a consistent meaning for all edge types (the higher value of edge, the stronger relationship) and a consistent value range (i.e., [0,1]) in this heterogeneous network. For this purpose, applying an exponential function is a potentially good candidate to convert the affinity values (the element which has a high value should be converted into a low value) and output them in the range [0,1]. Another function that fits our goal is a SoftMax function, a form of logistic regression (i.e., normalized exponential function), which normalizes the input value into a value vector that follows a probability distribution whose total sums up to one. We applied both functions on the binding affinity values for the three datasets separately and adopted the best result approach. The exponential and SoftMax \(\sigma\) functions^{50} are described in Eqs. (2) and (3), respectively.
where z is the data samples of affinity values, and parameter alpha \(\alpha\) in Eq. (2) can be tuned or possibly estimated from the data. After preprocessing the affinity values, we augmented the DTBA with new affinity values, and targettarget similarity and drugdrug similarity subgraphs to construct the extensive DTBA heterogeneous network, which is used later in the graphbased feature extraction step.
Learning representation for drugs and proteins
In this step, the inputs are the SMILES of the drugs and the aminoacid sequences for the target proteins, and then different techniques are applied to generate embeddings for each drug and each target protein.
Drug encoders
To capture the critical properties and generate features for each drug, we applied the sequencetosequence learning model (seq2seq, aka encoderdecoder model)^{51}. Specifically, we adopted the seq2seq fingerprint model introduced in^{52}. This unsupervised datadriven DL molecular embedding method applied several modifications into the original version of the seq2seq learning model to fit drug discovery tasks that are in our scope. Our objective is to learn the vital features of the drugs. We accommodated and adjusted this version of the seq2seq model to encode drugs that suit our work for the following reasons: (1) this method is datadriven, and there is no need for any expert knowledge; (2) the fingerprints (i.e., feature representations) generated by this method can construct the original drug representations, which means it ensures that the information encoded in the fingerprint vector is sufficient; (3) this method uses unsupervised training on a substantial unlabeled dataset by applying the power of deep neural networks (DNN). The seq2seq model that we applied consists of a multilayered Gated Recurrent Unit (GRU) network that maps the input into a fixed dimension FV, and a deep GRU network that decodes the FV back to the original form. It is necessary to mention that the seq2seq model implemented GRU instead of long shortterm memory (LSTM), usually implemented in the encoderdecoder model, which has similar performance but it accelerates the training process. A dropout layer is attached to avoid overfitting in the training phase. The last thing to mention, an extra fingerprint extraction layer set is fedforwarded to only the perceiver network to extract the resulting fingerprint embeddings, which we take advantage of to obtain the embeddings for each drug.
Figure 2 shows the overview of the seq2seq model that we utilized and then applied. The input and the output of the seq2seq model are the SMILES strings that serve as the text representation. That is, we fed the drugs’ SMILES to the model so that the perceiver converts the SMILES strings into a fixedsized FV, and then the interpreter translates it back to the original SMILES strings. The intermediate fixedsized FV is extracted as the seq2seq fingerprint. The intermediate FV encodes all the essential information to recover the original drugs' representation. Hence, we expect the seq2seq fingerprint to capture the precise information we can use for downstream ML tasks. In our work, we trained the model using around 85,900 SMILES, which we split into the tokens that can appear in the SMILES string (i.e., the alphabet for the seq2seq model). We assigned these tokens with the maximum SMILES length to the seq2seq model with other parameters. Furthermore, we utilized several parameters of our seq2seq model using different sets of tested values (see Table 2), and we applied Adam optimizer in the DL training process. Finally, we extracted the embeddings for each drug in Davis, KIBA, and PDBBind Refined datasets by predicting each drug's SMILES features.
Protein encoder
To generate meaningful feature representations for the target proteins, we applied ProtVec^{53}, a feature extraction method for protein aminoacid sequences. ProtVec^{53} is an unsupervised datadriven continuous distributed representation of biological sequences that captures a diverse range of informative biophysical and biochemical characteristics. Similar to the Skipgram word embeddings model's training process in natural language processing (NLP), a large corpus is needed to train distributed representation of biological sequences. Thus in this work, for ProtVec^{53}, 546,790 aminoacid sequences of different proteins were downloaded from the SwissProt UniProt server^{54}. Then, those sequences were broken down into subsequences (i.e., biological words) using the ngram model. However, instead of using an overlapping window of 3 to 6 residues, they generated three lists of shifted nonoverlapping words (3gram is a "biological" word consisting of 3 amino acids), as shown in Fig. 3. Thus, the ProtVec is trained based on 1,640,370 (546,790 × 3) sequences of 3gram through a Skipgram neural network model that maximize the probability of observed word sequences. We used the pretrained model to generate the embeddings for each target protein in our datasets. Then we represent each protein sequence by summing up the overlapping 3gram feature representation.
Feature extraction model
We developed three ways to extract the features for each drugtarget pair:

(1)
We applied graph mining techniques to compute the drugtarget metapath score features.

(2)
We used the embedding FV for each drug and target described in the previous section.

(3)
We produced a hybrid model that combines the features from the first two ways.
These features capture the characteristics of the target proteins, drugs, and DTBA network.
Metapathbased features
After learning feature representation for each target and drug, we calculated the cosine similarity (CosSim) between two drugs vector representations (i.e., drug_{i} and drug_{j} embeddings corresponding to v_{i} and v_{j,} respectively) for each pair of drugs as follows:
Likewise, we applied the same process to target proteins. Therefore, we constructed two new similarity matrices, which are TTsim2, targettarget similarity matrix with size m*m, where m is the number of targets, and DDsim2, drugdrug similarity matrix of size n*n where n is the number of drugs. Then, we applied min–max normalization on each similarity graph to have all similarity scores between 0 and 1. At this stage, we created two different extensive weighted heterogeneous networks DTBA with the same binding affinity graph but with different drugdrug and targettarget similarity graphs: G1(DTBA, TTsim1 subgraph, DDsim1 subgraph), and G2(DTBA, TTsim2 graph, DDsim2 graph). It is necessary to state that we excluded all test data in our model's training process, including the binding affinity edges that belong to the test set. Then, we applied graph mining techniques similar to^{55} to extract features from these two graphs G1 and G2, and then either used the features from each graph individually or combined them. For each weighted heterogeneous graphs G1 and G2, we computed each drugtarget path scores as described in Eq. (5):
P = {p_{1}, p_{2},…, p_{q}, …, p_{l}} is the set of paths from drug_{i} to target_{j}. In our work, we limited the path length to be equal to or less than three because of computational complexity. Thus, we obtained six path structures. Each path structure starts with a drug node and ends with a target node, and there is no cycle in the path (each node in the path appears only once). These path structures are represented by C_{h} where h = {1, 2, 3, 4, 5, 6}, which are: (C_{1}: (DDT), C_{2}: (DTT), C_{3}: (DDDT), C_{4}: (DTTT), C_{5}: (DDTT), and C_{6}: (DTDT). The metapath score for each path P_{q} is the product of all weights w_{x} of edges from d_{i} (drug node) to t_{j} (target node) in each path structure belongs to C_{h}, as follows:
The weight of similarity edges between two nodes of the same type indicates the transition probability between those nodes, and the weight of the preprocessed values of binding affinity edges indicate the probable strength with which drugs bind to the target proteins (the larger probability indicates a more significant degree of binding affinity). Therefore, multiplying edge weights ranging between [0,1] across the path penalizes longer paths, which fits our objective, contrary to the summing of edge weights across longer paths, which conflicts with our goal. Following our previous published work^{55}, we obtained 12 metapathbased scores by calculating the sum and max scores (i.e., features) under each path structure as shown in Eqs. (7) and (8), respectively.
where R_{ijh} is the set of all paths under all path structures between drug_{i} and target_{j}. We applied the same procedure to both G1 and G2, which means our FV dimension is either 12 or 24 when we combine G1 and G2 FVs. It is worth mentioning that the metapath score features were encoded in commuting matrices as introduced in^{56}, which we calculated by multiplying two or three adjacency matrices. The length of the metapath equals the number of multiplied adjacency matrices which are: DTBA, targettarget similarity, and drugdrug similarity. Using 3D matrix multiplication accelerates the running time to get the metapath scores.
Embeddingbased features
After automatically generating the feature representations for each drug and each target using different DL techniques (seq2seq model for drugs and ProtVec model for proteins), we obtained two embedding matrices f and Pv for drugs, and targets, respectively. Drug embedding matrix f has the size equal to n x d (n is the number of drugs and d is the feature dimension), and the target embedding matrix Pv has a size equal to m x k (m is the number of targets and k is the feature dimension which is 100). Then, because our goal is to predict drugs and targets potentially linked with their binding values, we created an embedding FV for each drugtarget pair by applying the concatenation operation, as shown in Fig. 1c. Thus, for any possible drugtarget pair, we concatenated a drug FV with target FV resulting in a longer FV which has the dimension of d + k (i.e., number of the features) and the number of the samples are n x m (i.e., number of drugs multiplied by the number of targets).
Hybrid modelbased features
In this step, we concatenated the metapath scores features with the drug embedding features and target embedding features, for all drugtarget pairs.
Regression model
After extracting all the drugtarget pairs' features, we normalized the features for the training and testing sets separately using min–max normalization^{57}. Then, the normalized FVs were fed with their label Y (i.e., binding affinity values) into a supervised ML regression model to predict the binding affinity values. Any models’ prediction performance relies on identifying the most critical features of the studied dataset. Thus, we developed three models using different sets of features (explained above) and tuned the models' parameters using these feature sets. We named these models Affinity2Vec_Pscore (when we use the graphbased features), Affinity2Vec_Embed (when we use embeddingbased features), and Affinity2Vec_Hybrid (when we combine both sets of features).
Unlike the wellknown parametric linear regression model that assumes the target variable can be expressed as a linear combination of the independent variables, the gradient boosted trees are nonparametric models that approximate any distribution function from the data. They usually achieve better prediction performance than linear regressors. Hence, we utilized gradient boosting regression, particularly Extreme Gradient Boosting Regressor (XGBoost)^{58}. XGBoost regressor is implemented using an optimized distributed gradient boosting library named XGBoost^{59} that is highly efficient, flexible, and portable. The objective of XGBoost in the prediction process is to minimize the loss function, which is, in our case, the mean square error (MSE).
We selected XGBoost because:

1.
It is much faster than other ensemble regressors.

2.
Its core algorithm is parallelizable, meaning that it exploits the power of the multicore machine and works on GPUs.

3.
It can be optimized during the validation process to enhance the performance because it has a wide variety of tuning parameters.
In this study, all the Affinity2Vec models utilized ML XGBoost regressor and DL regressor to predict the affinity values. The DL model that we utilized is a feedforward artificial neural network (ANN) that consists of three fully connected (FC) layers. For activation function, we used the rectified linear activation function (ReLU). XGBoost performed better than DL in our experiments and this result may be a consequence of DL models working better with larger feature numbers. Thus, we only reported the results of XGBoost. The most critical parameters that we optimized include but are not limited to: 1/ the number of trees in the ensemble, which is usually tuned by increasing the trees until we see no further improvements, 2/ the maximum depth of each tree, 3/ the learning rate (often set to small values such 0.1, 0.01), 4/ the subsample which is the number of the data samples used in each tree, and 5/ colsample_bytree which is the number of features used in each tree, set to a value of 1 when using all features.
Evaluation metrics
To acquire a more realistic view of Affinity2Vec's prediction performance, we implemented the same evaluation metrics widely used to evaluate the regressionbased stateoftheart methods. These metrics are: the mean squared error (MSE)^{60}, the concordance index (CI)^{61}, the regression toward the mean index (rm2)^{62}, and the area under the precisionrecall curve (AUPR)^{63}.
Mean square error (MSE)
MSE^{60} is the most popular metric to evaluate regressionbased models to measure how close the fitted line is to the actual data points. Thus, MSE is used as a loss function that the training model tries to minimize. MSE is defined as follows:
where p represents the prediction values vector, y represents the actual values vector, n is the number of samples, and MSE is the average sum of the square difference between the predicted values and the actual values. The formula uses the square to ensure the negative values do not cancel the positive values. The smaller the value of MSE, the better the regression model's performance. Root mean squared error (RMSE) is the square root of MSE, used as a substitute for MSE in several studies.
Concordance index (CI)
CI has been widely used to evaluate regression models' ranking performance^{61}. The CI of a set of paired data is equal to the probability that two random pairs with different label values are predicted in the correct order defines as follows:
bx and by are the prediction values for the larger affinity dx and the smaller affinity dy, respectively, Z is the normalization constant, and h(m) is the Heaviside step function^{64} defined as follows:
Regression toward the Mean (rm2 index)
The rm2^{65} has been extensively used to validate regression‐based quantitative structure–activity relationship (QSAR) models. rm2 is a modified version of the squared correlation coefficient, also known as determination coefficient, (r^{2}), implemented to evaluate the binding affinity model's external predictive potential. If rm2 is greater than 0.5 on the test data, it indicates the performance is acceptable. rm2 is defined using two other evaluation metrics: r^{2}^{66} is the proportion of variation in the outcome described by the predictor variables. The r^{2} corresponds to the squared correlation between the actual values and the predicted values in multiple regression models, while \({r}_{0}^{2}\) is a squared correlation coefficient with zero intercepts:
More details of the formulation are explained in^{65,66}. The higher the r^{2} and rm2, the better the performance of the regressor. Note, we also used r^{2} in a nonparametric statistical approach called YRandomization (also known as YScrambling)^{67}, that served as the final validation step.
Area under precisionrecall curve (AUPR)
AUPR evaluation metric^{63} frequently used with binary classification problems known to perform well with imbalanced data since it differentiates the predicted scores of the positive data samples and the predicted scores of negative data samples. We utilized AUPR to evaluate our Affinity2Vec’s performance in binary classification prediction. The closer the value of AUPR is to 1, the better the performance. We can easily convert our datasets into their binary shapes by setting specific binding affinity thresholds for each dataset separately. Thus, if the binding affinity value is above the threshold, it is transformed to 1 (known interaction); otherwise, it will be zero. Our study follows the suggested thresholds from prior work^{23,39,68}, where we set the threshold values to 7 and 12.1 for the Davis and KIBA datasets, respectively. For the PDBBind Refined dataset, we specified 6, 7, and 10 as the thresholds following the previous work^{44}. We calculated the AUPR using each threshold and then calculated the average AUPR.
Experimental settings
We applied two settings to evaluate our models. For setting1, like the stateoftheart methods, we utilized the nested crossvalidation (CV) approach to evaluate Affinity2Vec’s. The advantage of the nested CV approach over the regular CV is that it attempts to overcome bias and overfit the training data. Thus, it tunes the hyperparameters of the regressors in the training stage. The nested CV procedure that we applied is as follows: we randomly partitioned the dataset into six equal sets, one of them used as a holdout set for testing, and the other five sets for training fivefold CV. Hence, all four evaluation measurements obtained are for the five models' averages on the holdout test data. In addition, as mentioned before, we eliminated all binding affinity edges that belong to the test data from the graph that we formulated. To make the performance comparison unbiased, as fair as possible, we used the same benchmark datasets (Davis and KIBA) and setting of nested CVs, applied the same splits of the training and testing data as the current stateoftheart methods, and finally, utilized the same four evaluation metrics.
For settings2, we followed the same setting process of prior works^{42}, using: a timebased splitting protocol, the same dataset (PDBBind Refined v.2015), and the same evaluation metrics. Thus, we split the dataset into three subsets: training, validation, and test sets based on the publication years that were provided with the dataset. For training data, we used all protein–ligand complexes from the year 2011 and before. For the validation data, we used protein–ligand complexes for 2012, while for test data, we used all complexes from 2013 and after. Therefore, we obtained 2188, 312, and 547 data samples (ligand–protein complexes) for training, validation, and testing sets, respectively.
We executed all our experiments on a Linux Ubuntu 18.04.5 LTS Intel Xeon Platinum 8176 workstation, 64bit OS, with 112 processors and 2 GPUs (Quadro and Titan) with CUDA version 11.0. For implementation, we used python 3.8 and some required libraries such as Keras^{69} for DL, DeepChem for seq2seq fingerprint model^{51}, ProtVec^{53} for proteins’ embeddings, XGBoost^{59} for regression, and more.
Results and discussion
Here we explain the results we obtained for the three benchmark datasets under two settings using the three models we developed. We further compared our model's performance, Affinity2Vec, with selected stateoftheart methods and discussed our model's key strengths that might enhance our model performance.
Affinity2Vec models’ performance
We conducted several experiments to predict DTBA and reported the bestobtained results among all variants of Affinity2Vec in MSE and CI for each model with the number of features that we used in each model using Davis and KIBA datasets in Table 3. We assessed three versions of Affinity2Vec models using distinct sets of FVs for each dataset separately. This process allowed us to test our methods' performance in multiple experiments and then select the model with the bestobtained results. Supplementary Table S1 provides the other evaluation metrics (rm2 and AUPR) for all the Affinity2Vec models.
As shown in Table 3, we achieved the worst MSE and CI evaluation metrics for both datasets when only concatenating FVs of drugs' SMILES and targets' sequences embeddings and feeding them to the XGBoost model. However, we observe a slight improvement in prediction performance when we include embeddings in the hybrid models (compared to the Affinity2Vec_Pscore models). We further observe that Affinity2Vec_Pscore models perform better using both datasets. But the better results are achieved with G2 that incorporate the embeddings in the metapath score features in the form of cosine similarity for all drug pairs or all target pairs (specifically in the KIBA dataset). That demonstrates applying some process on the autogenerated features instead of concatenating them, making them more meaningful and informative, and contributing to the prediction positively. Furthermore, it shows the importance of formulating the DTBA as graphbased and graph mining power to obtain different metapath scores.
The reason behind G2 not performing well when using the Davis dataset may be a consequence of DL working better with larger datasets such as the KIBA dataset used in our work. This means both the seq2seq and ProtVec models capture the most significant features in the larger dataset (i.e., KIBA). Therefore, all models that used DL in some steps showed better performance in the KIBA dataset.
Moreover, the Affinity2Vec_Pscore model showed different performance for both datasets when we used the metapath scores from G1 or G2. The only difference between G1 and G2 was the source of obtaining similarity graphs of drugs or for targets. Thus, the similarity types play an essential role in the graphbased features. Furthermore, the Davis dataset results show that using DDsim1 and TTsim1 (from the 2D chemical structure and normalized SW alignment scores for drugs and targets, respectively) enhances the performance when involved in the FVs in four versions of our model. In contrast, for the KIBA dataset, the DDsim2 and TTsim2 (from SMILES embeddings cosine similarity and aminoacid sequence embeddings cosine similarity) improved the performance when they were involved in the FVs. As a result, it is a fundamental and critical process to choose the most informative similarity to integrate with the DTBA weighted heterogeneous network to calculate path score features. Besides, we observe that the Affinity2Vec_Pscore model outperformed the other models' variants in the KIBA dataset. We can pick up the importance of the metapath score features from the performance results, which show that the Affinity2Vec_Hybrid, which combines both types of features, is very similar to the Affinity2Vec_Pscore model. This is despite Affinity2Vec_Pscore only having 12/24 features and Affinity2Vec_Embed having 228/356 features. Thus, it seems the metapath score features are dominating the results. This suggests graphbased features were constructed based on the popular "guiltbyassociation" assumption, similar drugs tend to interact with similar targets and vise verse, can exploit the graph structure to provide more meaningful information that extends beyond the number of features used. This outstanding characteristic significantly improved the DTBA prediction performance in our work.
As a final step, we further validate our ML models' robustness and stability by demonstrating that the strong performance obtained was not by chance. That is, we implemented YRandomization^{67} to validate the quantitative structure/propertyactivity relationship (QSPR/QSAR). We performed this test by comparing our model (trained on the original dataset) to several versions of our model trained on randomly shuffled datasets. Briefly, for the Davis and KIBA datasets separately, we first trained our model using the original data (i.e., features and labels) and obtained the results. Second, for 100 iterations, we fixed the FV but shuffled the labels, then trained the model over the new featureslabel pairs and obtained their performances. We performed the evaluations using leaveout test data in terms of r^{2} (explained in “Evaluation metrics”), an evaluation metric commonly used to measure the goodnessoffit^{67}. With this data, we demonstrate that Affinity2Vec obtains statistically significant (Pvalues < 0.05) prediction performances in terms of r^{2} using the Davis dataset with Pvalue = 1.47e−18 (r^{2} = 0.71), and the KIBA dataset with Pvalue = 8.57e−20 (r^{2} = 0.78), compared to 100 randomized models for each dataset that achieved nonsignificant results (Pvalues > = 0.1 and negative r^{2} values). Negative r^{2} indicates poor performance and that the model does not follow the data trend (i.e., chosen by chance). Obtaining very low Pvalues proves a dependency between the features and the labels, and our ML classifier exploited these correlations to obtain good results.
Performance comparison under setting1
For the comparison with the stateoftheart methods, we considered the nonstructurebased methods. These include the first ML works developed in this field of DTBA prediction, KronRLS^{20} and SimBoost^{21}; and stateoftheart DL methods, DeepDTA^{23}, GANsDTA^{68}, and DeepCDA^{27}. It is important to mention that the results reported for the stateoftheart methods were retrieved directly from the original publications except for KronRLS and SimBoost, which are informed by^{23} using the same settings for performance measurement and same datasets (Davis and KIBA). Furthermore, we only reported the bestperformed version for each stateoftheart method and did not include the results for all the versions.
Figure 4 shows that Affinity2Vec (in brown) outperformed all the stateoftheart methods by obtaining the best average in each evaluation metric across both datasets. Affinity2Vec achieved the highest average CI, rm2, and AUPR, which are around 1.6%, 6.6%, and 5.4% higher than the secondhighest average CI, rm2, AUPR achieved, respectively. It also has the best average MSE across both datasets, which is 3.6% lower than the secondbest method DeepCDA. This step (calculating the average for each evaluation metrics across all datasets) is vital because it provides a better assessment of overall method performance, which is independent of the dataset used.
Tables 4 and 5 show the average MSE, CI, rm2, AUPR evaluation metrics for our method and five stateoftheart methods using Davis and KIBA dataset over fivefold CV on the holdout test. Using the Davis dataset (see Table 4), our method, Affinity2Vec_Hybrid, achieved the best performance for two evaluation metrics, yielding the lowest MSE of 0.24 (considered the most important metric in regressionbased problems) and the highest rm2 of 0.693. Affinity2Vec_Hybrid also is the secondbest achieving competitive CI and AUPR. That is, its CI and AUPR are lower than the bestperforming method (DeepCDA) in terms of CI and AUPR by 0.005 and 0.04, respectively.
Using the KIBA dataset (see Table 5), both Affinity2Vec_Pscore and Affinity2Vec_Hybrid, outperformed all five stateoftheart methods in all evaluation metrics. Thus, we here compared the first ranked method to the thirdranked method. Specifically, the first rank version, Affinity2Vec_Pscore, improved the performance by 3.4%, 10%, and 11.4% for CI, rm2, and AUPR, respectively, compared to the secondbest method (DeepCDA). It is noticeable that the CI score improvement is good, but both rm2 and AUPR scores have increased significantly compared with DeepCDA. Moreover, Affinity2Vec_Pscore proved its robustness by producing the lowest error, i.e., decreasing MSE to 0.112, a notable difference compared to the secondbest method (DeepCDA), which is 6.5% higher.
Furthermore, we demonstrated the efficiency of our method by providing the top five drugprotein pairs based on the strongest binding (i.e., lower affinity values) predicted by the best Affinity2Vec model using the holdout test set. Table 6 showed these drugproteins pair along with their predicted and true affinities with other information. We can see that the predicted values are very close to the actual values, which indicates the robustness of our method. Moreover, in the KIBA dataset, for the first ranked predicted pair, the actual and predicted values are not very close, but it is the drugtarget pair with the strongest actual binding affinity, and it is still recognized as such based on the predicted value, which reveals the robustness of our method.
To further provide confidence in the predicted values, we also plot the predicted affinity measurements for the best Affinity2Vec model against the actual affinity measurements for both Davis and KIBA datasets using the holdout test set. Figure 5 illustrates our claim by showing how close the predicted values (in blue) are to the actual values (in red). We expect the ideal predictive model to give the predictions (p) = actual values (y) (i.e., prediction on or close to the x = y line). The results suggest our model performs much better on the KIBA dataset, as we can see that the density is high around the x = y line.
Performance evaluation and comparison using PDBBind dataset under setting2
In this section, we assessed our methods’ performance in predicting binding affinity measurements using the benchmark PDBBind Refined dataset. Although the PDBBind Refined dataset provides the 3D structural information, we excluded this information to demonstrate that our method offers reasonable predictions based on this experimentally validated dataset despite not using the 3D data (thereby eliminating the bottleneck issue associated with the 3D structures). For a comprehensive evaluation, we assessed all three Affinity2Vec (Pscore, Embed, and Hybrid) models using the PDBBind Refined dataset under setting2 (i.e., timebased dataset split) and compared them with two baseline methods that used the same dataset and the same setting. It is important to note that we have one Affinity2Vec model under each method version since using the PDBBind Refined dataset, we just obtained one graph G(V, E) based on the Ligands' SMILES embeddings cosine similarity and proteins' sequence embeddings cosine similarity.
Using the PDBBind Refined dataset and setting2, we compared Affinity2Vec's performance with stateoftheart methods, which applied the same experiment setup called MoleculeNet^{7}, a benchmarking platform designed to evaluate computational methods for multiple molecular tasks, and MDeePred^{8}, a multichannel protein featurization DLbased method. We included four comparison methods from MoleculeNet since it utilized random forest (RF) and Deep Neural Networks (DNN) with grid featurization (GridF) and extended connectivity fingerprints (ECFP). The performance results of our models and these baseline models in terms of RMSE, CI, Spearman, and AUPR were based on the independent test set and are shown in Table 7. As we see here and from Table 7, we compared Affinity2Vec with baseline methods using the PDBBind dataset differing from the baseline methods (KronRLS, SimBoost, DeepDTA, GANsDTA, and DeepCDA) we compared with using Davis and KIBA dataset. All latter methods require several precalculations to train the models that were not available for the PDBBind Refined dataset. Furthermore, the predicted and the actual affinity values for the MoleculeNet methods were only available for the PDBBind Refined dataset. Therefore, all reported results for these baseline methods using PDBBind Refined dataset in this setting were taken from the previous work^{8}.
The results in Table 7 show that our methods performed well using the independent PDBBind Refined test sets under setting2 for all our method versions. Among our methods, the Affinity2Vec_Embed model performed the best in all evaluation metrics RMSE, CI, and average AUPR. In contrast, the Affinity2Vec_Pscore model performed the worst, and the reason behind that might be because the graph we used to extract metapath scores is very sparse due to the limited number of available affinity values in the PDBBind Refined dataset compared to the Davis and KIBA datasets. Moreover, Affinity2Vec offered the best and the secondbest performances compared to the stateoftheart methods by lowering the RMSE by 9.1% compared to the thirdbest (MDeePred method). Moreover, Affinity2Vec achieved competitive results in terms of CI by obtaining the third place lower than the secondbest (MoleculeNet GridRF) by 0.002. Finally, using average AUPR, Affinity2Vec Embed and Hybrid models achieved the best and secondbest, which indicates that our methods work well as classifiers after converting the continuous output to binary output.
In summary, based on all the reported results, Affinity2Vec showed superior performance using the larger dataset, KIBA, by outperforming all stateoftheart methods based on all evaluation metrics, indicating that Affinity2Vec works better with largescale biomedical data. But Affinity2Vec also performed on par with the best performing models when a smaller dataset like Davis is used. Furthermore, Affinity2Vecs’ performance was competitive using the PDBBind Refined dataset against the stateoftheart methods (under a timebased split setting) where the test data was unseen during the training stage. Also, Affinity2Vec reported steady performance using various affinity thresholds in terms of average AUPR.
Key strength of Affinity2Vec
The reasons behind Affinity2Vec outperforming several of the stateoftheart methods are manifold. Firstly, some methods deal with the drug itself as a molecule graph but do not formulate the entire problem as graphbased or construct a heterogeneous graph using several data resources. Thus, as far as we are aware, this is the first networkbased approach that predicts binding affinity, and no published works deal with the problem in a graphbased manner. It is essential to mention that most of the DL baseline methods used different DL representation learning techniques to autogenerate features (i.e., embeddings) for drugs and target proteins, as we also did. However, the different steps that we did gave our method significant advantages over baseline DL methods. These advantages are:

We exploited the power of welldesigned DL models developed in the biomedical domain specifically for drug embeddings or target embeddings by utilizing them to fit our goal for generating highquality feature representations that help enhance the prediction performance.

The postprocessing of autogenerated features gives them more meaning and more reliability. We postprocessed these features by calculating the cosine similarity of drugs' embeddings and targets' embeddings and then incorporating those similarities in formulating the new weighted heterogeneous graph. Integrating these new similarities in the graph provides more information for extracting metapath score features after removing all low score edges by setting cutoff thresholds. This whole process is the distinct characteristic in our work that extraordinarily improved the performance.
Conclusion
It is more informative to predict the strength with which a drug binds to a target rather than just indicating the binary relation between the drug and its target. Thus, we developed Affinity2Vec, a novel networkbased method to identify DTBA as a regressionbased problem. We first built a weighted heterogeneous network by integrating: DTBA graph, targettarget similarity graph, and drugdrug similarity graph. We applied several preprocessing steps on those graphs before the integration step. Our method, Affinity2Vec, exerted several techniques from representation learning (i.e., embedding), DL, graph mining, and ML. DL was used in the seq2seq model and the ProtVec ngram model to automatically generate feature representations for drugs and target proteins, respectively. We applied graph mining to obtain metapath score features, and then we fed the FVs to the ML regression model (specifically ensemble learning) for prediction. We performed a comprehensive evaluation using three benchmark datasets (Davis, KIBA, and PDBBind Refined) under two different settings, and Affinity2Vec showed consistent performance compared to several stateoftheart methods and illustrated its robustness.
Despite our method's accurate prediction and high performance, Affinity2Vec still suffers from some weaknesses limiting its capability and optimal performance. It is worth mentioning that our method, Affinity2Vec, is limited to predict random drugtarget pairs. Therefore, it cannot identify the interaction of new targets or for new drugs, and we intend to handle this limitation in future work. Also, we tested Affinity2Vec using three benchmark datasets, and it is crucial to apply our method on more largescale datasets that are closer to actual scenarios and fit more to our objective of drug repositioning. Last, we represented the drugs as sequences (SMILES) instead of using the molecules graph representation, which is more informative.
For future work, we will work to enhance the prediction performance by utilizing different embedding techniques (i.e., graph embedding^{70}, knowledge graph embedding^{71}, and sequence embedding^{72}). We will explore our model's interpretability, which gives us the strength to highlight the most critical features that we obtain from DL methods. We intend to apply Affinity2Vec to a case study of reallife associated with drug repurposing and then verify our model's selected predictions to confirm our results' clinical relevance experimentally. The last notable point is that our networkbased method can be a generic solution to any similar problem formulated as a graphbased problem in biomedical domains such as drugdisease interaction networks and protein–protein interaction networks.
Data availability
The source of the Davis dataset is publicly available in (accessed by May 2020): http://staff.cs.utu.fi/~aatapa/data/DrugTarget/. Based on a previous published study, the source of the KIBA dataset is publicly available in: https://github.com/hkmztrk/DeepDTA/tree/master/data/kiba. The source of the PDBBind Refined dataset can be found in (accessed by September 2021). http://www.pdbbind.org.cn, (Only need a registration to download the data), and in https://github.com/cansyl/MDeePred/tree/master/training_files/PDBBind_Refined. The source code of the Affinity2Vec methods and all three datasets are available on a GitHub repository at: https://github.com/MahaThafar/Affinity2Vec.
Abbreviations
 DTIs:

Drugtarget interactions
 DTBA:

Drugtarget binding affinity
 ML:

Machine learning
 DL:

Deep learning
 SW:

SmithWaterman
 KIBA:

Kinase Inhibitor BioActivity
 SMILES:

Simplified MolecularInput LineEntry system
 aaseq:

Aminoacid sequence
 3D:

Threedimensional
 NN:

Neural network
 NLP:

Natural language processing
 LSTM:

Longshortterm memory
 GRU:

Gated Recurrent Unit
 FV:

Feature vectors
 TTsim:

Targettarget similarity matrix
 DDsim:

Drugdrug similarity matrix
 DNN:

Deep neural network
 CNN:

Convolutional neural network
 seq2seq:

Sequencetosequence
 ANN:

Artificial neural network
 ReLU:

Rectified linear activation
 FC:

Fully connected
 XGBoost:

EXtreme Gradient Boosting
 MSE:

Mean square error
 CI:

Concordance index
 rm2:

Regression toward the mean index
 AUPR:

The area under the precisionrecall curve
 CV:

Cross validation
References
Paul, S. M. et al. How to improve R&D productivity: The pharmaceutical industry’s grand challenge. Nat. Rev. Drug Discov. 9, 203–214. https://doi.org/10.1038/nrd3078 (2010).
Pushpakom, S. et al. Drug repurposing: Progress, challenges and recommendations. Nat. Rev. Drug Discov. 18, 41–58. https://doi.org/10.1038/nrd.2018.168 (2019).
Chen, X. et al. Drug–target interaction prediction: Databases, web servers and computational models. Brief. Bioinform. 17, 696–712. https://doi.org/10.1093/bib/bbv066 (2015).
Thafar, M., Raies, A. B., Albaradei, S., Essack, M. & Bajic, V. B. Comparison study of computational prediction tools for drugtarget binding affinities. Front. Chem. 7, 782. https://doi.org/10.3389/fchem.2019.00782 (2019).
Alonso, H., Bliznyuk, A. A. & Gready, J. E. Combining docking and molecular dynamic simulations in drug design. Med. Res. Rev. 26, 531–568. https://doi.org/10.1002/med.20067 (2006).
Kontoyianni, M. Docking and virtual screening in drug discovery. Methods Mol. Biol. 1647, 255–266. https://doi.org/10.1007/9781493972012_18 (2017).
Mousavian, Z. & MasoudiNejad, A. Drug–target interaction prediction via chemogenomic space: Learningbased methods. Expert Opin. Drug Metab. Toxicol. 10, 1273–1287. https://doi.org/10.1517/17425255.2014.950222 (2014).
Jung, L. S. & Cho, Y.R. Survey of networkbased approaches of drugtarget interaction prediction. 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). https://doi.org/10.1109/bibm49941.2020.9313222 (2020).
Wu, Z., Li, W., Liu, G. & Tang, Y. Networkbased methods for prediction of drugtarget interactions. Front. Pharmacol. 9, 1134. https://doi.org/10.3389/fphar.2018.01134 (2018).
Bagherian, M. et al. Machine learning approaches and databases for prediction of drug–target interaction: A survey paper. Brief. Bioinform. 22, 247–269. https://doi.org/10.1093/bib/bbz157 (2021).
Ezzat, A., Wu, M., Li, X.L. & Kwoh, C.K. Computational prediction of drugtarget interactions using chemogenomic approaches: An empirical survey. Brief. Bioinform. https://doi.org/10.1093/bib/bby002 (2018).
Nath, A., Kumari, P. & Chaube, R. Prediction of human drug targets and their interactions using machine learning methods: Current and future perspectives. Methods Mol. Biol. 1762, 21–30. https://doi.org/10.1007/9781493977567_2 (2018).
Sachdev, K. & Gupta, M. K. A comprehensive review of feature based methods for drug target interaction prediction. J. Biomed. Inform. 93, 103159. https://doi.org/10.1016/j.jbi.2019.103159 (2019).
Alshahrani, M. & Hoehndorf, R. Drug Repurposing Through Joint Learning on Knowledge Graphs and Literature. (University Library, 2018).
Thafar, M. A. et al. Computational Drugtarget Interaction Prediction based on Graph Embedding and Graph Mining. In Proceedings of the 2020 10th International Conference on Bioscience, Biochemistry and Bioinformatics 14–21. https://doi.org/10.1145/3386052.3386062.
Mohamed, S. K., Nováček, V. & Nounu, A. Discovering protein drug targets using knowledge graph embeddings. Bioinformatics https://doi.org/10.1093/bioinformatics/btz600 (2019).
Kairys, V., Baranauskiene, L., Kazlauskiene, M., Matulis, D. & Kazlauskas, E. Binding affinity in drug design: Experimental and computational techniques. Expert Opin. Drug Discov. 14, 755–768. https://doi.org/10.1080/17460441.2019.1623202 (2019).
Sotriffer, C. & Matter, H. The challenge of affinity prediction: Scoring functions for structurebased virtual screening. Methods Princ. Med. Chem. https://doi.org/10.1002/9783527633326.ch7 (2011).
Wang, X., Kuwahara, H. & Gao, X. Modeling DNA affinity landscape through tworound support vector regression with weighted degree kernels. BMC Syst. Biol. 8(Suppl 5), S5. https://doi.org/10.1186/175205098S5S5 (2014).
Pahikkala, T. et al. Toward more realistic drugtarget interaction predictions. Brief. Bioinform. 16, 325–337. https://doi.org/10.1093/bib/bbu010 (2015).
He, T., Heidemeyer, M., Ban, F., Cherkasov, A. & Ester, M. SimBoost: A readacross approach for predicting drug–target binding affinities using gradient boosting machines. J. Cheminform. https://doi.org/10.1186/s133210170209z (2017).
Shim, J., Hong, Z.Y., Sohn, I. & Hwang, C. Prediction of drug–target binding affinity using similaritybased convolutional neural network. Sci. Rep. 11, 1–9. https://doi.org/10.1038/s4159802183679y (2021).
Öztürk, H., Özgür, A. & Ozkirimli, E. DeepDTA: Deep drugtarget binding affinity prediction. Bioinformatics 34, i821–i829. https://doi.org/10.1093/bioinformatics/bty593 (2018).
Karimi, M., Wu, D., Wang, Z. & Shen, Y. DeepAffinity: Interpretable deep learning of compoundprotein affinity through unified recurrent and convolutional neural networks. Bioinformatics https://doi.org/10.1093/bioinformatics/btz111 (2019).
Zhao, Q., Xiao, F., Yang, M., Li, Y. & Wang, J. AttentionDTA: Prediction of drug–target binding affinity using attention model. In 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 64–69. https://doi.org/10.1109/BIBM47256.2019.8983125.
Huang, K. et al. DeepPurpose: A deep learning library for drugtarget interaction prediction. Bioinformatics https://doi.org/10.1093/bioinformatics/btaa1005 (2020).
Abbasi, K. et al. DeepCDA: Deep crossdomain compound–protein affinity prediction through LSTM and convolutional neural networks. Bioinformatics 36, 4633–4642. https://doi.org/10.1093/bioinformatics/btaa544 (2020).
Zhao, L., Xie, P., Hao, L., Li, T. & Wang, C. Gene Ontology aided compound protein binding affinity prediction using BERT encoding. 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), https://doi.org/10.1109/bibm49941.2020.9312985 (2020).
Nguyen, T., Le, H. & Venkatesh, S. GraphDTA: Prediction of drug–target binding affinity using graph convolutional networks. Bioinformatics https://doi.org/10.1101/684662 (2019).
Tsubaki, M., Tomii, K. & Sese, J. Compound–protein interaction prediction with endtoend learning of neural networks for graphs and sequences. Bioinformatics 35, 309–318. https://doi.org/10.1093/bioinformatics/bty535 (2019).
Alshahrani, M., Thafar, M. A. & Essack, M. Application and evaluation of knowledge graph embeddings in biomedical data. PeerJ Comput. Sci. 7, e341. https://doi.org/10.7717/peerjcs.341 (2021).
Olayan, R. S., Ashoor, H. & Bajic, V. B. DDR: Efficient computational method to predict drug–target interactions using graph mining and machine learning approaches. Bioinformatics 34, 1164–1173. https://doi.org/10.1093/bioinformatics/btx731 (2018).
Cheng, F., Zhou, Y., Li, W., Liu, G. & Tang, Y. Prediction of chemicalprotein interactions network with weighted networkbased inference method. PLoS ONE 7, e41064. https://doi.org/10.1371/journal.pone.0041064 (2012).
Zeng, X. et al. Networkbased prediction of drug–target interactions using an arbitraryorder proximity embedded deep forest. Bioinformatics 36, 2805–2812. https://doi.org/10.1093/bioinformatics/btaa010 (2020).
Zhou, D., Xu, Z., Li, W., Xie, X. & Peng, S. MultiDTI: Drug–target interaction prediction based on multimodal representation learning to bridge the gap between new chemical entities and known heterogeneous network. Bioinformatics https://doi.org/10.1093/bioinformatics/btab473 (2021).
Thafar, M. A. et al. DTi2Vec: Drugtarget interaction prediction using network embedding and ensemble learning. J. Cheminform. 13, 71. https://doi.org/10.1186/s1332102100552w (2021).
Davis, M. I. et al. Comprehensive analysis of kinase inhibitor selectivity. Nat. Biotechnol. 29, 1046–1051. https://doi.org/10.1038/nbt.1990 (2011).
Tang, J. et al. Making sense of largescale kinase inhibitor bioactivity data sets: A comparative and integrative analysis. J. Chem. Inf. Model. 54, 735–743. https://doi.org/10.1021/ci400709d (2014).
He, T. et al. SimBoost: A readacross approach for predicting drug–target binding affinities using gradient boosting machines. J. Cheminform. https://doi.org/10.1186/s133210170209z (2017).
Liu, Z. et al. PDBwide collection of binding data: Current status of the PDBbind database. Bioinformatics 31, 405–412. https://doi.org/10.1093/bioinformatics/btu626 (2015).
Wu, Z. et al. MoleculeNet: A benchmark for molecular machine learning. Chem. Sci. 9, 513–530. https://doi.org/10.1039/c7sc02664a (2018).
Rifaioglu, A. S. et al. MDeePred: Novel multichannel protein featurization for deep learningbased binding affinity prediction in drug discovery. Bioinformatics 37, 693–704. https://doi.org/10.1093/bioinformatics/btaa858 (2021).
Zhou, J. et al. DistanceAware Molecule Graph Attention Network for DrugTarget Binding Affinity Prediction. arXiv [qbio.QM] (2020).
Leach, A. R. & Gillet, V. J. An Introduction to Chemoinformatics (Springer Science & Business Media, 2003).
Bolton, E. E., Wang, Y., Thiessen, P. A. & Bryant, S. H. In Annual Reports in Computational Chemistry Vol. 4 (eds Wheeler, R. A. & Spellmeyer, D. C.) 217–241 (Elsevier, 2008).
Lovrić, M., Molero, J. M. & Kern, R. PySpark and RDKit: Moving towards big data in cheminformatics. Mol. Inform. 38, e1800082. https://doi.org/10.1002/minf.201800082 (2019).
Apweiler, R. UniProt: The Universal Protein knowledgebase. Nucleic Acids Res. 32, 115D – 119. https://doi.org/10.1093/nar/gkh131 (2004).
Hattori, M., Okuno, Y., Goto, S. & Kanehisa, M. Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways. J. Am. Chem. Soc. 125, 11853–11865 (2003).
Yamanishi, Y., Araki, M., Gutteridge, A., Honda, W. & Kanehisa, M. Prediction of drug–target interaction networks from the integration of chemical and genomic spaces. Bioinformatics 24, i232–i240. https://doi.org/10.1093/bioinformatics/btn162 (2008).
Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016).
Jang, M., Seo, S. & Kang, P. Recurrent neural networkbased semantic variational autoencoder for Sequencetosequence learning. Inf. Sci. 490, 59–73. https://doi.org/10.1016/j.ins.2019.03.066 (2019).
Xu, Z., Wang, S., Zhu, F. & Huang, J. Seq2seq Fingerprint: An Unsupervised Deep Molecular Embedding for Drug Discovery. In Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics 285–294 (Association for Computing Machinery). https://doi.org/10.1145/3107411.3107424.
Asgari, E. & Mofrad, M. R. K. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE 10, e0141287. https://doi.org/10.1371/journal.pone.0141287 (2015).
Boutet, E. et al. UniProtKB/SwissProt, the manually annotated section of the UniProt KnowledgeBase: How to use the entry view. Methods Mol. Biol. 1374, 23–54. https://doi.org/10.1007/9781493931675_2 (2016).
Thafar, M. A. et al. DTiGEMS+: Drug–target interaction prediction using graph embedding, graph mining, and similaritybased techniques. J. Cheminform. 12, 44. https://doi.org/10.1186/s13321020004472 (2020).
Fu, G. et al. Predicting drug target interactions using metapathbased semantic network analysis. BMC Bioinform. 17, 160. https://doi.org/10.1186/s128590161005x (2016).
Al Shalabi, L. & Shaaban, Z. Normalization as a preprocessing engine for data mining and the approach of preference matrix. In 2006 International Conference on Dependability of Computer Systems 207–214. https://doi.org/10.1109/DEPCOSRELCOMEX.2006.38
Friedman, J. H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001).
Chen, T. & Guestrin, C. XGBoost: A Scalable Tree Boosting System ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM 785–794 (2016).
Wackerly, D., Mendenhall, W. & Scheaffer, R. L. Mathematical Statistics with Applications (Nelson Education, 2007).
Gönen, M. & Heller, G. Concordance probability and discriminatory power in proportional hazards regression. Biometrika 92, 965–970. https://doi.org/10.1093/biomet/92.4.965 (2005).
Roy, K. et al. Some case studies on application of “r(m)2” metrics for judging quality of quantitative structureactivity relationship predictions: Emphasis on scaling of response data. J. Comput. Chem. 34, 1071–1082. https://doi.org/10.1002/jcc.23231 (2013).
Davis, J. & Goadrich, M. The relationship between precisionrecall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning 233–240 (ACM). https://doi.org/10.1145/1143844.1143874
Davies, B. Integral Transforms and Their Applications (Springer Science & Business Media, 2012).
Roy, K. Quantitative StructureActivity Relationships in Drug Design, Predictive Toxicology, and Risk Assessment (IGI Global, 2015).
Roy, P. P. et al. On two novel parameters for validation of predictive QSAR models. Molecules, 2009, 14, 16601701. Molecules 15, 604–605. https://doi.org/10.3390/molecules15010604 (2010).
Rücker, C., Rücker, G. & Meringer, M. yRandomization and its variants in QSPR/QSAR. J. Chem. Inf. Model. 47, 2345–2357 (2007).
Zhao, L., Wang, J., Pang, L., Liu, Y. & Zhang, J. GANsDTA: Predicting drugtarget binding affinity using GANs. Front. Genet. 10, 1243. https://doi.org/10.3389/fgene.2019.01243 (2019).
Gulli, A. & Pal, S. Deep Learning with Keras (Packt Publishing Ltd, 2017).
Goyal, P. & Ferrara, E. Graph embedding techniques, applications, and performance: A survey. Knowl.Based Syst. 151, 78–94. https://doi.org/10.1016/j.knosys.2018.03.022 (2018).
Alshahrani, M. et al. Neurosymbolic representation learning on biological knowledge graphs. Bioinformatics 33, 2723–2730. https://doi.org/10.1093/bioinformatics/btx275 (2017).
Dai, H. et al. Sequence2Vec: A novel embedding approach for modeling transcription factor binding affinity landscape. Bioinformatics 33, 3575–3583. https://doi.org/10.1093/bioinformatics/btx480 (2017).
Acknowledgements
The research reported in this publication was supported by King Abdullah University of Science and Technology (KAUST) through grant awards Nos. BAS/1/10590101, BAS/1/16240101, FCC/1/19762001, FCC/1/19762601, URF/1/34500101, REI/1/42160101, REI/1/44370101, REI/1/44730101, and URF/1/40980101.
Author information
Authors and Affiliations
Contributions
M.A.T., X.G., and M.E. conceptualized and designed the study. M.A.T. performed the experiments and wrote the first draft of the manuscript. M.A.T., M.A., and S.A. analyzed the data and the results. M.A.T., S.A., M.A., and M.E. contributed to the writing of the manuscript. M.E., X.G., and T.G. revised/edited the manuscript. All authors contributed to the discussion, read and approved the submitted version.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Thafar, M.A., Alshahrani, M., Albaradei, S. et al. Affinity2Vec: drugtarget binding affinity prediction through representation learning, graph mining, and machine learning. Sci Rep 12, 4751 (2022). https://doi.org/10.1038/s41598022087879
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598022087879
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.