Large-scale Direct Targeting for Drug Repositioning and Discovery

A system-level identification of drug-target direct interactions is vital to drug repositioning and discovery. However, the biological means on a large scale remains challenging and expensive even nowadays. The available computational models mainly focus on predicting indirect interactions or direct interactions on a small scale. To address these problems, in this work, a novel algorithm termed weighted ensemble similarity (WES) has been developed to identify drug direct targets based on a large-scale of 98,327 drug-target relationships. WES includes: (1) identifying the key ligand structural features that are highly-related to the pharmacological properties in a framework of ensemble; (2) determining a drug’s affiliation of a target by evaluation of the overall similarity (ensemble) rather than a single ligand judgment; and (3) integrating the standardized ensemble similarities (Z score) by Bayesian network and multi-variate kernel approach to make predictions. All these lead WES to predict drug direct targets with external and experimental test accuracies of 70% and 71%, respectively. This shows that the WES method provides a potential in silico model for drug repositioning and discovery.

In this work, we propose a novel weighted ensemble similarity (WES) algorithm, an extension of the SEA method, to predict the drug-target direct interactions. Here, the term ensemble is an extension concept derived from statistical physics. As we know, each protein (receptor) has several ligands, these ligands construct a set, and here, the set was treated as an ensemble. This concept is proposed based on the following considerations: (1) if the ligand set has structurally similar compounds, then the ensemble average will cover a narrow chemical space. Thus, to compare a compound with the ensemble average or any single compound in a set might be have similar results; (2) however, in most cases, the ligands are diverse for a receptor like P-glycoprotein 16 or COX2 3 , they might be divided into several smaller sub-clusters. If the prediction of a compound that is still made based on its similarity with a certain compound in the training set, it will not give reliable results. Thus, a more reasonable way is to compare a compound similarity with the whole feature of an ensemble (set).
Here, the WES model was built on a large data set involving 98,327 drug-target relations, which includes BindingDB 17

WES-an algorithm for predicting direct interactions of drugs and targets. The algorithm
works in three phases: (1) identifying the key ligand structural and physicochemical features (CDK and Dragon) that are highly-related to the pharmacological properties in a framework of ensemble. We assembled the feature matrix for the ligand set of each protein based on statistical tests (non-parametric Wilcoxon Sum Rank Test for Dragon feature; one-sided Fisher's exact test for CDK feature). (2) Determining a drug's affiliation of a target by evaluation of the overall similarity of an ensemble rather than a single ligand judgment. As the resulting score does not discriminate relevant similarities from random but depends on the number of ligands in each set, it is not a perfect assessment of the overall similarity of the ligand sets. Then the overall similarities were converted into the size-bias-free normalized values to eliminate the relevant similarities from random. (3) And finally, integrating the standardized ensemble similarities (Z score) by Bayesian network to make predictions.
Model performance. Feature analysis. To investigate the effects of different structural features of the ligands on the model performance, we have used the Chemical Development Kit (CDK), Dragon and the CDK-Dragon hybrid features for model construction, respectively (see Methods for details). Table 1 illustrates the results in terms of precision and recall rates. Clearly, the hybrid model outperforms both the CDK and Dragon ones in recovering the negative links. Notably, the hybrid model for the leave-one-out cross-validation (LOOCV) performs well in predicting the binding (sensitivity 85%, SEN) and the non-binding (specificity 71%, SPE) patterns, with the accuracy of 78%, the precision (PRE 74%) and the area under the receiver operating curves (AUC) of 0.85, respectively. It is noted that all the scores (Z score for CDK and Dragon model and likelihood for CDK-Dragon hybrid model), used to make prediction, in this work were selected when the models achieve the highest F1 score in cross-validation otherwise specified (see Methods for details). The ROC curves (Fig. 1) show that all the three models are capable of catching sufficient information related to detect interactions at high true-positive rates against low false-positive rates at any threshold. With the increase of the AUC in the complete dataset, the hybrid model improves the ability to identify those known drug-target links, demonstrating that more chemical and pharmacological information introduced to build models can achieve better predictive activity.
To investigate the influence of weighted features attributed to the WES performance, we tested the different inputs: weighted features vs. non-weighted features. Table S1 shows that the weighted hybrid feature-based WES outperforms the non-weighted feature-based model, with the ACC of 78%, PRE of 74% and AUC of 0.85, respectively. This reflects that WES algorithm weights and selects features to reduce dimensionality of the descriptor set, thus resulting in good performance. Also we have made a check of the effectiveness of integrating the standardized ensemble similarities (Z score) by Bayesian network. Notably, the integrated WES model also performs better than the non-integrated one in predicting the binding (SEN 85%) and the non-binding (SPE 71%) patterns (Table  S1). These results serve to highlight the fact that integration procedure of WES algorithm exhibits high prediction efficiency.

External data validation.
To ensure the reliability of the WES model, we further carried out an external validation. The dataset for external validation includes both the binding (positive sample) and non-binding data (negative sample) as following: 1) the positive samples were extracted from PDB for those ligand-protein pairs with the half-maximal inhibitory concentrations (IC 50 ) < 10 μ M. The interactions which overlap with the training set for model construction were manually deleted, and finally 649 interactions were obtained; 2) the negative samples were achieved from BindingDB with a filter criterion of IC 50 > 500 μ M. And finally, 3,172 ligand-target non-binding data was obtained as negative samples. The hybrid model shows the prediction ACC of 71% (458/649) for the positive samples and 70% (2,209/3,172) for the negative samples. All these demonstrate the weighted hybrid WES achieves excellent performance for different data sources.
Target class prediction. The performance of WES method was further tested on five pharmaceutical classes involving enzymes (n = 761), ion channels (n = 78), membrane proteins (n = 275), transporters (n = 50) and transcription factors (n = 39), respectively. Figure 1 and Table 1 show the AUC, SEN, SPE, PRE and ACC of the models. WES displays the highest prediction ability for the transcription factor (ACC = 0.80) and the membrane protein (ACC = 0.79), followed by the enzyme (ACC = 0.78), transporter (ACC = 0.79) and ion channels (ACC = 0.75), respectively. Also, we have compared the performance of WES optimal model for target class prediction with other published models (enzymes, 664; ion channels, 204; membrane proteins, 95; nuclear receptors, 26; respectively.), including the nearest profile, weighted profile, bipartite Graph learning methods and the same criteria 5 . Table 2 indicates that all the methods have quite high AUC and SPE but low SEN values. The WES and bipartite graph model outperform the other two models (nearest profile, weighted profile). However, it has to be noted that, the WES model was constructed with a lager dataset exhibiting more molecular and pharmacological diversities, thus it is believed that WES might have more generalization ability for making predictions.

Comparison of WES with 1NN.
In multi-objective pattern recognition, the k-Nearest Neighbors algorithm (k-NN) is a non-parametric and widely used method. The output depends on whether k-NN is used for classification by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). WES has been compared to a one nearest neighbor (1NN) model (Fig. 2), which judges the probability of a drug targeting to a protein based only on the maximum similarity to the reference ligands of the target. For close analogs, Tanimoto coefficients (Tc) > 0.65, the fraction of true positives was comparable between 1NN and WES (Fig. 2). Surprisingly, by across most similarity thresholds, WES substantially outperforms 1NN. Notably, among the correct drug-target predictions by WES, 4,319 of them show low similarity (Tc < 0.4) with the ligand sets of their respective targets. However, the proportion held by 1NN is zero. These results prove that WES is more capable of predicting drug targets for various structurally diverse chemicals.
Evaluation of ligand scaffold hopping. In order to further assess the ligand scaffold hopping (LSH) ability for WES model, we have compared the predicted ligands with those known ligands for the same targets. The results show a diversified structural scaffolds as shown in Table S2-3. This indicates that WES catches the relatively complete drug-binding features for a protein from the ensemble level not from its single ligand like 1NN method. For example, drug Hydrocortamate, which is predicted to modulate Enpp2 (Fig. 3), is only marginally similar to the known ligand sets (Tc value 0.47; Fig. 3). Clearly, those similar compounds are more easily identified by WES. For example, Saquinavir, closely resemble (Tc value 0.91; Fig. 3) to the ligand set of REN, is predicted to regulate REN (Fig. 3). The LSH analysis  confirms the specificity of prediction for WES, which is important for drug repositioning for those known drugs in pharmaceutical researches.

Experimental validation.
To validate the practicability of WES model, we randomly selected Enpp2, Faah, PTGS2, PPARG, and REN, the five inflammation-related targets, and predicted their direct ligand-target interactions. The 24 top-scoring (hybrid-WES) and commercially available drug-target interactions (Table 3) were tested by the ligand-binding assays.
Here, the ligand-target affinities are calculated by IC 50 values, and the ligands were then classified as strong (IC 50 < 1 μ M), moderate (1 μ M ≤ IC 50 < 10 μ M), weak (10 μ M ≤ IC 50 < 100 μ M), or non-binders (IC 50 ≥ 100 μ M) according to Regina S. Salvat et al. 20 . In this work, the IC 50 ≤ 10 μ M is defined for binders for building the training dataset. Clearly, this criteria is strict, as we believe that a more strict strategy will be helpful to reduce data noises, since which were collected from various resources. Here, both the weak and strong binders were counted, resulting in a prediction ACC of 71% (17/24) for the experimental interactions predicted by the hybrid WES.
Perhaps the most compelling results are the test of the drugs against those targets to which they were not previously known to bind, so called drug repositioning (Table 3). By direct binding assay, we find Desmopressin is a new 1 μ M antagonist of REN receptor, which was not reported previously. This is also consistent with the phenomenon for Treprostinil which is newly found to antagonize PPARG in a micromolar concentration range. Intriguingly, Esmolol is also observed to modulate PPARG, though it has been reported to act on ADRB1 21 .

Discussion
The decoding of drug direct targets is of great importance in drug repositioning and discovery, but it is laborious and costly. Hence, a reliable computational approach for drug direct target prediction would be of significant values. In this study, we propose a new WES algorithm which exhibits reasonable reliability in discriminating direct interactions and non-interactions with a well specificity and sensitivity (AUC = 0.85), internal, external and experimental test accuracies of 78%, 70% and 71%, respectively.
Attention needs to be particularly paid to two steps in construction of the WES algorithm. First, the bulk of features have little to do with the pharmacological properties of a ligand. In order to identify the pharmacology-related features, we weighted the structural features based on statistical tests and optimization analysis in a framework of ensemble. This step not only reduces dimensionality of the descriptor set, but also eliminate data noise.
Second, most ligands are dissimilar with each other even they target to the same protein. Thus traditional single molecule similarity-based methods may be insufficient to predict the complex drug-target interactions. Here, we introduced the ensemble concept to assure the model to predict a compound activity not because of its similarity with certain compound in the training set, but of its similarity with the whole feature of an ensemble. Compared with the 1NN model, which judges the probability of a drug targeting to a protein based only on the maximum similarity to a reference ligand, the WES algorithm has more generalization ability in predicting those scaffold-hopping ligands. values and protein sequences from the BindingDB database, including 5,311 proteins and 490,282 ligands, respectively. K i is the concentration of an inhibitor that is required to decrease the maximal rate of the reaction by half. IC 50 is a measure of the effectiveness of a substance in inhibiting a specific biological or biochemical function. To obtain a reliable data set, we filtered the PLPs with the following steps: (1) deleting the redundant PLPs based on the protein sequences and the ligand Inchkey; (2) removing the PLPs of which K i and IC 50 values are unavailable or the average value of them larger than 10 μ M; (3) expunging the smaller ligand-set sized protein that overlaps more than 60% ligands with another protein; (4) excluding those ligands whose Tanimoto similarity is larger than 0.75 in the ligand set of one protein; (5) deleting the proteins whose ligand number is less than 5. As a result, 1788 proteins and 68,777 ligands that constituted 98,327 PLPs were obtained as the positive set. The negative set was constructed by a random generation of the same number of relations that do not overlap with those positive interactions. The two datasets are then used for training the models. All the data can be download from our website related with this work (http://lsp.nwsuaf.edu.cn/tcmsp.php).
Construction of feature matrix. CDK Fingerprint matrix. Ligands were represented by 1,024-bit chemical hashed fingerprints, which were computed using the CDK with default 2D parameters. The CDK is a scientific, LGPL-ed library for bio-informatics and chemi-informatics and computational chemistry written in Java. Taking the ligand set of a protein j constituted by n j ligands, an initial matrix P = { F (j) } (n j × 1024) was generated to represent the protein, where ( ) ) 1024 is the binary fingerprint vector of ligand k. To investigate which feature fit of the fingerprint has a higher contribution rate in distinguishing one protein from the others, we weighted each feature based on the significance (by P-value using one-sided Fisher's exact test) of overrepresentation against the background incidence of the feature in respective protein. The P-values are adjusted to control for multiple hypothesis tests, yielding q-values. The weight for each feature was then computed using the following formula:   , N is the number of total proteins in the training set. We used q = 0.05, the generally considered statistically significant threshold, as it ensures a reasonable discrimination of the feature weights ( Figure S1).

Dragon Fingerprint matrix.
In addition, ligands were also represented by 1,664 Dragon descriptors (http://www.talete.mi.it/index.htm). As a professional software package, Dragon calculates molecular descriptors frequently used to evaluate the molecular structure-activity relationship. Taking the ligand set of a protein j constituted by n j ligands, an initial matrix P = {D (j) } (n j × 1664) is generated to represent the protein, where

Model building.
Firstly, for a protein j, we selected m j1 and m j2 highest weighted features from the CDK and Dragon descriptors, respectively; then the protein j was represented by the feature matrices P = { F (j) } (n j × m j1 ) and P = { D (j) } (n j × m j2 ); finally, the fingerprint-Dragon based weighted similarity scores between two ligand (l 1 , l 2 ) were expressed as where ∧ indicts the Boolean operator "AND", whereas ∨ represents the Boolean operator "OR", respectively. In order to obtain a good estimate of the overall similarity with the ligand set (ensemble), we first defined a raw score for this ligand by summing its weighted similarity relative to the ligand set of protein j with S i ≥ S cut . The threshold S cut was determined by retrospective cross-fold analysis. Unlike WES, SEA chooses S cut to meet that the random Z score is consistent and enriches for a BLAST-like background probability distribution. Actually, by sampling across the range of S cut choices, we chose the threshold that will lead to the highest ROC AUC, resulting in a similarity threshold. The scores below the threshold were discarded which do not contribute to the overall similarity. Then, a model of the distribution of random raw scores was developed and fitted. Random raw scores were calculated by comparing a randomly selected ligand set (size = 50) to the ligand set of each protein. Therefore, we can acquire the mean (μ) and standard deviation (σ) of the 50 random raw scores. And the normalized raw score, annotated as Z score, can be represented as equation (6) The calculation process of Z score is as follows: 1. For a protein j, choose 50 ligands at random from all ligands and calculate the mean and standard deviation values of raw scores at different similarity thresholds (S cut ) with step size 0.01, where 0 < S cut < 1. Store all calculated mean values (μ j = {μ j1 ,…, μ j100 }) and standard deviation values (σ j = {σ j1 ,…, σ j100 }), along with the set size of the protein j. 2. For each S cut , plot the set size of protein ligand vs all μj(S cut ) and σj(S cut ) scores, respectively; and then the linear regression was applied to determine the equations of μ j and σ j . Typically, equations y μ = α 1 x + β 1 and y σ = α 2 x + β 2 are appropriate for standardizing the Raw sores. Given the normalized equation (6), calculate the Z score. If a new drug-target interaction has a Z score above a threshold, it will be treated as a direct interaction. The threshold above which the highest F1 score was achieved in LOOCV was used to make predictions (equation 7).
where precision is the ratio of the number of true positives to the number of predicted positives and recall is the ratio of the true positives which are correctly identified.
Z score integration. To depict the likelihood of a ligand binds to a specific protein, we integrated the Z scores into a likelihood value by the Bayesian network method, so called the hybrid model in this work. The likelihood was defined as: where P(Z = z 1 ,z 2 |C = c) indicates the probability of Z score scored z 1 or z 2 in class c, and z 1 and z 2 represent the CDK and Dragon Z scores, respectively. In addition, we evaluated the conditional probability by the multivariate kernel density estimation approach, which is a nonparametric technique for density estimation through the following formula: is the Gaussian kernel, d is the dimensionality of vector X, (d = 2); n is the number of data samples in class c, H is the bandwidth (or smoothing) d × d matrix which is symmetric and positive definite. And a ligand is considered to incorporate into a protein when the L value is greater than threshold θ, which is the same as the threshold of Z score.
Performance evaluation. The WES model was evaluated and verified with LOOCV. In details, the WES algorithm is applied once for each interaction, using all other interactions as a training set and using the selected interaction as a single-item test set. Several parameters, ACC (equation 10 Comparison to a 1NN model. We evaluated two 1NN models, using either CDK or Dragon fingerprints. For a drug, it was compared to all known ligands of a target. The highest Tc value between the querying drug and known ligands was assigned to the drug-target pair. For each drug, we identified the lowest Tc value that yielded valid WES predictions using the respective fingerprint and collected all drug-target pairs with Tc scores above that threshold. We calculated an adjusted hit rate (equation 14): Adjusted hit rate  1  1  14 The additional count for both numerator and denominator distinguishes cases where no predictions were confirmed.

TP TP FP
External data validation for binding and non-binding data. To examine the generalization ability of WES, we manually collected the direct binding data in PDB and non-binding data in BindingDB (see details in Results).