Screening drug-target interactions with positive-unlabeled learning

Identifying drug-target interaction (DTI) candidates is crucial for drug repositioning. However, usually only positive DTIs are deposited in known databases, which challenges computational methods to predict novel DTIs due to the lack of negative samples. To overcome this dilemma, researchers usually randomly select negative samples from unlabeled drug-target pairs, which introduces a lot of false-positives. In this study, a negative sample extraction method named NDTISE is first developed to screen strong negative DTI examples based on positive-unlabeled learning. A novel DTI screening framework, PUDTI, is then designed to infer new drug repositioning candidates by integrating NDTISE, probabilities that remaining ambiguous samples belong to the positive and negative classes, and an SVM-based optimization model. We investigated the effectiveness of NDTISE on a DTI data provided by NCPIS. NDTISE is much better than random selection and slightly outperforms NCPIS. We then compared PUDTI with 6 state-of-the-art methods on 4 classes of DTI datasets from human enzymes, ion channels, GPCRs and nuclear receptors. PUDTI achieved the highest AUC among the 7 methods on all 4 datasets. Finally, we validated a few top predicted DTIs through mining independent drug databases and literatures. In conclusion, PUDTI provides an effective pre-filtering method for new drug design.


Results
Our goal is to (a) improve DTI predictive accuracy based on the PUDTI framework; (b) effectively identify drug repositioning candidates for existing drugs and targets; (c) provide new clues of the treatment for Alzheimer's diseases. The central idea is to extract NDTISs based on PU learning. Figures 1, 2, 3, 4 and 5 show the illustration of the PUDTI framework. The framework consists of five main parts: representing each DTI as a vector based on various biological information, selecting feature subsets of DTIs, constructing strong NDTISs, computing the similarity weights of the ambiguous examples, and building an SVM-based optimization model.
We evaluated whether our proposed PUDTI framework can identify potential DTIs properly. We presented extensive experiments under different experimental settings. (1) We compared the performances of our proposed NDTISE method with random selection method and NCPIS on a DTI data provided by NCPIS 26 . (2) We evaluated our proposed PUDTI framework on four classes of datasets from human enzymes, ion channels, GPCRs and nuclear receptors, respectively. (3) We compared the performances of 5 representative DTI prediction models including BLM, RLS-Avg, RLS-Kron, KBMF2K-classification and KBMF2K-regression by applying the negative samples predicted by NDTISE, random selection and NCPIS, respectively on the DrugBank data. (4) Parts of new drug repositioning candidates of existing drugs and targets are identified. (5) New clues of the treatment of Alzheimer's disease are inferred.
We executed the feature selection method and ranked each feature based on their discriminant capability scores in constructed positive sample set P and unlabeled sample set U. Moreover, we screened the top 300 features for DTIs. Considering previous studies 25 and our test, we chose the radial basis kernel as the kernel function because of its good boundary response 24 . The parameters C 1 , C 2 , C 3 and C 4 were set with a step size of 2 −4 in the range [2 −5 , 2 5 ].

Performance Comparison of Different Negative Sample Selection Methods.
We compared three different negative sample selection methods including NDTISE, random selection and NCPIS on the DTI data provided in the paper 26 using six classical classification models including naive Bayes (NB), k-nearest neighbor (kNN), L1-logistic (L1-R) and L2-logistic regression(L2-R), RF and SVM. The parameters on these classifiers were set as the default values provided by ref. 26. The negative ratio in NCPIS was chosen as 3. The k for kNN algorithm was set as 1. Both the codes of the Spy and Rocchio classifiers 32,33 can be achieved from the LPU system 30 (http://www.cs.uic.edu/liub/LPU/LPU-download.html).
A total of 10 trials of pairwise 5-fold cross-validation 9,26 were used to measure the NDTISE method against random selection method and NCPIS. (1) The drug-target pairs D (interacting or non-interacting) in the gold standard dataset were randomly partitioned into five mutually exclusive subsets that were roughly equal in size D D D { , , , } 1 2 5 … . (2) In each round t {1, 2, , 5} ∈ … , one drug-target pair set D t was regarded as a test set, and the entries in D t were masked. The remaining four subsets D\D t were taken as training sets to recover the masked true labels in D t . (3) The experiment was repeated 10 times to avoid sampling bias, and the average predictive performance over the 5-folds for 10 trials was used as the final result.  To extract sub-datasets for PU learning, we specially conducted the following setting: we randomly extracted r percent of samples from known DTI dataset in the training set to form a positive sample set P. The remaining samples from the known DTI dataset and unknown drug-target pairs in the training set were used together to form an unlabeled dataset U. We firstly set r = 10, and evaluated the performances of the NDTISE method by increasing r. We observed that the NDTISE method is basically stable when r is no less than 30. Therefore, we set r as 30 in this study. The above six classifiers utilized P and RN extracted by the three negative sample selection methods as positive and negative samples, respectively. SVM-SW computed the similarity weights of the ambiguous samples besides P and RN.
We listed in Table 1 the performances of the three negative sample selection methods using respective classification models in terms of precision, recall, f-measure and AUC. NDTISE outperforms the other two methods in 4 classification methods and achieves comparable performances to NCPIS in the other two classification methods. Compared to random selection method, for instance, the average AUC values on NDTISE increased by 17 Although the performances of NDTISE were slightly lower than NCPIS in the RF and SVM, our proposed PUDTI framework based on the SVM-SW classifier was better than NCPIS, as shown in Table 2. The results indicated that considering the probabilities that the ambiguous samples belong to the positive and negative classes may help improve classification performance.  Table 3 described the details. To demonstrate the performance of our proposed PUDTI framework, we compared it with 6 state-of-the-art methods on these four datasets: DBSI 11 , NetLapRLS 34 , KBMF2K 21 , NetCBP 27 , WNN-GIP 35 and PUDT-Lan 28 . The six methods were used to predict potential DTIs from human nuclear receptors, GPCRs, ion channels and enzymes and the last method inferred possible DTIs based on PU learning.
We listed in Table 4 the average AUC values of these six methods and our proposed PUDTI framework. It is clear that PU-based prediction methods significantly outperform other methods on all four datasets, which suggests that extracting negative DTI samples from unlabeled drug-target pairs may help improve prediction performance. In addition, our proposed PUDTI framework is better than the PUDT-Lan method, which might due to the fact that we considered the probabilities that the ambiguous samples belong to the positive and negative classes in PUDTI.

Comparison with Representative DTI Prediction Methods on the DrugBank data.
We compared the performances of 5 representative DTI prediction models including BLM, RLS-Avg, RLS-Kron, KBMF2K-classification and KBMF2K-regression by applying the negative samples predicted by NDTISE, random selection and NCPIS, respectively on the DrugBank data. These methods were originally used to identify potential DTIs from human enzymes, ion channels, GPCRs and nuclear receptors, which were provided by ref. 17. For RLS-Avg and RLS-Kron, we set the parameters as (0.5, 0.5) and (0.5, 0.5), wherein the two classifiers obtained better classification performances than (1, 1) and (1, 1) 26 . We extracted strong NDTISs based on algorithm 1. The drug and protein similarity matrices can be calculated according to cosine formula based on the feature vectors of drugs and proteins. We still used 10 trials of pairwise 5-fold cross-validation and conducted sub-dataset extraction for PU learning, similar to the previous section.
The results are as shown in Fig. 6. NDTISE significantly outperforms random selection method in 5 representative DTI prediction models. The recall values of NDTISE were lower than NCPIS in these models. However, the precision values of NDTISE are better than NCPIS, that is, more correctly predicted DTIs were obtained; although, successfully predicted DTIs were relatively few. Moreover, NDTISE obtained better improvement than NCPIS in terms of F-measure and AUC. These results indicated that our designed NDTISE method can extract NDTISs properly. Sensitivity Study on the Parameter. The similarity weights of an ambiguous sample are used to measure the probabilities that the sample belongs to the positive and negative classes. The parameter α is used to balance the importance between local and global similarities. To measure the sensitivity of α in our proposed PUDTI framework, we conducted a series of extensive experiments to investigate the performance under different settings.
As described in Fig. 7, when r is 30, and if α < 0.6, the performances increase gradually; and if α > 0.6, the performances decrease gradually. We obtained the similar results when r was selected from 40 to 70 with a step size of 10. Therefore, we set α as 0.6. Drug Repositioning for Astemizole. Astemizole is a long-acting and non-sedative antihistaminic. The drug has antiallergic properties and is used to treat allergic conjunctivitis, asthma, chronic idiopathic urticaria and seasonal allergic rhinitis 36 . Recently, ref. 37 reported that astemizole was possibly a new anti-cancer drug. Therefore, identifying new drug repositioning candidates for the drug is significant. We intended to find new Astemizole interacts with eight proteins, namely, P24462, P08183, P35367, P51589, P20815, P10635, P08684 and Q12809 in the DrugBank database 38 . We extracted twelve negative DTIs for the drug, namely, O75600, P07814, P21549, P23378, P23415, P28066, P30793, P34896, P34897, Q10588, Q53ET4 and Q8IWU9. Five of these extracted negative DTIs have been reported by ref. 26. We used cytoscape 39 to draw DTI networks. Figure 8(a) listed known DTIs in the DrugBank database 38 and reliable NDTISs extracted by algorithm 1.
We predicted possible interaction partners for astemizole based on known DTIs and extracted NDTISs. The predicted results are shown in Fig. 8(b). These DTIs can be divided into four parts: the first part includes known DTIs in the DrugBank database 38 , wherein seven of eight known DTIs are identified by PUDTI. The second part includes DTI candidates that are unknown in the DrugBank database 38 but can be validated by retrieving the other databases. Among these DTIs, the interactions between astemizole and four proteins, namely, Q07973, O95259, P28223 and P41595, can be validated by searching the STITCH database 40 , and the interactions between astemizole and two proteins, namely, P35346 and P30874, can be substantiated by retrieving the SuperTarget database 41 .
P08183 is an energy-dependent efflux pump and used to decrease drug accumulation in cells 42 . The protein interacts with astemizole in the DrugBank database 38   energy-dependent phospholipid efflux translocator and used to positively regulate biliary lipid secretion. It specifically translocates phosphatidylcholine from canalicular membrane bilayer into hepatocytes. The translocation enables biliary phospholipids to be extracted into the canaliculi lumen and thus protects hepatocytes from the detergent properties of bile salts 42 . Both P08183 and P21439 are multidrug resistance proteins 38 . The function of P21439 is similar to P08183's 41 . Moreover, sequence similarity and sequence identity between these two proteins  are 0.86 and 0.753 in the SuperTarget database, respectively 41 . Therefore, we inferred that P21439 may be new drug repositioning candidates of astemizole based on the predictive accuracy of PUDTI, functional similarity, sequence similarity and sequence identity to known target.
Drug Repositioning for DNA topoisomerase 2-alpha. DNA topoisomerase 2-alpha (P11388) encoded by the TOP2A gene is used to control topological states of DNA. It is essential for segregating daughter chromosomes during mitosis and meiosis 38 . We intended to find new drug repositioning candidates for the protein from the DrugBank database 38 by training SVM-SW classification model after determining the performances of PUDTI. P11388 interacts with thirty-two drugs in the DrugBank database 38 . Most of these drugs are used to interfere with the transcription process and prevent the RNA synthesis 38 . We extracted thirteen negative DTIs for the proteins, where eight of these extracted negative DTIs have been reported by ref. 26. We used cytoscape 39 to draw DTI networks. Figure 9(a) listed known DTIs in the DrugBank database 38 and reliable NDTISs extracted by algorithm 1.
We predicted possible interaction partners for P11388 based on known DTIs and extracted NDTISs. The predicted results were shown in Fig. 9(b). These DTIs can be divided into four parts: the first part includes known DTIs in the DrugBank database 38 , wherein twenty-seven of thirty-two known DTIs are identified by our proposed PUDTI framework. The second part includes DTI candidates that are unknown in the DrugBank database 38 but can be validated by retrieving the other databases. Among these DTIs, the interaction between dactinomycin  and P11388 can be validated by searching the UniProt database 42 , and the interaction between gatifloxacin and P11388 can be substantiated by retrieving the SuperTarget database 41 . Dactinomycin is used to bind to DNA and inhibit RNA synthesis. Protein synthesis, a result of impaired mRNA production, will decline after dactinomycin therapy 38 . Gatifloxacin is used to inhibit bacterial enzymes DNA gyrase. The drug is available in aqueous solutions for intravenous therapy 38 .
The third part includes the interactions between P11388 and dichlorophenamide and miconazole, which have been reported by ref. 26. The remaining are from the associations between P11388 and irinotecan and topotecan. P11388 interacts with camptothecine in the SuperTarget database 41 . Both irinotecan and topotecan are derivatives of camptothecin 38 . Topotecan is a drug used to treat ovarian cancer. It is used to regulate DNA topology and facilitate DNA recombination, replication and repair by inhibiting DNA topoisomerase I 38 . The similarity between camptothecine and topotecan is 0.94 in the SuperTarget database 41 . The association between P11388 and topotecan can be validated by retrieving refs 43-45. Therefore, we inferred that P11388 may interact with topotecan.
Find New Clues of Treatment for Alzheimer's Diseases. The above results of drug repositioning imply that existing drugs and drug targets may help find new therapies for diseases. We investigated the complex associations between existing drugs and drug targets of Alzheimer's disease to infer new clues of treatment for the disease. We retrieved six drugs for Alzheimer's disease based on its indications in the DrugBank database, namely, galantamine, olanzapine, quetiapine, risperidone, thioridazine and ziprasidone 38 . All the other five drugs except for galantamine target seven proteins, namely, D(1A), D(2) and D3 dopamine receptors (P21728, P14416 and P35462), alpha-1A and alpha-1B adrenergic receptor (P35348 and P35368), 5-hydroxytryptamine receptors (P28223) and potassium voltage-gated channel subfamily H member 2(Q12809) 38 .
We found some drugs targeting these seven proteins in the DrugBank database. However, we can not infer new clues of the treatment of Alzheimer's disease only by these seven target proteins. Therefore, we intended to predict the interactions between these six drugs and targets, as well as the associations between these drug targets and the other drugs. The results are shown in Fig. 10. We can observe that the other five drugs except for galantamine generally target parts of target proteins, namely, adrenergic receptors (P35348, P35368, P08913, P18089 and P18825), dopamine receptors (P21728, P21917, P21918, P35462 and P14416), 5-hydroxytryptamine receptors (P28223, P34969 and P08908), muscarinic acetylcholine receptors (P08172, P08173, P08912 and P11229), histamine H1 receptor (P35367) and potassium voltage-gated channel subfamily H member 2(Q12809). Therefore, we inferred that these target proteins may have a strong correlation with Alzheimer's disease.
We further considered the other drugs targeting these proteins in the DrugBank database and found that aripiprazole may have strong correlations with these target proteins. Aripiprazole is atypical antipsychotic medication and is used to treat schizophrenia and mediate its antipsychotic effects primarily by P14416. It has been reported in ref. 46 that aripiprazole may be in clinical trails and used to the treatment of Alzheimer's disease. Therefore, we inferred that aripiprazole may be a drug candidate of Alzheimer's disease.

Discussion
Supervised learning-based methods demonstrated better classification performances for potential DTI identification than traditional computational methods. However, experimentally validated NDTISs were impossible to achieve or even unavailable. Therefore, screening negative training samples for DTI prediction models is a recurring problem. In this study, we designed the NDTISE method to extract reliable NDTISs based on PU learning and various biological information. A novel DTI screening framework, PUDTI, is then developed to find new drug repositioning candidates of existing drugs and targets. Experimental results from three different negative sample selection methods on the DTI data provided by NCPIS 26 , 6 state-of-the-art methods on 4 classes of DTI datasets from human nuclear receptors, GPCRs, ion channels and enzymes, and 5 representative DTI prediction models on the DrugBank data demonstrated the generalization capability and competitiveness of our proposed PUDTI framework. The framework identified new drug repositioning candidates for the drug astemizole and the target DNA topoisomerase 2-alpha, and found new clues of the treatment for Alzheimer's disease.
The PUDTI framework can produce good results over all measures compared with different methods. This observation may be ascribed to the following advantages of the framework. (1) The framework can effectively extract those DTI candidates that are most likely to be negative samples. These NDTISs are applied to identify possible DTIs with the labeled DTIs. (2) The framework took advantage of multiple classifier combination and effectively integrated two types of PU learning models and various biological information related to drugs and targets. (3) In the DTI prediction problem, the noise in training samples was unavoidable. Different similarity weights were calculated to demonstrate different noise levels of the ambiguous samples. Therefore, the built SVM-SW was more tolerant to different noise levels of various DTI data types.
The PUDTI framework integrated the Spy and Rocchio classifiers 32,33 to extract reliable NDTISs. However, the predictive accuracy can be further improved by integrating multiple PU learning models. In subsequent investigations, we will consider an ensemble PU learning framework for DTI screening to minimize the possible bias and errors in these two types of PU learning methods.
The negative sample construction is a key issue in predicting associations between various biological entities, such as lncRNA-disease associations, miRNA-disease associations and drug-drug associations. The PUDTI framework may also benefit from the extraction of various negative samples, which will in turn assist in identifying underlying associations between these entities. In further experiments, we will consider to build negative lncRNA-disease association dataset and negative miRNA-disease association dataset based on PU learning to improve predictive performance.
Finding new therapies for existing drugs is significant for modern drug development. There are complex associations between diseases and their known drugs and drug targets. In the future, we will consider to build a supervised learning model by constructing a disease-drug-target network to identify new clues of the treatment for existing diseases.

Materials and Methods
Materials. Representing Drug Molecules. Different kinds of descriptors were used to describe various drug molecule properties in drug discovery. A PaDEL-Descriptor software 47 has been designed to represent drug molecules. We used the software and represented a drug molecule as G g g g ( , , , ) T

2 1444
= … based on the preprocessing program provided by ref. 25.
Representing Target Proteins. Various types of protein descriptors were defined based on different properties of target proteins in proteomics. For representing target proteins, we used three types of protein properties, namely, protein domain 48 , pseudo amino acid composition (PAAC) 49 and position specific scores 50 .
Protein Domain: Domains of target proteins were retrieved from the PFAM database 48 Position Specific Score Matrix (PSSM): The bi-gram feature extraction method (BiGFE) 51 was developed to describe the evolutionary information of target proteins combining position specific scoring matrix (PSSM) 50 of target proteins. References 12 and 52 used the method and obtained improved performances in predicting DTIs. We described each protein as a 400-dimensional feature vector based on the BiGFE method: Combing domains, PAACs and PSSM, a protein target can be represented as a 1781-dimensional vector: Therefore, each DTI sample can be described as a 3225-dimension vector based on PaDEL-Descriptors of drugs and domains, PAACs and PSSM of target proteins: Methods. The proposed PUDTI framework can be divided into five steps: • Select the feature subsets of DTI samples.
• Screen the high-quality NDTISs.
• Calculate the representative positive and negative prototypes.
• Compute the similarity weights of the ambiguous samples.
• Construct the final classification model and identify DTI candidates.
In the following, we described every step in details.
Step 1: Feature Selection. There are parts of robust features in DTI feature set. Selecting a feature subset from these features may help decrease the false positive and the false negative ratios, thereby avoiding the overfitting problem. Reference 53 developed a feature selection method to distinguish disease genes from non-disease genes, we used the method to select feature subsets for each DTI to efficiently distinguish interacting drug-target pairs from noninteracting drug-target pairs. For each DTI feature f, we define its association score in P and U (as(f, P) and as(f, U)) as follows: where DTP i is the ith Drug-Target pair, DTP i ∈ P indicates that the ith DTP is positive and DTP i ∈ U represents that the ith DTP is unlabeled. asso(DTP i , f) represents the association score between DTP i and the feature f, which can be computed as follows: We then compute the discriminant ability score of f in P and U as, By Eq. (7), we intend to screen those discriminative features which either frequently present in P but seldom in U or frequently present in U but seldom in P. For a feature f, when as(f, P) in P is large but as(f, U) in U is small or as(f, U) in U is large but as(f, P) in P is small, da(f) will be large because both af(f, P) + af(f, U) and log(|P|/af(f, P) + |U|/af(f, U)) are relatively large. On the contrary, the score will be relatively low when both af(f, P) and af(f, U) are small or large simultaneously. Thus, we can select representative feature subsets for each DTI.
Step 2: Screening Reliable NDTISs. Typically, supervised learning-based models require numerous labeled positive and negative samples to achieve good classification accuracy. However, known DTIs are rare, and NDTISs are difficult to achieve or even unavailable. Moreover, numerous DTI examples are unlabeled. To obtain a good predictive performance, we intend to screen trustworthy NDTISs. We considered two classical PU learning models, namely, the Spy and Rocchio techniques 32,33 . To reduce the expected error rates when screening NDTISs, we minimized the bias of individual model based on multiple classifier combination. The details are described in algorithm 1.
In algorithm 1, RN and EP denote reliable NDTISs and positive samples extracted by algorithm 1, respectively. C Spy and C Roc represent the classification results from the Spy and Rocchio classifiers 32, 33 , respectively. Steps 1 and 2 initialize P, U, RN and EP. Steps 3-5 classify the unknown DTIs in U. Steps 6-9 screen RN by excluding positive DTIs as far as possible. For instance, a DTI is regarded as a reliable negative sample if its classification results from two classifiers are both negative classes, that is, the DTI simultaneously satisfies C Spy = −1 and C Roc = −1. Steps 10-14 are used to add high-quality positive examples to P. The U in Step 15 denotes the remaining unlabeled DTIs after extracting parts of high-quality positive and negative examples. We considered these remaining DTIs as the ambiguous samples.
Step 3: Computing the Representative Positive and Negative DTI Prototypes. We achieved reliable NDTISs from the last section. In theory, we can build a classifier and predict new DTIs using P and RN. However, the classification results may not be accurate enough because parts of ambiguous samples remain. For these ambiguous samples, we cannot determine whether they belong to the positive or negative classes. Assigning these examples to the positive or negative class will disturb the classification performance. As such, considering the method provided by refs 29 and 31, we developed a similarity weight calculation method to measure the probabilities that remaining ambiguous samples belong to the positive and negative classes.

Algorithm 1. The NDTISE method.
To compute the similarity weights of these ambiguous samples, we partitioned DTI samples in RN into a modules using the k-means clustering algorithm and computed the representative positive and negative DTI prototypes. The details are described in algorithm 2.
The parameter a was set as = * + a t RN U RN /( ), where |RN| and |U| denote the numbers of RN and U, respectively. t, α and β were set as 30, 16 and 4, respectively, as recommended by the studies 29-31 .
Step 4: Computing the Similarity Weights of the Ambiguous Samples. The similarity weights of the remaining ambiguous samples in U represent the probabilities that the samples belong to the positive and negative DTI classes. To compute the similarity weights, we defined the similarities of an ambiguous sample x to the ith representative positive and negative prototypes (p i and n i ) as follows: where n is set as = * + n t U U RN /( ) and t is set as 30, which are recommended by refs 29 and 31.
Step 5-9 tag x with a temporary label. |US i | denotes the number of all samples in US i . |tempos i | denotes the number of samples which are temporarily regarded as positive samples in US i , |temneg i | denotes the number of samples which are temporarily regarded as negative samples in US i . The most similar positive and negative prototypes of x can be obtained by equation (8).
As illustrated in Fig. 11, H denotes the decision hyperplane in the process of classification and can be computed by the Rocchio classifier 33  Computing Global Similarity Weights: The local similarity weights utilized the biological features shared by the ambiguous samples and computed the similarities between all samples in a cluster. However, the local similarity weights of samples in the same cluster are possibly different because of different physical locations. For example, assigning the same class weight to the ambiguous samples y and z in M 2 is inappropriate even though the two samples have the same local similarity weights. Therefore, we calculated the global similarity weights between x and all representative prototypes to measure the probabilities that x belongs to the positive and negative DTI classes from a global perspective.
The global similarity weights of x can be measured as follows: where GloP(x) and GloN(x) represent the probabilities that x belongs to the positive and negative DTI classes from a global perspective. We obtain the final probabilities that x belongs to the positive and negative DTI classes based on its local and global similarity weights: where the parameter α is used to balance the importance between the global similarity and the local similarity.
Step be training dataset. x i denotes the ith DTI sample and can be represented as a feature vector x i after feature selection in Step 1, y i ∈ {+1, −1}. We can classify the unknown DTIs based on standard SVM: where ε i is a slack variable of x i and is used to allow for misclassifications in the training examples, and C is used to balance the impact of ε i . The test sample x is viewed as the positive class if w · φ(x) + b > 0; otherwise, it is negative. Combining standard SVM with the similarity weights of the ambiguous samples, we further introduced SVM-SW for finding DTI candidates: where ε i , ε j , ε m and ε n are the error terms. C 1 , C 2 , C 3 and C 4 are penalty factors that are used to control the trade-off between margin and misclassification errors. W P (x j )ε j and W N (x m )ε m are errors with different weights. Different W P (x j ) and W N (x m ) reflect different effects of the parameters ε j and ε m on classification accuracy, respectively. The large value of W P (x j ) can increase the effect of ε j ; therefore, the ambiguous example x j is more likely to belong to the positive class. Similarly, the smaller value of W N (x m ) can reduce the effect of ε m ; therefore, x m is less significant toward the negative class. Solving the Model: The model can be solved based on the method provided by refs 29 and 31. For a test sample x, it is regarded as a positive DTI if w · φ(x) + b > 0; otherwise, it is regarded as a negative DTI.
SCIENtIfIC RePoRtS | 7: 8087 | DOI:10.1038/s41598-017-08079-7 Experimental Setup and Evaluation Metrics. Various performance measures have been proposed to evaluate DTI prediction models. Among these, precision, recall, AUC and F-measure are extensively used. Precision, recall and F-measure 26 are computed as equations (13) (15) where TP, FP, TN and FN represent true positive, false positive, true negative, and false negative, respectively. Precision is the percentage of correctly predicted DTIs and is used to measure the distinguished capability of a classifier. Recall is the percentage of successfully predicted DTIs. F-measure is used to evaluate the average classification performance. Either small precision or recall will result in a low F-measure 30 : therefore, F-measure is used to measure predictive models. AUC is the average area under the receiver operating curve. For these four parameters, higher values exhibit better classification performance. We used these four metrics to evaluate our proposed PUDTI framework.