A novel one-class SVM based negative data sampling method for reconstructing proteome-wide HTLV-human protein interaction networks

Protein-protein interaction (PPI) prediction is generally treated as a problem of binary classification wherein negative data sampling is still an open problem to be addressed. The commonly used random sampling is prone to yield less representative negative data with considerable false negatives. Meanwhile rational constraints are seldom exerted on model selection to reduce the risk of false positive predictions for most of the existing computational methods. In this work, we propose a novel negative data sampling method based on one-class SVM (support vector machine, SVM) to predict proteome-wide protein interactions between HTLV retrovirus and Homo sapiens, wherein one-class SVM is used to choose reliable and representative negative data, and two-class SVM is used to yield proteome-wide outcomes as predictive feedback for rational model selection. Computational results suggest that one-class SVM is more suited to be used as negative data sampling method than two-class PPI predictor, and the predictive feedback constrained model selection helps to yield a rational predictive model that reduces the risk of false positive predictions. Some predictions have been validated by the recent literature. Lastly, gene ontology based clustering of the predicted PPI networks is conducted to provide valuable cues for the pathogenesis of HTLV retrovirus.

Protein-protein interaction (PPI) prediction is generally treated as a problem of binary classification wherein negative data sampling is still an open problem to be addressed. The commonly used random sampling is prone to yield less representative negative data with considerable false negatives. Meanwhile rational constraints are seldom exerted on model selection to reduce the risk of false positive predictions for most of the existing computational methods. In this work, we propose a novel negative data sampling method based on one-class SVM (support vector machine, SVM) to predict proteome-wide protein interactions between HTLV retrovirus and Homo sapiens, wherein one-class SVM is used to choose reliable and representative negative data, and two-class SVM is used to yield proteome-wide outcomes as predictive feedback for rational model selection. Computational results suggest that one-class SVM is more suited to be used as negative data sampling method than two-class PPI predictor, and the predictive feedback constrained model selection helps to yield a rational predictive model that reduces the risk of false positive predictions. Some predictions have been validated by the recent literature. Lastly, gene ontology based clustering of the predicted PPI networks is conducted to provide valuable cues for the pathogenesis of HTLV retrovirus. P rotein-protein interaction (PPI) plays an important role in mediating biological processes, cellular signaling pathways and development of organismal systems. Accurate mapping of the proteome-wide interactome is a central problem of proteomics and system biology. Although recent years have witnessed much progress in experimental identification and computational prediction of PPIs 1 , high risk of false discovery rate is still a problem to be effectively addressed 1,2 . For instances, in vitro detection methods such as affinity purification are prone to capture false interactions, in vivo yeast two-hybrid (Y2H) is likely biased towards non-specific interactions 3 and gene co-expression that could induce synthetic lethality is not efficient to detect pathogen-host protein interactions 4,5 . Recent critical assessments of experimentally obtained PPI data suggest that these data exhibit an unacceptably high fraction of false positives and low agreement between each other [6][7][8] . Meanwhile, computational methods also takes the risk of high false discovery rate for the following reasons. Firstly, the experimentally identified PPI data are likely to contain a certain level of noise (false interactions). Secondly, the negative data needed for two-class PPI prediction are usually obtained by random sampling [9][10][11][12][13][14][15] , which may introduce considerable false negative. Thirdly, model selection is generally conducted by cross validation on the training PPI data, and the trained models, if used for proteome-wide predictions, are prone to overpredictions. For pathogen-host PPI prediction, these issues become worse because the training data available are much smaller and less representative. Thus the intra-species models [12][13][14][15][16][17][18][19][20] are likely to yield more false positive predictions than the inter-species PPI prediction models [9][10][11] .
At present the negative data required for computational reconstruction of PPI networks are in general not available. Recently some negative data from biological experiments have been collected into database, e.g. the reference set of negatome 21 , but the negative data are not enough train a two-class classifier. To meet the need of computational modeling, random sampling is often used to generate negative data [9][10][11][12][13][14][15] . The assumption behind random sampling is that the non-interactome space is much larger than the interactome space, so that random sampling could hit the non-interactome space with a large probability to sample true negatives (non-interac-tions). However, random sampling is supposed to introduce uncertainty and complexity to the model behaviour, simple as it is. There are several major factors that affect model performance, such as the learning algorithm, feature construction method and the data quality. The uncertainty introduced by random sampling makes it hard to discriminate which factor leads to the poor model performance. For instance, Yu et al. 22 cast a doubt on the PPI predictive ability of simple sequence k-mer feature construction, while Park et al. 23 argued that it was not the k-mer feature construction but the random sampling method that resulted in poor model performance. No matter whether the arguments catch the point, the quality of negative data is undoubtedly critical to the model performance. To obtain reliable negative data, Ben-Hur et al. 24 proposed to exclude those subcellular co-localized proteins, and Mei 25 further showed that exclusiveness of subcellular co-localized proteins outperformed random sampling without introducing predictive bias. Intuitively, the negative data obtained by excluding those subcellular co-localized proteins seem to be more reliable but less representative, because the negative data do not represent the proteins pairs that are subcellular co-localized but do not interact. To make a detour around negative data sampling, one-class learning/clustering methods have been proposed for PPI prediction, e.g. association rule mining 17 , one-class SVM 26,27 , ensemble non-negative matrix factorization based clustering 28 , etc. These methods, though much simplified, are more likely to yield a large fraction of false positive predictions, because they do not learn the negative (non-interaction) patterns. A wise choice is not to evade negative data sampling but to properly ensure that the obtained negative data are reliable and representative.
The assumption behind the practice is that a model optimally trained on the training PPI data can generalize well to the gigantic unseen space of protein pairs. This assumption does not always hold true, especially when the training PPI data is rather small. To gain knowledge about the quality of model selection, one simple and natural method is to use the model to predict all possible (proteome-wide) or a large percentage of protein pairs, and then check the false positives. However, lack of experimental evidences makes it hard for us to determine the false positive rate. Nevertheless, the rationality of the predictions still can be estimated through the predicted positive rate. Jansen et al. 2 has estimated that the expected number of negatives (non-interacting protein pairs) is several orders of magnitude higher than the number of positives (interacting protein pairs). This estimation can be used to check the quality of model selection. If the predicted positives account for a large percentage of the proteomewide protein pairs (e.g. .50%), we can infer that the predictions go against the estimation in ref2 and thus there is a large fraction of false positive predictions. Moreover, large predicted positive rate contradicts with the assumption of large negative (small positive) space behind random sampling. If the model is trained on the negative data sampled by random sampling (small positive space) and the model yields a large percentage of positives (large positive space), we can see an obvious paradox between the assumption of random sampling and its outcome. After checking the outcomes of the random forest method 18 , we find that the 25 Salmonella proteins are predicted to interact with 22,651 human proteins (nearly all known human proteins), indicating a certain degree of overprediction. We can see that it is necessary to analyse the proteome-wide predictions and impose rational constraints on model selection. For large-scale intra-species PPI prediction, the computation of model selection will be daunting, but the computation is acceptable for pathogen-host PPI prediction.
Feature construction is a third important concern of computational modeling for PPI prediction. As compared to intra-species PPI networks reconstruction (e.g. yeast PPI network 9 , Arabidopsis thaliana PPI network 10 , human PPI network 11 , etc.), inter-species pathogen-host PPI networks reconstruction is more challenging in that the pathogen-host PPI data available is generally much smaller. To improve the model performance, most of the existing methods generally leverage a catalog of biological feature information, e.g. binding motif, gene expression profile, gene co-expression ,gene ontology, sequence k-mer, post-translational modification, protein structural information and PPI network topology [12][13][14]29,30 , etc. Among these types of feature information, the sequence information of protein achieves relatively moderate discriminative ability 22,23 , though less expensive to obtain. Tastan et al. 12 has claimed that gene ontology (GO) is one of the strongest indicators for host-pathogen PPI prediction when combined with other feature information. Moreover, gene ontology alone has been reported to achieve satisfactory performance for pathogen-host PPI prediction 25 and intra-species PPI prediction 29 . In spite of strong discriminative ability, non-sequence information (e.g. gene ontology, spatial structural information, gene co-expression, etc.) has the drawback that the feature information is generally not complete. To overcome the drawback, proper substitution of incomplete feature information has been deliberately proposed 18,25 .
In this work, we address the two concerns of negative data sampling and rational constraints on model selection to reliably reconstruct the proteome-wide protein interaction networks between HTLV retrovirus and Homo sapiens. We use one-class SVM to sample reliable and representative negative examples, and use two-class SVM proteome-wide predictive feedback as constraints on one-class SVM model selection. Reliability demands that the negative examples are distributed far away from the positive examples with low risk of false negatives, and representativeness demands that the negative examples supporting two-class decision boundary should be near to the positive examples so as to reduce the risk of false positives. The two seemingly opposite requirements suggest that a proper negative data sampling method should achieve good trade-off between reliability and representativeness. Here we propose two-class SVM proteome-wide predictive feedback to guide the search of one-class SVM hyperparameter space, such that the constrained model selection reduces the risk of false positive predictions. As for feature construction, we use gene ontology (GO) here to represent proteins in view of its strong discriminative ability of PPI prediction. To enrich GO feature information and make up for totally unannotated proteins, we conduct homolog knowledge transfer via independent homolog instances as reported in 31 . Lastly, we conduct gene ontology based clustering analysis of the predicted HTLV-human PPI networks to provide valuable cues for understanding the pathogenesis of HTLV retrovirus.

Data.
Human T-cell lymphotropic viruses (HTLV) belong to the family of retroviruses. The type 1 HTLV virus (HTLV-1) can induce Adult T-cell Leukemia/ Lymphoma and the type 2 HTLV virus (HTLV-2) does not show known pathogenesis, though closely related to HTLV-1 31 . Simonis et al. 32 used highthroughput yeast-two-hybrid (HT-Y2H) 33,34 to identify 166 interactions between HTLV and human proteins. There are only three interactions related to HTLV-1 Tax (Nup62, MAD1L1, Cdc23) that overlap with the 145 interactions from VirusMINT 35 and VirHostNet 36 , accounting for 2.1% recognition rate.
For the convenience of reference, we call S1 pos the data from 32 and S2 pos the data from 35,36 . Additionally, we call S3 pos the data from 37 . We check the three datasets against UniprotKB database (http://www.uniprot.org/uniprot/), and remove those putative HTLV proteins and those HTLV proteins that have no corresponding accessions in Swissprot database (manually annotated and reviewed part of UniprotKB). After filtration, S1 pos is reduced to 155 interactions, S2 pos is reduced to 144 interactions and S3 pos contains the HTLV protein p30 only with 42 interactions. We call S pos (S pos 5 S1 pos < S2 pos < S3 pos ) the union of the three dataset, and thus S pos contains 341 interactions. We sample the equal number of negative data for each HTLV protein in S pos and thus obtain the corresponding negative data S neg The union of S pos and S neg , called S (S 5 S pos < S neg ) is used to train two-class SVM for proteomewide HTLV-human PPI networks reconstruction. To stringently demonstrate the www.nature.com/scientificreports SCIENTIFIC REPORTS | 5 : 8034 | DOI: 10.1038/srep08034 model performance, we also use S1 pos and S2 pos as mutual independent test data and use S3 pos as literature validation.
GO feature construction. Gene ontology (GO) is used as indicator of HTLV-human PPI prediction and GO feature construction is conducted as 31 . The homolog GO knowledge is treated as independent instance (called homolog instance) to augment the target instance (the GO information of the proteins themselves). The homologs are extracted from SwissProt 57.3 database 38 using PSI-Blast with default E-value 5 10 39 against all species, and the GO terms are extracted from GOA database 40 . For each protein i, there are two sets of GO terms, one set denoted as homolog set S i H contains the GO terms from the homologs, and the other set denoted as target set S i T contains the GO terms from the protein itself. Based on the denotations, we can formally define two feature vectors for each protein pair (i 1 , i 2 ) as follows: where B (i1,i2 ) The above definition is symmetrical, so that protein pair (i 1 , i 2 ) and protein pair (i 2 , i 1 ) have identical feature representation. If either set of GO terms is empty, the feature vector is defined as null and should be removed: One-class SVM based negative data sampling. One-class SVM was originally proposed for estimating the support of a high-dimensional distribution 41 and detecting novelty/outlier 42 . Unlike two-class classification, one-class SVM attempts to derive from the positive data alone one decision boundary, one side of which is positive and the other side is outlier. The decision boundary can be assumed as a hyperplane 41,42 or a hypersphere 43 . The assumption of hyperplane is to map the data into a kernel space so as to construct a hyperplane that is maximally distant from the origin. Given the training vectors x i g R n , i 5 1, 2, …, l that possess positive labels only, the primal problem of one-class SVM is formally defined as the following quadratic program 42 : where n g (0,1) controls the upper bound on the fraction of outliers and the lower bound on the fraction of support vectors. j i is slack variable, r denotes offset, w(x i ) is mapping function and v is instance weight. The prime problem (2) corresponds to the following dual problem 42 : After the coefficients of the support vectors (a i . 0) are obtained, the decision function is then defined as follows: where the kernel function k(x, y) is defined as the inner product of two mapping functions, i.e. k(x, y) 5 (w(x)?w(y)), for instance, Gaussian kernel assumes the form: where jjDjj denotes 2-norm of vector D and the hyperparameter c controls the flexibility of kernel. One-class SVM is originally developed to learn the patterns inherent in the positive data and then use the patterns to discriminative outliers from the positive data 42 . Recently, one-class SVM has been used as two-class classification 26,27 to avoid nega-tive data sampling, the idea behind which is that the negative class is actually treated equally as the positive outliers. Unfortunately, the negative data generally do not share similar patterns with the positive outliers and one-class SVM can not properly define the two-class decision boundary without learning the negative patterns. Here we use one-class SVM instead to roughly confine the positive (1) region that contains the positive data and then sample negative data outside the region. The question is how much the space of the positive (1) region should be. For the convenience of description, we denote as positive (1) region the opposite side of the hyperplane from the origin, and accordingly negative (-) region the other side of the hyperplane. The more distant the hyperplane is from the origin, the larger the positive (1) region will be. In this case, the space of the negative (-) region is reduced and the sampling in this space is supposed to be more reliable, but the positive (1) region is supposed to contain more errors (outliers and false positives). On the contrary, if the hyperplane is nearer to the origin, the positive (1) region is reduced and the the negative (-) region is supposed to contain more false negatives. In a word, the dilemma is that we should choose the hyperplane far away from the origin or near to the origin, or to say, choose reliable negative data with high false positive rate or choose reliable positive data with high false negative rate. The dilemma, though theoretically unresolved 42 , can be effectively solved by empirically tuning the parameter n g (0,1). One simple method is to define a series of parameter n g (0,1) values to control the space of the positive (1) region. For each parameter n g (0,1) value, together with the kernel parameter c, we train a one-class SVM model to predict proteome-wide HTLV-human protein pairs and then choose a portion of reliable and representative negative data from the negative outcomes (predicted non-interactions). To achieve a proper trade off between reliability and representativeness, we choose the predicted negatives that are centered around the negative outcomes, too far or too near negatives are discarded. Assuming there are n predicted negative data with outcomes R i , 0, i 5 1, …, n, the mean and standard variance of the outcomes are defined as follows: Then the negative data are chosen within the following data indices: To reduce the risk of model bias, the size of the chosen negative data is equal to the size of positive data (assuming N). We further choose the negative data within the indices defined by formula (7) with large outcome values.

I neg~I
1 ,I 2 ,:::I N jjR I1 jwjR I2 j w:::wjR IN jw:::wjR I jIj j, I 1 ,I 2 ,::: where jIj denotes the cardinality of set I. Using the above described negative sampling method, we obtain the corresponding negative data for S1 pos , S2 pos and S3 pos , denoted as S1 neg , S2 neg and S3 neg , respectively. Then the three datasets for two-class SVM training are defined as S1 5 S1 pos < S1 neg , S2 5 S2 pos < S2 neg and S3 5 S3 pos < S3 neg . The final training data for proteome-wide HTLV-human PPI prediction is defined as follows: Two-class SVM prediction. For each parameter pair (n, c), one-class SVM yields one negative dataset S neg , based on which we train a two-class SVM for novel HTLVhuman PPI prediction. Unlike one-class SVM, two-class SVM attempts to maximize the margin between two-class hyperplanes. The prime problem of two-class SVM is defined as follows 44 : where y i denotes the class label of data point x i , the parameter n achieves trade-off between the upper bound on the fraction of training errors and the lower bound of the fraction of support vectors. The parameter n of one-class SVM affects the quality of sampled negative data while the parameter n of two-class SVM affects the generalization ability of two-class predictive model. Comparing formula (3) with formula (11), we can see that two-class SVM needs the information of data label but one-class SVM does not. The prime problem of formula (11) Solving the optimization problem, we can obtain the coefficients of the support vectors (a i . 0) and further define the decision function as follows: Like one-class SVM, two-class SVM also has one parameter pair (n, c) to be empirically tuned on the training data (c denotes Gaussian kernel parameter). Here leave-one-out cross validation (LOOCV) is used to tune the parameter pair (n, c). After parameter tuning, the trained two-class SVM is used to predict proteome-wide HTLV-human protein pairs. As described in formula (1) and formula (2), each test protein pair (i 1 , i 2 ) is represented by the target instance B (i1,i2 ) T and the homolog instance B (i1,i2 ) H , thus twoclass SVM decision function f yields two outputs for the two instances ). The final decision value for protein pair (i 1 , i 2 ) is defined as follows: where j?j denotes the absolute value, and then the final label for protein pair (i 1 , i 2 ) is defined as follows: Proteome-wide predictive feedback constrained model selection. A series of oneclass SVM parameter pair (n, c) values yield a series of candidate negative data S neg . The question is how to determine the quality of the negative data. The common practice is to conduct model evaluation by k-fold cross validation or leave-one-out cross validation (LOOCV) on the training data S 5 S pos < S neg , and then choose the negative data S neg that achieves the best model performance. However, cross validation model evaluation on the training data is not enough to demonstrate the true generalization ability. A model that behaves well on the training data is still likely to yield overpredictions like the random forest method for pathogen-host PPI prediction 18 . The rationality of the predictions should be very necessarily verified. Jansen et al. 2 has proposed a doctrine that the expected number of negatives (noninteracting protein pairs) is several orders of magnitude higher than the number of positives (interacting protein pairs). The doctrine can be used for us to check the rationality of proteome-wide predictions. Assuming there are p protein pairs to be predicted, p 1 pairs are predicted as positive (interactions) and p 2 pairs are predicted as negative (non-interactions) (p 5 p 1 1 p 2 ), the model can be accepted only if the following rule is observed: Otherwise, there is a high risk of false positive predictions. Here we use formula (16) as constraint on the model selection of one-class SVM. The parameter pair (n, c) of with larger K and good two-class SVM LOOCV performance is preferred. Two-class SVM LOOCV performance is estimated with multiple performance metrics, such as ROC-AUC (Receiver Operating Characteristic -Area Under Curve), PR-AUC (Precision recall curve AUC), SP (Specificity), SE (Sensitivity) and MCC (Matthews correlation coefficient). SP, SE and MCC can be derived confusion matrix M. Formula (17) defines several intermediate variables, from which we can calculate SP l , SE l and MCC l for each label as formula (18), and calculate overall MCC as formula (19). . ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi (p l zr l )(p l zs l )(q l zr l )(q l zs l ) p ,l~1,2:::,L where the confusion matrix M i,j records the counts that class i are classified to class j, and L denotes the number of labels. AUC is calculated based on the decision values of two-class SVM.

Results
Proteome-wide negative data sampling. One-class SVM parameter pair (n, c) tuning. The search space of one-class SVM parameter pair (n, c) is daunting. To reduce the computational complexity, we narrow down the space of n and c to the set {2 m j 2 11 # m # 21, m g Z}. For simplicity of annotations, the set is sorted in a descending order, and we use n and c to denote the index of the set elements. n 5 i denotes that n assumes the value 2 2i , c 5 j denotes that c assumes the value 2 2j . The two parameters are empirically tuned by leave-one-out cross validation on the positive data S pos (S pos 5 S1 pos < S2 pos < S3 pos ). Each parameter pair (n, c) value trains one one-class SVM model (denoted as OCSVM (n, c) ) and OCSVM (n, c) yields corresponding LOOCV performance, e.g. recognition rate of the known PPIs. We split all the achieved LOOCV performances into eight ranges ( Figure 1). The OCSVM (n, c) that achieves higher recognition rate is supposed to yield smaller negative (-) region, implying that sampling negative data in this region will be more reliable but less representative. After obtaining the eight trained OCSVM (n, c) models, we then use OCSVM (n, c) to conduct proteome-wide negative data sampling.
Negative data sampling from OCSVM (n, c) predicted negatives. Now we use the trained OCSVM (n, c) models to predict all unseen HTLVhuman protein pairs, and then obtain eight negative datasets from the predicted negatives according to formula (7)(8)(9). There are 10 HTLV proteins in the training data S pos and the human proteins are taken from Swissprot database 38 . After excluding those known HTLV-targeted human proteins in S pos and those protein pairs (i 1 , i 2 ) that satisfy B (i1,i2) T~n ull^B (i1,i2) H~n ull, we obtain the whole search space for each HTLV protein as shown in Table 1. The predicted positive rates yielded by the eight trained OCSVM (n, c) models are illustrated with brown bars in Figure 1. From Figure 1, we can see that the better LOOCV performance (recognition rate of positive data, bars in brown) OCSVM (n, c) achieves, the more protein pairs are predicted to be positive (bars in dark blue). Moreover, with the increase of LOOCV performance, the ratio of predicted negative rate to predicted positive rate decreases to be less than 1 (see the latter four representative parameter pairs (n, c)), which does not observe the rule (K . 1) defined in formula (16). For instance, OCSVM (n 5 3, c 5 7) achieves 88.35% predicted positive rate (bar in brown), which is far beyond rational scope. If we choose OCSVM (n 5 3, c 5 7) only because of its 90.94% LOOCV performance (bar in dark blue), we will take the risk of high false positive predictions. Thus it should be cautious to accept a trained one-class SVM only based on its cross www.nature.com/scientificreports SCIENTIFIC REPORTS | 5 : 8034 | DOI: 10.1038/srep08034 validation performance on the training data without examining the rationality of proteome-wide predictions.
Proteome-wide predictive feedback constrained model selection. Two-class SVM performance evaluation. For each representative parameter pairs (n, c), OCSVM (n, c) yields one training data S (n, c) , based on which we train one two-class SVM denoted as TCSVM (n, c) . Like one-class SVM, two-class SVM also has two parameters (n, c) to be empirically tuned, denoted as (n9, c9) to be distinguished from one-class SVM parameters (n, c). Here (n9, c9) is tuned by leave-oneout cross validation within the parameter space {2 m j 2 5 # m # 21, m g Z}. Since (n9, c9) is trivial to us, (n9, c9) will not be mentioned any more. The LOOCV ROC curves of the eight TCSVM (n, c) models are illustrated in Figure 2. From the points of view of AUC scores, the eight TCSVM (n, c) models all achieve sound LOOCV performance with the AUC score $ 0.8807. Other LOOCV performance metrics (Accuracy, MCC) are shown in Figure 3. The upper sub-part plots the bar chart of Accuracy and MCC for S (n, c) and the lower three subparts for S1 (n, c) , S2 (n, c) and S3 (n, c) . Except the second TCSVM (n 5 10, c 5 4) , all the other TCSVM (n, c) models achieve .80% Accuracy and .0.68 MCC. Comparing Figure 1, Figure 2 and Figure 3, we can see that the higher LOOCV performance OCSVM (n, c) achieves, the higher LOOCV performance TCSVM (n, c) also will achieve on the negative data yielded by OCSVM (n, c) . The results are not surprising. Higher OCSVM (n, c) LOOCV performance suggests that OCSVM (n, c) achieves larger positive (1) region and smaller negative (-) region of the hyperplane. Thus the negative data predicted by OCSVM (n, c) are more reliable and more easily discriminated from the positive data by TCSVM (n, c) . However, the negative sampled in smaller negative (-) region of the hyperplane are supposed to be less representative, so that many so-called unreliable negative data will be misclassified to positive class, i.e. false positive predictions or overpredictions. For the reason, the quality of the negative data yielded by OCSVM (n, c) should be subjected to further verification by proteomewide TCSVM (n, c) predictive feedback. TCSVM (n, c) outcomes constrained model selection. Similar to OCSVM (n, c) , the proteome-wide prediction space for each HTLV protein is collected by excluding those human proteins in S (n, c) that the HTLV protein interacts with and does not interact with. For most HTLV proteins, the number of human proteins to be predicted is over 20,000, thus there are more than 200,000 protein pairs to be   , c) , whose predicted negatives will be sampled as negative data. The lower part shows the predicted positive rate achieved by TCSVM (n 5 1, c 5 3) , which is used as constraint on OCSVM (n, c) model selection predicted. The predicted positive rates for the eight TCSVM (n, c) models are shown in Figure 4. Except the former three TCSVM (n, c) models, the latter five TCSVM (n, c) from [n 5 1, c 5 5] to [n 5 3, c 5 7] all achieve . 50% predicted positive rate with constant K less than 1 (K is defined in formula (16)), thus out of our options. The first TCSVM (n 5 1, Proteome-wide predicted positive rate is an effective metric to validate the rationality of predictions. To choose a proper model from TCSVM (n 5 1, c 5 3) , TCSVM (n 5 10, c 5 4) and TCSVM (n 5 1, c 5 3) , we further propose the metric percentage of HTLV-targeted human proteins for the final model selection (see Figure 5). As shown in Figure 5, the latter six TCSVM (n, c) models all predict . 60% human proteins to be targeted by HTLV proteins, the first TCSVM (n 5 1, c 5 3) predicts 51.25% interacting human partners and the second TCSVM (n 5 10, c 5 4) predicts 55.81% interacting human partners. The percentage of predicted human partners seems to be relatively high, partly because the known PPI dataset is small  To further choose the final model from TCSVM (n 5 1, c 5 3) and TCSVM (n 5 10, c 5 4) , we provide in Figure 6 the details of percentage of human partners predicted to be targeted by each HTLV protein.
The test data S3 pos contains HTLV p30 only and the training data S1 < S2 does not contain HTLV p30, so it is not surprising that TCSVM S1|S2 pos (n~1,c~3) achieves low recognition rate on S3 pos . But the result is still promising as compared to experimental siRNA screens (10% recognition rate) 13 .
Comparison with random sampling. Random sampling is simple and unbiased, but is prone to be less reliable and less representative. For comparison, a negative data S random neg with equal size to the positive data S pos is obtained using random sampling. Then we train a two-class SVM denoted as TCSVM random on the data S random~Spos |S random neg . The comparative LOOCV ROC curves between TCSVM (n 5 1, c 5 3) and TCSVM random are shown in Figure 7. We can see that TCSVM (n 5 1, c 5 3) performs better than TCSVM random with AUC score 0.8917 versus 0.8304. TCSVM (n 5 1, c 5 3) also shows better LOOCV performance than TCSVM random with (Accuracy 5 0.8158, MCC 5 0.6812) versus (Accuracy 5 0.7778, MCC 5 0.6239). In addition, we also conduct proteome-wide predictions using TCSVM random . The computational results show that TCSVM random achieves 24.97% proteome-wide predicted positive rate , relatively lower than TCSVM (n 5 1, c 5 3) (33.70%), suggesting a relatively lower risk of false positive predictions. TCSVM random achieves 3.00 K value, higher than TCSVM (n 5 1, c 5 3) K value (1.97). The K value defined in formula (16) is proposed to roughly estimate the rationality of predictions. In general, low K value (#1) suggests a high risk of false positive predictions, which can be used as constraint on model selection. It is hard to accurately define the upper bound and the lower bound of K value, high K value does not always imply good model. Too high K value may suggest high false negative rate and insufficiency of model predictive ability. We should obtain a proper trade-off between proteome-wide prediction based K value, training data based cross validation performance and literature evidence based independent test performance. Here we might as well choose TCSVM (n 5 1, c 5 3) for the reasons: (1) TCSVM (n 5 1, c 5 3) achieves better LOOCV performance; (2) TCSVM (n 5 1, c 5 3) confines the space of negative data sampling, thus the obtained negative data are more reliable and representative; (3) TCSVM (n 5 1, c 5 3) and OCSVM (n 5 1, c 5 3) attempts to achieve a proper trade-off between false positives and false negatives.
Proteome-wide HTLV-human PPI networks reconstruction. PPI networks reconstruction. As described above, TCSVM (n 5 1, c 5 3) that  (14)), the predicted interactions and the predicted non-interactions will be more reliable with lower risk of false predictions. The rapidly reconstructed HTLV-human PPI networks provide valuable cues for further biomedical research. Gene ontology based clustering analysis of the predicted networks will be discussed in the next section.
Literature validation of PPI predictions. K value is useful to check the rationality of proteome-wide predictions and literature validation is further needed to check the reliability of proteome-wide predictions. However, the fact that the existing experimental evidences are sparsely scattered over hundreds of biomedical literature makes it hard for us to collect enough data to validate the predictions. Nevertheless, we still manage to find 20 novel experimental PPIs that are correctly recognized by our proposed TCSVM (n 5 1, c 5 3) (see Table 3). The PPIs given in Table 3 have not been collected into the training data S (n , though some PPIs were found much earlier than 32 . For instances, HTLV1 p30 is found to interact with Cyclin E and CDK2 to affect their complex formation and thus to delay S phase entry 45 . Nakano et al. 46 proposed that HTLV1 p30 may interact with nucleoporin NUP62 and tumor suppressor LZTS2. HTLV1 tax has been found to interact with NEMO, OPTN, RELB and IKKE 47 and the interaction between HTLV1 tax and Mdm2 results in the degradation of FoxO4, a transcription factor and tumor suppressor of Akt signaling pathway 48 . In 49 , HTLV1 hbz is reported to directly inhibit the acetyl transferase activity of p300/CBP. In 50 , HTLV1 hbz is reported to interact with SMAD2/3/4. In 51 , HTLV2 tax2 is reported to interact the key component of autophagy pathways BECN1 to connect the IKK complex to autophagy pathways. In 52 , it is reported that the direct interaction between CIITA with Tax2 inhibits the oncogenic retrovirus replication in infected cells. It is hard to manually extract all the related experimental PPIs from so many scattered literature, so we give only dozens of examples as shown in Table 2. The 20 experimental evidences help to validate the reliability of TCSVM (n 5 1, c 5 3) proteome-wide predictions.
The number of experimental direct PPIs is very limited, so we also find some indirect evidences to further validate the reliability of TCSVM (n 5 1, c 5 3) predictions. Taylor et al. 53 assessed the effect of p30 on cellular RNA transcript expression and their nuclear export, and reported the related down-regulated genes and the up-regulated genes regulated by HTLV1 protein p30. The alteration of the host cellular transcript expression may indicate that there is a direct or functional (indirect) interaction between p30 and the up-or downregulated genes. Hence we conduct overlap analysis between TCSVM   Discussion Biological experiments generally focus on positive phenomena such as interaction, binding, modification, activation, expression, response, etc., whereas the corresponding negative phenomena arouse less attentions. Actually the negative phenomena also benefit our understanding of the positive patterns and especially facilitate computational modeling. Because experimental negative data are seldom available, proper negative data sampling method is highly desired to sample reliable and representative negative data. In this work, we use one-class SVM to confine the space of negative data sampling for the sake of reliability and sample the centred negatives (m 2 s, m 1 s) for the sake of representativeness. To validate the quality of sampled negative data or to select proper one-class SVM parameter pair (n, c), we calculate the K value and the predicted positive rate of twoclass SVM proteome-wide predictions, based on which to exert constraints on one-class SVM model selection. The computational results show that the final OCSVM (n 5 1, c 5 3) yields a quality negative data to train the predictive model TCSVM (n 5 1, c 5 3) . TCSVM (n 5 1, c 5 3) has been empirically demonstrated to show good LOOCV performance, good independent test performance and rational proteome-wide predictions. Here we further conduct gene ontology based clustering analysis of predicted HTLV-human PPI networks to gain the insight of general patterns that HTLV viruses attack human proteins.
To further validate the sampled negative data, we conduct leaveone-out cross validation (LOOCV) and literature validation. The performance metrics ROC-AUC, SP, SE, Accuracy and MCC demonstrate that the two-class SVM TCSVM (n 5 1, c 5 3) trained on the obtained negative data achieve good LOOCV performance and rational predicted positive rate, yielding low risk of false positive predictions.
Lastly, gene ontology based clustering analysis of the predictions reveals some HTLV-targeted significant signaling pathways and human proteins that fulfil critical molecular functions, which provides much insight into the pathogenesis of HTLV retroviruses. To gain knowledge about how the HTLV proteins interfere with the host signaling pathways, what host cellular functions the HTLV proteins are prone to do harm with, and where the interactions occur, we cluster all the predicted interactions into thee major classes according to GO terms, i.e. biological processes (P), molecular functions (F) and cellular compartments (C). Here we use gene ontology term (GO term) as distance metric, i.e. the human partners that possess the same GO term are assigned to the same cluster. Thus each cluster of human proteins defines a biological module that reveals the general behaviour patterns of HTLV viruses. To distinguish the patterns that all the 10 HTLV proteins observe and the patterns that several HTLV proteins observe, we further split each cluster into two sub-clusters, one sub-cluster embraces all the 10 HTLV viruses (denoted as P1, F1 and C1), and the other sub-cluster embraces only a part of viruses (denoted as P2, F2 and C2). P1, F1 and C1 are given in Supplementary Section 7, Section 8 and Section 9, respectively. P2, F2 and C2 are given in the Supplementary Section 10, Section 11 and Section 12, respectively. For the sake of large number of biological modules (clusters), we only demonstrate several biological modules as examples, interested readers are referred to Supplementary Section 7 , Supplementary Section 12 for other biological cues.
PPI Sub-network GO:0007219 -Notch signaling pathway. Notch signaling pathway plays an important role in cell proliferation, differentiation and apoptosis. Recent research has suggested that constitutive activation of Notch signaling pathway is essential to the pathogenesis of HTLV-1 associated adult T-cell leukemia (ATL), and the inhibition of Notch signaling by C-secretase inhibitors reduces tumor cell proliferation and tumor formation in ATL-engrafted mice 54 . In this work, TCSVM (n 5 1, c 5 3) predicts 545 interactions between the 10 HTLV proteins and 65 human proteins that are involved in Notch signaling pathway. We use the biological processes GO term GO:0007219 to denote the predicted PPI subnetwork. The PPI sub-network GO:0007219 is extracted from Supplementary Section 7 and is illustrated by A q in Figure 8. The HTLV proteins are denoted with diamond and the human protein are denoted with eclipse. From Figure 8, we can see that the 10 HTLV proteins are densely connected with 50 , 60 Notch signaling proteins. Interestingly, it is predicted many times that the 10 HTLV proteins simultaneously target the same human protein, i.e. the degree of the human protein is 10 in the PPI Sub-network GO:0007219. In the predicted PPI sub-network, there are 40 human proteins with degree 10 and 10 human proteins with degree 9. In the experimental network S pos , we also find the phenomena that more than one HTLV proteins target the same human protein. In S pos , there are 43 human proteins that interact with more than one HTLV protein, e.g. the human protein EWS is targeted by 5 HTLV proteins {HTLV1 rex; HTLV1 tax; HTLV2 gag; HTLV2 rex; HTLV2 tax2}. A human protein that is targeted by   57,58 . In 57 , it is reported that immune stimuli on T cell receptor signaling pathway may activate HTLV-1 gene expression and cellular gene expression. In 58 , it is stated that HTLV-1 dysregulates common T-cell activation pathways for the virus to establish persistent infection. In this work, TCSVM (n 5 1, c 5 3) predicts 250 interactions between the 10 HTLV proteins and 33 human proteins that are involved in T cell receptor signaling pathway. PPI Sub-network GO:0050852 is extracted from Supplementary Section 7 and is illustrated by B q in Figure 8. The predicted PPI sub-network is less densely connected than PPI Subnetwork GO:0007219. There are 10 human proteins with degree 10 and 14 human proteins with degree 9. The human proteins targeted by multiple HTLV proteins may also fulfil critical molecular functions. For example, the human protein THMS1 (Q8N1K5) is predicted to be targeted by all the 10 HTLV proteins. According to Uniprot annotations, THMS1 plays a central role in late thymocyte development and regulates T-cell development through T-cell antigen receptor (TCR) signaling (http://www.uniprot.org/uniprot/ Q8N1K5).
PPI Sub-network GO:0046426 -negative regulation of JAK-STAT cascade. JAK-STAT signalling pathway plays a critical role in the transduction of extracellular signals from cytokines and growth factors that are involved in hematopoiesis, immune regulation, fertility, lactation, growth and embryogenesis. Negative regulators of JAK-STAT pathways include tyrosine phosphatases, protein inhibitors of activated STATs, suppressors of cytokine signalling proteins, and cytokine-inducible SH2-containing protein 49 . It has been reported that HTLV-1 Tax protein suppresses apoptosis through constitutive activation of the NFkB pathway, which in turn activates JAK3-STAT5 pathway to cause lymphocyte proliferation and adult T-cell lymphoma/leukemia 59 . In this work, TCSVM (n 5 1, c 5 3) predicts 16 interactions between 8 HTLV proteins and 3 human proteins that are involved in negative regulation of JAK-STAT cascade. It may be inferred that the 8 HTLV proteins repress the 3 negative regulators of JAK-STAT cascade to keep constitutive activation of the JAK-STAT signaling pathway. PPI Sub-network GO:0046426 is extracted from Supplementary Section 10 and is illustrated by C q in Figure 8. The predicted sub-network is rather sparsely connected. There is one human protein with degree 8 and two human proteins with degree 4. Two HTLV proteins {HTLV1 tax, HTLV2 tax2} are predicted not to interact with the pathway related human proteins. We only extract only three signaling pathways as illustrated in Figure 8, interested readers are referred to Supplementary Section 7 and Supplementary Section 10 for other signaling pathways or biological processes.
PPI Sub-network GO:0017124 -SH3 domain binding. It has been stated that HTLV pathogenesis is closely related to the interaction between HTLV protein and SH3 domain containing proteins 60 . In this work, TCSVM (n 5 1, c 5 3) predicts 343 interactions between the 10 HTLV proteins and 53 SH3 domain binding proteins. It may be inferred that HTLV proteins interrupt the normal functions of the SH3 domain containing proteins by interacting with the corresponding SH3 domain binding proteins. PPI Sub-network GO:0017124 is extracted from Supplementary Section 11 and is illustrated by A q in Figure 9. In the predicted PPI sub-network, there are 15 human proteins with degree 10, 4 human proteins with degree 9 and 8 human proteins with degree 8. The human protein PTTG1 (O95997) predicted to be targeted by the 10 HTLV proteins acts as regulatory protein and plays a central role in chromosome stability, in the p53/TP53 pathway, and in DNA repair. During the mitosis, PTTG1 blocks Separase/ESPL1 function, preventing the proteolysis of the cohesin complex and the subsequent segregation of the chromosomes (http://www. uniprot.org/uniprot/O95997). www.nature.com/scientificreports PPI Sub-network GO:0002039 -p53 binding. In 61 , the experimental results suggest that p53 function is inactivated by HTLV Tax protein to induce statistically significant prevalence of tumorigenesis. In 62 , the authors stated that HTLV Tax does not co-immunoprecipitate with p53 and there may be an indirect mechanism to reduce the activity of p53. The assumption is validated in 63 , where it is stated that HTLV-I Tax induces a novel interaction between p65/RelA and p53 to inhibit p53 transcriptional activity. In this work, TCSVM (n 5 1, c 5 3) predicts 238 interactions between the 10 HTLV proteins and 30 p53 binding proteins. The results suggest that interaction with p53 binding proteins is another indirect mechanism to inactivate p53 function. PPI Sub-network GO:0002039 is extracted from Supplementary Section 11 and is illustrated by B q in Figure 9. In the predicted subnetwork, there are 17 human proteins with degree 10 and 3 human proteins with degree 8. p53 binding proteins may be indispensible for p53 to be co-complexed for proper transcription activity. For instance, the human protein BRD7 (Q9NPI1) predicted to interact with HTLV proteins is actually a coactivator for TP53-mediated activation of transcription of a set of target genes, and BRD7 is required for TP53-mediated cell-cycle arrest in response to oncogene activation (http://www.uniprot.org/uniprot/Q9NPI1). If HTLV proteins interfere with Q9NPI1 function, there would be much adverse affect on p53 transcription activity.
PPI Sub-network GO:0004553 -O-glycosyl hydrolase activity. TCSVM (n 5 1, c 5 3) predicts that some HTLV proteins interact with some human proteins fulfilling the function of O-glycosyl hydrolase activity. PPI Sub-network GO:0004553 is extracted from Supplementary Section 10 and is illustrated by C q in Figure 9. In the PPI sub-network, there are 37 interactions between 8 HTLV proteins and 12 human proteins. There are 4 human proteins that are targeted by 5 HTLV proteins. For instance, GLB1 (P16278) cleaves betalinked terminal galactosyl residues from gangliosides, glycoproteins and glycosaminoglycans (http://www.uniprot.org/ uniprot/P16278).