Robust and accurate prediction of protein–protein interactions by exploiting evolutionary information

Various biochemical functions of organisms are performed by protein–protein interactions (PPIs). Therefore, recognition of protein–protein interactions is very important for understanding most life activities, such as DNA replication and transcription, protein synthesis and secretion, signal transduction and metabolism. Although high-throughput technology makes it possible to generate large-scale PPIs data, it requires expensive cost of both time and labor, and leave a risk of high false positive rate. In order to formulate a more ingenious solution, biology community is looking for computational methods to quickly and efficiently discover massive protein interaction data. In this paper, we propose a computational method for predicting PPIs based on a fresh idea of combining orthogonal locality preserving projections (OLPP) and rotation forest (RoF) models, using protein sequence information. Specifically, the protein sequence is first converted into position-specific scoring matrices (PSSMs) containing protein evolutionary information by using the Position-Specific Iterated Basic Local Alignment Search Tool (PSI-BLAST). Then we characterize a protein as a fixed length feature vector by applying OLPP to PSSMs. Finally, we train an RoF classifier for the purpose of identifying non-interacting and interacting protein pairs. The proposed method yielded a significantly better results than existing methods, with 90.07% and 96.09% prediction accuracy on Yeast and Human datasets. Our experiment show the proposed method can serve as a useful tool to accelerate the process of solving key problems in proteomics.

where TN is the number of true negatives, indicating that the non-interacting proteins are predicted correctly; TP is the amount of true positives, representing that the interacting proteins are predicted correctly; FN is the number of false negatives, indicating that the interacting proteins are predicted to be non-interacting; and FP is the amount of false positives, representing that the non-interacting proteins are predicted to have interaction. Additionally, the receiver operating characteristic (ROC) 29 curves and the area under the ROC curve (AUC) 30 were also calculated to further evaluate the discriminatory accuracy of the proposed model. The workflow of the proposed method is shown in Fig. 1.

Assessment of prediction.
We applied the proposed method to two popular PPIs datasets to verify the performance of the model, including Yeast and Human datasets. In addition, to avoid over-fitting problems in the experiment, we used a five-fold cross-validation method to evaluate prediction performance. Specifically, we divided the entire dataset into five parts, four of which were used for training and one part was used for testing. In this way, we can obtain five separate models from the Yeast and Human datasets and perform five independent experiments. To be fair, we set the same parameters for the rotation forest classifier on different datasets. In this experiment, we use a grid search method to optimize two important parameters of the RoF algorithm. Figure 2 presents the prediction results of the RoF algorithm under different parameters. Here, the parameter K (the amount of feature subsets) is set to 10 and the parameter L (the amount of decision trees) is set to 35. The predicted results obtained by combining the proposed model with the five-fold cross-validation method on different datasets are shown in Table 1.
From Table 1, we can see that the proposed method for predicting PPIs has a good performance on the Yeast dataset. Its average accuracy, precision, sensitivity, and MCC were 90.07%, 90.24%, 89.83%, and 82.10%, respectively, and their standard deviations were 0.60%, 0.56%, 1.41%, and 0.97%, respectively. In addition, our method also achieved satisfactory results on the Human dataset. Its average accuracy, precision, sensitivity, and (1) Accuracy = TN + TP TN + TP + FN + FP ,    www.nature.com/scientificreports/ the parameters c and g are set to 4 and 1, respectively. When detecting PPIs on the Human dataset, the parameters c and g are set to 8 and 1, respectively. Furthermore, we chose the radial basis function as the kernel function in this experiment. From Table 2, we can observe that the SVM-based method achieves an average accuracy of 78.96%, an average precision of 79.08%, an average sensitivity of 78.76%, and an average MCC of 66.80% by using fivefold crossvalidation on the Yeast dataset. However, the RoF-based methods achieved average accuracy, precision, sensitivity,    Comparison time performance with SVM-based method.. In this section, we compare the training time required by RoF and SVM algorithms on two datasets, by using the same OLPP feature extraction method on the same machine configuration. Table 3 gives the comparison results of the training time required by different algorithms on the Yeast and Human datasets. It can be shown that the training time of OLPP + RoF method is 401 s higher than that of OLPP + SVM method and the accuracy is improved by about 10% on the Yeast dataset. Similarly, the training time of OLPP + RoF method is 170 s while the training time of OLPP + SVM method is 110 s on the Human dataset. Although the training speed of the latter is 60 s faster than that of the former, the accuracy is reduced by about 9%. As a result, the RoF algorithm is superior to the SVM algorithm in terms of both prediction accuracy and training time.  PPIs. In particular, machine learning algorithms have also received widespread attention from researchers. In this section, we compare the proposed method with the currently known methods to further evaluate the predictive power of the model. Tables 4 and 5 summarize the predicted results of other existing methods on Yeast and Human datasets, respectively. From Table 4, we can see that the accuracy of the proposed method is 90.07%, the sensitivity is 89.83%, the precision is 90.24% and the MCC is 82.10% with the corresponding standard deviations of 0.60, 1.41, 0.56, and 0.97, respectively on the Yeast dataset. Similarly, we can find the prediction results of different methods on the Human dataset from Performance on independent dataset. Although the proposed model has achieved good performance on Yeast and Human datasets, the suitability of the proposed method for different datasets still needs to be verified. Therefore, we also performed additional experiments to further determine the predictive performance of this model for other species. It should be noticed that there is a biological hypothesis that PPIs are mapped from one species to another. This hypothesis is that many physically interacting proteins have coevolved in a given organism so that they are also likely to interact with proteins from other organisms. In this experiment, we used all of the 11,188 protein pairs of Yeast datasets to construct a training set through the previously proposed method. Then, we use four independent datasets as test sets to detect the final prediction model separately.
Among them, the four independent test sets are C. elegans, H. pylori, H. sapiens, and M. musculus collected in the DIP database. The number of their protein pairs is 4013, 1420, 1412 and 313, respectively. Table 6 shows the PPIs prediction results of the five methods on four species. We can conclude that the proposed model achieved up to 90% prediction accuracy on four independent datasets C. elegans, H. pylori, H. sapiens, and M. musculus, which were 90.93%, 92.54%, 92.21%, and 91.37%, respectively. These results not only indicate the outstanding performance of the proposed method in predicting the interaction of other species but also show that the method has good generalization.

Conclusions
Machine learning algorithms play a crucial role in proteomics research as they can quickly and accurately improve the prediction accuracy of PPIs. In this work, we propose an ensemble learning approach to detect PPIs from protein sequences. Orthogonal locality preserving projections are used to extract discriminative features from the PSSM, which can effectively preserve evolutionary information of the protein sequence. Finally, we use a rotation forest model to predict PPIs. To evaluate the reliability of the proposed method for PPIs prediction, we performed experiments on Yeast and Human datasets to verify the performance of the method. At the same time, we also compared the proposed model with the SVM classifier and other existing models. The experimental results show that our method has achieved good performance in predicting protein interactions and it can be a useful tool for detecting PPIs.

Materials and methodology
Data sources. Previous studies have generated many databases of protein-protein interactions, such as Biomolecular Interaction Network Database (BIND) 43 , Molecular Interaction Database (MINT) 44 , and Database of Interacting Proteins (DIP) 45 . To demonstrate the efficacy of the proposed method, we used two publicly available and highly reliable datasets for this study, including Yeast and Human, which were derived from the database of interacting proteins (DIP) and collected by Guo et al. 23 and Huang et al. 19 , respectively. To eliminate the redundancy of the dataset and ensure the validity of the experiment, we performed a screening work to remove the redundant protein pairs 46 . Specifically, protein pairs with fewer than fifty residues are completely removed, as they may be just fragments. Furthermore, considering the presence of homologous sequence pairs, those protein pairs with more than 40% sequence identity were also removed. Finally, we retained the remaining 5594 protein pairs to construct a positive PPIs dataset. At the same time, we also constructed a negative dataset using an additional 5594 non-interacting protein pairs, and their subcellular localization was different. Thus, the final Yeast dataset in this experiment consisted of 11,188 protein pairs, which contained 50% negative datasets and 50% positive datasets. Analogously, we constructed 8161 protein pairs for Human dataset, which included 4262 non-interacting protein pairs and 3899 interacting protein pairs.  Table 6. Performance comparisons on four species. N/A means not available.

Species
Test pairs Our Method Huang et al. 19 Ding et al. 40 Wang et al. 41  www.nature.com/scientificreports/ Position-specific scoring matrix. Gribskov et al. 47 initially introduced a position-specific scoring matrix (PSSM) for the search for distantly related proteins. PSSM is an evolutionary profile based on feature extraction methods that have been successfully used in various fields of bioinformatics. For instance, protein secondary structure prediction 48 , prediction of membrane protein types 49 , prediction of disordered regions 50 , identification of DNA binding proteins 51 , and protein binding site prediction 52 . To integrate the evolutionary information of proteins, we also used PSSM to predict PPIs in this study. The structure of the PSSM can be represented as a matrix with T rows and 20 columns. It can be interpreted as P = {x i,j : i = 1, . . . , T, j = 1, . . . , 20}. Of these, the rows of the matrix are protein residues and the columns refer to native amino acids. We can use the following formula to describe PSSM: where T represents the length of the protein sequence, and the element x i,j of PSSM refers to the residue score of the ith residue mutated to the type j amino acid during biological evolution.
In this paper, we employed the Position-Specific Iterated BLAST (PSI-BLAST) 53 program and the SwissProt database on a local machine to transform each protein sequence into a matrix of score values to further construct experimental datasets to predict PPIs 54 . In the process of running PSI-BLAST, we hope to select highly homologous sequences, and mainly employ these aligned sequences to construct a new scoring matrix. This matrix is called the Position-Specific Scoring Matrix (PSSM), and is weighted according to the kinds of high homology found in the initial hit list. Using this matrix again, we do a blast to pick any new homologous sequences as our scoring schema will change. This process is repeated until no new sequences can be found. PSI-BLAST is more sensitive compared to BLAST, especially in terms of discovering new members of protein families. To generate highly homologous sequences, the important parameter cutoff e-value and the number of iterations of PSI-BLAST were set to 0.001 and 3, respectively, while other parameters were default values. The applications of PSI-BLAST can be publicly accessed at http:// blast. ncbi. nlm. nih. gov/ Blast. cgi.

Orthogonal locality preserving projections (OLPP). Orthogonal locality preserving projections
(OLPP) algorithm is an effective manifold learning method. It was used early in the recognition of human faces and was proposed by Deng Cai et al. 55 . This algorithm is extended based on Locality preserving projections (LPP) 56 . Among them, the theoretical knowledge and detailed derivation of the LPP method can be traced back to Ref. 57 . Suppose we give a set of n D-dimensional data x 1 , x 2 , . . . , x n through n d-dimensional vectors y 1 , y 2 , . . . , y n , respectively, D > d. The objective function of LPP is formally stated below: where S represents a similarity matrix and y i is the one-dimensional representation of x i with a projection vector w. Here, y i = w T x i . According to the minimized objective function, LPP will incur a severe penalty if neighboring points x i and x j are projected far away. One possible way to define the similarity matrix S is as follows: where ε is extremely small, ε > 0, and the parameter t is seen as a regulator. Here, ε specifies the radius of the local neighborhood. That is, ε defines the locality. Thus, the objective function needs to be minimized so that when x i and x j are close, y i and y j are close as well. Finally, the transformation vector w is given by solving the minimum eigenvalue: where X = {x 1 , x 2 , . . . , x n } and represents the eigenvalue and w is the corresponding eigenvector. Here, L = D − S is the Laplacian matrix and D represents a diagonal matrix, D ii = j S ji . Next, we describe the OLPP algorithm by using the following steps.
1. PCA projection. Principal Components Analysis (PCA) is an effective tool for reducing the dimensionality of multivariate data by using a covariance analysis between factors. PCA projects the input data into an alternate subspace by discarding the portion corresponding to zero eigenvalue. Here, we introduce the W PCA to represent the transformation matrix of PCA. 2. Contiguity graph construction. OLPP algorithm can construct a K-nearest neighbor (KNN) graph in supervised or unsupervised mode and can also achieve good stability. Let G denote a KNN graph with n nodes. The i-th node corresponds to x i We tend to put an edge between nodes i and j if x i and x j are close, i.e. x i is among p nearest neighbors of x j . In other words, x j is among p nearest neighbors of x i . Edges are located between a sample and its K nearest neighbors in an unsupervised setting. Here, K represents a small integer. In general, we use the Euclidean distance metric to measure the closeness between data nodes in a K nearest neighbor min ij y i − y j 2 S ij , www.nature.com/scientificreports/ graph. In an unsupervised mode, we can get a constructed nearest neighbor graph that approximates the local manifold structure. 3. Selecting the weights. If node i and j are linked, the weight W ij is expressed as, where t is a suitable constant. If node i and j are not linked, we have W ij = 0. The weight matrix W of graph G refers to the native structure of the feature space. 4. Computing the orthogonal basis functions. After finding the weight matrix W, we tend to calculate the diagonal matrix D. The diagonal matrix D is defined as the sums of each column element of W (or sums of each row element of W as W is symmetric): We also calculated the Laplacian matrix L, which is defined as Rotation forest. In recent years, many ensemble algorithms have been rapidly developed in the field of machine learning, mainly because the ensemble learning classification method can greatly improve the prediction accuracy of classification results. Among them, ensemble classifier built using ensemble machine learning algorithms, such as boosting and bagging methods, usually have much better prediction accuracy than using only a single classifier. In this paper, we use the Rotation Forest (RoF) classifier to perform the classification task of protein-protein interactions. Rotation forest is an ensemble classifier combining decision tree algorithm and principal component analysis theory, which was proposed by Rodriguez et al. 58 . The main idea of the RoF classifier is to improve the diversity and prediction accuracy of the base classifiers by using a transformation approach to perform feature extractions for each classifier 59 . In addition, each decision tree is individually trained and embedded in a rotated feature space utilizing a new dataset in the transformed feature space by the original dataset 60 . Other research literature suggests that the RoF algorithm can achieve better prediction accuracy in classification problems when compared to other ensemble methods 61,62 . Assuming X be the original training dataset and we can represent it with a matrix of N × n. Here, N denotes the number of training samples and n denotes the number of features. The corresponding feature set and class label can be represented as S and Y , respectively, where Y = (y 1 , y 2 , . . . , y n ) T . Let L be the total number of decision tree classifiers present in the RoF, where the ith decision tree is T i (i = 1, 2, . . . , L). More specifically, the feature set S is first randomly divided into K disjoint subsets in the rotation forest model. In each subset, there are C = n K features. Here, K and L are two user-defined parameters. Next, we can get S ij and X ij , where S ij is the jth subset of features for the ith decision tree classifier and X ij is the training dataset X for features in S ij . Based on the bootstrap algorithm, we can generate a new nonempty training set X ′ ij , which is 75% of the original training dataset. Furthermore, a linear transformation method is applied to X ′ ij to generate a coefficient vector, and it can be described as {a (1) ij , . . . , a (C j ) ij }, and the size of each X ′ ij is C × 1. Subsequently, a sparse rotation transformation matrix G i can be constructed, as shown in the following equation:   www.nature.com/scientificreports/ Then, for a given test sample x, the d ij (xG a i ) generated by the decision tree classifier T i is used to determine that the sample x belongs to the class y i . In the next step, the average combination method is used for each class y i to calculate the confidence and the formula is as follows: