Predicting Protein-Protein Interactions from Matrix-Based Protein Sequence Using Convolution Neural Network and Feature-Selective Rotation Forest

Wang, Lei; Wang, Hai-Feng; Liu, San-Rong; Yan, Xin; Song, Ke-Jian

doi:10.1038/s41598-019-46369-4

Download PDF

Article
Open access
Published: 08 July 2019

Predicting Protein-Protein Interactions from Matrix-Based Protein Sequence Using Convolution Neural Network and Feature-Selective Rotation Forest

Lei Wang ORCID: orcid.org/0000-0003-0184-307X^1,2,
Hai-Feng Wang¹,
San-Rong Liu¹,
Xin Yan³ &
…
Ke-Jian Song⁴

Scientific Reports volume 9, Article number: 9848 (2019) Cite this article

11k Accesses
53 Citations
5 Altmetric
Metrics details

Subjects

Abstract

Protein is an essential component of the living organism. The prediction of protein-protein interactions (PPIs) has important implications for understanding the behavioral processes of life, preventing diseases, and developing new drugs. Although the development of high-throughput technology makes it possible to identify PPIs in large-scale biological experiments, it restricts the extensive use of experimental methods due to the constraints of time, cost, false positive rate and other conditions. Therefore, there is an urgent need for computational methods as a supplement to experimental methods to predict PPIs rapidly and accurately. In this paper, we propose a novel approach, namely CNN-FSRF, for predicting PPIs based on protein sequence by combining deep learning Convolution Neural Network (CNN) with Feature-Selective Rotation Forest (FSRF). The proposed method firstly converts the protein sequence into the Position-Specific Scoring Matrix (PSSM) containing biological evolution information, then uses CNN to objectively and efficiently extracts the deeply hidden features of the protein, and finally removes the redundant noise information by FSRF and gives the accurate prediction results. When performed on the PPIs datasets Yeast and Helicobacter pylori, CNN-FSRF achieved a prediction accuracy of 97.75% and 88.96%. To further evaluate the prediction performance, we compared CNN-FSRF with SVM and other existing methods. In addition, we also verified the performance of CNN-FSRF on independent datasets. Excellent experimental results indicate that CNN-FSRF can be used as a useful complement to biological experiments to identify protein interactions.

Amalgamation of 3D structure and sequence information for protein–protein interaction prediction

Article Open access 05 November 2020

Kanchan Jha & Sriparna Saha

Learning the protein language of proteome-wide protein-protein binding sites via explainable ensemble deep learning

Article Open access 19 January 2023

Zilong Hou, Yuning Yang, … Xiangtao Li

A multi-source molecular network representation model for protein–protein interactions prediction

Article Open access 14 March 2024

Hai-Tao Zou, Bo-Ya Ji & Xiao-Lan Xie

Introduction

Protein is the essential component of the living organism, and it participates in various processes of life activities such as metabolism, signal transduction, hormone regulation, DNA transcription and replication^1,2. In general, proteins perform their functions in the form of complexes by interacting with other proteins. Studying protein-protein interactions (PPIs) not only help to understand the life process, but also help to explore the pathogenesis of disease and pursue drug targets³. Over the past several decades, the detection methods of protein interaction based on biological experiments, such as tandem affinity purification (TAP)⁴, yeast two-hybrid (Y2H)^5,6 and mass spectrometric protein complex identification⁷, gradually matured and achieved considerable research results.

However, due to the biological experiment methods are costly and time-consuming, the protein interaction detected by experimental methods can only account for a small part of the complete PPIs networks^8,9,10,11. In addition, the detection results are also susceptible to the experimental environment and operational processes, resulting in some false positives and false negatives. Therefore, developing reliable computational methods to predict protein interactions accurately is of great practical significance.

In fact, there are many computational methods that have been proposed as complementary to experimental methods to predict protein-protein interactions^12,13,14,15. These methods typically use binary classification model to describe protein-protein pairs with or without interaction, which can be roughly divided into the following categories: protein domains, gene expression, gene neighborhood, protein structure information^16,17, literature mining knowledge¹⁸, and phylogenetic profiles^19,20. However, if there is no corresponding pre-knowledge, these methods cannot be implemented^21,22.

With the rapid development of sequencing technology, protein sequence information is collected and stored in large quantities. There is abundant useful information in the protein sequence, and the experimental results show that using amino acid sequence alone is sufficient to predict the interaction of protein accurately. Therefore, protein interaction prediction methods that directly extract information from amino acid sequences have aroused great interest in recent years^23,24,25. You et al. proposed the method of protein interaction prediction based on Support Vector Machine (SVM), considering the sequence order and the dipeptide information of the primary protein sequence. This method has achieved 90.06% accuracy, 94.37% specificity and 85.74% sensitivity in the protein Yeast dataset²⁶. Hu et al. introduced a novel co-evolutionary feature extraction method, namely CoFex, to predict protein interactions. CoFex can extract the feature vectors that accurately express the protein properties according to the presence or absence of the co-evolutionary features of the two protein sequences, thereby providing the performance of the PPIs prediction²⁷. Pan et al. proposed a new hierarchical LDA-RF model to directly predict protein-protein interactions in the primary protein sequences, which can mine hidden internal structures buried into the noisy amino acid sequences in low-dimensional latent semantic space. The experimental results show that this model can effectively predict potential protein interactions⁹. Saha et al. constructed an ensemble model for protein interaction prediction based on a majority voting method. The model uses four well-established machine learning methods: support vector machines, random forests, decision trees, and naive Bayes. In the cross-validation experiment, the ensemble learning method achieved over 80% sensitivity and 90% prediction accuracy²⁸. Jeong et al. predict protein interactions using algorithms that extract features only from protein sequences and machine learning for computational function prediction. The experimental results show that these features derived from the position-specific scoring matrix are very suitable for protein interaction prediction²⁹.

In this study, we propose a novel sequence-based approach, namely CNN-FSRF, to predict potential protein interactions using deep learning Convolutional Neural Network (CNN) algorithm combined with Feature-Selective Rotation Forest (FSRF) classifier. More specifically, we first use the position-specific scoring matrix to convert each protein alphabet sequence into the numerically matrix-based protein descriptor that contains evolution information. Then we use the convolutional neural network to extract the high-level abstraction features of the protein automatically and objectively. Finally, these features are fed into the feature-selective rotation forest classifier to get the final prediction results. To evaluate the predictive performance of CNN-FSRF, we performed verification in the Yeast and Helicobacter pylori PPI datasets, respectively. The experimental results show that CNN-FSRF achieves 97.75% and 88.96% accuracy with 99.61% and 91.86% sensitivity at the specificity of 95.89% and 86.11% in the above datasets, respectively. Excellent results indicate that CNN-FSRF can be a useful complement to biological experiments to identify potential protein-protein interactions.

Materials and Methodology

In this section, we outline the main idea behind CNN-FSRF approach. Figure 1 gives a schematic diagram of how CNN-FSRF uses convolution neural network and feature-selective rotation forest classifier to predict protein-protein interactions. As can be seen from the figure, our model can be divided into three steps. The first is matrix-based protein numerical representation. For a given protein, since its sequence is usually represented by the letter symbol of 20 kinds of amino acids, in order to facilitate computer algorithm processing, we use the Position-Specific Scoring Matrix (PSSM) method to convert the letter sequence of the protein into the numerical matrix. The second is feature extraction based on Convolutional Neural Network (CNN). Although the protein sequence contains abundant information, it also mixed with a lot of noise. In order to get a more precise representation, we use the deep learning CNN algorithm to extract its features. CNN can automatically and objectively extract the advanced features of protein information in a layer-by-layer manner, thus effectively avoiding the interference of human factors. The finally is the PPI prediction based on Feature-Selective Rotation Forest (FSRF) classifier. After obtaining the advanced features of the protein, we used FSRF classifier to predict relationship between them. The FSRF classifier has the advantage of greatly improving the classification speed under the premise of guaranteeing the accuracy, so as to quickly and effectively predicts the interaction between proteins.

Golden standard datasets

We evaluate the CNN-FSRF approach through two real PPIs datasets. The Yeast dataset collected from the Saccharomyces cerevisiae core subset of the Database of Interacting Proteins (DIP) by Guo et al.³⁰. The core subset contains a total of 5966 interacting protein pairs. After we remove protein pairs containing less than 50 residues or more than 40% sequence identity protein, the remaining 5594 protein pairs constitute the golden standard positive data set. For the standard negative data set, we constructed based on the assumption of Guo et al.³⁰ that there is no interaction between proteins in different subcellular compartments. To avoid the occurrence of imbalanced dataset, we selected the same number of protein pairs as the positive dataset to construct the negative dataset. As a result, there is a total of 11188 protein pairs in the final Yeast dataset, with positive and negative samples each accounting for half. For the Helicobacter pylori PPIs dataset from Martin et al.¹², we use the same method for processing. The final Helicobacter pylori dataset consisted of 2916 protein pairs, of which 1458 interacted pairs and 1458 non-interacted pairs.

Evaluation criteria

To evaluate the performance of CNN-FSRF, we use the 5-fold cross-validation and several general evaluation criteria in our experiments. The 5-fold cross-validation randomly divides whole dataset into five independent subsets of the same size. Each time one subset is used as the test set, and the remaining four subsets are used as the training sets. In the experiment, this process is executed five times to ensure that each subset is used as the test set once. Finally, the average and standard deviation of these five experiments are taken as the final experimental results. We follow the widely used evaluation criteria to evaluate the model, including accuracy (Accu.), sensitivity (Sen.), specificity (Spec.), precision (Prec.), F-Score (F_score), and Matthews Correlation Coefficient (MCC). They are defined as:

$$Accu.=\frac{TP+TN}{TP+TN+FP+FN}$$

(1)

$$Sen.=\frac{TP}{TP+FN}$$

(2)

$$Spec.=\frac{TN}{TN+FP}$$

(3)

$$Prec.=\frac{TP}{TP+FP}$$

(4)

$${F}_{score}=2\times \frac{Sen.\times Prec.}{Sen.+Prec.}$$

(5)

$$MCC=\frac{TP\times TN-FP\times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$$

(6)

where TP indicates the number of positive samples that are correctly identified, TN indicates the number of negative samples that are correctly identified, FP indicates the number of positive samples that are incorrectly identified, and FN indicates the number of negative samples that are incorrectly identified.

In these evaluation criteria, the accuracy reflects the proportion of the correct prediction results of the model. Sensitivity reflects the ability of classification model to identify positive samples. The higher value of sensitivity indicates that the model has a stronger ability to identify positive samples. Precision reflects the ability of classification model to discriminate negative samples. The higher value of precision indicates that the model has a stronger ability to discriminate negative samples. F_score is a combination of sensitivity and precision. The higher value of F_score indicates that the model is more robust. The Matthew correlation coefficient (MCC) reflects the correlation between the prediction results and the observation results. It is an important indicator of the overall performance of the model. The larger value of MCC indicates that the model has a better performance. In addition, Receiver Operating Characteristic (ROC) curves and Precision-Recall (P-R) curves are also drawn as evaluation criteria. In order to directly measure the quality of the results expressed by the ROC curve, the Area Under a Curve (AUC) is calculated at the same time. Its value ranges from 0 to 1 and the larger the value, the better the performance of the model.

Matrix-based protein numerical representation

Protein sequences are usually stored in the database in the form of letters. In order to facilitate the deep learning algorithm to extract its hidden features, the protein sequence must be encoded into the numerical form. In this study, we use the Position-Specific Scoring Matrix (PSSM) method that can contain biological evolution information to generate matrix-based numeric descriptors^31,32. When measuring the matching weights of amino acids, PSSM not only records the importance and relevance of matching, but also records the position of amino acid residues in the sequence. This matrix helps to reveal more evolutionary information of protein sequences and is therefore widely used in many fields of bioinformatics.

PSSM is the matrix of N row of 20 columns, where the row represents the length of the protein sequence and the column represents the 20 naive amino acids. Assume that P = {r_i,j:i = 1 … N and j = 1 … 20}, PSSM can be expressed as:

$$P=[\begin{array}{cccc}{r}_{1,1} & {r}_{1,2} & \cdots & {r}_{1,20}\\ {r}_{2,1} & {r}_{2,2} & \cdots & {r}_{2,20}\\ \vdots & \vdots & \vdots & \vdots \\ {r}_{N,1} & {r}_{N,2} & \cdots & {r}_{N,20}\end{array}]$$

(7)

where r_i,j in the i row of PSSM mean that the probability of the ith residue being mutated into type j of 20 native amino acids during the procession of evolutionary in the protein from multiple sequence alignments.

In the experiment, we use the sequence comparison tool Position-Specific Iterated BLAST (PSI-BLAST) to obtain the PSSM matrix. BLAST is an effective tool for finding locally similar regions between sequences. It is able to compare nucleotide or protein sequences to sequence databases, and calculate the statistical significance of matches, so as to infer the functional and evolutionary relationships between sequences as well as help identify gene family members. PSI-BLAST is a more sensitive BLAST program that can effectively detect new members of protein families and similar proteins in distantly related species. The feature of PSI-BLAST is that it can use the profile to search the database, re-construct the profile with the results of the search, and then search the database again with the new profile, so repeatedly until no new results are produced. PSI-BLAST naturally extends the BLAST method to find hidden patterns in protein sequences and to find many related proteins with a large sequence difference and a similar structural function. To maximize the effectiveness of the algorithm, we use the non-redundant SwissProt as the alignment database. All sequence entries in the SwissProt database are searched by experienced protein chemists and molecular biologists for consulting the relevant literature and carefully checking through computer tools. In addition, we also set the expected threshold of the PSI-BLAST algorithm to 0.001, the number of iterations to 3, and the rest of the parameters to the default values.

Convolutional neural network

Deep learning belongs to a branch of machine learning. Its motivation lies in establishing and simulating the neural network of the human brain for learning, and interpreting data in a mechanism that imitates the human brain^33,34,35. Deep learning can form an abstract high-level representation by combining low-level features to discover the rules of data. Therefore, in this paper, we use deep learning convolution neural network algorithm to extract hidden useful information in protein.

The convolution neural network is a feed-forward neural network. Its neurons can respond to the surrounding units in a part of the coverage and have excellent performance for data feature extraction³⁶. CNN uses forward propagation to calculate the output value and back propagation to adjust weights and biases. CNN is composed of the input layer, the convolution layer, subsampling layer, full connection layer and the output layer. Its structure diagram is shown in Fig. 2.

Assuming that L_i represents the feature map of the ith layer, it can be described as:

$${L}_{i}=h({L}_{i-1}\,\circ \,{W}_{i}+{b}_{i})$$

(8)

where W_i means the weight matrix of the convolution kernel of ith layer, b_i means the offset vector, h(x) means the activation function and operator $\circ $ means convolution operations. The subsampling layer usually behind the convolutional layer and the feature map is sampled according to given rules. Assuming that L_i is a subsampling layer, its sampling formula is:

$${L}_{i}=subsampling({L}_{i-1})$$

(9)

Through multiple convolution and sub sampling operations, CNN classifies the extracted features by the fully connected layer, and the probability distribution $ {\mathcal F} $ is obtained based on input. The core mathematical idea of CNN is to map the input matrix L_o to a new feature representation $ {\mathcal F} $ through multi-layer data transformation.

$$ {\mathcal F} (i)=Map(C={c}_{i}|{L}_{0};\,(W,b))$$

(10)

where c_i represents the ith label class, L_o denotes the input matrix, and $ {\mathcal F} $ denotes the feature expression.

The goal of CNN training is to minimize the network loss function F(W, b). At the same time, to alleviate the over fitting problem, the final loss function E(W, b) is usually controlled by a norm, and the intensity of the over fitting is controlled by the parameter ε.

$$E(W,b)=F(W,b)+\frac{{\rm{\varepsilon }}}{2}{W}^{T}W$$

(11)

When adjusting parameters, CNN usually uses gradient descent method to optimize, update network parameters (W, b) layer by layer from back to front, and use learning rate λ to control the strength of back propagation.

$${W}_{i}={W}_{i}-{\rm{\lambda }}\frac{\partial E(W,b)}{\partial {W}_{i}}$$

(12)

$${b}_{i}={b}_{i}-{\rm{\lambda }}\frac{\partial E(W,b)}{\partial {b}_{i}}$$

(13)

Feature-selective rotation forest

The Rotation Forest (RF) is an ensemble classifier which contains multiple decision trees. It can quickly be applied to many data science problems and can efficiently obtain accurate classification results³⁷. Therefore, it has received high attention and popularity from researchers. The main idea of RF is to randomly divide the data set into multiple subsets and implement the corresponding coordinate transformation, and transform the data from the original space to the new space to increase the difference between the data, so as to improve the diversity and accuracy of the classifier at the same time.

In this study, aiming at the high dimensionality and noise-containing characteristics of the PPIs data, we improved the RF and proposed Feature-Selective Rotation Forest (FSRF) algorithm. The FSRF algorithm can effectively reduce the data dimension and remove the noise information in the data, thus improving the prediction accuracy and speed of the classifier. More specifically, we use the χ² method in statistics to calculate the weight of all the features, and rank them according to the weighted values, and delete the small influence on the classification according to the given feature selection rate. The weight of a given feature P can be calculated according to the following formula.

$${\chi }^{2}=\sum _{i=1}^{l}\sum _{j=1}^{2}\frac{{({\rho }_{ij}-{\sigma }_{ij})}^{2}}{{\sigma }_{i,j}}$$

(14)

where l is the number of values in feature P, ρ_ij is the count of the value β_i in feature P belongs to class y_j, defined as:

$${\rho }_{ij}=count(P={\beta }_{i}\,and\,Y={y}_{j})$$

(15)

σ_i,j is the expected value of β_i and y_j, defined as:

$${\sigma }_{i,j}=\frac{count(P={\beta }_{i})\times count(Y={y}_{j})}{L}$$

(16)

where count(P = β_i) is the number of samples with the value β_i in the feature P, count(Y = y_j) is the number of samples with the value y_j in the class Y, and L is the total number of samples in the training set.

After calculating the weights of all the features by formula 14, we remove the features with small weight value according to the given weight selection rate ε, and thus obtain a new feature set S. Let E = (e₁, e₂, …, e_n)^T be an n × L matrix which is composed of n observation feature vector for each training sample and C = (c₁, c₂, …, c_n)^T denote the corresponding labels. Therefore, the data sample can be represented as {e_i, c_i}, where e_i = (e_i1, e_i2, …, e_iL) is an L-dimensional feature vector. According to the number K of given decision trees, the sample set is randomly divided into a subset of the same size and transformed by principal component analysis (PCA) algorithm. Then all coefficients of the principal component are rearranged and stored to form a rotation matrix to change the original training set. Therefore, the decision tree can be represented by T₁, T₂, …, T_k, and the training process of one decision tree T_i can be described as follows:

(a)
The sample set S is randomly divided into K (a factor of n) disjoint subsets, and each subset contains the number of features is n/k.
(b)
A corresponding column of features in the subset S_i,j is selected to form a new matrix E_i,j from the training dataset E. A new training set ${E^{\prime} }_{i,j}$ which is extracted from E_i,j randomly with 3/4 of the dataset using bootstrap algorithm. Loop K times in this way, so that each subset is converted
(c)
Matrix ${E^{\prime} }_{i,j}$ is used as the feature transform by PCA technique for producing the coefficient matrix M_i,j, which jth column coefficient as the characteristic component jth.
(d)
A sparse rotation matrix R_i is constructed, and its coefficients which obtained from the matrix M_i,j expressed as follows:

$${R}_{i}=[\begin{array}{llll}{\mu }_{i,1}^{(1)},\,\ldots ,\,{\mu }_{i,1}^{({G}_{1})} & 0 & \cdots & 0\\ 0 & {\mu }_{i,2}^{(1)},\,\cdots ,\,{\mu }_{i,2}^{({G}_{2})} & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & {\mu }_{i,k}^{(1)},\ldots ,{\mu }_{i,k}^{({G}_{k})}\end{array}]$$

(17)

In the prediction period, provided the test sample e, generated by the classifier T_i of ${d}_{i,j}(E{R}_{i}^{\mu })$ to determine e belongs to class c_i. And then the class of confidence is calculated by means of the average combination, and the formula is as follows:

$${\theta }_{j}(e)=\frac{1}{k}\sum _{i=1}^{k}{d}_{i,j}(E{R}_{i}^{\mu })$$

(18)

Therefore, the test sample e easily assigned to the classes with the greatest possible.

Results and Discussion

In this section, we summarize the experimental results of the CNN-FSRF method on the standard datasets. To comprehensively evaluate the performance of the model, we compare the proposed method with the state-of-the-art Support Vector Machine (SVM) classifier and other excellent methods on the same datasets. In addition, we verified the proposed method on independent datasets. The CNN-FSRF based on protein sequence is implemented by MATLAB platform. For the SVM classifier, we use the LIBSVM implementation designed by Lin et al., which can be downloaded at https://www.csie.ntu.edu.tw/~cjlin/libsvm/. The parameters of FSRF and SVM algorithms have been optimized by the grid search method.

Prediction Performance of CNN-FSRF Model

We first performed experiments on Yeast dataset, and Table 1 summarizes the results of the 5-fold cross-validation experiment. It can be seen that the accuracy of CNN-FSRF on Yeast dataset was as high as 97.75%. In order to better investigate the predictive ability of the model, we also calculate the values of sensitivity, specificity, precision, F_score, Matthews correlation coefficient, and AUC. In these evaluation criteria, the F_score value that reflects the stability of the model is 97.79% and the MCC and AUC values that reflect the overall performance of the model were 95.57% and 97.54%, and their standard variance were 0.53%, 1.05% and 0.66%, respectively. Figure 3 shows the ROC curves and P-R curves obtained by CNN-FSRF on the Yeast dataset respectively. It can be seen from the graph that the curves generated by the five experiments cover most of the coordinate space. The 5-fold cross-validation experimental results demonstrate that CNN-FSRF performs well on the Yeast dataset.

Table 1 The 5-fold cross-validation results were generated on the Yeast dataset by using the CNN-FSRF method.

Full size table

We next implement the proposed method on the Helicobacter pylori dataset, and its 5-fold cross-validation experimental results are shown in Table 2. We can see from Table 2 that CNN-FSRF achieved an accuracy of 88.96% on the Helicobacter pylori dataset. In the F_score, MCC, and AUC that comprehensively reflect model performance, the values obtained by CNN-FSRF were 89.26%, 78.09%, and 89.08%, and the standard deviations were 0.67%, 1.16%, and 0.79%, respectively. Figure 4 plots the ROC curves and P-R curves generated on the Helicobacter pylori dataset. It can be seen from the figure that although the CNN-FSRF performance on the Helicobacter pylori dataset is not excellent on the Yeast dataset, it also achieved good performance. This may be due to the fact that the number of samples in the Helicobacter pylori dataset (2916) is less than in the Yeast dataset (11188). It is well known that the number of samples used to train the classifier in machine learning is closely related to the final test result. The more samples in the training set, the more fully trained the classifier, the higher the model fitting degree learned, and the better the prediction result. Therefore, the results obtained by the proposed model in the Helicobacter pylori dataset were not as good as those in the Yeast dataset also conform to this rule. In addition, this result can also indicate that the performance of CNN-FSRF will become better as the training set increases.

Table 2 The 5-fold cross-validation results were generated on the Helicobacter pylori dataset by using the CNN-FSRF method.

Full size table

Comparison between the proposed model and SVM Model

SVM is a supervised learning model, which is one of the most robust and accurate methods in data mining algorithms³⁸. SVM can map the sample space into the high-dimensional feature space through a non-linear mapping, so that the non-linear separable problem in the original sample space is transformed into a linear separable problem in the feature space. To demonstrate the performance of the proposed method, we compare the CNN-FSRF and SVM model (CNN-SVM) on the same dataset. For fairness, we optimized the parameters of the SVM using the grid search method and used the same protein number descriptors.

The 5-fold cross-validation experimental results by the SVM classifier combined with the CNN extracted feature descriptors were shown in Table 3. It is observed from Table 3 that CNN-SVM achieved the 5-fold cross-validation accuracy of 88.92% and the standard deviation of 1.34% on the Yeast dataset. The accuracy is 8.83% lower than that of CNN-FSRF and the standard deviation is 0.80% higher than that of CNN-FSRF. Except that CNN-SVM is 0.11% higher than CNN-FSRF on sensitivity, CNN-SVM is 17.76%, 14.01%, 7.79%, 15.84% and 8.69% lower on specificity, precision, F_score, MCC and AUC than CNN-FSRF. However, in the standard deviation, the above evaluation criteria CNN-SVM are 0.03%, 1.64%, 0.95%, 0.59%, 1.17% and 0.63% higher than CNN-FSRF, respectively.

Table 3 Comparison of 5-fold cross-validation results of CNN-FSRF and CNN-SVM on Yeast dataset.

Full size table

To facilitate observation, we present these evaluation criteria in the form of histogram. At the same time, we also plotted ROC curves and P-R curves of CNN-FSRF and CNN-SVM on the same coordinate axis. It can be clearly seen from Fig. 5 that CNN-FSRF performed better than CNN-SVM on accuracy and F_score, which reflects the prediction accuracy and the stability of the model. In addition, it can be clearly seen from Fig. 6 that the proposed CNN-FSRF also outperforms CNN-SVM on comprehensive evaluation criteria AUC reflecting the overall performance of the model. This indicates that the overall performance of CNN-FSRF is superior to that of CNN-SVM. Therefore, we have reason to believe that the proposed CNN-FSRF method can effectively predict the interaction between proteins.

Comparison with existing methods

To further evaluate the performance of CNN-FSRF, we collected the work of other researchers on the same Yeast and Helicobacter pylori datasets and used 5-fold cross-validation method to predict PPI. Since some works do not provide more evaluation criteria, we only list the common evaluation criteria of these works, including accuracy, sensitivity, precision and MCC.

Table 4 lists the performance of several previous works and our model on the Yeast dataset. From the table we can see that the proposed method achieves the best results in accuracy, sensitivity and MCC, but only the third result in precision. Specifically, the proposed model achieved 97.75% on the accuracy, which is 1.15% higher than the second highest Wangs’ work. The model has a great advantage in sensitivity, and achieves 99.61% of the results, which is 4.49% higher than the second highest Zhangs’ work. The results obtained from the proposed model on precision generally achieved only the third highest 95.89% result, which was 3.47% lower than the first high Wangs’ work. The proposed model on the MCC also has a large advantage, achieving 96.04% of the results, which is 2.63% higher than the second highest Wangs’ work. Generally speaking, the comprehensive performance of the proposed method is superior to other methods in the table, and has highly competitive in predicting PPI. In addition, we can also see that Wangs’ work, Dus’ work, Zhangs’ work, Patels’ work and the proposed model all use deep learning-based algorithms, and the results obtained by these methods are significantly better than those of other methods in the table that do not use deep learning. This demonstrates that the use of deep learning algorithm can effectively improve the performance of the model.

Table 4 The performance comparison between different methods on the Yeast dataset.

Full size table

We collected previous work on the Helicobacter pylori dataset and summarized the results in Table 5. We can see from the table that our model achieved the best results in terms of accuracy, sensitivity, and precision, and achieved the second best result on the MCC. Specifically, CNN-FSRF is 1.46% higher in accuracy than the second Ensemble ELM model, 2.91% higher in sensitivity than the second Ensemble ELM model, 0.71% higher in precision than the second Ensemble ELM model, and 0.04% lower in MCC than the first Ensemble ELM model. Generally, our model achieved the highest prediction accuracy on the Helicobacter pylori dataset, and the performance of the model ranked second, but it is only 0.04% less than the first one.

Table 5 The performance comparison of different methods on the Helicobacter pylori dataset.

Full size table

We can also see from Tables 4 and 5 that the performance of these methods we collected on the Helicobacter pylori dataset is generally not as good as that on Yeast dataset, which is likely to be related to the number of dataset samples, and also in accordance with the conclusions of our previous section. In addition, it can be seen from the horizontal comparison that the results obtained by our model on the Helicobacter pylori dataset are only slightly better than the other methods, but the results obtained on the Yeast dataset are much better than the other methods. This indicates that with the increase of data sets, our approach can quickly improve overall performance and is well-suited for large datasets.

Performance on independent datasets

Although CNN-FSRF achieved high light performance on the Yeast and Helicobacter pylori datasets, we further verify its performance on independent datasets. Specifically, we first train the CNN-FSRF using the entire Yeast dataset, and then use the trained model to predict the interaction among the proteins in the C. elegans, E. coli, H. sapiens and M. musculus datasets. This in biological experiments means using protein interactions identified in one organism to predict interactions in other organisms. This approach is based on the assumption that homologous proteins have the ability to maintain their interactions. The hypothesis is based on the assumption that homologous species have similar functional behaviors, so that they maintain the same PPIs³⁹.

The C. elegans, E. coli, H. sapiens and M. musculus datasets contain only pairs of interacting proteins, the numbers of which are 4013, 6954, 1412, and 313, respectively. Therefore, in the experiment we only calculated meaningful accuracy, sensitivity and F_score. Table 6 lists the experimental results on the independent datasets. As can be seen from the table, CNN-FSRF achieved good results in these four datasets, with average accuracy, sensitivity, and F_score of 95.95%, 95.95% and 97.92%, respectively. Excellent experimental results show that our model can also achieve good results in independent datasets. This fully demonstrates that our method not only has good performance, but also has good generalization and can be applied to different protein interaction prediction problems.

Table 6 Prediction results of four species based on the proposed method.

Full size table

Conclusions

In this study, we develop a novel sequence-based approach to accurately predict potential protein-protein interactions by combining deep learning convolutional neural network with feature-selective rotation forest. It is well known that extracting effective feature descriptors is the key to predicting PPIs, so the main advantage of this paper is that it can extract the feature information of protein objectively and profoundly by the convolution neural network. Then use FSRF to remove noise information and give accurate prediction results. The experimental results show that CNN-FSRF performs significantly well in predicting PPIs. CNN-FSRF obtained 97.75% and 88.96% prediction accuracy using the 5-fold cross-validation in the real PPIs datasets Yeast and Helicobacter pylori. In the experiment, we compared the CNN-FSRF with the SVM model and other existing methods. In addition, we validated our approach on the independent datasets. Excellent experimental results demonstrate that our approach can be an effective tool to accurately predict potential protein interactions. In future research, we will continue to study the use of deep learning to extract effective protein features in the hope of achieving better results.

References

Zhang, Q. C. et al. Structure-based prediction of protein-protein interactions on a genome-wide scale. Nature 490, 556−+, https://doi.org/10.1038/nature11503 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Wang, L. et al. Advancing the prediction accuracy of protein-protein interactions by utilizing evolutionary information from position-specific scoring matrix and ensemble classifier. Journal Of Theoretical Biology 418, 105–110, https://doi.org/10.1016/j.jtbi.2017.01.003 (2017).
Article MathSciNet CAS PubMed Google Scholar
You, Z. H., Lei, Y. K., Gui, J., Huang, D. S. & Zhou, X. B. Using manifold embedding for assessing and predicting protein interactions from high-throughput experimental data. Bioinformatics 26, 2744–2751, https://doi.org/10.1093/bioinformatics/btq510 (2010).
Article CAS PubMed PubMed Central Google Scholar
Gavin, A. C. et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415, 141–147, https://doi.org/10.1038/415141a (2002).
Article ADS CAS PubMed Google Scholar
Ito, T. et al. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proceedings of the National Academy of Sciences of the United States of America 98, 4569–4574, https://doi.org/10.1073/pnas.061034498 (2001).
Article ADS CAS PubMed PubMed Central Google Scholar
Krogan, N. J. et al. Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440, 637–643, https://doi.org/10.1038/nature04670 (2006).
Article ADS CAS PubMed Google Scholar
Ho, Y. et al. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415, 180–183, https://doi.org/10.1038/415180a (2002).
Article ADS CAS PubMed Google Scholar
Yang, Y. D. & Zhou, Y. Q. Specific interactions for ab initio folding of protein terminal regions with secondary structures. Proteins-Structure Function and Bioinformatics 72, 793–803, https://doi.org/10.1002/prot.21968 (2008).
Article CAS Google Scholar
Pan, X.-Y., Zhang, Y.-N. & Shen, H.-B. Large-Scale Prediction of Human Protein-Protein Interactions from Amino Acid Sequence Based on Latent Topic Features. Journal of Proteome Research 9, 4992–5001, https://doi.org/10.1021/pr100618t (2010).
Article CAS PubMed Google Scholar
Katona, G. et al. Fast two-photon in vivo imaging with three-dimensional random-access scanning in large tissue volumes. Nature Methods 9, 201–208 (2012).
Article CAS PubMed Google Scholar
Katona, G., Garcia-Bonete, M. J. & Lundholm, I. V. Estimating the difference between structure-factor amplitudes using multivariate Bayesian inference. Acta Crystallographica 72, 406–411 (2016).
CAS PubMed PubMed Central Google Scholar
Martin, S., Roe, D. & Faulon, J. L. Predicting protein-protein interactions using signature products. Bioinformatics 21, 218–226, https://doi.org/10.1093/bioinformatics/bth483 (2005).
Article CAS PubMed Google Scholar
Jiao, Q. J., Zhang, Y. K., Li, L. N. & Shen, H. B. BinTree seeking: a novel approach to mine both bi-sparse and cohesive modules in protein interaction networks. Plos One 6, e27646 (2011).
Article ADS CAS PubMed PubMed Central Google Scholar
Luo, X. et al. A Highly Efficient Approach to Protein Interactome Mapping Based on Collaborative Filtering Framework. Scientific Reports 5, https://doi.org/10.1038/srep07702 (2015).
Urquiza, J. M. et al. Method for prediction of protein–protein interactions in yeast using genomics/proteomics information and feature selection. Neurocomputing 74, 2683–2690 (2011).
Article Google Scholar
Zhang, Q. C. et al. Structure-based prediction of protein-protein interactions on a genome-wide scale (vol 490, pg 556, 2012). Nature 495, 127–127, https://doi.org/10.1038/nature11977 (2013).
Article ADS CAS Google Scholar
Zhang, Q. C., Petrey, D., Norel, R. & Honig, B. H. Protein interface conservation across structure space. Proc Natl Acad Sci USA 107, 10896–10901 (2010).
Article ADS CAS PubMed PubMed Central Google Scholar
Kafkas, Ş., Varoğlu, E., Rebholz-Schuhmann, D. & Taneri, B. Functional variation of alternative splice forms in their protein interaction networks: a literature mining approach. Bmc Bioinformatics 11, P1 (2010).
Article PubMed Central Google Scholar
Xu, J. et al. Refined phylogenetic profiles method for predicting protein-protein interactions. Bioinformatics 21, 3409 (2005).
Article PubMed Google Scholar
Sun, J., Li, Y. & Zhao, Z. Phylogenetic profiles for the prediction of protein-protein interactions: how to select reference organisms? Biochem Biophys Res Commun 353, 985–991 (2007).
Article CAS PubMed Google Scholar
Autore, F. et al. Large-scale modelling of the divergent spectrin repeats in nesprins: giant modular proteins. Plos One 8, e63633 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Zhang, J., Yang, J., Huang, T., Shu, Y. & Chen, L. Identification of novel proliferative diabetic retinopathy related genes on protein–protein interaction network. Neurocomputing 217, 63–72 (2016).
Article Google Scholar
Zhang, Y.-N., Pan, X.-Y., Huang, Y. & Shen, H.-B. Adaptive compressive learning for prediction of protein-protein interactions from primary sequence. Journal of Theoretical Biology 283, 44–52, https://doi.org/10.1016/j.jtbi.2011.05.023 (2011).
Article CAS PubMed MATH Google Scholar
Wang, D. D., Wang, R. & Yan, H. Fast prediction of protein–protein interaction sites based on Extreme Learning Machines. Neurocomputing 128, 258–266 (2014).
Article Google Scholar
Zhu, L., You, Z. H. & Huang, D. S. Increasing the reliability of protein–protein interaction networks via non-convex semantic embedding. Neurocomputing 121, 99–107 (2013).
Article Google Scholar
You, Z. H. et al. Detecting Protein-Protein Interactions with a Novel Matrix-Based Protein Sequence Representation and Support Vector Machines. Biomed Research International 2015, 1–9 (2015).
Article ADS Google Scholar
Hu, L. & Chan, K. C. Extracting Coevolutionary Features from Protein Sequences for Predicting Protein-Protein Interactions. IEEE/ACM Trans Comput Biol Bioinform 14, 155–166 (2017).
Article CAS PubMed Google Scholar
Saha, I. et al. Ensemble learning prediction of protein-protein interactions using proteins functional annotations. Molecular Biosystems 10, 820–830, https://doi.org/10.1039/c3mb70486f (2014).
Article CAS PubMed Google Scholar
Jeong, J. C., Lin, X. & Chen, X.-W. On Position-Specific Scoring Matrix for Protein Function Prediction. Ieee-Acm Transactions on Computational Biology and Bioinformatics 8, 308–315, https://doi.org/10.1109/tcbb.2010.93 (2011).
Article PubMed Google Scholar
Guo, Y., Yu, L., Wen, Z. & Li, M. Using support vector machine combined with auto covariance to predict proteinprotein interactions from protein sequences. Nucleic Acids Research 36, 3025–3030, https://doi.org/10.1093/nar/gkn159 (2008).
Article CAS PubMed PubMed Central Google Scholar
Gao, Z. G. et al. Ens-PPI: A Novel Ensemble Classifier for Predicting the Interactions of Proteins Using Autocovariance Transformation from PSSM. Biomed Research International, 8, https://doi.org/10.1155/2016/4563524 (2016).
CAS Google Scholar
Wang, L. et al. A Computational-Based Method for Predicting Drug-Target Interactions by Using Stacked Autoencoder Deep Neural Network. Journal Of Computational Biology 25, 361–373, https://doi.org/10.1089/cmb.2017.0135 (2018).
Article MathSciNet CAS PubMed Google Scholar
Ngiam, J. et al. In International Conference on Machine Learning, ICML 2011, Bellevue, Washington, Usa, June 28 - July. 689–696.
Zhou, S., Chen, Q. & Wang, X. Active deep learning method for semi-supervised sentiment classification. Neurocomputing 120, 536–546 (2013).
Article Google Scholar
Wang, L. et al. RFDT: A Rotation Forest-based Predictor for Predicting Drug-Target Interactions Using Drug Structure and Protein Sequence Information. Current Protein & Peptide Science 19, 445–454, https://doi.org/10.2174/1389203718666161114111656 (2018).
Article CAS Google Scholar
Guo, X., Chen, L. & Shen, C. Hierarchical adaptive deep convolution neural network and its application to bearing fault diagnosis. Measurement 93, 490–502 (2016).
Article Google Scholar
Rodriguez, J. J. & Kuncheva, L. I. Rotation forest: A new classifier ensemble method. Ieee Transactions on Pattern Analysis and Machine Intelligence 28, 1619–1630, https://doi.org/10.1109/tpami.2006.211 (2006).
Article PubMed Google Scholar
Pal, M. & Foody, G. M. Feature Selection for Classification of Hyperspectral Data by SVM. IEEE Transactions on Geoscience & Remote Sensing 48, 2297–2307 (2010).
Article ADS Google Scholar
Shi, M.-G., Xia, J.-F., Li, X.-L. & Huang, D.-S. Predicting protein-protein interactions from sequence using correlation coefficient and high-quality interaction dataset. Amino Acids 38, 891–899, https://doi.org/10.1007/s00726-009-0295-y (2010).
Article CAS PubMed Google Scholar
Yang, L., Xia, J.-F. & Gui, J. Prediction of Protein-Protein Interactions from Protein Sequence Using Local Descriptors. Protein and Peptide Letters 17, 1085–1090 (2010).
Article CAS PubMed Google Scholar
Zhou, Y. Z., Gao, Y. & Zheng, Y. Y. Prediction of Protein-Protein Interactions Using Local Description of Amino Acid Sequence. Advances in Computer Science and Education Applications, Pt Ii 202, 254–262 (2011).
Article Google Scholar
You, Z.-H., Lei, Y.-K., Zhu, L., Xia, J. & Wang, B. Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis. Bmc Bioinformatics 14, https://doi.org/10.1186/1471-2105-14-s8-s10 (2013).
Wang, Y. B. et al. Predicting protein-protein interactions from protein sequences by a stacked sparse autoencoder deep neural network. Molecular Biosystems 13, 1336–1344 (2017).
Article CAS PubMed Google Scholar
Du, X. et al. DeepPPI: Boosting Prediction of Protein-Protein Interactions with Deep Neural Networks. Journal of Chemical Information & Modeling 57, 1499 (2017).
Article CAS Google Scholar
Long, Z., Yu, G., Xia, D. & Wang, J. Protein-Protein Interactions Prediction based on Ensemble Deep Neural Networks. Neurocomputing, S0925231218306337- (2018).
Tripathi, R. DeepInteract: Deep Neural Network based Protein-Protein Interaction prediction tool. Current Bioinformatics 11 (2017).
Liu, B. et al. QChIPat: a quantitative method to identify distinct binding patterns for two biological ChIP-seq samples in different experimental conditions. Bmc Genomics 14, https://doi.org/10.1186/1471-2164-14-s8-s3 (2013).
Article PubMed PubMed Central Google Scholar
Nanni, L. & Lumini, A. An ensemble of K-local hyperplanes for predicting protein-protein interactions. Bioinformatics 22, 1207–1210, https://doi.org/10.1093/bioinformatics/btl055 (2006).
Article CAS PubMed Google Scholar
Bock, J. R. & Gough, D. A. Whole-proteome interaction mining. Bioinformatics 19, 125–134, https://doi.org/10.1093/bioinformatics/19.1.125 (2003).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China, under Grants 61702444, in part by the West Light Foundation of The Chinese Academy of Sciences, under Grant 2018-XBQNXZ-B-008, in part by the Zaozhuang Science and Technology Development Plan, under Grant 2018GX07. The authors would like to thank all anonymous reviewers for their constructive advices.

Author information

Authors and Affiliations

College of Information Science and Engineering, Zaozhuang University, Zaozhuang, Shandong, 277100, P.R. China
Lei Wang, Hai-Feng Wang & San-Rong Liu
Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, 830011, P.R. China
Lei Wang
School of Foreign Languages, Zaozhuang University, Zaozhuang, Shandong, 277100, P.R. China
Xin Yan
School of information engineering, JiangXi University of Science and Technology, Ganzhou, Jiangxi, 341000, P.R. China
Ke-Jian Song

Authors

Lei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hai-Feng Wang
View author publications
You can also search for this author in PubMed Google Scholar
San-Rong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xin Yan
View author publications
You can also search for this author in PubMed Google Scholar
Ke-Jian Song
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

L.W., H.W. and X.Y. conceived the algorithm, carried out the analyses, prepared the data sets, carried out experiments, and wrote the manuscript. S.L., K.S. and L.W. designed, performed and analyzed experiments and wrote the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Lei Wang or Xin Yan.

Ethics declarations

Competing Interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, L., Wang, HF., Liu, SR. et al. Predicting Protein-Protein Interactions from Matrix-Based Protein Sequence Using Convolution Neural Network and Feature-Selective Rotation Forest. Sci Rep 9, 9848 (2019). https://doi.org/10.1038/s41598-019-46369-4

Download citation

Received: 05 March 2019
Accepted: 10 June 2019
Published: 08 July 2019
DOI: https://doi.org/10.1038/s41598-019-46369-4

This article is cited by

An Efficient Deep Learning Approach for DNA-Binding Proteins Classification from Primary Sequences
- Nosiba Yousif Ahmed
- Wafa Alameen Alsanousi
- Mohamed Elhafiz M. Musa
International Journal of Computational Intelligence Systems (2024)
CCXGB: Centroid-based features enhancement using Convolutional Neural Network combined with XGB classifier for Protein-Protein interaction prediction
- Gunjan Sahni
- Soniya Lalwani
International Journal of Information Technology (2024)
LPI-SKMSC: Predicting LncRNA–Protein Interactions with Segmented k-mer Frequencies and Multi-space Clustering
- Dian-Zheng Sun
- Zhan-Li Sun
- Shuang-Hao Yong
Interdisciplinary Sciences: Computational Life Sciences (2024)
Pre-trained protein language model sheds new light on the prediction of Arabidopsis protein–protein interactions
- Kewei Zhou
- Chenping Lei
- Ziding Zhang
Plant Methods (2023)
TeM-DTBA: time-efficient drug target binding affinity prediction using multiple modalities with Lasso feature selection
- Tanya Liyaqat
- Tanvir Ahmad
- Chandni Saxena
Journal of Computer-Aided Molecular Design (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Introduction

Materials and Methodology

Golden standard datasets

Evaluation criteria

Matrix-based protein numerical representation

Convolutional neural network

Feature-selective rotation forest

Results and Discussion

Prediction Performance of CNN-FSRF Model

Comparison between the proposed model and SVM Model

Comparison with existing methods

Performance on independent datasets

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing Interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links