Introduction

Protein–protein interactions (PPIs) play an essential role in biological processes, such as cell metabolism, immune response1, and signal transduction2. Therefore, it is essential to develop effective strategies for correctly identifying potential PPIs to understand better protein functions and model complex protein structures. In recent years, some small-scale experimental methods (such as chromatography and biochemical assays) are always utilized to predict the potential PPIs. However, these methods are often inefficient, high time-consuming, and not suitable for large-scale prediction3. Hence, several high-throughput experimental methods have also been invented for identifying potential protein–protein interactions, including immune precipitation, yeast two-hybrid screens (Y2H)4, crystallography, and protein chips5. These methods have generated copious known protein–protein interaction pairs, which is of great importance for analyzing potential PPIs. Nevertheless, these high-throughput technologies still have obvious drawbacks, such as a high false-positive rate, small coverage, and time-intensive6,7. Accordingly, due to these limitations of traditional experimental methods, there is an urgent need to develop effective and accurate computational models to identify potential PPIs. In recent years, more and more computational methods8,9,10,11,12 have been developed as an aid to biological experiment methods with the aim of solving their high false-positive, small converge and time-intensive problems. More specifically, computational methods employ sophisticated algorithms and statistical models to analyze biological data, helping to minimize false-positive results8,9,10,11,12. They take advantage of the availability of vast amounts of biological data generated through high-throughput techniques. By analyzing large-scale datasets, these methods can identify patterns, trends, and associations that may be undetectable with traditional experimental approaches. Furthermore, biological experiments can be time-consuming and costly, requiring extensive sample preparation, data collection, and analysis. Computational methods provide a more efficient and cost-effective alternative. Once the necessary algorithms and models are developed, computational analyses can be performed relatively quickly on powerful computer systems. This saves time and resources, allowing researchers to explore a broader range of hypotheses or conduct large-scale investigations more feasibly.

Recently, several computational methods for potential protein–protein interaction prediction have been proposed. Of these, some methods take advantage of 3D structure13, gene ontology and annotations14, gene fusion, and co-expression15,16,17,18 technologies. However, these technologies usually require prior knowledge of the collected proteins, which dramatically limits their accuracy and reliability. For example, the 3D structure of many proteins is difficult to obtain, and the gene ontology annotation of proteins is incomplete19,20,21,22,23. In contrast, abundant sequence data of proteins from multiple sources is relatively easy to obtain. Thence, several computational methods based on sequence features of proteins have been developed to predict potential PPIs. For example, Shen et al.24 developed a novel model for protein–protein interaction prediction only utilizing protein sequence information. In their work, protein sequence information was first extracted based on amino acids' triad characteristics. Then the model was constructed by using support vector machines (SVM) combined with a kernel function. This experiment fully proves that the computational methods only using protein sequence features also have a good prediction ability of protein–protein interactions. Guo et al.25 constructed a new protein sequence feature representation method to predict potential PPIs. Specifically, they selected the auto covariance (AC) method to extract the characteristics of protein sequences based on seven physicochemical properties of amino acids. This method thoroughly considered the interactions between amino acids at different distances in the protein sequence and ultimately performed better than other sequence-based methods. Furthermore, their study demonstrated that extracting protein sequence features by the auto-covariance (AC) method is feasible and effective for potential protein–protein interactions prediction.

In addition, machine learning algorithms have also attracted the attention of many researchers in the field of potential protein–protein interactions prediction. For example, Wang et al.26 developed a feature-weighted Rotation Forest model for protein–protein interaction prediction by eliminating useless information to use the valuable features fully. In the results, their proposed method achieved excellent prediction performance under the cross-validation experiment. You et al.27 presented a new method to transform the protein sequence features into matrix representation and then utilized the support vector machine (SVM) for training and prediction. Their model finally achieved excellent prediction results in the yeast PPIs datasets. Finally, You et al.28 developed an ensemble weighted sparse representation model classifier and replaced the matrix representation with the integrated protein sequence-function to predict potential protein–protein interactions. Compared with many previous advanced methods, this model has better performance.

Human cells are part of a complex biomolecular network, involving interactions and associations among various biomolecules, such as proteins, miRNAs, and diseases. Proteins often interact with each other based on their shared relationships with other biomolecules. Leveraging this associated information can help predict potential protein–protein interactions (PPIs). In this study, we introduce a new computational model (called MultiPPIs) to predict PPIs. This model combines protein sequence physicochemical features with multi-source biomolecular association data (including drugs, miRNAs, lncRNAs, and diseases). First, we use the auto-covariance method to extract features from protein sequences based on amino acids' physicochemical properties. Second, we create a network that integrates known associations among various biomolecules, as depicted in Fig. 1. Using DeepWalk29, a graph representation method, we extract association information from this network. We then utilize 19,237 known PPI pairs from the STRING database (2017)30 as our positive dataset. A matching number of random non-interacting pairs form the negative dataset. These datasets are combined to create our final training set. The prediction model is constructed using a Random Forest (RF) classifier, optimized for best performance. The process flow of MultiPPIs is outlined in Fig. 2. In our study, the proposed model, under fivefold cross-validation, achieves an average accuracy of 0.8603 and an AUC of 0.9304. These results are better than many current computational methods. We also compared two feature combination strategies. Our method is more effective than using only protein sequence information by combining multiple types of data. Additionally, we test four popular classifiers and find the Random Forest classifier to be the most suitable for our model, offering superior prediction performance. These experiments demonstrate that our model is an efficient tool for predicting potential protein–protein interactions. Compared with previous computational methods8,9,10,11,12, our method mainly has the following specific advantages: (1) Considering the holistic nature of biomolecular networks, our method collects a large amount of association data to construct a multi-source molecular network, and extracts the higher-order network features of proteins based on the graph representation learning method to improve the accuracy of the prediction of PPIs. (2) Our method fully takes advantage of the local property of residues in protein sequences and describes the level of correlation between two protein sequences based on their specific physical and chemical properties. This not only improves the prediction performance of our method, but also solves the cold-start problem often encountered by graph neural network-based methods. (3) By conducting extensive experiments, including comparison of feature combinations, comparison of classification models, optimization and adjustment of model parameters, and comparison with previous experimental methods, our method has been confirmed to have excellent performance in predicting PPIs and is better than most previous computational methods.

Figure 1
figure 1

The multi-source molecular network.

Figure 2
figure 2

The flowchart of our proposed model.

Results and discussion

The five-fold cross-validation performance of our proposed model

Cross-validation is a standard method used in machine learning to construct and validate model parameters. In this work, fivefold cross-validation was adopted to evaluate the performance of our model. First, we equally divided the sample data into five parts. Second, we sequentially selected four parts as the training set and the remaining 1 part as the test set. The experiment repeated 5 times. Finally, six standard parameters were used as evaluation indicators for our experiments, including specificity (Spec.), Matthews's correlation coefficient (MCC), precision (Prec.), sensitivity (Sen.), accuracy (Acc.), and the areas under the ROC curve (AUC). Table 1 lists the detailed results of each validation. The last line shows the average value and the standard deviation of the results across five runs of the classifier. These experimental results demonstrated that our model could achieve good results and stability in the protein–protein interaction prediction.

Table 1 The fivefold cross-validation results of our proposed model.

The Receiver Operating Characteristic (ROC) curve is an essential and common statistical analysis tool widely used to judge the quality of classification and prediction results in medical research and machine learning. It first sorts the samples according to the prediction results of the classifier and then predicts the samples as positive samples one by one in this order. This way calculates two important values (True Positive Rate, False Positive Rate) each time and plots them as the horizontal and vertical coordinates, respectively. Besides, the AUC is defined as the areas under the ROC curve, and its value range is generally between 0.5 and 1. Generally, the ROC curve cannot indicate which classifier has better performance, so the AUC value is selected as the evaluation index. The classifier with a larger AUC has better performance. The Precision-Recall (PR) curve is another tool to evaluate the performance of a classifier. For the category imbalance problem, the PR curve is widely considered superior to the ROC curve. Similarly, the AUPR is defined as the areas under the PR curve. Figures 3 and 4 respectively show our method's ROC and PR curves under fivefold cross-validation. These results once again demonstrated our model's good effect and stability in predicting potential protein–protein interactions.

Figure 3
figure 3

The ROC curves and AUC values of our model under fivefold cross-validation.

Figure 4
figure 4

The PR curves and AUPR values of our model under fivefold cross-validation.

Compare the effect of our feature combination strategy

To further compare the effect of our feature combination strategy, a different feature combination was utilized to represent protein nodes. More specifically, we used the only protein sequence features (combination 1) and the combination of the protein sequence features and the multi-source associated information of proteins used by MultiPPIs (combination 2) to represent proteins before carrying out the fivefold cross-validation experiment. One important thing that must be mentioned is that the experimental environment of the two different combinations is the same to ensure the fairness of comparison. Table 2 lists the results of the experiment results of combination 1 under the fivefold cross-validation experiment. The experiment results of combination 1 is shown in Table 1. Figures 5 and 6, respectively, show the comparative experiment's ROC curves and PR curves. As the results show, our feature combination strategy performs better than most computational methods that only use protein sequence features. This once again proves that the multi-source association information with other biomolecules of proteins is helpful for protein–protein interaction prediction.

Table 2 The results of different feature combinations under fivefold cross-validation.
Figure 5
figure 5

The ROC curves and AUC values of two different feature combination strategies. (A) the ROC curves and AUC values of the only protein sequence features. (B) The ROC curves and AUC values of the combination of protein sequence features and the multi-source associated information of proteins. (C) Comparison of the ROC curves and AUC values of two different feature combination strategies.

Figure 6
figure 6

The PR curves and AUPR values of two different feature combination strategies. (A) The PR curves and AUPR values of the only protein sequence features. (B) The PR curves and AUPR values of the combination of protein sequence features and the multi-source associated information of proteins. (C) Comparison of the PR curves and AUPR values of two different feature combination strategies.

Compare the effect of different classifiers

To choose the most suitable classifier for our model, we conducted a comparison experiment with the four most commonly used classifiers, including Decision Tree, Naive Bayes, KNN, and Random Forest. We used these four classifiers with default training parameters to train and predict the protein–protein interactions and kept other experimental conditions consistent. Finally, the Random Forest classifier performed better by observing the prediction results. Table 3 lists the average parameter values of different classifiers under fivefold cross-validation. Figures 7 and 8, respectively, show the ROC and PR curves of the comparative experiment. The comparison experiment results proved that the Random Forest is more suitable for our model than other classifiers, especially in terms of the AUC and accuracy, which can represent the ability of a model.

Table 3 The average parameter values of different classifiers under fivefold cross-validation.
Figure 7
figure 7

The ROC curves and AUC values of different classifiers. (A) The ROC curves and AUC values of the Decision Tree classifier. (B) The ROC curves and AUC values of the KNN classifier. (C) The ROC curves and AUC values of the Naive Bayes classifier. (D) The ROC curves and AUC values of the random forest classifier. (E) Comparison of the ROC curves and AUC values of different classifiers.

Figure 8
figure 8

The PR curves and AUPR values of different classifiers. (A) The PR curves and AUPR values of the decision tree classifier. (B) The PR curves and AUPR values of the KNN classifier. (C) The PR curves and AUPR values of the Naive Bayes classifier. (D) The PR curves and AUPR values of the Random Forest classifier. (E) Comparison of the PR curves and AUPR values of different classifiers.

Compare the effect of random forest classifier parameter

Random Forest (RF) is a flexible and efficient supervised learning algorithm Breiman proposed in 2001. This algorithm has achieved good results in solving problems in many fields. It has the characteristics of preventing overfitting, strong model stability, and easy to deal with nonlinear regression problems. It is also a particular bootstrap aggregating (bagging) method which uses the decision tree as the training model. It first uses the bootstrap method to generate training sets and then constructs a decision tree for each training set. Finally, all these decision trees are combined to form the classifier to improve the overall effect. Additionally, when segmenting node features, the Random Forest method does not select all features that can maximize the index (such as information gain). Instead, it randomly extracts a subset of features and then finds the optimal solution within this subset. For the Random Forest model parameters, we need to set the regression tree number N. In detail, and we started to train the model at an interval of 20 from N = 180 and observed the relationship between the number of N and the final prediction accuracy. We terminated the model training if the prediction accuracy decreased with the increase of N. Table 4 lists the accuracy results of the Random Forest classifier with different N parameters under fivefold cross-validation. As a result, we can see that the Random Forest classifier has the best performance when the number of regression trees is 300.

Table 4 The accuracy results of the Random Forest classifier with different N parameters.

Performance comparison with the state-of-the-art methods

To further evaluate the effectiveness of MultiPPIs, we conduct a detailed comparative analysis between it and several existing protein–protein interaction prediction methods, including LR_PPI31, DPPI32, WSRC_GE33, LPPI34 and PIPR35. Our evaluation framework encompasses five distinct performance metrics, as detailed in Table 5. These metrics include specificity (Spec.), Matthews’s correlation coefficient (MCC), precision (Prec.), sensitivity (Sen.), accuracy (Acc.), and the areas under the ROC curve (AUC), providing a comprehensive view of each method's predictive capabilities. Our findings reveal a significant enhancement in performance with MultiPPIs. This substantial leap in accuracy underscores the effectiveness of MultiPPIs in identifying protein–protein interactions, marking a notable advancement in the field.

Table 5 Performance comparison of MultiPPIs with the state-of-the-art methods.

Materials and methods

Protein sequence features based on the physicochemical properties of amino acids

In this study, we downloaded the sequence information of proteins from the STRING: in 201730 database. Proteins are biopolymers composed of up to 20 different amino acids as basic units. The sequence of amino acid residues in the peptide chain is called the primary structure of proteins. Consequently, we selected the six physicochemical properties of amino acids to represent the protein sequence features in this work, including polarity (P1), hydrophobicity (H), net charge index of side chains (NCISC), volumes of side chains of amino acids (VSC), solvent-accessible surface area (SASA) and polarizability (P2). The original physicochemical values of these 20 amino acids are listed in Table 6.

Table 6 The original physicochemical values of 20 amino acids.

Performance evaluation criteria for our experiments

In order to verify the quality of our proposed method, six standard parameters were calculated as evaluation indicators for our experiments, including specificity (Spec.), Matthews's correlation coefficient (MCC), precision (Prec.), sensitivity (Sen.), accuracy (Acc.), and the areas under the ROC curve (AUC). The description of all computational formulas is as follows:

$$Spec =\frac{TN}{FP+TN}$$
(1)
$$MCC=\frac{TP\times TN-FP\times FN}{\surd (TP+FN)\times (TN+FP)\times (TP+FP)\times (TN+FN)}$$
(2)
$$Prec =\frac{TP}{FP+TP}$$
(3)
$$Sen =\frac{TP}{TP+FN}$$
(4)
$$Acc =\frac{TP+TN}{TP+FP+TN+FN}$$
(5)

where TN, FN, TP, and FP represent the total number of true negative, false negative, true positive, and false positive. Furthermore, the AUC (the area under the ROC curve) was also implemented to evaluate the performance of our model.

Auto covariance (AC) method

The extraction of protein sequence features using the auto covariance (AC) method was completely proposed by Guo et al.36. This method fully takes advantage of the local property of residues in protein sequences and describes the level of correlation between two protein sequences based on their specific physical and chemical properties37,38,39. First, we normalized the original physicochemical values of 20 amino acids to unit standard deviations (SD) and zero means according to Eq. (1):

$${{P}_{ij}}^{\mathrm{^{\prime}}}=\frac{{P}_{ij}-\overline{{P }_{j}}}{{S}_{j}}, (i=\mathrm{1,2},\dots ,6;j=\mathrm{1,2},\dots 20)$$
(6)

where \({P}_{ij}\) is the \({j}_{th}\) descriptor value for \({i}_{th}\) amino acid, \(\overline{{P }_{j}}\) is the mean of \({j}_{th}\) descriptor over the 20 amino acids and \({S}_{j}\) is the corresponding standard deviations, given by:

$$\overline{{P }_{j}}=\frac{{\sum }_{i=1}^{20}{P}_{ij}}{20}$$
(7)
$${S}_{j}= \sqrt{\frac{{\sum }_{i=1}^{20}{({P}_{ij}-\overline{{P }_{j}})}^{2}}{20}}$$
(8)

In this way, each amino acid in a protein sequence is converted to the corresponding standardized physicochemical value. Then, the AC method is used to encode the protein sequence into a feature vector:

$${\text{AC}}=\frac{1}{N-d}{\sum }_{j=1}^{N-d}({X}_{i,j}-\frac{1}{n}\sum_{i=1}^{n}{X}_{i,j})({X}_{i+d,j}-\frac{1}{n}\sum_{i=1}^{n}{X}_{i,j})$$
(9)

where \({X}_{i,j}\) is the \({j}_{th}\) descriptor value of the \({i}_{th}\) amino acid, N is the length of the protein sequence, d is the width of the sliding window. In this article, the parameters d and j are respectively set to 30 and 6. On this basis, a protein sequence is finally encoded as a 30*6 = 180-dimensional feature vector.

The multi-source molecular network construction

In order to utilize the associated information of proteins with other biomolecules, we systematically and comprehensively constructed the association information network by integrating the known associations among proteins, diseases, miRNAs, drugs, and lncRNAs, which were downloaded from multiple databases. The source and version of the raw data are shown in Table 7 below. In addition, we have done some operations with the raw data, such as removing some irrelevant items and unifying the identifiers. Besides, we also counted the number of nodes contained in the original association data, as shown in Table 8.

Table 7 The data information in the multi-source molecular network.
Table 8 The node information in the multi-source molecular network.

DeepWalk algorithms

In order to extract the associated information feature of proteins from the association information network we constructed, the graph embedding algorithms: DeepWalk29 was adopted in our work. The input of the DeepWalk method is a graph or network, and then the social representation of vertices in the network was learned through the truncated random walk and the SkipGram model. Finally, it outputs the potential relationship of vertices in the network. The basic idea of this algorithm is first to obtain the node sequence as a sentence through the random walk, and then to obtain the local information of the network from the truncated random walk sequence by maximizing the co-occurrence probability of vertex \({v}_{j}\) within a window size w to learn the potential representation of the node based on the local information, which is calculated as follows:

$$\Pr \left( {\left\{ {v_{j - w} , \ldots ,v_{j + w} } \right\}s\backslash v_{j} |\Phi \left( {v_{j} } \right)} \right) = \prod\nolimits_{i = j - w,i \ne j}^{j + w} {\Pr \left( {v_{i} |\Phi \left( {v_{j} } \right)} \right)}$$
(10)
$${\text{Pr}}({v}_{i}|\Phi \left({v}_{j}\right)=\prod_{k=1}^{\left[{\text{log}}\left|V\right|\right]}1/(1+{e}^{-\Phi \left({v}_{j}\right)\cdot \varphi \left({b}_{k}\right)})$$
(11)

where \(\Phi ({v}_{j})\) indicates that vertex \({v}_{j}\) is mapped to its representation space, \(\varphi ({b}_{k})\) means the parent node of the tree node \({b}_{k}\). More specifically, the entire DeepWalk method is mainly composed of two algorithms. Algorithm 1 of the DeepWalk model mainly includes 4 steps: (1) Generate γ random walks for each node in the input network structure. (2) Uniformly samples a point in the network as the root node in each random walk process. (3) Uniformly select the neighbor node as the next node from the current node. (4) Repeat the above steps until the walking path reaches the specified length. Algorithm 2 of the DeepWalk model is to perform the SkipGram model for training the sequence data to get the network feature vector of each node. The SkipGram model iters all possible matches within a window for the random walk sequence. It utilizes nodes to assume the context and discovers the representation of the vector by achieving the maximum co-occurrence probability of words in a window while neglecting the order in which the nodes occur in the sentence. According to the independent presumption, the probability of co-occurrence can be transferred into the conditional probability product. The detailed process of the algorithm is respectively shown in Tables 9 and 10. In this way, the associated information with other biomolecules of proteins in the association information network is converted to the feature vector, which can be used by the machine learning classifiers.

Table 9 Algorithm 1 of the DeepWalk model.
Table 10 Algorithm 2 of the DeepWalk model

The representation of protein nodes

In this study, the protein nodes were represented by the combination of the physicochemical features of protein sequences and multi-source association information with other biomolecules (drugs, miRNAs, lncRNAs, and diseases) of proteins in the association information network. The sequence feature of proteins was obtained by the auto-covariance (AC) method based on the six physicochemical properties of amino acids. Besides, the associated information with other nodes of proteins was obtained by the network representation method DeepWalk based on the association information network we constructed. Finally, we combined these two features to represent the protein–protein interaction pairs.

Conclusion

The protein–protein interactions (PPIs) play a vital role in the cell biochemical reaction network and are significant for regulating cells and their signals. However, the traditional biological experiment methods have the limitations of a high time-consuming and long period, which is not suitable for large-scale protein–protein interactions prediction. In this study, we proposed a novel computational method to predict potential PPIs by combining the sequence feature and associated information with other molecules of proteins. For the sequence feature of proteins, we utilized the auto covariance (AC) method to extract it based on the six physicochemical properties of amino acids. For the association information feature with other molecules of proteins, we utilized the DeepWalk network representation method to extract it based on the association information network we constructed. In this way, the proteins were represented by combining these two features. Finally, the Random Forest classifier and its corresponding optimal parameters were selected for training and prediction. As a result, our proposed method achieved average accuracy and AUC of 86.03% and 93.03% under fivefold cross-validation, which is superior to many existing computational models. Besides, to evaluate the effect of our feature combination, we further compared the performance of only the protein sequence feature and the combination of protein sequence and association feature. Furthermore, to select the most suitable classifier for our model, we also compared the ability of the four most commonly used classifiers. While overcoming many challenges, our current method still has its limitations. In our work, we collected 8 associations between 5 biological molecules to construct a multi-source molecular network. All the proteins in our dataset are distributed on this network. Therefore, we are able to utilize the relationships between different molecules to extract the network features of protein nodes. Note that we have removed known protein–protein interactions during training to avoid causing label leakage. An independent test set, completely independent of the existing dataset, would result in the inability to use molecular network relationships. We designed our model to address this limitation by considering both the physicochemical properties of the protein sequence. For new proteins that cannot be added to the network, we use this feature for interaction prediction. Our data and code is open source, easily available at https://github.com/jiboyalab/multiPPIs.