A multi-source molecular network representation model for protein–protein interactions prediction

Zou, Hai-Tao; Ji, Bo-Ya; Xie, Xiao-Lan

doi:10.1038/s41598-024-56286-w

Download PDF

Article
Open access
Published: 14 March 2024

A multi-source molecular network representation model for protein–protein interactions prediction

Hai-Tao Zou¹,
Bo-Ya Ji² &
Xiao-Lan Xie¹

Scientific Reports volume 14, Article number: 6184 (2024) Cite this article

439 Accesses
Metrics details

Subjects

Abstract

The prediction of potential protein–protein interactions (PPIs) is a critical step in decoding diseases and understanding cellular mechanisms. Traditional biological experiments have identified plenty of potential PPIs in recent years, but this problem is still far from being solved. Hence, there is urgent to develop computational models with good performance and high efficiency to predict potential PPIs. In this study, we propose a multi-source molecular network representation learning model (called MultiPPIs) to predict potential protein–protein interactions. Specifically, we first extract the protein sequence features according to the physicochemical properties of amino acids by utilizing the auto covariance method. Second, a multi-source association network is constructed by integrating the known associations among miRNAs, proteins, lncRNAs, drugs, and diseases. The graph representation learning method, DeepWalk, is adopted to extract the multisource association information of proteins with other biomolecules. In this way, the known protein–protein interaction pairs can be represented as a concatenation of the protein sequence and the multi-source association features of proteins. Finally, the Random Forest classifier and corresponding optimal parameters are used for training and prediction. In the results, MultiPPIs obtains an average 86.03% prediction accuracy with 82.69% sensitivity at the AUC of 93.03% under five-fold cross-validation. The experimental results indicate that MultiPPIs has a good prediction performance and provides valuable insights into the field of potential protein–protein interactions prediction. MultiPPIs is free available at https://github.com/jiboyalab/multiPPIs.

Hierarchical graph learning for protein–protein interaction

Article Open access 25 February 2023

Amalgamation of 3D structure and sequence information for protein–protein interaction prediction

Article Open access 05 November 2020

A deep-learning framework for multi-level peptide–protein interaction prediction

Article Open access 15 September 2021

Introduction

Protein–protein interactions (PPIs) play an essential role in biological processes, such as cell metabolism, immune response¹, and signal transduction². Therefore, it is essential to develop effective strategies for correctly identifying potential PPIs to understand better protein functions and model complex protein structures. In recent years, some small-scale experimental methods (such as chromatography and biochemical assays) are always utilized to predict the potential PPIs. However, these methods are often inefficient, high time-consuming, and not suitable for large-scale prediction³. Hence, several high-throughput experimental methods have also been invented for identifying potential protein–protein interactions, including immune precipitation, yeast two-hybrid screens (Y2H)⁴, crystallography, and protein chips⁵. These methods have generated copious known protein–protein interaction pairs, which is of great importance for analyzing potential PPIs. Nevertheless, these high-throughput technologies still have obvious drawbacks, such as a high false-positive rate, small coverage, and time-intensive^6,7. Accordingly, due to these limitations of traditional experimental methods, there is an urgent need to develop effective and accurate computational models to identify potential PPIs. In recent years, more and more computational methods^8,9,10,11,12 have been developed as an aid to biological experiment methods with the aim of solving their high false-positive, small converge and time-intensive problems. More specifically, computational methods employ sophisticated algorithms and statistical models to analyze biological data, helping to minimize false-positive results^8,9,10,11,12. They take advantage of the availability of vast amounts of biological data generated through high-throughput techniques. By analyzing large-scale datasets, these methods can identify patterns, trends, and associations that may be undetectable with traditional experimental approaches. Furthermore, biological experiments can be time-consuming and costly, requiring extensive sample preparation, data collection, and analysis. Computational methods provide a more efficient and cost-effective alternative. Once the necessary algorithms and models are developed, computational analyses can be performed relatively quickly on powerful computer systems. This saves time and resources, allowing researchers to explore a broader range of hypotheses or conduct large-scale investigations more feasibly.

Recently, several computational methods for potential protein–protein interaction prediction have been proposed. Of these, some methods take advantage of 3D structure¹³, gene ontology and annotations¹⁴, gene fusion, and co-expression^15,16,17,18 technologies. However, these technologies usually require prior knowledge of the collected proteins, which dramatically limits their accuracy and reliability. For example, the 3D structure of many proteins is difficult to obtain, and the gene ontology annotation of proteins is incomplete^{19,20,21,22,23}. In contrast, abundant sequence data of proteins from multiple sources is relatively easy to obtain. Thence, several computational methods based on sequence features of proteins have been developed to predict potential PPIs. For example, Shen et al.²⁴ developed a novel model for protein–protein interaction prediction only utilizing protein sequence information. In their work, protein sequence information was first extracted based on amino acids' triad characteristics. Then the model was constructed by using support vector machines (SVM) combined with a kernel function. This experiment fully proves that the computational methods only using protein sequence features also have a good prediction ability of protein–protein interactions. Guo et al.²⁵ constructed a new protein sequence feature representation method to predict potential PPIs. Specifically, they selected the auto covariance (AC) method to extract the characteristics of protein sequences based on seven physicochemical properties of amino acids. This method thoroughly considered the interactions between amino acids at different distances in the protein sequence and ultimately performed better than other sequence-based methods. Furthermore, their study demonstrated that extracting protein sequence features by the auto-covariance (AC) method is feasible and effective for potential protein–protein interactions prediction.

In addition, machine learning algorithms have also attracted the attention of many researchers in the field of potential protein–protein interactions prediction. For example, Wang et al.²⁶ developed a feature-weighted Rotation Forest model for protein–protein interaction prediction by eliminating useless information to use the valuable features fully. In the results, their proposed method achieved excellent prediction performance under the cross-validation experiment. You et al.²⁷ presented a new method to transform the protein sequence features into matrix representation and then utilized the support vector machine (SVM) for training and prediction. Their model finally achieved excellent prediction results in the yeast PPIs datasets. Finally, You et al.²⁸ developed an ensemble weighted sparse representation model classifier and replaced the matrix representation with the integrated protein sequence-function to predict potential protein–protein interactions. Compared with many previous advanced methods, this model has better performance.

Human cells are part of a complex biomolecular network, involving interactions and associations among various biomolecules, such as proteins, miRNAs, and diseases. Proteins often interact with each other based on their shared relationships with other biomolecules. Leveraging this associated information can help predict potential protein–protein interactions (PPIs). In this study, we introduce a new computational model (called MultiPPIs) to predict PPIs. This model combines protein sequence physicochemical features with multi-source biomolecular association data (including drugs, miRNAs, lncRNAs, and diseases). First, we use the auto-covariance method to extract features from protein sequences based on amino acids' physicochemical properties. Second, we create a network that integrates known associations among various biomolecules, as depicted in Fig. 1. Using DeepWalk²⁹, a graph representation method, we extract association information from this network. We then utilize 19,237 known PPI pairs from the STRING database (2017)³⁰ as our positive dataset. A matching number of random non-interacting pairs form the negative dataset. These datasets are combined to create our final training set. The prediction model is constructed using a Random Forest (RF) classifier, optimized for best performance. The process flow of MultiPPIs is outlined in Fig. 2. In our study, the proposed model, under fivefold cross-validation, achieves an average accuracy of 0.8603 and an AUC of 0.9304. These results are better than many current computational methods. We also compared two feature combination strategies. Our method is more effective than using only protein sequence information by combining multiple types of data. Additionally, we test four popular classifiers and find the Random Forest classifier to be the most suitable for our model, offering superior prediction performance. These experiments demonstrate that our model is an efficient tool for predicting potential protein–protein interactions. Compared with previous computational methods^8,9,10,11,12, our method mainly has the following specific advantages: (1) Considering the holistic nature of biomolecular networks, our method collects a large amount of association data to construct a multi-source molecular network, and extracts the higher-order network features of proteins based on the graph representation learning method to improve the accuracy of the prediction of PPIs. (2) Our method fully takes advantage of the local property of residues in protein sequences and describes the level of correlation between two protein sequences based on their specific physical and chemical properties. This not only improves the prediction performance of our method, but also solves the cold-start problem often encountered by graph neural network-based methods. (3) By conducting extensive experiments, including comparison of feature combinations, comparison of classification models, optimization and adjustment of model parameters, and comparison with previous experimental methods, our method has been confirmed to have excellent performance in predicting PPIs and is better than most previous computational methods.

Results and discussion

The five-fold cross-validation performance of our proposed model

Cross-validation is a standard method used in machine learning to construct and validate model parameters. In this work, fivefold cross-validation was adopted to evaluate the performance of our model. First, we equally divided the sample data into five parts. Second, we sequentially selected four parts as the training set and the remaining 1 part as the test set. The experiment repeated 5 times. Finally, six standard parameters were used as evaluation indicators for our experiments, including specificity (Spec.), Matthews's correlation coefficient (MCC), precision (Prec.), sensitivity (Sen.), accuracy (Acc.), and the areas under the ROC curve (AUC). Table 1 lists the detailed results of each validation. The last line shows the average value and the standard deviation of the results across five runs of the classifier. These experimental results demonstrated that our model could achieve good results and stability in the protein–protein interaction prediction.

Table 1 The fivefold cross-validation results of our proposed model.

Full size table

The Receiver Operating Characteristic (ROC) curve is an essential and common statistical analysis tool widely used to judge the quality of classification and prediction results in medical research and machine learning. It first sorts the samples according to the prediction results of the classifier and then predicts the samples as positive samples one by one in this order. This way calculates two important values (True Positive Rate, False Positive Rate) each time and plots them as the horizontal and vertical coordinates, respectively. Besides, the AUC is defined as the areas under the ROC curve, and its value range is generally between 0.5 and 1. Generally, the ROC curve cannot indicate which classifier has better performance, so the AUC value is selected as the evaluation index. The classifier with a larger AUC has better performance. The Precision-Recall (PR) curve is another tool to evaluate the performance of a classifier. For the category imbalance problem, the PR curve is widely considered superior to the ROC curve. Similarly, the AUPR is defined as the areas under the PR curve. Figures 3 and 4 respectively show our method's ROC and PR curves under fivefold cross-validation. These results once again demonstrated our model's good effect and stability in predicting potential protein–protein interactions.

Compare the effect of our feature combination strategy

To further compare the effect of our feature combination strategy, a different feature combination was utilized to represent protein nodes. More specifically, we used the only protein sequence features (combination 1) and the combination of the protein sequence features and the multi-source associated information of proteins used by MultiPPIs (combination 2) to represent proteins before carrying out the fivefold cross-validation experiment. One important thing that must be mentioned is that the experimental environment of the two different combinations is the same to ensure the fairness of comparison. Table 2 lists the results of the experiment results of combination 1 under the fivefold cross-validation experiment. The experiment results of combination 1 is shown in Table 1. Figures 5 and 6, respectively, show the comparative experiment's ROC curves and PR curves. As the results show, our feature combination strategy performs better than most computational methods that only use protein sequence features. This once again proves that the multi-source association information with other biomolecules of proteins is helpful for protein–protein interaction prediction.

Table 2 The results of different feature combinations under fivefold cross-validation.

Full size table

Compare the effect of different classifiers

To choose the most suitable classifier for our model, we conducted a comparison experiment with the four most commonly used classifiers, including Decision Tree, Naive Bayes, KNN, and Random Forest. We used these four classifiers with default training parameters to train and predict the protein–protein interactions and kept other experimental conditions consistent. Finally, the Random Forest classifier performed better by observing the prediction results. Table 3 lists the average parameter values of different classifiers under fivefold cross-validation. Figures 7 and 8, respectively, show the ROC and PR curves of the comparative experiment. The comparison experiment results proved that the Random Forest is more suitable for our model than other classifiers, especially in terms of the AUC and accuracy, which can represent the ability of a model.

Table 3 The average parameter values of different classifiers under fivefold cross-validation.

Full size table

Compare the effect of random forest classifier parameter

Random Forest (RF) is a flexible and efficient supervised learning algorithm Breiman proposed in 2001. This algorithm has achieved good results in solving problems in many fields. It has the characteristics of preventing overfitting, strong model stability, and easy to deal with nonlinear regression problems. It is also a particular bootstrap aggregating (bagging) method which uses the decision tree as the training model. It first uses the bootstrap method to generate training sets and then constructs a decision tree for each training set. Finally, all these decision trees are combined to form the classifier to improve the overall effect. Additionally, when segmenting node features, the Random Forest method does not select all features that can maximize the index (such as information gain). Instead, it randomly extracts a subset of features and then finds the optimal solution within this subset. For the Random Forest model parameters, we need to set the regression tree number N. In detail, and we started to train the model at an interval of 20 from N = 180 and observed the relationship between the number of N and the final prediction accuracy. We terminated the model training if the prediction accuracy decreased with the increase of N. Table 4 lists the accuracy results of the Random Forest classifier with different N parameters under fivefold cross-validation. As a result, we can see that the Random Forest classifier has the best performance when the number of regression trees is 300.

Table 4 The accuracy results of the Random Forest classifier with different N parameters.

Full size table

Performance comparison with the state-of-the-art methods

To further evaluate the effectiveness of MultiPPIs, we conduct a detailed comparative analysis between it and several existing protein–protein interaction prediction methods, including LR_PPI³¹, DPPI³², WSRC_GE³³, LPPI³⁴ and PIPR³⁵. Our evaluation framework encompasses five distinct performance metrics, as detailed in Table 5. These metrics include specificity (Spec.), Matthews’s correlation coefficient (MCC), precision (Prec.), sensitivity (Sen.), accuracy (Acc.), and the areas under the ROC curve (AUC), providing a comprehensive view of each method's predictive capabilities. Our findings reveal a significant enhancement in performance with MultiPPIs. This substantial leap in accuracy underscores the effectiveness of MultiPPIs in identifying protein–protein interactions, marking a notable advancement in the field.

Table 5 Performance comparison of MultiPPIs with the state-of-the-art methods.

Full size table

Materials and methods

Protein sequence features based on the physicochemical properties of amino acids

In this study, we downloaded the sequence information of proteins from the STRING: in 2017³⁰ database. Proteins are biopolymers composed of up to 20 different amino acids as basic units. The sequence of amino acid residues in the peptide chain is called the primary structure of proteins. Consequently, we selected the six physicochemical properties of amino acids to represent the protein sequence features in this work, including polarity (P1), hydrophobicity (H), net charge index of side chains (NCISC), volumes of side chains of amino acids (VSC), solvent-accessible surface area (SASA) and polarizability (P2). The original physicochemical values of these 20 amino acids are listed in Table 6.

Table 6 The original physicochemical values of 20 amino acids.

Full size table

Performance evaluation criteria for our experiments

In order to verify the quality of our proposed method, six standard parameters were calculated as evaluation indicators for our experiments, including specificity (Spec.), Matthews's correlation coefficient (MCC), precision (Prec.), sensitivity (Sen.), accuracy (Acc.), and the areas under the ROC curve (AUC). The description of all computational formulas is as follows:

$$Spec =\frac{TN}{FP+TN}$$

(1)

$$MCC=\frac{TP\times TN-FP\times FN}{\surd (TP+FN)\times (TN+FP)\times (TP+FP)\times (TN+FN)}$$

(2)

$$Prec =\frac{TP}{FP+TP}$$

(3)

$$Sen =\frac{TP}{TP+FN}$$

(4)

$$Acc =\frac{TP+TN}{TP+FP+TN+FN}$$

(5)

where TN, FN, TP, and FP represent the total number of true negative, false negative, true positive, and false positive. Furthermore, the AUC (the area under the ROC curve) was also implemented to evaluate the performance of our model.

Auto covariance (AC) method

The extraction of protein sequence features using the auto covariance (AC) method was completely proposed by Guo et al.³⁶. This method fully takes advantage of the local property of residues in protein sequences and describes the level of correlation between two protein sequences based on their specific physical and chemical properties^37,38,39. First, we normalized the original physicochemical values of 20 amino acids to unit standard deviations (SD) and zero means according to Eq. (1):

$${{P}_{ij}}^{\mathrm{^{\prime}}}=\frac{{P}_{ij}-\overline{{P }_{j}}}{{S}_{j}}, (i=\mathrm{1,2},\dots ,6;j=\mathrm{1,2},\dots 20)$$

(6)

where ${P}_{ij}$ is the ${j}_{th}$ descriptor value for ${i}_{th}$ amino acid, $\overline{{P }_{j}}$ is the mean of ${j}_{th}$ descriptor over the 20 amino acids and ${S}_{j}$ is the corresponding standard deviations, given by:

$$\overline{{P }_{j}}=\frac{{\sum }_{i=1}^{20}{P}_{ij}}{20}$$

(7)

$${S}_{j}= \sqrt{\frac{{\sum }_{i=1}^{20}{({P}_{ij}-\overline{{P }_{j}})}^{2}}{20}}$$

(8)

In this way, each amino acid in a protein sequence is converted to the corresponding standardized physicochemical value. Then, the AC method is used to encode the protein sequence into a feature vector:

$${\text{AC}}=\frac{1}{N-d}{\sum }_{j=1}^{N-d}({X}_{i,j}-\frac{1}{n}\sum_{i=1}^{n}{X}_{i,j})({X}_{i+d,j}-\frac{1}{n}\sum_{i=1}^{n}{X}_{i,j})$$

(9)

where ${X}_{i,j}$ is the ${j}_{th}$ descriptor value of the ${i}_{th}$ amino acid, N is the length of the protein sequence, d is the width of the sliding window. In this article, the parameters d and j are respectively set to 30 and 6. On this basis, a protein sequence is finally encoded as a 30*6 = 180-dimensional feature vector.

The multi-source molecular network construction

In order to utilize the associated information of proteins with other biomolecules, we systematically and comprehensively constructed the association information network by integrating the known associations among proteins, diseases, miRNAs, drugs, and lncRNAs, which were downloaded from multiple databases. The source and version of the raw data are shown in Table 7 below. In addition, we have done some operations with the raw data, such as removing some irrelevant items and unifying the identifiers. Besides, we also counted the number of nodes contained in the original association data, as shown in Table 8.

Table 7 The data information in the multi-source molecular network.

Full size table

Table 8 The node information in the multi-source molecular network.

Full size table

DeepWalk algorithms

In order to extract the associated information feature of proteins from the association information network we constructed, the graph embedding algorithms: DeepWalk²⁹ was adopted in our work. The input of the DeepWalk method is a graph or network, and then the social representation of vertices in the network was learned through the truncated random walk and the SkipGram model. Finally, it outputs the potential relationship of vertices in the network. The basic idea of this algorithm is first to obtain the node sequence as a sentence through the random walk, and then to obtain the local information of the network from the truncated random walk sequence by maximizing the co-occurrence probability of vertex ${v}_{j}$ within a window size w to learn the potential representation of the node based on the local information, which is calculated as follows:

$$\Pr \left( {\left\{ {v_{j - w} , \ldots ,v_{j + w} } \right\}s\backslash v_{j} |\Phi \left( {v_{j} } \right)} \right) = \prod\nolimits_{i = j - w,i \ne j}^{j + w} {\Pr \left( {v_{i} |\Phi \left( {v_{j} } \right)} \right)}$$

(10)

$${\text{Pr}}({v}_{i}|\Phi \left({v}_{j}\right)=\prod_{k=1}^{\left[{\text{log}}\left|V\right|\right]}1/(1+{e}^{-\Phi \left({v}_{j}\right)\cdot \varphi \left({b}_{k}\right)})$$

(11)

where $\Phi ({v}_{j})$ indicates that vertex ${v}_{j}$ is mapped to its representation space, $\varphi ({b}_{k})$ means the parent node of the tree node ${b}_{k}$. More specifically, the entire DeepWalk method is mainly composed of two algorithms. Algorithm 1 of the DeepWalk model mainly includes 4 steps: (1) Generate γ random walks for each node in the input network structure. (2) Uniformly samples a point in the network as the root node in each random walk process. (3) Uniformly select the neighbor node as the next node from the current node. (4) Repeat the above steps until the walking path reaches the specified length. Algorithm 2 of the DeepWalk model is to perform the SkipGram model for training the sequence data to get the network feature vector of each node. The SkipGram model iters all possible matches within a window for the random walk sequence. It utilizes nodes to assume the context and discovers the representation of the vector by achieving the maximum co-occurrence probability of words in a window while neglecting the order in which the nodes occur in the sentence. According to the independent presumption, the probability of co-occurrence can be transferred into the conditional probability product. The detailed process of the algorithm is respectively shown in Tables 9 and 10. In this way, the associated information with other biomolecules of proteins in the association information network is converted to the feature vector, which can be used by the machine learning classifiers.

Table 9 Algorithm 1 of the DeepWalk model.

Full size table

Table 10 Algorithm 2 of the DeepWalk model

Full size table

The representation of protein nodes

In this study, the protein nodes were represented by the combination of the physicochemical features of protein sequences and multi-source association information with other biomolecules (drugs, miRNAs, lncRNAs, and diseases) of proteins in the association information network. The sequence feature of proteins was obtained by the auto-covariance (AC) method based on the six physicochemical properties of amino acids. Besides, the associated information with other nodes of proteins was obtained by the network representation method DeepWalk based on the association information network we constructed. Finally, we combined these two features to represent the protein–protein interaction pairs.

Conclusion

The protein–protein interactions (PPIs) play a vital role in the cell biochemical reaction network and are significant for regulating cells and their signals. However, the traditional biological experiment methods have the limitations of a high time-consuming and long period, which is not suitable for large-scale protein–protein interactions prediction. In this study, we proposed a novel computational method to predict potential PPIs by combining the sequence feature and associated information with other molecules of proteins. For the sequence feature of proteins, we utilized the auto covariance (AC) method to extract it based on the six physicochemical properties of amino acids. For the association information feature with other molecules of proteins, we utilized the DeepWalk network representation method to extract it based on the association information network we constructed. In this way, the proteins were represented by combining these two features. Finally, the Random Forest classifier and its corresponding optimal parameters were selected for training and prediction. As a result, our proposed method achieved average accuracy and AUC of 86.03% and 93.03% under fivefold cross-validation, which is superior to many existing computational models. Besides, to evaluate the effect of our feature combination, we further compared the performance of only the protein sequence feature and the combination of protein sequence and association feature. Furthermore, to select the most suitable classifier for our model, we also compared the ability of the four most commonly used classifiers. While overcoming many challenges, our current method still has its limitations. In our work, we collected 8 associations between 5 biological molecules to construct a multi-source molecular network. All the proteins in our dataset are distributed on this network. Therefore, we are able to utilize the relationships between different molecules to extract the network features of protein nodes. Note that we have removed known protein–protein interactions during training to avoid causing label leakage. An independent test set, completely independent of the existing dataset, would result in the inability to use molecular network relationships. We designed our model to address this limitation by considering both the physicochemical properties of the protein sequence. For new proteins that cannot be added to the network, we use this feature for interaction prediction. Our data and code is open source, easily available at https://github.com/jiboyalab/multiPPIs.

Data availability

The data and source code are available in a public github repository: https://github.com/jiboyalab/multiPPIs

References

Williams, N. E. Immunoprecipitation procedures. Methods Cell Biol. 2000, 449 (1999).
Article Google Scholar
Zhao, X.-M., Wang, R.-S., Chen, L. & Aihara, K. Uncovering signal transduction networks from high-throughput data by integer linear programming. Nucleic Acids Res. 36, e48–e48 (2008).
Article PubMed PubMed Central Google Scholar
Zhang, S.-W. & Wei, Z.-G. Some remarks on prediction of protein–protein interaction with machine learning. Med. Chem. 11, 254–264 (2015).
Article CAS PubMed Google Scholar
Fields, S. & Song, O.-K. A novel genetic system to detect protein–protein interactions. Nature 340, 245–246 (1989).
Article CAS PubMed ADS Google Scholar
Zhu, H. et al. Global analysis of protein activities using proteome chips. Science 293, 2101–2105 (2001).
Article CAS PubMed ADS Google Scholar
Uetz, P. et al. A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae. Nature 403, 623–627 (2000).
Article CAS PubMed ADS Google Scholar
Aumentado-Armstrong, T. T., Istrate, B. & Murgita, R. A. Algorithmic approaches to protein–protein interaction site prediction. Algorithms Mol. Biol. 10, 7 (2015).
Article PubMed PubMed Central Google Scholar
Jha, K., Saha, S. & Singh, H. Prediction of protein–protein interaction using graph neural networks. Sci. Rep. 12, 8360. https://doi.org/10.1038/s41598-022-12201-9 (2022).
Article CAS PubMed PubMed Central ADS Google Scholar
Hu, X., Feng, C., Ling, T. & Chen, M. Deep learning frameworks for protein–protein interaction prediction. Comput. Struct. Biotechnol. J. 20, 3223–3233. https://doi.org/10.1016/j.csbj.2022.06.025 (2022).
Article CAS PubMed PubMed Central Google Scholar
Li, X. et al. SDNN-PPI: Self-attention with deep neural network effect on protein-protein interaction prediction. BMC Genomics 23, 474. https://doi.org/10.1186/s12864-022-08687-2 (2022).
Article CAS PubMed PubMed Central Google Scholar
Jha, K., Karmakar, S. & Saha, S. Graph-BERT and language model-based framework for protein–protein interaction identification. Sci. Rep. 13, 5663. https://doi.org/10.1038/s41598-023-31612-w (2023).
Article CAS PubMed PubMed Central ADS Google Scholar
Will, I., Beckerson, W. C. & de Bekker, C. Using machine learning to predict protein–protein interactions between a zombie ant fungus and its carpenter ant host. Sci. Rep. 13, 13821. https://doi.org/10.1038/s41598-023-40764-8 (2023).
Article CAS PubMed PubMed Central ADS Google Scholar
Smith, G. R. & Sternberg, M. J. Prediction of protein–protein interactions by docking methods. Current Opin. Struct. Biol. 12, 28–35 (2002).
Article Google Scholar
Lee, H., Deng, M., Sun, F. & Chen, T. An integrated approach to the prediction of domain–domain interactions. BMC Bioinform. 7, 269 (2006).
Article Google Scholar
Marcotte, E. et al. Detecting protein function and protein-protein interactions from genome sequences. Science 285, 751–753 (1999).
Article CAS PubMed Google Scholar
Enright, A., Iliopoulos, I., Kyrpides, N. C. & Ouzounis, C. A. Protein interaction maps for complete genomes based on gene fusion events. Nature 402, 86–90 (1999).
Article CAS PubMed ADS Google Scholar
Gao, J., Tung, W. & Hu, J. Quantifying dynamical predictability: The pseudo-ensemble approach. Chin. Ann. Math. Ser. B 30, 569–588 (2009).
Article MathSciNet Google Scholar
Ji, B.-Y. et al. Predicting miRNA-disease association from heterogeneous information network with GraRep embedding model. Sci. Rep. 10, 6658 (2020).
Article CAS PubMed PubMed Central ADS Google Scholar
Yu, G., Fu, G., Wang, J. & Zhao, Y. NewGOA: Predicting new GO annotations of proteins by bi-random walks on a hybrid graph. IEEE/ACM Trans. Comput. Biol. Bioinform. 15, 1390–1402 (2017).
Article PubMed Google Scholar
Fu, G., Wang, J., Yang, B. & Yu, G. NegGOA: Negative GO annotations selection using ontology structure. Bioinformatics 32, 2996–3004 (2016).
Article CAS PubMed Google Scholar
Wilke, C. O. Bringing molecules back into molecular evolution. PLoS Comput. Biol. 8, e1002572 (2012).
Article CAS PubMed PubMed Central ADS Google Scholar
Deng, S.-P., Zhu, L. & Huang, D.-S. Predicting hub genes associated with cervical cancer through gene co-expression networks. IEEE/ACM Trans. Comput. Biol. Bioinform. 13, 27–35 (2015).
Article PubMed Google Scholar
Zheng, C.-H., Huang, D.-S., Zhang, L. & Kong, X.-Z. Tumor clustering using nonnegative matrix factorization with gene selection. IEEE Trans. Inf. Technol. Biomed. 13, 599–607 (2009).
Article PubMed Google Scholar
Shen, J. et al. Predicting protein–protein interactions based only on sequences information. Proc. Natl. Acad. Sci. 104, 4337–4341 (2007).
Article CAS PubMed PubMed Central ADS Google Scholar
Guo, Y., Yu, L., Wen, Z. & Li, M. Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences. Nucleic Acids Res. 36, 3025–3030. https://doi.org/10.1093/nar/gkn159 (2008).
Article CAS PubMed PubMed Central Google Scholar
Wang, L. et al. An improved efficient rotation forest algorithm to predict the interactions among proteins. Soft Comput. 22, 3373–3381 (2018).
Article Google Scholar
You, Z.-H. et al. Detecting protein–protein interactions with a novel matrix-based protein sequence representation and support vector machines. BioMed Res. Int. 2015, 1 (2015).
Article Google Scholar
You, Z.-H. et al. An efficient ensemble learning approach for predicting protein–protein interactions by integrating protein primary sequence and evolutionary information. IEEE/ACM Trans. Comput. Biol. Bioinform. 16, 809 (2018).
Article Google Scholar
Perozzi, B., Al-Rfou, R., & Skiena, S. DeepWalk: Online learning of social representations (2014).
Szklarczyk, D. et al. The STRING database in 2017: Quality-controlled protein–protein association networks, made broadly accessible. Nucleic Acids Res. 45, gkw937 (2016).
Google Scholar
Sun, T., Zhou, B., Lai, L. & Pei, J. Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC Bioinform. 18, 277. https://doi.org/10.1186/s12859-017-1700-2 (2017).
Article CAS Google Scholar
Hashemifar, S., Neyshabur, B., Khan, A. A. & Xu, J. Predicting protein–protein interactions through sequence-based deep learning. Bioinformatics 34, i802–i810 (2018).
Article CAS PubMed PubMed Central Google Scholar
Huang, Y.-A., You, Z.-H., Chen, X., Chan, K. & Luo, X. Sequence-based prediction of protein-protein interactions using weighted sparse representation model combined with global encoding. BMC Bioinform. 17, 184. https://doi.org/10.1186/s12859-016-1035-4 (2016).
Article CAS Google Scholar
Su, X.-R. et al. An efficient computational model for large-scale prediction of protein–protein interactions based on accurate and scalable graph embedding. Front. Genet. 12, 635451. https://doi.org/10.3389/fgene.2021.635451 (2021).
Article PubMed PubMed Central Google Scholar
Chen, M. et al. Multifaceted protein–protein interaction prediction based on Siamese residual RCNN. Bioinformatics 35, i305–i314. https://doi.org/10.1093/bioinformatics/btz328 (2019).
Article CAS PubMed PubMed Central Google Scholar
Guo, Y., Yu, L., Wen, Z. & Li, M. Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Res. 36, 3025–3030. https://doi.org/10.1093/nar/gkn159 (2008).
Article CAS PubMed PubMed Central Google Scholar
Broto, P., Moreau, G. & Vandycke, C. Molecular structures: perception, autocorrelation descriptor and SAR studies. Perception of molecules: Topological structure and 3-dimensional structure. Eur. J. Med. Chem. 19, 61–65 (1984).
CAS Google Scholar
Xia, J. F., Han, K. & Huang, D. S. Sequence-based prediction of protein–protein interactions by means of rotation forest and autocorrelation descriptor. Protein Peptide Lett. 17, 137–145 (2010).
Article CAS Google Scholar
Shuichi, K. et al. AAindex: Amino acid index database, progress report 2008. Nucleic Acids Res. 36, D202 (2008).
Google Scholar
Huang, Z. et al. HMDD v3.0: A database for experimentally supported human microRNA–disease associations. Nucleic Acids Res. 47, D1013–D1017 (2018).
Article PubMed Central Google Scholar
Wishart, D. S. et al. DrugBank 5.0: A major update to the DrugBank database for 2018. Nucleic Acids Res. 46, D1074–D1082 (2017).
Article PubMed Central Google Scholar
Miao, Y.-R., Liu, W., Zhang, Q. & Guo, A.-Y. lncRNASNP2: An updated database of functional SNPs and mutations in human and mouse lncRNAs. Nucleic Acids Res. 46, D276–D280 (2017).
Article PubMed Central Google Scholar
Chen, G. et al. LncRNADisease: A database for long-non-coding RNA-associated diseases. Nucleic Acids Res. 41, D983–D986 (2012).
Article PubMed PubMed Central Google Scholar
Davis, A. P. et al. The comparative toxicogenomics database: Update 2019. Nucleic Acids Res. 47, D948–D954 (2018).
Article PubMed Central Google Scholar
Piñero, J. et al. DisGeNET: A comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Research 45, gkw943 (2016).
Google Scholar
Chou, C.-H. et al. miRTarBase update 2018: A resource for experimentally validated microRNA-target interactions. Nucleic Acids Res. 46, D296–D302 (2017).
Article PubMed Central Google Scholar
Cheng, L. et al. LncRNA2Target v2.0: A comprehensive database for target genes of lncRNAs in human and mouse. Nucleic Acids Res. 47, D140–D144 (2018).
Article PubMed Central ADS Google Scholar

Download references

Funding

This work was supported by NSFC Grants (62262011, U19A2067); Guangxi key research and development program (No.2022AB43023, No.2022AB05005); Graduate Research Innovation Project of Hunan Province (QL20230101, CX20230440).

Author information

Authors and Affiliations

College of Information Science and Engineering, Guilin University of Technology, Guilin, 541000, China
Hai-Tao Zou & Xiao-Lan Xie
College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410000, China
Bo-Ya Ji

Authors

Hai-Tao Zou
View author publications
You can also search for this author in PubMed Google Scholar
Bo-Ya Ji
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Lan Xie
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

H.T.Z. and B.Y.J. conceived the experiment, prepared the data set and wrote the manuscript. X.L.X. performed and analyzed the experiment and checked the manuscript. All the authors approved the final manuscript.

Corresponding authors

Correspondence to Bo-Ya Ji or Xiao-Lan Xie.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zou, HT., Ji, BY. & Xie, XL. A multi-source molecular network representation model for protein–protein interactions prediction. Sci Rep 14, 6184 (2024). https://doi.org/10.1038/s41598-024-56286-w

Download citation

Received: 07 November 2023
Accepted: 05 March 2024
Published: 14 March 2024
DOI: https://doi.org/10.1038/s41598-024-56286-w

Keywords

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.