Abstract
The prediction of potential protein–protein interactions (PPIs) is a critical step in decoding diseases and understanding cellular mechanisms. Traditional biological experiments have identified plenty of potential PPIs in recent years, but this problem is still far from being solved. Hence, there is urgent to develop computational models with good performance and high efficiency to predict potential PPIs. In this study, we propose a multi-source molecular network representation learning model (called MultiPPIs) to predict potential protein–protein interactions. Specifically, we first extract the protein sequence features according to the physicochemical properties of amino acids by utilizing the auto covariance method. Second, a multi-source association network is constructed by integrating the known associations among miRNAs, proteins, lncRNAs, drugs, and diseases. The graph representation learning method, DeepWalk, is adopted to extract the multisource association information of proteins with other biomolecules. In this way, the known protein–protein interaction pairs can be represented as a concatenation of the protein sequence and the multi-source association features of proteins. Finally, the Random Forest classifier and corresponding optimal parameters are used for training and prediction. In the results, MultiPPIs obtains an average 86.03% prediction accuracy with 82.69% sensitivity at the AUC of 93.03% under five-fold cross-validation. The experimental results indicate that MultiPPIs has a good prediction performance and provides valuable insights into the field of potential protein–protein interactions prediction. MultiPPIs is free available at https://github.com/jiboyalab/multiPPIs.
Similar content being viewed by others
Introduction
Protein–protein interactions (PPIs) play an essential role in biological processes, such as cell metabolism, immune response1, and signal transduction2. Therefore, it is essential to develop effective strategies for correctly identifying potential PPIs to understand better protein functions and model complex protein structures. In recent years, some small-scale experimental methods (such as chromatography and biochemical assays) are always utilized to predict the potential PPIs. However, these methods are often inefficient, high time-consuming, and not suitable for large-scale prediction3. Hence, several high-throughput experimental methods have also been invented for identifying potential protein–protein interactions, including immune precipitation, yeast two-hybrid screens (Y2H)4, crystallography, and protein chips5. These methods have generated copious known protein–protein interaction pairs, which is of great importance for analyzing potential PPIs. Nevertheless, these high-throughput technologies still have obvious drawbacks, such as a high false-positive rate, small coverage, and time-intensive6,7. Accordingly, due to these limitations of traditional experimental methods, there is an urgent need to develop effective and accurate computational models to identify potential PPIs. In recent years, more and more computational methods8,9,10,11,12 have been developed as an aid to biological experiment methods with the aim of solving their high false-positive, small converge and time-intensive problems. More specifically, computational methods employ sophisticated algorithms and statistical models to analyze biological data, helping to minimize false-positive results8,9,10,11,12. They take advantage of the availability of vast amounts of biological data generated through high-throughput techniques. By analyzing large-scale datasets, these methods can identify patterns, trends, and associations that may be undetectable with traditional experimental approaches. Furthermore, biological experiments can be time-consuming and costly, requiring extensive sample preparation, data collection, and analysis. Computational methods provide a more efficient and cost-effective alternative. Once the necessary algorithms and models are developed, computational analyses can be performed relatively quickly on powerful computer systems. This saves time and resources, allowing researchers to explore a broader range of hypotheses or conduct large-scale investigations more feasibly.
Recently, several computational methods for potential protein–protein interaction prediction have been proposed. Of these, some methods take advantage of 3D structure13, gene ontology and annotations14, gene fusion, and co-expression15,16,17,18 technologies. However, these technologies usually require prior knowledge of the collected proteins, which dramatically limits their accuracy and reliability. For example, the 3D structure of many proteins is difficult to obtain, and the gene ontology annotation of proteins is incomplete19,20,21,22,23. In contrast, abundant sequence data of proteins from multiple sources is relatively easy to obtain. Thence, several computational methods based on sequence features of proteins have been developed to predict potential PPIs. For example, Shen et al.24 developed a novel model for protein–protein interaction prediction only utilizing protein sequence information. In their work, protein sequence information was first extracted based on amino acids' triad characteristics. Then the model was constructed by using support vector machines (SVM) combined with a kernel function. This experiment fully proves that the computational methods only using protein sequence features also have a good prediction ability of protein–protein interactions. Guo et al.25 constructed a new protein sequence feature representation method to predict potential PPIs. Specifically, they selected the auto covariance (AC) method to extract the characteristics of protein sequences based on seven physicochemical properties of amino acids. This method thoroughly considered the interactions between amino acids at different distances in the protein sequence and ultimately performed better than other sequence-based methods. Furthermore, their study demonstrated that extracting protein sequence features by the auto-covariance (AC) method is feasible and effective for potential protein–protein interactions prediction.
In addition, machine learning algorithms have also attracted the attention of many researchers in the field of potential protein–protein interactions prediction. For example, Wang et al.26 developed a feature-weighted Rotation Forest model for protein–protein interaction prediction by eliminating useless information to use the valuable features fully. In the results, their proposed method achieved excellent prediction performance under the cross-validation experiment. You et al.27 presented a new method to transform the protein sequence features into matrix representation and then utilized the support vector machine (SVM) for training and prediction. Their model finally achieved excellent prediction results in the yeast PPIs datasets. Finally, You et al.28 developed an ensemble weighted sparse representation model classifier and replaced the matrix representation with the integrated protein sequence-function to predict potential protein–protein interactions. Compared with many previous advanced methods, this model has better performance.
Human cells are part of a complex biomolecular network, involving interactions and associations among various biomolecules, such as proteins, miRNAs, and diseases. Proteins often interact with each other based on their shared relationships with other biomolecules. Leveraging this associated information can help predict potential protein–protein interactions (PPIs). In this study, we introduce a new computational model (called MultiPPIs) to predict PPIs. This model combines protein sequence physicochemical features with multi-source biomolecular association data (including drugs, miRNAs, lncRNAs, and diseases). First, we use the auto-covariance method to extract features from protein sequences based on amino acids' physicochemical properties. Second, we create a network that integrates known associations among various biomolecules, as depicted in Fig. 1. Using DeepWalk29, a graph representation method, we extract association information from this network. We then utilize 19,237 known PPI pairs from the STRING database (2017)30 as our positive dataset. A matching number of random non-interacting pairs form the negative dataset. These datasets are combined to create our final training set. The prediction model is constructed using a Random Forest (RF) classifier, optimized for best performance. The process flow of MultiPPIs is outlined in Fig. 2. In our study, the proposed model, under fivefold cross-validation, achieves an average accuracy of 0.8603 and an AUC of 0.9304. These results are better than many current computational methods. We also compared two feature combination strategies. Our method is more effective than using only protein sequence information by combining multiple types of data. Additionally, we test four popular classifiers and find the Random Forest classifier to be the most suitable for our model, offering superior prediction performance. These experiments demonstrate that our model is an efficient tool for predicting potential protein–protein interactions. Compared with previous computational methods8,9,10,11,12, our method mainly has the following specific advantages: (1) Considering the holistic nature of biomolecular networks, our method collects a large amount of association data to construct a multi-source molecular network, and extracts the higher-order network features of proteins based on the graph representation learning method to improve the accuracy of the prediction of PPIs. (2) Our method fully takes advantage of the local property of residues in protein sequences and describes the level of correlation between two protein sequences based on their specific physical and chemical properties. This not only improves the prediction performance of our method, but also solves the cold-start problem often encountered by graph neural network-based methods. (3) By conducting extensive experiments, including comparison of feature combinations, comparison of classification models, optimization and adjustment of model parameters, and comparison with previous experimental methods, our method has been confirmed to have excellent performance in predicting PPIs and is better than most previous computational methods.
Results and discussion
The five-fold cross-validation performance of our proposed model
Cross-validation is a standard method used in machine learning to construct and validate model parameters. In this work, fivefold cross-validation was adopted to evaluate the performance of our model. First, we equally divided the sample data into five parts. Second, we sequentially selected four parts as the training set and the remaining 1 part as the test set. The experiment repeated 5 times. Finally, six standard parameters were used as evaluation indicators for our experiments, including specificity (Spec.), Matthews's correlation coefficient (MCC), precision (Prec.), sensitivity (Sen.), accuracy (Acc.), and the areas under the ROC curve (AUC). Table 1 lists the detailed results of each validation. The last line shows the average value and the standard deviation of the results across five runs of the classifier. These experimental results demonstrated that our model could achieve good results and stability in the protein–protein interaction prediction.
The Receiver Operating Characteristic (ROC) curve is an essential and common statistical analysis tool widely used to judge the quality of classification and prediction results in medical research and machine learning. It first sorts the samples according to the prediction results of the classifier and then predicts the samples as positive samples one by one in this order. This way calculates two important values (True Positive Rate, False Positive Rate) each time and plots them as the horizontal and vertical coordinates, respectively. Besides, the AUC is defined as the areas under the ROC curve, and its value range is generally between 0.5 and 1. Generally, the ROC curve cannot indicate which classifier has better performance, so the AUC value is selected as the evaluation index. The classifier with a larger AUC has better performance. The Precision-Recall (PR) curve is another tool to evaluate the performance of a classifier. For the category imbalance problem, the PR curve is widely considered superior to the ROC curve. Similarly, the AUPR is defined as the areas under the PR curve. Figures 3 and 4 respectively show our method's ROC and PR curves under fivefold cross-validation. These results once again demonstrated our model's good effect and stability in predicting potential protein–protein interactions.
Compare the effect of our feature combination strategy
To further compare the effect of our feature combination strategy, a different feature combination was utilized to represent protein nodes. More specifically, we used the only protein sequence features (combination 1) and the combination of the protein sequence features and the multi-source associated information of proteins used by MultiPPIs (combination 2) to represent proteins before carrying out the fivefold cross-validation experiment. One important thing that must be mentioned is that the experimental environment of the two different combinations is the same to ensure the fairness of comparison. Table 2 lists the results of the experiment results of combination 1 under the fivefold cross-validation experiment. The experiment results of combination 1 is shown in Table 1. Figures 5 and 6, respectively, show the comparative experiment's ROC curves and PR curves. As the results show, our feature combination strategy performs better than most computational methods that only use protein sequence features. This once again proves that the multi-source association information with other biomolecules of proteins is helpful for protein–protein interaction prediction.
Compare the effect of different classifiers
To choose the most suitable classifier for our model, we conducted a comparison experiment with the four most commonly used classifiers, including Decision Tree, Naive Bayes, KNN, and Random Forest. We used these four classifiers with default training parameters to train and predict the protein–protein interactions and kept other experimental conditions consistent. Finally, the Random Forest classifier performed better by observing the prediction results. Table 3 lists the average parameter values of different classifiers under fivefold cross-validation. Figures 7 and 8, respectively, show the ROC and PR curves of the comparative experiment. The comparison experiment results proved that the Random Forest is more suitable for our model than other classifiers, especially in terms of the AUC and accuracy, which can represent the ability of a model.
Compare the effect of random forest classifier parameter
Random Forest (RF) is a flexible and efficient supervised learning algorithm Breiman proposed in 2001. This algorithm has achieved good results in solving problems in many fields. It has the characteristics of preventing overfitting, strong model stability, and easy to deal with nonlinear regression problems. It is also a particular bootstrap aggregating (bagging) method which uses the decision tree as the training model. It first uses the bootstrap method to generate training sets and then constructs a decision tree for each training set. Finally, all these decision trees are combined to form the classifier to improve the overall effect. Additionally, when segmenting node features, the Random Forest method does not select all features that can maximize the index (such as information gain). Instead, it randomly extracts a subset of features and then finds the optimal solution within this subset. For the Random Forest model parameters, we need to set the regression tree number N. In detail, and we started to train the model at an interval of 20 from N = 180 and observed the relationship between the number of N and the final prediction accuracy. We terminated the model training if the prediction accuracy decreased with the increase of N. Table 4 lists the accuracy results of the Random Forest classifier with different N parameters under fivefold cross-validation. As a result, we can see that the Random Forest classifier has the best performance when the number of regression trees is 300.
Performance comparison with the state-of-the-art methods
To further evaluate the effectiveness of MultiPPIs, we conduct a detailed comparative analysis between it and several existing protein–protein interaction prediction methods, including LR_PPI31, DPPI32, WSRC_GE33, LPPI34 and PIPR35. Our evaluation framework encompasses five distinct performance metrics, as detailed in Table 5. These metrics include specificity (Spec.), Matthews’s correlation coefficient (MCC), precision (Prec.), sensitivity (Sen.), accuracy (Acc.), and the areas under the ROC curve (AUC), providing a comprehensive view of each method's predictive capabilities. Our findings reveal a significant enhancement in performance with MultiPPIs. This substantial leap in accuracy underscores the effectiveness of MultiPPIs in identifying protein–protein interactions, marking a notable advancement in the field.
Materials and methods
Protein sequence features based on the physicochemical properties of amino acids
In this study, we downloaded the sequence information of proteins from the STRING: in 201730 database. Proteins are biopolymers composed of up to 20 different amino acids as basic units. The sequence of amino acid residues in the peptide chain is called the primary structure of proteins. Consequently, we selected the six physicochemical properties of amino acids to represent the protein sequence features in this work, including polarity (P1), hydrophobicity (H), net charge index of side chains (NCISC), volumes of side chains of amino acids (VSC), solvent-accessible surface area (SASA) and polarizability (P2). The original physicochemical values of these 20 amino acids are listed in Table 6.
Performance evaluation criteria for our experiments
In order to verify the quality of our proposed method, six standard parameters were calculated as evaluation indicators for our experiments, including specificity (Spec.), Matthews's correlation coefficient (MCC), precision (Prec.), sensitivity (Sen.), accuracy (Acc.), and the areas under the ROC curve (AUC). The description of all computational formulas is as follows:
where TN, FN, TP, and FP represent the total number of true negative, false negative, true positive, and false positive. Furthermore, the AUC (the area under the ROC curve) was also implemented to evaluate the performance of our model.
Auto covariance (AC) method
The extraction of protein sequence features using the auto covariance (AC) method was completely proposed by Guo et al.36. This method fully takes advantage of the local property of residues in protein sequences and describes the level of correlation between two protein sequences based on their specific physical and chemical properties37,38,39. First, we normalized the original physicochemical values of 20 amino acids to unit standard deviations (SD) and zero means according to Eq. (1):
where \({P}_{ij}\) is the \({j}_{th}\) descriptor value for \({i}_{th}\) amino acid, \(\overline{{P }_{j}}\) is the mean of \({j}_{th}\) descriptor over the 20 amino acids and \({S}_{j}\) is the corresponding standard deviations, given by:
In this way, each amino acid in a protein sequence is converted to the corresponding standardized physicochemical value. Then, the AC method is used to encode the protein sequence into a feature vector:
where \({X}_{i,j}\) is the \({j}_{th}\) descriptor value of the \({i}_{th}\) amino acid, N is the length of the protein sequence, d is the width of the sliding window. In this article, the parameters d and j are respectively set to 30 and 6. On this basis, a protein sequence is finally encoded as a 30*6 = 180-dimensional feature vector.
The multi-source molecular network construction
In order to utilize the associated information of proteins with other biomolecules, we systematically and comprehensively constructed the association information network by integrating the known associations among proteins, diseases, miRNAs, drugs, and lncRNAs, which were downloaded from multiple databases. The source and version of the raw data are shown in Table 7 below. In addition, we have done some operations with the raw data, such as removing some irrelevant items and unifying the identifiers. Besides, we also counted the number of nodes contained in the original association data, as shown in Table 8.
DeepWalk algorithms
In order to extract the associated information feature of proteins from the association information network we constructed, the graph embedding algorithms: DeepWalk29 was adopted in our work. The input of the DeepWalk method is a graph or network, and then the social representation of vertices in the network was learned through the truncated random walk and the SkipGram model. Finally, it outputs the potential relationship of vertices in the network. The basic idea of this algorithm is first to obtain the node sequence as a sentence through the random walk, and then to obtain the local information of the network from the truncated random walk sequence by maximizing the co-occurrence probability of vertex \({v}_{j}\) within a window size w to learn the potential representation of the node based on the local information, which is calculated as follows:
where \(\Phi ({v}_{j})\) indicates that vertex \({v}_{j}\) is mapped to its representation space, \(\varphi ({b}_{k})\) means the parent node of the tree node \({b}_{k}\). More specifically, the entire DeepWalk method is mainly composed of two algorithms. Algorithm 1 of the DeepWalk model mainly includes 4 steps: (1) Generate γ random walks for each node in the input network structure. (2) Uniformly samples a point in the network as the root node in each random walk process. (3) Uniformly select the neighbor node as the next node from the current node. (4) Repeat the above steps until the walking path reaches the specified length. Algorithm 2 of the DeepWalk model is to perform the SkipGram model for training the sequence data to get the network feature vector of each node. The SkipGram model iters all possible matches within a window for the random walk sequence. It utilizes nodes to assume the context and discovers the representation of the vector by achieving the maximum co-occurrence probability of words in a window while neglecting the order in which the nodes occur in the sentence. According to the independent presumption, the probability of co-occurrence can be transferred into the conditional probability product. The detailed process of the algorithm is respectively shown in Tables 9 and 10. In this way, the associated information with other biomolecules of proteins in the association information network is converted to the feature vector, which can be used by the machine learning classifiers.
The representation of protein nodes
In this study, the protein nodes were represented by the combination of the physicochemical features of protein sequences and multi-source association information with other biomolecules (drugs, miRNAs, lncRNAs, and diseases) of proteins in the association information network. The sequence feature of proteins was obtained by the auto-covariance (AC) method based on the six physicochemical properties of amino acids. Besides, the associated information with other nodes of proteins was obtained by the network representation method DeepWalk based on the association information network we constructed. Finally, we combined these two features to represent the protein–protein interaction pairs.
Conclusion
The protein–protein interactions (PPIs) play a vital role in the cell biochemical reaction network and are significant for regulating cells and their signals. However, the traditional biological experiment methods have the limitations of a high time-consuming and long period, which is not suitable for large-scale protein–protein interactions prediction. In this study, we proposed a novel computational method to predict potential PPIs by combining the sequence feature and associated information with other molecules of proteins. For the sequence feature of proteins, we utilized the auto covariance (AC) method to extract it based on the six physicochemical properties of amino acids. For the association information feature with other molecules of proteins, we utilized the DeepWalk network representation method to extract it based on the association information network we constructed. In this way, the proteins were represented by combining these two features. Finally, the Random Forest classifier and its corresponding optimal parameters were selected for training and prediction. As a result, our proposed method achieved average accuracy and AUC of 86.03% and 93.03% under fivefold cross-validation, which is superior to many existing computational models. Besides, to evaluate the effect of our feature combination, we further compared the performance of only the protein sequence feature and the combination of protein sequence and association feature. Furthermore, to select the most suitable classifier for our model, we also compared the ability of the four most commonly used classifiers. While overcoming many challenges, our current method still has its limitations. In our work, we collected 8 associations between 5 biological molecules to construct a multi-source molecular network. All the proteins in our dataset are distributed on this network. Therefore, we are able to utilize the relationships between different molecules to extract the network features of protein nodes. Note that we have removed known protein–protein interactions during training to avoid causing label leakage. An independent test set, completely independent of the existing dataset, would result in the inability to use molecular network relationships. We designed our model to address this limitation by considering both the physicochemical properties of the protein sequence. For new proteins that cannot be added to the network, we use this feature for interaction prediction. Our data and code is open source, easily available at https://github.com/jiboyalab/multiPPIs.
Data availability
The data and source code are available in a public github repository: https://github.com/jiboyalab/multiPPIs
References
Williams, N. E. Immunoprecipitation procedures. Methods Cell Biol. 2000, 449 (1999).
Zhao, X.-M., Wang, R.-S., Chen, L. & Aihara, K. Uncovering signal transduction networks from high-throughput data by integer linear programming. Nucleic Acids Res. 36, e48–e48 (2008).
Zhang, S.-W. & Wei, Z.-G. Some remarks on prediction of protein–protein interaction with machine learning. Med. Chem. 11, 254–264 (2015).
Fields, S. & Song, O.-K. A novel genetic system to detect protein–protein interactions. Nature 340, 245–246 (1989).
Zhu, H. et al. Global analysis of protein activities using proteome chips. Science 293, 2101–2105 (2001).
Uetz, P. et al. A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae. Nature 403, 623–627 (2000).
Aumentado-Armstrong, T. T., Istrate, B. & Murgita, R. A. Algorithmic approaches to protein–protein interaction site prediction. Algorithms Mol. Biol. 10, 7 (2015).
Jha, K., Saha, S. & Singh, H. Prediction of protein–protein interaction using graph neural networks. Sci. Rep. 12, 8360. https://doi.org/10.1038/s41598-022-12201-9 (2022).
Hu, X., Feng, C., Ling, T. & Chen, M. Deep learning frameworks for protein–protein interaction prediction. Comput. Struct. Biotechnol. J. 20, 3223–3233. https://doi.org/10.1016/j.csbj.2022.06.025 (2022).
Li, X. et al. SDNN-PPI: Self-attention with deep neural network effect on protein-protein interaction prediction. BMC Genomics 23, 474. https://doi.org/10.1186/s12864-022-08687-2 (2022).
Jha, K., Karmakar, S. & Saha, S. Graph-BERT and language model-based framework for protein–protein interaction identification. Sci. Rep. 13, 5663. https://doi.org/10.1038/s41598-023-31612-w (2023).
Will, I., Beckerson, W. C. & de Bekker, C. Using machine learning to predict protein–protein interactions between a zombie ant fungus and its carpenter ant host. Sci. Rep. 13, 13821. https://doi.org/10.1038/s41598-023-40764-8 (2023).
Smith, G. R. & Sternberg, M. J. Prediction of protein–protein interactions by docking methods. Current Opin. Struct. Biol. 12, 28–35 (2002).
Lee, H., Deng, M., Sun, F. & Chen, T. An integrated approach to the prediction of domain–domain interactions. BMC Bioinform. 7, 269 (2006).
Marcotte, E. et al. Detecting protein function and protein-protein interactions from genome sequences. Science 285, 751–753 (1999).
Enright, A., Iliopoulos, I., Kyrpides, N. C. & Ouzounis, C. A. Protein interaction maps for complete genomes based on gene fusion events. Nature 402, 86–90 (1999).
Gao, J., Tung, W. & Hu, J. Quantifying dynamical predictability: The pseudo-ensemble approach. Chin. Ann. Math. Ser. B 30, 569–588 (2009).
Ji, B.-Y. et al. Predicting miRNA-disease association from heterogeneous information network with GraRep embedding model. Sci. Rep. 10, 6658 (2020).
Yu, G., Fu, G., Wang, J. & Zhao, Y. NewGOA: Predicting new GO annotations of proteins by bi-random walks on a hybrid graph. IEEE/ACM Trans. Comput. Biol. Bioinform. 15, 1390–1402 (2017).
Fu, G., Wang, J., Yang, B. & Yu, G. NegGOA: Negative GO annotations selection using ontology structure. Bioinformatics 32, 2996–3004 (2016).
Wilke, C. O. Bringing molecules back into molecular evolution. PLoS Comput. Biol. 8, e1002572 (2012).
Deng, S.-P., Zhu, L. & Huang, D.-S. Predicting hub genes associated with cervical cancer through gene co-expression networks. IEEE/ACM Trans. Comput. Biol. Bioinform. 13, 27–35 (2015).
Zheng, C.-H., Huang, D.-S., Zhang, L. & Kong, X.-Z. Tumor clustering using nonnegative matrix factorization with gene selection. IEEE Trans. Inf. Technol. Biomed. 13, 599–607 (2009).
Shen, J. et al. Predicting protein–protein interactions based only on sequences information. Proc. Natl. Acad. Sci. 104, 4337–4341 (2007).
Guo, Y., Yu, L., Wen, Z. & Li, M. Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences. Nucleic Acids Res. 36, 3025–3030. https://doi.org/10.1093/nar/gkn159 (2008).
Wang, L. et al. An improved efficient rotation forest algorithm to predict the interactions among proteins. Soft Comput. 22, 3373–3381 (2018).
You, Z.-H. et al. Detecting protein–protein interactions with a novel matrix-based protein sequence representation and support vector machines. BioMed Res. Int. 2015, 1 (2015).
You, Z.-H. et al. An efficient ensemble learning approach for predicting protein–protein interactions by integrating protein primary sequence and evolutionary information. IEEE/ACM Trans. Comput. Biol. Bioinform. 16, 809 (2018).
Perozzi, B., Al-Rfou, R., & Skiena, S. DeepWalk: Online learning of social representations (2014).
Szklarczyk, D. et al. The STRING database in 2017: Quality-controlled protein–protein association networks, made broadly accessible. Nucleic Acids Res. 45, gkw937 (2016).
Sun, T., Zhou, B., Lai, L. & Pei, J. Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC Bioinform. 18, 277. https://doi.org/10.1186/s12859-017-1700-2 (2017).
Hashemifar, S., Neyshabur, B., Khan, A. A. & Xu, J. Predicting protein–protein interactions through sequence-based deep learning. Bioinformatics 34, i802–i810 (2018).
Huang, Y.-A., You, Z.-H., Chen, X., Chan, K. & Luo, X. Sequence-based prediction of protein-protein interactions using weighted sparse representation model combined with global encoding. BMC Bioinform. 17, 184. https://doi.org/10.1186/s12859-016-1035-4 (2016).
Su, X.-R. et al. An efficient computational model for large-scale prediction of protein–protein interactions based on accurate and scalable graph embedding. Front. Genet. 12, 635451. https://doi.org/10.3389/fgene.2021.635451 (2021).
Chen, M. et al. Multifaceted protein–protein interaction prediction based on Siamese residual RCNN. Bioinformatics 35, i305–i314. https://doi.org/10.1093/bioinformatics/btz328 (2019).
Guo, Y., Yu, L., Wen, Z. & Li, M. Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Res. 36, 3025–3030. https://doi.org/10.1093/nar/gkn159 (2008).
Broto, P., Moreau, G. & Vandycke, C. Molecular structures: perception, autocorrelation descriptor and SAR studies. Perception of molecules: Topological structure and 3-dimensional structure. Eur. J. Med. Chem. 19, 61–65 (1984).
Xia, J. F., Han, K. & Huang, D. S. Sequence-based prediction of protein–protein interactions by means of rotation forest and autocorrelation descriptor. Protein Peptide Lett. 17, 137–145 (2010).
Shuichi, K. et al. AAindex: Amino acid index database, progress report 2008. Nucleic Acids Res. 36, D202 (2008).
Huang, Z. et al. HMDD v3.0: A database for experimentally supported human microRNA–disease associations. Nucleic Acids Res. 47, D1013–D1017 (2018).
Wishart, D. S. et al. DrugBank 5.0: A major update to the DrugBank database for 2018. Nucleic Acids Res. 46, D1074–D1082 (2017).
Miao, Y.-R., Liu, W., Zhang, Q. & Guo, A.-Y. lncRNASNP2: An updated database of functional SNPs and mutations in human and mouse lncRNAs. Nucleic Acids Res. 46, D276–D280 (2017).
Chen, G. et al. LncRNADisease: A database for long-non-coding RNA-associated diseases. Nucleic Acids Res. 41, D983–D986 (2012).
Davis, A. P. et al. The comparative toxicogenomics database: Update 2019. Nucleic Acids Res. 47, D948–D954 (2018).
Piñero, J. et al. DisGeNET: A comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Research 45, gkw943 (2016).
Chou, C.-H. et al. miRTarBase update 2018: A resource for experimentally validated microRNA-target interactions. Nucleic Acids Res. 46, D296–D302 (2017).
Cheng, L. et al. LncRNA2Target v2.0: A comprehensive database for target genes of lncRNAs in human and mouse. Nucleic Acids Res. 47, D140–D144 (2018).
Funding
This work was supported by NSFC Grants (62262011, U19A2067); Guangxi key research and development program (No.2022AB43023, No.2022AB05005); Graduate Research Innovation Project of Hunan Province (QL20230101, CX20230440).
Author information
Authors and Affiliations
Contributions
H.T.Z. and B.Y.J. conceived the experiment, prepared the data set and wrote the manuscript. X.L.X. performed and analyzed the experiment and checked the manuscript. All the authors approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zou, HT., Ji, BY. & Xie, XL. A multi-source molecular network representation model for protein–protein interactions prediction. Sci Rep 14, 6184 (2024). https://doi.org/10.1038/s41598-024-56286-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-024-56286-w
Keywords
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.