Protein–protein interaction and non-interaction predictions using gene sequence natural vector

Zhao, Nan; Zhuo, Maji; Tian, Kun; Gong, Xinqi

doi:10.1038/s42003-022-03617-0

Download PDF

Article
Open access
Published: 02 July 2022

Protein–protein interaction and non-interaction predictions using gene sequence natural vector

Nan Zhao¹,
Maji Zhuo¹,
Kun Tian¹ &
…
Xinqi Gong ORCID: orcid.org/0000-0003-2802-6176^1,2,3

Communications Biology volume 5, Article number: 652 (2022) Cite this article

5303 Accesses
5 Citations
23 Altmetric
Metrics details

Subjects

Abstract

Predicting protein–protein interaction and non-interaction are two important different aspects of multi-body structure predictions, which provide vital information about protein function. Some computational methods have recently been developed to complement experimental methods, but still cannot effectively detect real non-interacting protein pairs. We proposed a gene sequence-based method, named NVDT (Natural Vector combine with Dinucleotide and Triplet nucleotide), for the prediction of interaction and non-interaction. For protein–protein non-interactions (PPNIs), the proposed method obtained accuracies of 86.23% for Homo sapiens and 85.34% for Mus musculus, and it performed well on three types of non-interaction networks. For protein-protein interactions (PPIs), we obtained accuracies of 99.20, 94.94, 98.56, 95.41, and 94.83% for Saccharomyces cerevisiae, Drosophila melanogaster, Helicobacter pylori, Homo sapiens, and Mus musculus, respectively. Furthermore, NVDT outperformed established sequence-based methods and demonstrated high prediction results for cross-species interactions. NVDT is expected to be an effective approach for predicting PPIs and PPNIs.

Robust and accurate prediction of protein–protein interactions by exploiting evolutionary information

Article Open access 19 August 2021

Yang Li, Zheng Wang, … Yan-Bin Wang

Assessment of community efforts to advance network-based prediction of protein–protein interactions

Article Open access 22 March 2023

Xu-Wen Wang, Lorenzo Madeddu, … Yang-Yu Liu

A multi-source molecular network representation model for protein–protein interactions prediction

Article Open access 14 March 2024

Hai-Tao Zou, Bo-Ya Ji & Xiao-Lan Xie

Introduction

Most vital life activities are related to protein interactions, including physiological and pathological processes¹. Proteins with similar functions are more likely to interact with each other to form protein complexes that participate in complex and diverse biochemical activities. Detecting and characterizing PPIs provides insight into the functions of unannotated proteins and the mechanisms underlying cellular biochemical processes and complex diseases, and can provide a basis for protein engineering and drug design². The vast majority of proteins do not form interactions in physiological conditions. The number of non-interacting protein pairs vastly exceeds that of interacting pairs. Although some protein pairs have great structural similarities with available experimental complexes, interactions between them can be avoided if they are potentially hazardous to organism³. Identifying PPNIs is critical for understanding biological processes and reducing noise in datasets aimed at determining representative features of protein classification. Mastering the biological characteristics of non-interacting protein pairs helps to better study the three-dimensional structure and function of multi-body.

Rapid developments of high-throughput experimental technologies have enabled large-scale discovery and identification of PPIs, but they remain some disadvantages of time-consuming, labor-intensive, and high levels of false-positive and false-negative predictions^4,5. Therefore, many computational methods have emerged as an alternative for the experimental prediction of PPIs based on different data types, such as genomic information⁶, evolutionary information^7,8, structural information^9,10, network information¹¹, and sequence information^12,13. Since protein sequences are easily obtained and sequence-based schemes do not need prior knowledge, various sequence-based computational models are favored by researchers. For example, Bock and Gough utilized amino acid physicochemical properties to predict interactions based solely on primary sequences¹⁴. Shen et al. characterized protein sequence by conjoint triad (CT), where CT considers the properties of adjacent amino acids¹⁵. Guo et al. expressed protein sequence by auto covariance (AC), which considered interactions between residues separated by a certain distance¹⁶. Yang et al. employed local descriptors (LD, including composition, transition, and distribution) to evaluate the effect of discontinuous amino acids, but the extracted features ignore global information¹⁷. Yin et al. numerically represented protein sequences using biochemical properties of amino acids and conducted coevolution analysis based on Fourier transform to detect interacting protein pairs¹⁸. Due to the difficulty in accurately characterizing protein sequence information by these single feature extraction methods, researchers have proposed many computational methods integrating multiple features for prediction^19,20. For instance, Zhang et al. used AC, LD, and multi-scale continuous and discontinuous (MCD) local descriptor to extract features information of protein sequence²¹. Chen et al. exploited pseudo amino acid composition, Moreau-Broto, Moran and Geary autocorrelation descriptor, position-specific scoring matrix, and LD to encode biologically relevant features²². These feature vectors generally summarize physicochemical properties and position distribution of amino acids, but their large dimensions greatly increase the computational complexity.

The recent advancements in machine learning techniques have been greatly applied in the bioinformatics field, including RNA-binding site identification²³, drugs’ efficacy²⁴, drug–target binding recognition²⁵, medical diagnosis²⁶, protein complex structure prediction²⁷, and protein–protein/peptide/ligand predictions^28,29,30. Traditional machine learning algorithms are often combined with feature engineering to detect PPIs, such as amino acid physicochemical properties, co-occurrence frequency¹³, CT¹⁵, AC^16,31, LD¹⁷, signature product³², sequence order, and dipeptide information³³ were performed with support vector machine (SVM), Gabor feature³⁴, chaos game representation and wavelet transform³⁵ were performed with random forest (RF). In addition, deep learning algorithms are also widely used in PPIs detection, such as Siamese-like convolutional neural network (CNN) in DPPI³⁶ and deep residual recurrent CNN in PIPR³⁷. However, deep learning algorithms remain challenging as follows: (i) Lack of interpretability. The deep network is similar to a “black box” in which the physical meaning of features cannot be explained. But traditional machine learning algorithms involve feature engineering, which makes the model easy to interpret and understand. Previous successful methods show that the model generated by the combination of clear sequence features and traditional machine learning performs well. (ii) Complexity and time-consuming. The “inside” of deep learning is difficult to fully understand, which makes hyper-parametric and network design still a considerable challenge. But now that we have a more comprehensive understanding of traditional machine learning underlying algorithms, it is easier to adjust parameters and change model designs. Our lab had effectively captured amino acid sequence information using traditional machine learning algorithms to model and validate PPIs, and had many applications in other biological and medical problems. Yu et al. used SVM with heterogeneous kernels from a specific kernel set, including the Hadamard, RBF, and linear kernels, to perform the breast cancer outcome evaluation and the results proved to be effective³⁸. Lyu et al. developed a two-layer SVM ensemble-classifier to predict interface residue pairs of protein trimers and showed its effectiveness and reliability³⁹. Wang et al. employed linear SVM, RF, logistic regression with lasso penalty, and logistic regression with hierarchy interaction to predict interface residue pairs respectively, and showed that diverse machine learning methods tend to predict different protein–protein interface patterns⁴⁰. As a result, we select the appropriate feature engineering combined with traditional machine learning algorithms to predict protein–protein interaction and non-interaction.

Plenty of computational models are proposed under the paradigm of supervised learning, so the quality of training data is a key issue to determine the prediction performance. For the training of learning algorithms, interacting protein pairs (i.e., positive samples) and non-interacting protein pairs (i.e., negative samples) are equally important to computational biologists. However, biologists generally focus on interacting protein pairs, extensively collecting experimentally or computationally validated interacting protein pairs into public databases, while ignoring or discarding non-interacting protein pairs⁴¹. Negative samples in the computational methods are almost constructed by pairing proteins located in different subcellular positions, but these samples restrict the distribution of non-interacting protein pairs and lead to biased estimates of prediction accuracy⁴². Srivastava et al. introduced a triple-layer validation method to collect reliable non-interacting protein pairs and then characterized the most relevant protein correlation features to train the PPIs identification model, which showed excellent predictive capabilities⁴³. The training and performance evaluation of models are affected by strong bias in negative samples^44,45, so high-quality and practical negative samples are needed to train less biased models. Smialowski et al. developed the Negatome database, a set of protein pairs unlikely to show direct physical interactions, in 2009 by manually collating the literatures and analyzing three-dimensional protein complex structures⁴⁶. Blohm et al. proposed the second version of this database using a new advanced text-mining process to guide the manual annotation process⁴⁷. High-quality non-interacting samples are important to capture the interaction and non-interaction information from sequences, so Negatome samples are currently used in many studies as an alternative to pairing proteins located in different subcellular. Bryant et al. used negative samples collected in the Negatome database, combined with AlphaFold2, and optimized multi-sequence alignment to predict heterodimeric protein complexes⁴⁸. Das et al. used a negative dataset from the Negatome database when utilizing interface properties and SVM to classify and differentiate native and non-native complexes⁴⁹. Consequently, the non-interacting protein pairs in the Negatome database can facilitate the training of highly generalizing models to predict PPIs and PPNIs.

Feature selection is a major determinant of the generalizability of predictive models. Common sequence-based computational methods extract features from the amino acid sequence, but nucleotide-related information is missing in the amino acid sequence since multiple codons encode the same amino acid. Triplet nucleotides (i.e., codons) are associated with various diseases, and their repeats may lead to toxic proteins, alter RNA function, and control transcription and translation^50,51,52. Moratorio et al.⁵³ and Carrau et al.⁵⁴ found that codon usage may be selected to maintain mutational robustness. Some triple nucleotides often encode the same amino acid, which is called synonymous codons. Codon usage bias, where certain codons are used more frequently than their synonymous codons, is influenced by mutation, selection, and genetic drift. Zhuo et al. found that some interface codons had the obvious propensity to interface residues and the genetic codon affected the interaction interface between proteins⁵⁵. Because codons carry very important information about proteins, some methods consider extracting features from gene sequences. Zhou et al. used the codon pair frequency difference to predict protein interactions with comparable performance to those of other sequence-based methods⁵⁶. Najafabadi et al. utilized the relative codon frequency differences combined with a naive Bayesian classifier to validate PPIs, the approach showed good performance on Saccharomyces cerevisiae (S. cerevisiae), Escherichia coli, and Plasmodium falciparum datasets⁵⁷. Consequently, extracting additional biological information from gene sequences, like codon frequencies, may further improve the prediction ability. Deng et al. proposed natural vector based on the distributions of nucleotides in each DNA sequence to analyze the virus genome⁵⁸. Later, many methods improved on the natural vector. For example, Dong et al. proposed an improved natural vector method called Accumulated Natural Vector to analyze sequences, genomes, and their phylogenetic relationships⁵⁹. Zhao et al. added the covariances of amino acids to natural vector and used it to classify proteins and detect the evolutionary relationships among species⁶⁰. In addition, certain dinucleotides have connections with the regulation of metabolism, aging, and neurodegeneration⁶¹. Atkinson et al. provided that dinucleotide bias was related to evading cell defense mechanisms⁶². Takata et al. developed that codon and dinucleotide usage biases may be associated with the need to maintain the RNA secondary structure involved in splicing and gene expression⁶³. Kokate et al. studied codon and dinucleotide preferences of 29 Drosophila species and observed their association with speciation⁶⁴. Simón et al. analyzed dinucleotide and codon usage in all available non-redundant viral sequences and discovered that four dinucleotides (TG, GT, CA, and AC) were self-complementary and codon usage bias was mainly determined by the genomic composition⁶⁵. Therefore, we hypothesized that the distribution of contextual nucleotides in a gene sequence could be helpful for sequence classification prediction, so dinucleotide and triplet nucleotide information was added to the natural vector.

In this study, we developed a sequence-based approach for PPIs and PPNIs predictions called NVDT. NVDT firstly employed the distribution of nucleotides, dinucleotides, and triplet nucleotides to extract protein information from gene sequences, where the correspondence between a protein gene sequence and its feature vector was one-to-one. Second, we combined all pairs of corresponding natural vectors into a single feature vector describing a protein pair and then normalized feature vectors using the Z-score method. Finally, these features were fed into the classifier to obtain the final prediction results. NVDT not only combined the advantages of local and global protein information but also had high computational speed with low dimensions, making it a robust and efficient prediction method. We applied our approach to Homo sapiens (H. sapiens), Mus musculus (M. musculus), S. cerevisiae, Drosophila melanogaster (D. melanogaster), and Helicobacter pylori (H. pylori) datasets, and obtained high prediction results. We harnessed NVDT to produce network visualizations for three types of PPNI networks, including the one-core, multiple-core and crossing network, to further evaluate the capabilities of this approach. In addition, model evaluations indicated that NVDT improved PPIs prediction accuracies over state-of-the-art methods. Our PPNIs prediction results showed that our method was robust and would help to improve multi-body complex structure predictions.

Results

Application to H. sapiens and M. musculus datasets

To ensure the reliability of our approach, five-fold cross-validation was first used to evaluate model performance and select the optimal parameters (Supplementary Tables 1–4 and Supplementary Figs. 1–4). Then we further verified the performance of different classifiers on the test set. Table 1 showed that the accuracy of the real dataset could be improved substantially when using RF compared with SVM, where the accuracy for the H. sapiens increased from 80.92 to 86.23%, and that for M. musculus increased from 80.17 to 85.34%. But the differences in accuracy of the constructed dataset were relatively small.

Table 1 Prediction results by using two classifiers on H. sapiens and M. musculus datasets.

Full size table

Network prediction

We extended our proposed method to predict PPNI networks consisting of non-interaction pairs (NIPs). The knowledge of PPNI networks is helpful to overcome the noise in the datasets and find the relevant features that can best represent proteins, so as to better classify and design the three-dimensional structure of proteins. It can also be used to identify the proteins with the least interaction in a pathway. As a potential indispensable factor in the pathological process, these proteins are likely to be effective targets for drug design. At the same time, systematic analysis of the non-interaction relationship between a large number of proteins in biological systems is very helpful for multi-body complex structure predictions.

We predicted three types of PPNI networks using our approach. First, a one-core network is the simplest network in which only one core protein radially does not interact with other proteins. We found a guanine nucleotide-binding protein, P62879, which may be a modulator or transducer in various transmembrane signaling systems. Among 26 NIPs, 23 pairs were correctly predicted by our method (Fig. 1a and Supplementary Data 1), supporting the application of our method to predict PPNIs in one-core networks.

**Fig. 1: PPNI network prediction results.**

Second, a multiple-core network is essentially composed of several one-core networks, satisfying the corresponding non-interaction relationships among these nuclear proteins at the same time. We found a PPNI network with the pathway Q8TBX8-O75175-P31150-Q16828-Q8TAU0-Q9H6S3, involving 26 proteins. Of the 82 NIPs in this network, our method correctly predicted 66 pairs (Fig. 1b and Supplementary Data 1). To test whether the lack of core protein non-interaction information leads to low accuracy, we added existing non-interaction information for 10, 30, and 40% of core proteins. Supplementary Table 5 showed the frequency distribution of each core protein that was incorrectly predicted. Using this additional information, the accuracy could be increased from 80.49 to 85.37%, 92.68%, and 96.34%, respectively. With the increase in NIPs related to the six core proteins in the training set, the prediction accuracy of the test set also gradually improved. Therefore, additional experimental information could improve the ability of our method to predict NIPs in complex networks.

Third, in biology, most PPNI networks are crossing networks. We obtained a crossing network involving 73 human proteins from the Negatome database. Our method could correctly predict 58 pairs among 81 NIPs (Fig. 1c and Supplementary Data 1), indicating that our approach can be applied to general PPNI networks.

Five-fold cross-validation on the constructed dataset

As there was little difference in accuracy between SVM and RF when using constructed dataset, we used SVM for five-fold cross-validation. As shown in Table 2, the proposed method yielded a high average accuracy of 95.40% for H. sapien, 94.83% for M. musculus, 98.28% for S. cerevisiae, 93.22% for D. melanogaster, and 94.63% for H. pylori. These findings indicated that our model was effective and robust for PPIs prediction (Supplementary Tables 2, 6 and Supplementary Figs. 2, 5).

Table 2 Five-fold cross-validation results on the constructed dataset.

Full size table

Performance of different feature extraction and combination

For the same classifier, diverse feature extraction methods may yield different prediction results. To further determine the importance of dinucleotides and triplet nucleotides in predicting protein interactions, we separately predicted interactions combined SVM with NV (using only nucleotide information), NVD (using nucleotide and dinucleotide information), NVT (using nucleotide and triplet nucleotide information), and NVDT (using nucleotide, dinucleotide, and triplet nucleotide information). As shown in Fig. 2, SVM-NVDT showed the best accuracy, precision, MCC, and F-scores for S. cerevisiae, D. melanogaster and H. pylori constructed datasets and high sensitivity (Supplementary Table 7). For D. melanogaster, the accuracy of the proposed method was 7.02% higher than that of SVM-NV, 5.90% higher than that of SVM-NVD, and 1.12% higher than that of SVM-NVT. The polynucleotide module may therefore improve the prediction performance.

**Fig. 2: Prediction performance based on SVM with NV, NVD, and NVDT on *S. cerevisiae, D. melanogaster* and *H. pylori* constructed datasets.**

Similarly, distinct feature combination methods based on the same classifier will produce different prediction results. For one protein pair composed of protein i and protein j, the 92-dimensional feature vectors of two proteins can be obtained by the NVDT method. Suppose that the feature vector of protein i is A = (a₁, a₂, …, a₉₂), and the feature vector of protein j is B = (b₁, b₂, …, b₉₂). Now we encode the protein pairs in five different means, which are defined as follows.

$${{{{{\rm{Cod}}}}}}1:{abs}\left(A-B\right)=[\left|{a}_{1}-{b}_{1}\right|,\left|{a}_{2}-{b}_{2}\right|,\ldots ,\left|{a}_{92}-{b}_{92}\right|]$$

$${{{{{\rm{Cod}}}}}}2:A+B=[({a}_{1}+{b}_{1}),({a}_{2}+{b}_{2}),\ldots ,({a}_{92}+{b}_{92})]$$

$${{{{{\rm{Cod}}}}}}3:({abs}\left(A-B\right),A+B)=\; [\left|{a}_{1}-{b}_{1}\right|,\ldots ,\left|{a}_{92}-{b}_{92}\right|,\\ ({a}_{1}+{b}_{1}),\ldots ,({a}_{92}+{b}_{92})]$$

$${{{{{\rm{Cod}}}}}}4:A* B=[({a}_{1}\times {b}_{1}),({a}_{2}\times {b}_{2}),\ldots ,({a}_{92}\times {b}_{92})]$$

$${{{{{\rm{Cod}}}}}}5:(A,B)=[{a}_{1},{a}_{2},\ldots ,{a}_{92},{b}_{1},{b}_{2},\ldots ,{b}_{92}]$$

The accuracy results of Cod1–Cod5 on H. sapiens and M. musculus datasets were shown in Table 3. It can be seen that Cod5 (i.e., our method SVM-NVDT) achieved the highest accuracy in all four datasets.

Table 3 Prediction performance based on distinct feature combination methods on H. sapiens and M. musculus datasets.

Full size table

Comparison with other feature extraction methods

The prediction results for alternative methods based on gene sequence data using three datasets were shown in Table 4. Our method showed accuracies of 99.20% for S. cerevisiae, 94.94% for D. melanogaster, and 98.56% for H. pylori, which were better than the two other methods, indicating that NVDT was more suitable for predicting PPIs than other gene sequence-based methods.

Table 4 Prediction results of different methods on three constructed independent test datasets.

Full size table

We further compared our method with the codon pair-based method(CCPPI) proposed by Zhou et al.⁵⁶ on the S. cerevisiae dataset. We used the same ten-fold cross-validation as the CCPPI method and obtained an average accuracy of 98.05%, precision of 98.14%, sensitivity of 97.95%, and MCC of 96.10% (Supplementary Table 8). Compared with CCPPI, the average accuracy, precision, sensitivity, and MCC of our method are improved by 8.45, 10.04, 6.25, and 16.80%, respectively.

Comparison with other sequence-based methods

In order to make an effective comparison of our proposed SVM-NVDT model, and considering the limited literature based on the D. melanogaster dataset, we further compared the predictive performance of our method with other state-of-the-art sequence-based approaches using S. cerevisiae and H. pylori datasets. The comparison results on the S. cerevisiae dataset were presented in Table 5. The accuracy of our proposed method achieved an enhancement of 1.39% compared with the second TAGPPI, 2.11% to the third PIPR, 4.13%to the fourth LightGBM, and 4.56% to the fifth StackPPI. The comparison results on the H. pylori dataset were presented in Table 6. Our proposed method achieved an improvement in an accuracy of 7.31% compared with the second PCVMZM, 8.84% to the third StackPPI, 9.09% to the fourth RF+PR+LPQ, and 9.41% to the fifth Weighted Skip-sequential. Our model has shown superior results compared with other methods, further supporting the validity of our model.

Table 5 Comparison of established methods using the S. cerevisiae dataset.

Full size table

Table 6 Comparison of existing methods using the H. pylori dataset.

Full size table

Performance on independent cross-species datasets

When a large number of interacting proteins in an organism show correlated evolution, orthologs in other taxa will also interact. Therefore, we trained the model with SVM using all S. cerevisiae samples and used the other five species in the DIP database as independent test datasets. For these five test datasets, all samples were interacting protein pairs. As summarized in Table 7, the minimum prediction accuracy was 89.64%, indicating that the proposed method can be used to predict cross-species PPIs.

Table 7 Prediction results for independent datasets.

Full size table

Discussion

Our method calibrated on known PPNIs in H. sapiens and M. musculus addressed the limitations of established computational methods for PPIs prediction, including low accuracies and inefficient prediction of real non-interactions. At the same time, the validity of the model was verified on S. cerevisiae, D. melanogaster, H. pylori, H. sapiens, and M. musculus constructed datasets based on the commonly used protein pairs using different subcellular locations to construct negative samples. Several aspects of the proposed approach were worth highlighting. (1) We utilized information from gene sequence data, enabling the extraction of more extensive information not accessible in protein sequence data. The dinucleotides and triplet nucleotides are related to many diseases, metabolism, and mutations, and their information can be well obtained by NVDT. (2) NVDT had better performance than those of other gene sequence-based methods, with higher accuracy and better stability and this can be explained by the integration of polynucleotide information, which dramatically reduces the repeatability of different protein feature vectors. (3) NVDT juxtaposed the two feature vectors of a protein pair to form a new vector with a single dimension, retaining more information. Especially when combined with SVM, the prediction classifications of the two feature vectors (feature_A, feature_B) and (feature_B, feature_A) corresponding to each sample (assuming the protein pair composed of protein A and protein B) are always consistent. The reason may be that each dimension in SVM is independent, and the classification fundamentally depends on calculating the Euclidean distance of any two samples, so the order of features does not affect the classification results. (4) Ensemble classifiers typically show higher accuracy and robust performance than those of single classifiers. However, our model using a single classifier SVM showed better performance than those methods using ensemble classifiers on the H. pylori dataset. (5) The constructed negative datasets were restricted to protein pairs located in different cellular compartments, potentially leading to a functional bias in downstream analyses and predictions. The high accuracy of these datasets hardly reflected the true prediction effect, but the good performance on the real dataset can well reflect the feasibility of our model. Among them, results showed that the prediction accuracy of the real dataset was not sufficiently high, which may be explained by the limitation of real non-interacting protein pairs.

The proposed method utilized detailed gene sequence information, including the distribution information of three kinds of nucleotides, and showed better predictive performance than those of established protein sequence-based methods. Moreover, by comparing combined and single features, we found that various features may be complementary. And our model performed well on three types of networks. These results proved that gene sequence information can be used to distinguish interacting and non-interacting protein pairs and ultimately to establish a complete PPI and PPNI map to better understand biochemical and biological processes.

Methods

In this section, we elaborated on the proposed NVDT approach for predicting PPIs and PPNIs based on gene sequence. NVDT consisted of the following three steps: (1) Encode the protein pairs’ interaction and non-interaction information into natural vectors via the distribution of nucleotides, dinucleotides, and triplet nucleotides. (2) Generate and normalize feature vectors of protein pairs. (3) Train the model via different classifiers (SVM and RF) and test on an independent test set. Finally, we briefly described the common model performance evaluation metrics used in the study. The flowchart of NVDT was shown in Fig. 3.

**Fig. 3: Workflow of our computational pipeline to predict protein-protein interaction and non-interaction.**

Dataset collection

Seven datasets were obtained. The positive datasets consisted of interacting protein pairs collected from the public Database of Interacting Proteins (DIPs: https://dip.doe-mbi.ucla.edu/dip/)⁶⁶. To reduce fragments and sequence similarity, samples with fewer than 50 amino acids and >40% pairwise sequence identity to one another were excluded.

The negative datasets were composed of non-interacting protein pairs obtained in two ways. First, negative samples were derived from the Negatome Database 2.0 (http://mips.helmholtz-muenchen.de/proj/ppi/negatome)⁴⁷, which currently contained experimentally supported non-interacting protein pairs. In the Negatome database, the subcellular structure information for some proteins was unclear or even absent, and some proteins belonged to at least two or more subcellular localizations simultaneously, which was consistent with the actual phenomenon. After a selection of protein pairs from multiple species in the Negatome database, the majority belonged to H. sapiens (1217 pairs), followed by M. musculus (347 pairs) and Rattus norvegicus (33 pairs). Accordingly, we collected 2434 protein pairs for H. sapiens and 694 protein pairs for M. musculus, with interacting protein pairs and non-interacting protein pairs each accounting for half. The datasets obtained in this way were called “real dataset”.

The other negative samples in different subcellular compartments were obtained, based on the assumption that proteins within different subcellular localizations tend not to interact. Considering that the ratio of positive samples to negative samples used in previous literature research is mostly 1:1, we should not only verify the prediction accuracy of interacting protein pairs but also ensure the prediction accuracy of non-interacting protein pairs. Therefore, the balanced dataset was selected when constructing the dataset, that is, we randomly selected negative samples with the same number of positive samples. Here, the selection of negative samples was random without any cluster analysis. Four datasets were finally collected in this way, including 11188 protein pairs for S. cerevisiae, 2140 protein pairs for D. melanogaster, 1217 protein pairs for H. sapiens, and 694 protein pairs for M. musculus. We also performed on 2916 protein pairs for H. pylori described derived by Rain et al.⁶⁷, and a list of H. pylori protein interactions was given in its supplementary materials. These datasets were called “constructed dataset”.

The gene sequence of each protein was obtained from the NCBI database (https://www.ncbi.nlm.nih.gov/). To verify the performance of the proposed method, the positive samples of H. sapiens and M. musculus datasets in the real dataset and constructed dataset were consistent, only the negative samples were obtained differently (Supplementary Table 9). Finally, each of the seven datasets was divided into a training dataset and a test dataset in a ratio of 5:1.

Feature extraction

Deng et al. have defined natural vector and mathematically proved that there was a one-to-one correspondence between natural vector and gene sequence for a protein⁵⁸. The natural vector method to extract vital information from gene sequence was as follows.

Let $S={s}_{1}{s}_{2}\cdots {s}_{N}$ be a gene sequence of length N, where ${s}_{i}\in \left\{A,C,G,T\right\},i={{{{\mathrm{1,2}}}}},\ldots ,N$. For k representing one of the four nucleotides, define $\omega \left(\cdot \right)=\left\{A,C,G,T\right\}\to \{0,1\}$, where

$${\omega }_{k}\left({s}_{i}\right)=\left\{\begin{array}{c}0,{s}_{i}\;\ne\; k,\\ 1,{s}_{i}=k.\end{array}\right.$$

(1)

(1)
Denote ${n}_{k}=\mathop{\sum }\nolimits_{i=1}^{N}{\omega }_{k}\left({s}_{i}\right)$ as the number of nucleotide k in gene sequence S.
(2)
Taking the first nucleotide as the origin, ${{{{{{\rm{dis}}}}}}}_{k(i)}=i\cdot {\omega }_{k}\left({s}_{i}\right)$ is the distance from the first nucleotide to the ith nucleotide k, where $i={1,2},\ldots ,{n}_{k}$, and ${\mu }_{k}=\frac{\mathop{\sum }\nolimits_{i=1}^{{n}_{k}}{{{{{{\rm{dis}}}}}}}_{k(i)}}{{n}_{k}}$ represents the average position of nucleotide k in gene sequence S.
(3)
The normalized central moments are defined as ${D}_{j}^{k}=\mathop{\sum }\nolimits_{i=1}^{{n}_{k}}\frac{{(i-{\mu }_{k})}^{j}\cdot {\omega }_{k}\left({s}_{i}\right)}{{n}_{k}^{j-1}{N}^{j-1}},j={1,2},\ldots ,{n}_{k}$.

The first normalized central moments ${D}_{1}^{k}$ (j = 1 in ${D}_{j}^{k}$) could be ignored since its values were zero. Deng et al. and Yu et al. have demonstrated that the second normalized central moments ${D}_{2}^{k}$ (j = 2 in ${D}_{j}^{k}$) in the vector could obtain stable classification results; accordingly, the central moments (${D}_{j}^{k}$) higher than j = 2 do not need to be considered, and computed^58,68. Thus, the 12-dimensional natural vector N(S) of a gene sequence S was given as follows:

$$N\left(S\right)=\left({n}_{A},{n}_{C},{n}_{G},{n}_{T},{\mu }_{A},{\mu }_{C},{\mu }_{G},{\mu }_{T},{D}_{2}^{A},{D}_{2}^{C},{D}_{2}^{G},{D}_{2}^{T}\right)$$

(2)

In this work, we proposed an extended natural vector with the frequencies of dinucleotides and triplet nucleotides, which considered the properties of a single nucleotide and its vicinal nucleotides and regarded any two contiguous nucleotides or any three contiguous nucleotides as a unit. The size of type dinucleotide should be 4 × 4 = 16 and the size of type triplet nucleotide should be 4 × 4 × 4 = 64. Finally, we defined a 92-dimensional natural vector to make further investigation on gene sequence as the following:

$$N\left(S\right)=\left({n}_{A},{n}_{C},{n}_{G},{n}_{T},{\mu }_{A},{\mu }_{C},{\mu }_{G},{\mu }_{T},{D}_{2}^{A},{D}_{2}^{C},{D}_{2}^{G},{D}_{2}^{T},{n}_{{AA}},\ldots ,{n}_{{TT}},{n}_{{AAA}},\ldots ,{n}_{{TTT}}\right)$$

(3)

For each protein pair in the dataset, we could convert a gene sequence into a natural vector. Assume that the kth protein pair corresponds to protein i and protein j, we then juxtaposed the two natural vectors. The features of the protein pair can be described as two 184-dimensional feature vectors:

$${d}_{k}=({N}_{i},{N}_{j})\;{{{{{\rm{or}}}}}}\;({N}_{j},{N}_{i})$$

(4)

where N_i and N_j were feature vectors for proteins i and j.

However, protein feature vectors correlated to the length of protein (the number of nucleotides) complicate the comparison between two protein pairs. The feature vectors were standardized by the Z-score method⁶⁹, with the mean value of 0 and standard deviation of 1:

$${d}_{k}^{{\prime} }\left(l\right)=\frac{{d}_{k}\left(l\right)-\mu \left(l\right)}{\sigma \left(l\right)},k=1,2,\ldots ,n,{{{{{\rm{and}}}}}}\,l=1,2,\ldots ,184$$

(5)

where ${d}_{k}\left(l\right)$ was the lth feature of the kth protein pair, and n was the number of protein pairs. μ(l) and σ(l) were the mean value and standard deviation of all proteins for the lth feature, respectively:

$$\mu \left(l\right)=\frac{\mathop{\sum }\nolimits_{k=1}^{n}{d}_{k}\left(l\right)}{n},\sigma \left(l\right)=\sqrt{\frac{\mathop{\sum }\nolimits_{k=1}^{n}{\left({d}_{k}\left(l\right)-\mu \left(l\right)\right)}^{2}}{n}}$$

(6)

For data standardization, the training set was standardized in the above way, and the mean and standard deviation of the training set were used to standardize the test set. ${d}_{k}^{{\prime} }$, the input features of a classifier, were a 184-dimensional vector containing the statistical features of nucleotide, dinucleotide, and triplet nucleotide.

Support vector machine

SVM⁷⁰ is a supervised learning method that has been popularly utilized for classification and regression in computational biology. It constructs a hyperplane to maximize the margin. Specifically, SVM finds several samples as support vectors that minimize the distance of samples between different classes, and the minimum distance is called the margin. The optimal hyperplane is located in the center of the margin, where the larger the margin is, the smaller the generalization error will be. Therefore, this hyperplane can better divide the samples in the training set according to the labels given. In this study, the Gaussian radial basis function was chosen as the kernel function, and the regularization parameter c and the Gauss kernel function parameter γ were optimized by a grid search approach. Here, the parameters were the default of LIBSVM, and cross-validation was used to avoid over-fitting (Supplementary Table 10). And select a cut-off value of 0.5. The application of the SVM classifier under optimal parameter settings guarantees the reliability of classification prediction with the minimum error.

Random forest

Random forest (RF)⁷⁰ is another typical classification and regression method with biological applications. It uses multiple decision trees, which do not correlate them, to obtain the final result by voting or taking the mean value, generating an overall model with high accuracy and generalizability. For a new sample, each decision tree assigns the sample to a class, and then the voting method is used to determine which class is selected more frequently as the final classification result. Three parameters are usually adjusted when training an RF model, the number of trees to grow (denoted as n_Tree), the number of randomly selected features at each decision split (denoted as n_feature), and the minimum node size of a terminal node (denoted as min_leaf). In our study, n_Tree values under 200 were evaluated to find the optimal value concerning computational time, cost, and overfitting (Supplementary Table 10). All other parameters were set to default values.

Performance evaluation

Common performance evaluation metrics were used, including accuracy (Acc.), precision (Pre.), sensitivity (Sen.), specificity (Spe.), Matthews correlation coefficient (MCC), F1-score, and the area under a ROC curve (AUC) to evaluate the predictive performance of the proposed method. These metrics were defined as follows:

$${{{{{\rm{Accuracy}}}}}}=\frac{{TP}+{TN}}{{TP}+{TN}+{FP}+{FN}}$$

(7)

$${{{{{\rm{Precision}}}}}}=\frac{{TP}}{{TP}+{FP}}$$

(8)

$${{{{{\rm{Sensitivity}}}}}}=\frac{{TP}}{{TP}+{FN}}$$

(9)

$${{{{{\rm{Specificity}}}}}}=\frac{{TN}}{{TN}+{FP}}$$

(10)

$${{{{{\rm{MCC}}}}}}=\frac{{TP}\times {TN}-{FP}\times {FN}}{\sqrt{\left({TP}+{FN}\right)\times \left({TN}+{FP}\right)\times \left({TP}+{FP}\right)\times \left({TN}+{FN}\right)}}$$

(11)

$$F1\mbox{-}{{{{{\rm{score}}}}}}=\frac{2* {{{{{\rm{precision}}}}}}* {{{{{\rm{sensitivity}}}}}}}{{{{{{\rm{precision}}}}}}+{{{{{\rm{sensitivity}}}}}}}$$

(12)

where TP represents the number of positive samples that are correctly predicted; FN represents the number of positive samples that are incorrectly predicted; TN represents the number of negative samples that are correctly predicted; FP represents the number of negative samples that are incorrectly predicted. The receiver operating characteristic (ROC) curve was generated by plotting the TP rate against the FP rate at various thresholds; the abscissa of the ROC curve was 1-specificity and the longitudinal coordinate is sensitivity.

Statistics and reproducibility

All quantitative results of K-fold cross-validation were shown using mean ± standard deviation. Graphing of three types of PPNI networks was performed with Pajek. The computational experiments were repeated at least two additional times with a similar outcome.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

The data used in this work are available at https://github.com/Zhaonan99/NVDT and Supplementary Data 1. Supplementary Data 1 is a .xlsx file that includes data for reproducing Fig. 1. Supplementary Table 7 includes data for reproducing Fig. 2. The interacting protein pairs discussed have been deposited in the DIP database, the real non-interacting protein pairs are accessible through the Negatome Database 2.0, and the gene sequence of each protein is available from the NCBI database. Other information is available from the corresponding author on reasonable request.

Code availability

The source code is available at https://github.com/Zhaonan99/NVDT.

References

Zhang, B. Z., Li, J. Y., Quan, L. J., Chen, Y. & Lu, Q. Sequence-based prediction of protein–protein interaction sites by simplified long short-term memory network. Neurocomputing 357, 86–100 (2019).
Article Google Scholar
Ni, D., Lu, S. & Zhang, J. Emerging roles of allosteric modulators in the regulation of protein–protein interactions (PPIs): A new paradigm for PPI drug discovery. Med. Res. Rev. 39, 2314–2342 (2019).
Article PubMed Google Scholar
Launay, G., Ceres, N. & Martin, J. Non-interacting proteins may resemble interacting proteins: Prevalence and implications. Sci. Rep. 7, 40419 (2017).
Article CAS PubMed PubMed Central Google Scholar
You, Z. H., Lei, Y. K., Gui, J., Huang, D. S. & Zhou, X. Using manifold embedding for assessing and predicting protein interactions from high-throughput experimental data. Bioinformatics 26, 2744–2751 (2010).
Article CAS PubMed PubMed Central Google Scholar
von Mering, C. et al. Comparative assessment of large-scale data sets of protein–protein interactions. Nature 417, 399–403 (2002).
Article CAS Google Scholar
Lee, H., Deng, M., Sun, F. & Chen, T. An integrated approach to the prediction of domain–domain interactions. BMC Bioinform. 7, 269 (2006).
Article CAS Google Scholar
Zahiri, J., Yaghoubi, O., Mohammad-Noori, M., Ebrahimpour, R. & Masoudi-Nejad, A. PPIevo: Protein–protein interaction prediction from PSSM based evolutionary information. Genomics 102, 237–242 (2013).
Article CAS PubMed Google Scholar
Hsin Liu, C., Li, K. C. & Yuan, S. Human protein–protein interaction prediction by a novel sequence-based co-evolution method: Co-evolutionary divergence. Bioinformatics 29, 92–98 (2013).
Article CAS PubMed Google Scholar
Agrawal, N. J., Helk, B. & Trout, B. L. A computational tool to predict the evolutionarily conserved protein–protein interaction hot-spot residues from the structure of the unbound protein. FEBS Lett. 588, 326–333 (2014).
Article CAS PubMed Google Scholar
Zhang, Q. C. et al. Structure-based prediction of protein–protein interactions on a genome-wide scale. Nature 490, 556–560 (2012).
Article CAS PubMed PubMed Central Google Scholar
Kovács, I. A. et al. Network-based prediction of protein interactions. Nat. Commun. 10, 1240 (2019).
Article PubMed PubMed Central CAS Google Scholar
Chou, K. C. & Cai, Y. D. Predicting protein–protein interactions from sequences in a hybridization space. J. Proteome Res. 5, 316–322 (2006).
Article CAS PubMed Google Scholar
Hamp, T. & Rost, B. Evolutionary profiles improve protein–protein interaction prediction from sequence. Bioinformatics 31, 1945–1950 (2015).
Article CAS PubMed Google Scholar
Bock, J. R. & Gough, D. A. Predicting protein–protein interactions from primary structure. Bioinformatics 17, 455–460 (2001).
Article CAS PubMed Google Scholar
Shen, J. et al. Predicting protein–protein interactions based only on sequences information. Proc. Natl Acad. Sci. USA 104, 4337–4341 (2007).
Article CAS PubMed PubMed Central Google Scholar
Guo, Y., Yu, L., Wen, Z. & Li, M. Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Res. 36, 3025–3030 (2008).
Article CAS PubMed PubMed Central Google Scholar
Yang, L., Xia, J. F. & Gui, J. Prediction of protein–protein interactions from protein sequence using local descriptors. Protein Pept. Lett. 17, 1085–1090 (2010).
Article CAS PubMed Google Scholar
Yin, C. & Yau, S. S. A coevolution analysis for identifying protein–protein interactions by Fourier transform. PLoS ONE 12, e0174862 (2017).
Article PubMed PubMed Central CAS Google Scholar
Wang, J., Zhang, L., Jia, L., Ren, Y. & Yu, G. Protein–protein interactions prediction using a novel local conjoint triad descriptor of amino acid sequences. Int. J. Mol. Sci. https://doi.org/10.3390/ijms18112373 (2017).
Chen, C., Zhang, Q., Ma, Q. & Yu, B. LightGBM-PPI: Predicting protein–protein interactions through LightGBM with multi-information fusion. Chemometrics Intell. Lab. Syst. 191, 54–64 (2019).
Article CAS Google Scholar
Zhang, L., Yu, G., Xia, D. & Wang, J. Protein–protein interactions prediction based on ensemble deep neural networks. Neurocomputing 324, 10–19 (2019).
Article Google Scholar
Chen, C. et al. Improving protein–protein interactions prediction accuracy using XGBoost feature selection and stacked ensemble classifier. Comput. Biol. Med. 123, 103899 (2020).
Article CAS PubMed Google Scholar
Zhang, S. et al. A deep learning framework for modeling structural features of RNA-binding protein targets. Nucleic Acids Res. 44, e32 (2016).
Article PubMed Google Scholar
Gerdes, H. et al. Drug ranking using machine learning systematically predicts the efficacy of anti-cancer drugs. Nat. Commun. 12, 1850 (2021).
Article CAS PubMed PubMed Central Google Scholar
Zheng, S., Li, Y., Chen, S., Xu, J. & Yang, Y. Predicting drug–protein interaction using quasi-visual question answering system. Nat. Mach. Intell. 2, 134–140 (2020).
Article Google Scholar
Myszczynska, M. A. et al. Applications of machine learning to diagnosis and treatment of neurodegenerative diseases. Nat. Rev. Neurol. 16, 440–456 (2020).
Article PubMed Google Scholar
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
CAS PubMed PubMed Central Google Scholar
Hu, L., Wang, X., Huang, Y. A., Hu, P. & You, Z. H. A survey on computational models for predicting protein–protein interactions. Briefings Bioinform. https://doi.org/10.1093/bib/bbab036 (2021).
Cunningham, J. M., Koytiger, G., Sorger, P. K. & AlQuraishi, M. Biophysical prediction of protein–peptide interactions and signaling networks using machine learning. Nat. Methods 17, 175–183 (2020).
Article CAS PubMed PubMed Central Google Scholar
Sun, T., Chen, Y., Wen, Y., Zhu, Z. & Li, M. PremPLI: A machine learning model for predicting the effects of missense mutations on protein–ligand interactions. Commun. Biol. 4, 1311 (2021).
Article CAS PubMed PubMed Central Google Scholar
You, Z.-H., Yu, J.-Z., Zhu, L., Li, S. & Wen, Z.-K. A MapReduce-based parallel SVM for large-scale predicting protein–protein interactions. Neurocomputing 145, 37–43 (2014).
Article Google Scholar
Martin, S., Roe, D. & Faulon, J. L. Predicting protein–protein interactions using signature products. Bioinformatics 21, 218–226 (2005).
Article CAS PubMed Google Scholar
You, Z. H. et al. Detecting protein–protein interactions with a novel matrix-based protein sequence representation and support vector machines. BioMed. Res. Int. 2015, 867516 (2015).
Article PubMed PubMed Central CAS Google Scholar
Zhan, X. K. et al. Using random forest model combined with Gabor feature to predict protein–protein interaction from protein sequence. Evolut. Bioinform. Online 16, 1176934320934498 (2020).
Google Scholar
Jia, J. H., Liu, Z., Chen, X., Xiao, X. & Liu, B. X. Prediction of protein–protein interactions using chaos game representation and wavelet transform via the random forest algorithm. Genet. Mol. Res.: GMR 14, 11791–11805 (2015).
Article CAS PubMed Google Scholar
Hashemifar, S., Neyshabur, B., Khan, A. A. & Xu, J. Predicting protein–protein interactions through sequence-based deep learning. Bioinformatics 34, i802–i810 (2018).
Article CAS PubMed PubMed Central Google Scholar
Chen, M. et al. Multifaceted protein–protein interaction prediction based on Siamese residual RCNN. Bioinformatics 35, i305–i314 (2019).
Article CAS PubMed PubMed Central Google Scholar
Yu, X., Gong, X. & Jiang, H. Heterogeneous multiple kernel learning for breast cancer outcome evaluation. BMC Bioinformatics 21, 155 (2020).
Article PubMed PubMed Central Google Scholar
Lyu, Y. & Gong, X. A Two-layer SVM ensemble-classifier to predict interface residue pairs of protein trimers. Molecules https://doi.org/10.3390/molecules25194353 (2020).
Wang, W., Yang, Y., Yin, J. & Gong, X. Different protein–protein interface patterns predicted by different machine learning methods. Sci. Rep. https://doi.org/10.1038/s41598-017-16397-z (2017).
Mei, S. & Zhang, K. J. I. J. O. M. S. Neglog: Homology-based negative data sampling method for genome-scale reconstruction of human protein–protein interaction networks. Int. J. Mol. Sci. 20, 5075 (2019).
Article CAS PubMed Central Google Scholar
Ben-Hur, A. & Noble, W. S. Choosing negative examples for the prediction of protein–protein interactions. BMC Bioinform. 7, S2 (2006).
Article CAS Google Scholar
Srivastava, A., Mazzocco, G., Kel, A., Wyrwicz, L. S. & Plewczynski, D. Detecting reliable non interacting proteins (NIPs) significantly enhancing the computational prediction of protein-protein interactions using machine learning methods. Mol. Biosyst. 12, 778–785 (2016).
Article CAS PubMed Google Scholar
Nath, A. & Leier, A. Improved cytokine–receptor interaction prediction by exploiting the negative sample space. BMC Bioinform. 21, 493 (2020).
Article CAS Google Scholar
Park, Y. & Marcotte, E. M. Revisiting the negative example sampling problem for predicting protein-protein interactions. Bioinformatics 27, 3024–3028 (2011).
Article CAS PubMed PubMed Central Google Scholar
Smialowski, P. et al. The Negatome database: A reference set of non-interacting protein pairs. Nucleic Acids Res. 38, D540–D544 (2010).
Article CAS PubMed Google Scholar
Blohm, P. et al. Negatome 2.0: A database of non-interacting proteins derived by literature mining, manual annotation, and protein structure analysis. Nucleic Acids Res. 42, D396–D400 (2014).
Article CAS PubMed Google Scholar
Bryant, P., Pozzati, G. & Elofsson, A. Improved prediction of protein–protein interactions using AlphaFold2. Nat. Commun. 13, 1265 (2022).
Article CAS PubMed PubMed Central Google Scholar
Das, S. & Chakrabarti, S. Classification and prediction of protein–protein interaction interface using machine learning algorithm. Sci. Rep. 11, 1761 (2021).
Article CAS PubMed PubMed Central Google Scholar
Duret, L. Evolution of synonymous codon usage in metazoans. Curr. Opin. Genet. Dev. 12, 640–649 (2002).
Article CAS PubMed Google Scholar
Yu, C. H. et al. Codon usage influences the local rate of translation elongation to regulate co-translational protein folding. Mol. Cell 59, 744–754 (2015).
Article CAS PubMed PubMed Central Google Scholar
Zhao, F., Yu, C. H. & Liu, Y. Codon usage regulates protein structure and function by affecting translation elongation speed in Drosophila cells. Nucleic Acids Res. 45, 8484–8492 (2017).
Article CAS PubMed PubMed Central Google Scholar
Moratorio, G. et al. Attenuation of RNA viruses by redirecting their evolution in sequence space. Nat. Microbiol. 2, 17088 (2017).
Article CAS PubMed PubMed Central Google Scholar
Carrau, L. et al. Chikungunya virus vaccine candidates with decreased mutational robustness are attenuated in vivo and have compromised transmissibility. J. Virol. https://doi.org/10.1128/jvi.00775-19 (2019).
Zhuo, M. J. & Gong, X. Q. Natural distinct inter-preference between genetic codon and protein secondary structure combinations. Commun. Inf. Syst. 18, 331–347 (2018).
Article Google Scholar
Zhou, Y., Zhou, Y. S., He, F., Song, J. & Zhang, Z. Can simple codon pair usage predict protein–protein interaction? Mol. Biosyst. 8, 1396–1404 (2012).
Article CAS PubMed Google Scholar
Najafabadi, H. S. & Salavati, R. Sequence-based prediction of protein–protein interactions by means of codon usage. Genome Biol. 9, R87 (2008).
Article PubMed PubMed Central CAS Google Scholar
Deng, M., Yu, C., Liang, Q., He, R. L. & Yau, S. S. A novel method of characterizing genetic sequences: Genome space with biological distance and applications. PLoS One 6, e17293 (2011).
Article CAS PubMed PubMed Central Google Scholar
Dong, R., He, L., He, R. L. & Yau, S. S. A novel approach to clustering genome sequences using inter-nucleotide covariance. Front. Genet. 10, 234 (2019).
Article CAS PubMed PubMed Central Google Scholar
Zhao, X., Tian, K., He, R. L. & Yau, S. S. Establishing the phylogeny of Prochlorococcus with a new alignment-free method. Ecol. Evol. 7, 11057–11065 (2017).
Article PubMed PubMed Central Google Scholar
Soma, M. & Lalam, S. K. The role of nicotinamide mononucleotide (NMN) in anti-aging, longevity, and its potential for treating chronic conditions. Mol. Biol. Rep. https://doi.org/10.1007/s11033-022-07459-1 (2022).
Article PubMed Google Scholar
Atkinson, N. J., Witteveldt, J., Evans, D. J. & Simmonds, P. The influence of CpG and UpA dinucleotide frequencies on RNA virus replication and characterization of the innate cellular pathways underlying virus attenuation and enhanced replication. Nucleic Acids Res. 42, 4527–4545 (2014).
Article CAS PubMed PubMed Central Google Scholar
Takata, M. A. et al. Global synonymous mutagenesis identifies cis-acting RNA elements that regulate HIV-1 splicing and replication. PLoS Pathog. 14, e1006824 (2018).
Article PubMed PubMed Central CAS Google Scholar
Kokate, P. P., Techtmann, S. M. & Werner, T. Codon usage bias and dinucleotide preference in 29 Drosophila species. G3 https://doi.org/10.1093/g3journal/jkab191 (2021).
Simón, D., Cristina, J. & Musto, H. An overview of dinucleotide and codon usage in all viruses. Arch. Virol. https://doi.org/10.1007/s00705-022-05454-2 (2022).
Article PubMed PubMed Central Google Scholar
Xenarios, I. et al. DIP, the Database of Interacting Proteins: A research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 30, 303–305 (2002).
Article CAS PubMed PubMed Central Google Scholar
Rain, J. C. et al. The protein–protein interaction map of Helicobacter pylori. Nature 409, 211–215 (2001).
Article CAS PubMed Google Scholar
Yu, C. et al. Real time classification of viruses in 12 dimensions. PLoS One 8, e64328 (2013).
Article PubMed PubMed Central Google Scholar
Wylie, C. R. Jun. Advanced Engineering Mathematics. (McGraw-Hill Book Company, 1966).
Wei, Z.-S., Han, K., Yang, J.-Y., Shen, H.-B. & Yu, D.-J. Protein–protein interaction sites prediction by ensembling SVM and sample-weighted random forests. Neurocomputing 193, 201–212 (2016).
Article Google Scholar
Zhou, Y., Gao, Y. & Zheng, Y. Prediction of protein–protein interactions using local description of amino acid sequence. Adv. Comput. Sci. Educ. Appl., Pt Ii https://doi.org/10.1007/978-3-642-22456-0_37 (2011).
Wong, L. et al. Detection of interactions between proteins through rotation forest and local phase quantization descriptors. Int. J. Mol. Sci. https://doi.org/10.3390/ijms17010021 (2015).
Wang, Y. et al. PCVMZM: Using the probabilistic classification vector machines model combined with a Zernike moments descriptor to predict protein–protein interactions from protein sequences. Int. J. Mol. Sci. https://doi.org/10.3390/ijms18051029 (2017).
Du, X. et al. DeepPPI: Boosting prediction of protein–protein interactions with deep neural networks. J. Chem. Inf. Modeling 57, 1499–1510 (2017).
Article CAS Google Scholar
Song, B. et al. Learning spatial structures of proteins improves protein–protein interaction prediction. Brief. Bioinform. https://doi.org/10.1093/bib/bbab558 (2022).
Nanni, L. Hyperplanes for predicting protein–protein interactions. Neurocomputing 69, 257–263 (2005).
Article Google Scholar
Nanni, L. & Lumini, A. An ensemble of K-local hyperplanes for predicting protein–protein interactions. Bioinformatics 22, 1207–1210 (2006).
Article CAS PubMed Google Scholar
Shi, M. G., Xia, J. F., Li, X. L. & Huang, D. S. Predicting protein–protein interactions from sequence using correlation coefficient and high-quality interaction dataset. Amino Acids 38, 891–899 (2010).
Article CAS PubMed Google Scholar
You, Z. H., Lei, Y. K., Zhu, L., Xia, J. & Wang, B. Prediction of protein–protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis. BMC Bioinform. 14, S10 (2013).
Article Google Scholar
You, Z. H. et al. Prediction of protein–protein interactions from amino acid sequences using a novel multi-scale continuous and discontinuous feature set. BMC Bioinform. 15, S9 (2014).
Article Google Scholar
Goktepe, Y. E. & Kodaz, H. Prediction of protein–protein interactions using an effective sequence based combined method. Neurocomputing 303, 68–74 (2018).
Article Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. 31670725), the Beijing Advanced Innovation Center for Structural Biology of Tsinghua University, and the Public Computing Cloud of Renmin University of China.

Author information

Authors and Affiliations

Institute for Mathematical Sciences, School of Mathematics, Renmin University of China, Beijing, China
Nan Zhao, Maji Zhuo, Kun Tian & Xinqi Gong
Beijing Academy of Artificial Intelligence, Beijing, China
Xinqi Gong
Beijing Advanced Innovation Center for Structural Biology, Tsinghua University, Beijing, China
Xinqi Gong

Authors

Nan Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Maji Zhuo
View author publications
You can also search for this author in PubMed Google Scholar
Kun Tian
View author publications
You can also search for this author in PubMed Google Scholar
Xinqi Gong
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

N.Z. developed the method, collected the data, did the computation, and wrote the initial manuscript and later revised manuscripts. M.J.Z. proposed to draw on the natural vector and collected Saccharomyces cerevisiae dataset. K.T. helped to revise the manuscript. X.Q.G. designed the project and revised the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Xinqi Gong.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Biology thanks Jianzhao Gao and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Yuedong Yang and Gene Chong. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Peer Review File

Supplementary Information

Description of Additional Supplementary Files

Supplementary Data 1

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhao, N., Zhuo, M., Tian, K. et al. Protein–protein interaction and non-interaction predictions using gene sequence natural vector. Commun Biol 5, 652 (2022). https://doi.org/10.1038/s42003-022-03617-0

Download citation

Received: 04 February 2022
Accepted: 21 June 2022
Published: 02 July 2022
DOI: https://doi.org/10.1038/s42003-022-03617-0

This article is cited by

Topological links in predicted protein complex structures reveal limitations of AlphaFold
- Yingnan Hou
- Tengyu Xie
- Jing Huang
Communications Biology (2023)
SVSBI: sequence-based virtual screening of biomolecular interactions
- Li Shen
- Hongsong Feng
- Guo-Wei Wei
Communications Biology (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Application to H. sapiens and M. musculus datasets

Network prediction

Five-fold cross-validation on the constructed dataset

Performance of different feature extraction and combination

Comparison with other feature extraction methods

Comparison with other sequence-based methods

Performance on independent cross-species datasets

Discussion

Methods

Dataset collection

Feature extraction

Support vector machine

Random forest

Performance evaluation

Statistics and reproducibility

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links