Introduction

Most vital life activities are related to protein interactions, including physiological and pathological processes1. Proteins with similar functions are more likely to interact with each other to form protein complexes that participate in complex and diverse biochemical activities. Detecting and characterizing PPIs provides insight into the functions of unannotated proteins and the mechanisms underlying cellular biochemical processes and complex diseases, and can provide a basis for protein engineering and drug design2. The vast majority of proteins do not form interactions in physiological conditions. The number of non-interacting protein pairs vastly exceeds that of interacting pairs. Although some protein pairs have great structural similarities with available experimental complexes, interactions between them can be avoided if they are potentially hazardous to organism3. Identifying PPNIs is critical for understanding biological processes and reducing noise in datasets aimed at determining representative features of protein classification. Mastering the biological characteristics of non-interacting protein pairs helps to better study the three-dimensional structure and function of multi-body.

Rapid developments of high-throughput experimental technologies have enabled large-scale discovery and identification of PPIs, but they remain some disadvantages of time-consuming, labor-intensive, and high levels of false-positive and false-negative predictions4,5. Therefore, many computational methods have emerged as an alternative for the experimental prediction of PPIs based on different data types, such as genomic information6, evolutionary information7,8, structural information9,10, network information11, and sequence information12,13. Since protein sequences are easily obtained and sequence-based schemes do not need prior knowledge, various sequence-based computational models are favored by researchers. For example, Bock and Gough utilized amino acid physicochemical properties to predict interactions based solely on primary sequences14. Shen et al. characterized protein sequence by conjoint triad (CT), where CT considers the properties of adjacent amino acids15. Guo et al. expressed protein sequence by auto covariance (AC), which considered interactions between residues separated by a certain distance16. Yang et al. employed local descriptors (LD, including composition, transition, and distribution) to evaluate the effect of discontinuous amino acids, but the extracted features ignore global information17. Yin et al. numerically represented protein sequences using biochemical properties of amino acids and conducted coevolution analysis based on Fourier transform to detect interacting protein pairs18. Due to the difficulty in accurately characterizing protein sequence information by these single feature extraction methods, researchers have proposed many computational methods integrating multiple features for prediction19,20. For instance, Zhang et al. used AC, LD, and multi-scale continuous and discontinuous (MCD) local descriptor to extract features information of protein sequence21. Chen et al. exploited pseudo amino acid composition, Moreau-Broto, Moran and Geary autocorrelation descriptor, position-specific scoring matrix, and LD to encode biologically relevant features22. These feature vectors generally summarize physicochemical properties and position distribution of amino acids, but their large dimensions greatly increase the computational complexity.

The recent advancements in machine learning techniques have been greatly applied in the bioinformatics field, including RNA-binding site identification23, drugs’ efficacy24, drug–target binding recognition25, medical diagnosis26, protein complex structure prediction27, and protein–protein/peptide/ligand predictions28,29,30. Traditional machine learning algorithms are often combined with feature engineering to detect PPIs, such as amino acid physicochemical properties, co-occurrence frequency13, CT15, AC16,31, LD17, signature product32, sequence order, and dipeptide information33 were performed with support vector machine (SVM), Gabor feature34, chaos game representation and wavelet transform35 were performed with random forest (RF). In addition, deep learning algorithms are also widely used in PPIs detection, such as Siamese-like convolutional neural network (CNN) in DPPI36 and deep residual recurrent CNN in PIPR37. However, deep learning algorithms remain challenging as follows: (i) Lack of interpretability. The deep network is similar to a “black box” in which the physical meaning of features cannot be explained. But traditional machine learning algorithms involve feature engineering, which makes the model easy to interpret and understand. Previous successful methods show that the model generated by the combination of clear sequence features and traditional machine learning performs well. (ii) Complexity and time-consuming. The “inside” of deep learning is difficult to fully understand, which makes hyper-parametric and network design still a considerable challenge. But now that we have a more comprehensive understanding of traditional machine learning underlying algorithms, it is easier to adjust parameters and change model designs. Our lab had effectively captured amino acid sequence information using traditional machine learning algorithms to model and validate PPIs, and had many applications in other biological and medical problems. Yu et al. used SVM with heterogeneous kernels from a specific kernel set, including the Hadamard, RBF, and linear kernels, to perform the breast cancer outcome evaluation and the results proved to be effective38. Lyu et al. developed a two-layer SVM ensemble-classifier to predict interface residue pairs of protein trimers and showed its effectiveness and reliability39. Wang et al. employed linear SVM, RF, logistic regression with lasso penalty, and logistic regression with hierarchy interaction to predict interface residue pairs respectively, and showed that diverse machine learning methods tend to predict different protein–protein interface patterns40. As a result, we select the appropriate feature engineering combined with traditional machine learning algorithms to predict protein–protein interaction and non-interaction.

Plenty of computational models are proposed under the paradigm of supervised learning, so the quality of training data is a key issue to determine the prediction performance. For the training of learning algorithms, interacting protein pairs (i.e., positive samples) and non-interacting protein pairs (i.e., negative samples) are equally important to computational biologists. However, biologists generally focus on interacting protein pairs, extensively collecting experimentally or computationally validated interacting protein pairs into public databases, while ignoring or discarding non-interacting protein pairs41. Negative samples in the computational methods are almost constructed by pairing proteins located in different subcellular positions, but these samples restrict the distribution of non-interacting protein pairs and lead to biased estimates of prediction accuracy42. Srivastava et al. introduced a triple-layer validation method to collect reliable non-interacting protein pairs and then characterized the most relevant protein correlation features to train the PPIs identification model, which showed excellent predictive capabilities43. The training and performance evaluation of models are affected by strong bias in negative samples44,45, so high-quality and practical negative samples are needed to train less biased models. Smialowski et al. developed the Negatome database, a set of protein pairs unlikely to show direct physical interactions, in 2009 by manually collating the literatures and analyzing three-dimensional protein complex structures46. Blohm et al. proposed the second version of this database using a new advanced text-mining process to guide the manual annotation process47. High-quality non-interacting samples are important to capture the interaction and non-interaction information from sequences, so Negatome samples are currently used in many studies as an alternative to pairing proteins located in different subcellular. Bryant et al. used negative samples collected in the Negatome database, combined with AlphaFold2, and optimized multi-sequence alignment to predict heterodimeric protein complexes48. Das et al. used a negative dataset from the Negatome database when utilizing interface properties and SVM to classify and differentiate native and non-native complexes49. Consequently, the non-interacting protein pairs in the Negatome database can facilitate the training of highly generalizing models to predict PPIs and PPNIs.

Feature selection is a major determinant of the generalizability of predictive models. Common sequence-based computational methods extract features from the amino acid sequence, but nucleotide-related information is missing in the amino acid sequence since multiple codons encode the same amino acid. Triplet nucleotides (i.e., codons) are associated with various diseases, and their repeats may lead to toxic proteins, alter RNA function, and control transcription and translation50,51,52. Moratorio et al.53 and Carrau et al.54 found that codon usage may be selected to maintain mutational robustness. Some triple nucleotides often encode the same amino acid, which is called synonymous codons. Codon usage bias, where certain codons are used more frequently than their synonymous codons, is influenced by mutation, selection, and genetic drift. Zhuo et al. found that some interface codons had the obvious propensity to interface residues and the genetic codon affected the interaction interface between proteins55. Because codons carry very important information about proteins, some methods consider extracting features from gene sequences. Zhou et al. used the codon pair frequency difference to predict protein interactions with comparable performance to those of other sequence-based methods56. Najafabadi et al. utilized the relative codon frequency differences combined with a naive Bayesian classifier to validate PPIs, the approach showed good performance on Saccharomyces cerevisiae (S. cerevisiae), Escherichia coli, and Plasmodium falciparum datasets57. Consequently, extracting additional biological information from gene sequences, like codon frequencies, may further improve the prediction ability. Deng et al. proposed natural vector based on the distributions of nucleotides in each DNA sequence to analyze the virus genome58. Later, many methods improved on the natural vector. For example, Dong et al. proposed an improved natural vector method called Accumulated Natural Vector to analyze sequences, genomes, and their phylogenetic relationships59. Zhao et al. added the covariances of amino acids to natural vector and used it to classify proteins and detect the evolutionary relationships among species60. In addition, certain dinucleotides have connections with the regulation of metabolism, aging, and neurodegeneration61. Atkinson et al. provided that dinucleotide bias was related to evading cell defense mechanisms62. Takata et al. developed that codon and dinucleotide usage biases may be associated with the need to maintain the RNA secondary structure involved in splicing and gene expression63. Kokate et al. studied codon and dinucleotide preferences of 29 Drosophila species and observed their association with speciation64. Simón et al. analyzed dinucleotide and codon usage in all available non-redundant viral sequences and discovered that four dinucleotides (TG, GT, CA, and AC) were self-complementary and codon usage bias was mainly determined by the genomic composition65. Therefore, we hypothesized that the distribution of contextual nucleotides in a gene sequence could be helpful for sequence classification prediction, so dinucleotide and triplet nucleotide information was added to the natural vector.

In this study, we developed a sequence-based approach for PPIs and PPNIs predictions called NVDT. NVDT firstly employed the distribution of nucleotides, dinucleotides, and triplet nucleotides to extract protein information from gene sequences, where the correspondence between a protein gene sequence and its feature vector was one-to-one. Second, we combined all pairs of corresponding natural vectors into a single feature vector describing a protein pair and then normalized feature vectors using the Z-score method. Finally, these features were fed into the classifier to obtain the final prediction results. NVDT not only combined the advantages of local and global protein information but also had high computational speed with low dimensions, making it a robust and efficient prediction method. We applied our approach to Homo sapiens (H. sapiens), Mus musculus (M. musculus), S. cerevisiae, Drosophila melanogaster (D. melanogaster), and Helicobacter pylori (H. pylori) datasets, and obtained high prediction results. We harnessed NVDT to produce network visualizations for three types of PPNI networks, including the one-core, multiple-core and crossing network, to further evaluate the capabilities of this approach. In addition, model evaluations indicated that NVDT improved PPIs prediction accuracies over state-of-the-art methods. Our PPNIs prediction results showed that our method was robust and would help to improve multi-body complex structure predictions.

Results

Application to H. sapiens and M. musculus datasets

To ensure the reliability of our approach, five-fold cross-validation was first used to evaluate model performance and select the optimal parameters (Supplementary Tables 14 and Supplementary Figs. 14). Then we further verified the performance of different classifiers on the test set. Table 1 showed that the accuracy of the real dataset could be improved substantially when using RF compared with SVM, where the accuracy for the H. sapiens increased from 80.92 to 86.23%, and that for M. musculus increased from 80.17 to 85.34%. But the differences in accuracy of the constructed dataset were relatively small.

Table 1 Prediction results by using two classifiers on H. sapiens and M. musculus datasets.

Network prediction

We extended our proposed method to predict PPNI networks consisting of non-interaction pairs (NIPs). The knowledge of PPNI networks is helpful to overcome the noise in the datasets and find the relevant features that can best represent proteins, so as to better classify and design the three-dimensional structure of proteins. It can also be used to identify the proteins with the least interaction in a pathway. As a potential indispensable factor in the pathological process, these proteins are likely to be effective targets for drug design. At the same time, systematic analysis of the non-interaction relationship between a large number of proteins in biological systems is very helpful for multi-body complex structure predictions.

We predicted three types of PPNI networks using our approach. First, a one-core network is the simplest network in which only one core protein radially does not interact with other proteins. We found a guanine nucleotide-binding protein, P62879, which may be a modulator or transducer in various transmembrane signaling systems. Among 26 NIPs, 23 pairs were correctly predicted by our method (Fig. 1a and Supplementary Data 1), supporting the application of our method to predict PPNIs in one-core networks.

Fig. 1: PPNI network prediction results.
figure 1

a A one-core network involving P62879. b A multiple-core network involving the Q8TBX8-O75175-P31150-Q16828-Q8TAU0-Q9H6S3 pathway. c A crossing network. The core and satellite proteins are represented by indigo blue circles and light blue circles, respectively. Dotted lines connecting two proteins are divided into four classes: gray, predicted correctly; red, predicted falsely; green, re-predicted correctly after adding 40% non-interactions, blue, re-predicted falsely after adding 40% non-interactions.

Second, a multiple-core network is essentially composed of several one-core networks, satisfying the corresponding non-interaction relationships among these nuclear proteins at the same time. We found a PPNI network with the pathway Q8TBX8-O75175-P31150-Q16828-Q8TAU0-Q9H6S3, involving 26 proteins. Of the 82 NIPs in this network, our method correctly predicted 66 pairs (Fig. 1b and Supplementary Data 1). To test whether the lack of core protein non-interaction information leads to low accuracy, we added existing non-interaction information for 10, 30, and 40% of core proteins. Supplementary Table 5 showed the frequency distribution of each core protein that was incorrectly predicted. Using this additional information, the accuracy could be increased from 80.49 to 85.37%, 92.68%, and 96.34%, respectively. With the increase in NIPs related to the six core proteins in the training set, the prediction accuracy of the test set also gradually improved. Therefore, additional experimental information could improve the ability of our method to predict NIPs in complex networks.

Third, in biology, most PPNI networks are crossing networks. We obtained a crossing network involving 73 human proteins from the Negatome database. Our method could correctly predict 58 pairs among 81 NIPs (Fig. 1c and Supplementary Data 1), indicating that our approach can be applied to general PPNI networks.

Five-fold cross-validation on the constructed dataset

As there was little difference in accuracy between SVM and RF when using constructed dataset, we used SVM for five-fold cross-validation. As shown in Table 2, the proposed method yielded a high average accuracy of 95.40% for H. sapien, 94.83% for M. musculus, 98.28% for S. cerevisiae, 93.22% for D. melanogaster, and 94.63% for H. pylori. These findings indicated that our model was effective and robust for PPIs prediction (Supplementary Tables 2, 6 and Supplementary Figs. 2, 5).

Table 2 Five-fold cross-validation results on the constructed dataset.

Performance of different feature extraction and combination

For the same classifier, diverse feature extraction methods may yield different prediction results. To further determine the importance of dinucleotides and triplet nucleotides in predicting protein interactions, we separately predicted interactions combined SVM with NV (using only nucleotide information), NVD (using nucleotide and dinucleotide information), NVT (using nucleotide and triplet nucleotide information), and NVDT (using nucleotide, dinucleotide, and triplet nucleotide information). As shown in Fig. 2, SVM-NVDT showed the best accuracy, precision, MCC, and F-scores for S. cerevisiae, D. melanogaster and H. pylori constructed datasets and high sensitivity (Supplementary Table 7). For D. melanogaster, the accuracy of the proposed method was 7.02% higher than that of SVM-NV, 5.90% higher than that of SVM-NVD, and 1.12% higher than that of SVM-NVT. The polynucleotide module may therefore improve the prediction performance.

Fig. 2: Prediction performance based on SVM with NV, NVD, and NVDT on S. cerevisiae, D. melanogaster and H. pylori constructed datasets.
figure 2

The abscissa shows the prediction metrics and the ordinate shows the prediction performance.

Similarly, distinct feature combination methods based on the same classifier will produce different prediction results. For one protein pair composed of protein i and protein j, the 92-dimensional feature vectors of two proteins can be obtained by the NVDT method. Suppose that the feature vector of protein i is A = (a1, a2, …, a92), and the feature vector of protein j is B = (b1, b2, …, b92). Now we encode the protein pairs in five different means, which are defined as follows.

$${{{{{\rm{Cod}}}}}}1:{abs}\left(A-B\right)=[\left|{a}_{1}-{b}_{1}\right|,\left|{a}_{2}-{b}_{2}\right|,\ldots ,\left|{a}_{92}-{b}_{92}\right|]$$
$${{{{{\rm{Cod}}}}}}2:A+B=[({a}_{1}+{b}_{1}),({a}_{2}+{b}_{2}),\ldots ,({a}_{92}+{b}_{92})]$$
$${{{{{\rm{Cod}}}}}}3:({abs}\left(A-B\right),A+B)=\; [\left|{a}_{1}-{b}_{1}\right|,\ldots ,\left|{a}_{92}-{b}_{92}\right|,\\ ({a}_{1}+{b}_{1}),\ldots ,({a}_{92}+{b}_{92})]$$
$${{{{{\rm{Cod}}}}}}4:A* B=[({a}_{1}\times {b}_{1}),({a}_{2}\times {b}_{2}),\ldots ,({a}_{92}\times {b}_{92})]$$
$${{{{{\rm{Cod}}}}}}5:(A,B)=[{a}_{1},{a}_{2},\ldots ,{a}_{92},{b}_{1},{b}_{2},\ldots ,{b}_{92}]$$

The accuracy results of Cod1–Cod5 on H. sapiens and M. musculus datasets were shown in Table 3. It can be seen that Cod5 (i.e., our method SVM-NVDT) achieved the highest accuracy in all four datasets.

Table 3 Prediction performance based on distinct feature combination methods on H. sapiens and M. musculus datasets.

Comparison with other feature extraction methods

The prediction results for alternative methods based on gene sequence data using three datasets were shown in Table 4. Our method showed accuracies of 99.20% for S. cerevisiae, 94.94% for D. melanogaster, and 98.56% for H. pylori, which were better than the two other methods, indicating that NVDT was more suitable for predicting PPIs than other gene sequence-based methods.

Table 4 Prediction results of different methods on three constructed independent test datasets.

We further compared our method with the codon pair-based method(CCPPI) proposed by Zhou et al.56 on the S. cerevisiae dataset. We used the same ten-fold cross-validation as the CCPPI method and obtained an average accuracy of 98.05%, precision of 98.14%, sensitivity of 97.95%, and MCC of 96.10% (Supplementary Table 8). Compared with CCPPI, the average accuracy, precision, sensitivity, and MCC of our method are improved by 8.45, 10.04, 6.25, and 16.80%, respectively.

Comparison with other sequence-based methods

In order to make an effective comparison of our proposed SVM-NVDT model, and considering the limited literature based on the D. melanogaster dataset, we further compared the predictive performance of our method with other state-of-the-art sequence-based approaches using S. cerevisiae and H. pylori datasets. The comparison results on the S. cerevisiae dataset were presented in Table 5. The accuracy of our proposed method achieved an enhancement of 1.39% compared with the second TAGPPI, 2.11% to the third PIPR, 4.13%to the fourth LightGBM, and 4.56% to the fifth StackPPI. The comparison results on the H. pylori dataset were presented in Table 6. Our proposed method achieved an improvement in an accuracy of 7.31% compared with the second PCVMZM, 8.84% to the third StackPPI, 9.09% to the fourth RF+PR+LPQ, and 9.41% to the fifth Weighted Skip-sequential. Our model has shown superior results compared with other methods, further supporting the validity of our model.

Table 5 Comparison of established methods using the S. cerevisiae dataset.
Table 6 Comparison of existing methods using the H. pylori dataset.

Performance on independent cross-species datasets

When a large number of interacting proteins in an organism show correlated evolution, orthologs in other taxa will also interact. Therefore, we trained the model with SVM using all S. cerevisiae samples and used the other five species in the DIP database as independent test datasets. For these five test datasets, all samples were interacting protein pairs. As summarized in Table 7, the minimum prediction accuracy was 89.64%, indicating that the proposed method can be used to predict cross-species PPIs.

Table 7 Prediction results for independent datasets.

Discussion

Our method calibrated on known PPNIs in H. sapiens and M. musculus addressed the limitations of established computational methods for PPIs prediction, including low accuracies and inefficient prediction of real non-interactions. At the same time, the validity of the model was verified on S. cerevisiae, D. melanogaster, H. pylori, H. sapiens, and M. musculus constructed datasets based on the commonly used protein pairs using different subcellular locations to construct negative samples. Several aspects of the proposed approach were worth highlighting. (1) We utilized information from gene sequence data, enabling the extraction of more extensive information not accessible in protein sequence data. The dinucleotides and triplet nucleotides are related to many diseases, metabolism, and mutations, and their information can be well obtained by NVDT. (2) NVDT had better performance than those of other gene sequence-based methods, with higher accuracy and better stability and this can be explained by the integration of polynucleotide information, which dramatically reduces the repeatability of different protein feature vectors. (3) NVDT juxtaposed the two feature vectors of a protein pair to form a new vector with a single dimension, retaining more information. Especially when combined with SVM, the prediction classifications of the two feature vectors (feature_A, feature_B) and (feature_B, feature_A) corresponding to each sample (assuming the protein pair composed of protein A and protein B) are always consistent. The reason may be that each dimension in SVM is independent, and the classification fundamentally depends on calculating the Euclidean distance of any two samples, so the order of features does not affect the classification results. (4) Ensemble classifiers typically show higher accuracy and robust performance than those of single classifiers. However, our model using a single classifier SVM showed better performance than those methods using ensemble classifiers on the H. pylori dataset. (5) The constructed negative datasets were restricted to protein pairs located in different cellular compartments, potentially leading to a functional bias in downstream analyses and predictions. The high accuracy of these datasets hardly reflected the true prediction effect, but the good performance on the real dataset can well reflect the feasibility of our model. Among them, results showed that the prediction accuracy of the real dataset was not sufficiently high, which may be explained by the limitation of real non-interacting protein pairs.

The proposed method utilized detailed gene sequence information, including the distribution information of three kinds of nucleotides, and showed better predictive performance than those of established protein sequence-based methods. Moreover, by comparing combined and single features, we found that various features may be complementary. And our model performed well on three types of networks. These results proved that gene sequence information can be used to distinguish interacting and non-interacting protein pairs and ultimately to establish a complete PPI and PPNI map to better understand biochemical and biological processes.

Methods

In this section, we elaborated on the proposed NVDT approach for predicting PPIs and PPNIs based on gene sequence. NVDT consisted of the following three steps: (1) Encode the protein pairs’ interaction and non-interaction information into natural vectors via the distribution of nucleotides, dinucleotides, and triplet nucleotides. (2) Generate and normalize feature vectors of protein pairs. (3) Train the model via different classifiers (SVM and RF) and test on an independent test set. Finally, we briefly described the common model performance evaluation metrics used in the study. The flowchart of NVDT was shown in Fig. 3.

Fig. 3: Workflow of our computational pipeline to predict protein-protein interaction and non-interaction.
figure 3

It describes the whole research process, including dataset collection, feature extraction, feature-selective classifier and result analysis.

Dataset collection

Seven datasets were obtained. The positive datasets consisted of interacting protein pairs collected from the public Database of Interacting Proteins (DIPs: https://dip.doe-mbi.ucla.edu/dip/)66. To reduce fragments and sequence similarity, samples with fewer than 50 amino acids and >40% pairwise sequence identity to one another were excluded.

The negative datasets were composed of non-interacting protein pairs obtained in two ways. First, negative samples were derived from the Negatome Database 2.0 (http://mips.helmholtz-muenchen.de/proj/ppi/negatome)47, which currently contained experimentally supported non-interacting protein pairs. In the Negatome database, the subcellular structure information for some proteins was unclear or even absent, and some proteins belonged to at least two or more subcellular localizations simultaneously, which was consistent with the actual phenomenon. After a selection of protein pairs from multiple species in the Negatome database, the majority belonged to H. sapiens (1217 pairs), followed by M. musculus (347 pairs) and Rattus norvegicus (33 pairs). Accordingly, we collected 2434 protein pairs for H. sapiens and 694 protein pairs for M. musculus, with interacting protein pairs and non-interacting protein pairs each accounting for half. The datasets obtained in this way were called “real dataset”.

The other negative samples in different subcellular compartments were obtained, based on the assumption that proteins within different subcellular localizations tend not to interact. Considering that the ratio of positive samples to negative samples used in previous literature research is mostly 1:1, we should not only verify the prediction accuracy of interacting protein pairs but also ensure the prediction accuracy of non-interacting protein pairs. Therefore, the balanced dataset was selected when constructing the dataset, that is, we randomly selected negative samples with the same number of positive samples. Here, the selection of negative samples was random without any cluster analysis. Four datasets were finally collected in this way, including 11188 protein pairs for S. cerevisiae, 2140 protein pairs for D. melanogaster, 1217 protein pairs for H. sapiens, and 694 protein pairs for M. musculus. We also performed on 2916 protein pairs for H. pylori described derived by Rain et al.67, and a list of H. pylori protein interactions was given in its supplementary materials. These datasets were called “constructed dataset”.

The gene sequence of each protein was obtained from the NCBI database (https://www.ncbi.nlm.nih.gov/). To verify the performance of the proposed method, the positive samples of H. sapiens and M. musculus datasets in the real dataset and constructed dataset were consistent, only the negative samples were obtained differently (Supplementary Table 9). Finally, each of the seven datasets was divided into a training dataset and a test dataset in a ratio of 5:1.

Feature extraction

Deng et al. have defined natural vector and mathematically proved that there was a one-to-one correspondence between natural vector and gene sequence for a protein58. The natural vector method to extract vital information from gene sequence was as follows.

Let \(S={s}_{1}{s}_{2}\cdots {s}_{N}\) be a gene sequence of length N, where \({s}_{i}\in \left\{A,C,G,T\right\},i={{{{\mathrm{1,2}}}}},\ldots ,N\). For k representing one of the four nucleotides, define \(\omega \left(\cdot \right)=\left\{A,C,G,T\right\}\to \{0,1\}\), where

$${\omega }_{k}\left({s}_{i}\right)=\left\{\begin{array}{c}0,{s}_{i}\;\ne\; k,\\ 1,{s}_{i}=k.\end{array}\right.$$
(1)
  1. (1)

    Denote \({n}_{k}=\mathop{\sum }\nolimits_{i=1}^{N}{\omega }_{k}\left({s}_{i}\right)\) as the number of nucleotide k in gene sequence S.

  2. (2)

    Taking the first nucleotide as the origin, \({{{{{{\rm{dis}}}}}}}_{k(i)}=i\cdot {\omega }_{k}\left({s}_{i}\right)\) is the distance from the first nucleotide to the ith nucleotide k, where \(i={1,2},\ldots ,{n}_{k}\), and \({\mu }_{k}=\frac{\mathop{\sum }\nolimits_{i=1}^{{n}_{k}}{{{{{{\rm{dis}}}}}}}_{k(i)}}{{n}_{k}}\) represents the average position of nucleotide k in gene sequence S.

  3. (3)

    The normalized central moments are defined as \({D}_{j}^{k}=\mathop{\sum }\nolimits_{i=1}^{{n}_{k}}\frac{{(i-{\mu }_{k})}^{j}\cdot {\omega }_{k}\left({s}_{i}\right)}{{n}_{k}^{j-1}{N}^{j-1}},j={1,2},\ldots ,{n}_{k}\).

The first normalized central moments \({D}_{1}^{k}\) (j = 1 in \({D}_{j}^{k}\)) could be ignored since its values were zero. Deng et al. and Yu et al. have demonstrated that the second normalized central moments \({D}_{2}^{k}\) (j = 2 in \({D}_{j}^{k}\)) in the vector could obtain stable classification results; accordingly, the central moments (\({D}_{j}^{k}\)) higher than j = 2 do not need to be considered, and computed58,68. Thus, the 12-dimensional natural vector N(S) of a gene sequence S was given as follows:

$$N\left(S\right)=\left({n}_{A},{n}_{C},{n}_{G},{n}_{T},{\mu }_{A},{\mu }_{C},{\mu }_{G},{\mu }_{T},{D}_{2}^{A},{D}_{2}^{C},{D}_{2}^{G},{D}_{2}^{T}\right)$$
(2)

In this work, we proposed an extended natural vector with the frequencies of dinucleotides and triplet nucleotides, which considered the properties of a single nucleotide and its vicinal nucleotides and regarded any two contiguous nucleotides or any three contiguous nucleotides as a unit. The size of type dinucleotide should be 4 × 4 = 16 and the size of type triplet nucleotide should be 4 × 4 × 4 = 64. Finally, we defined a 92-dimensional natural vector to make further investigation on gene sequence as the following:

$$N\left(S\right)=\left({n}_{A},{n}_{C},{n}_{G},{n}_{T},{\mu }_{A},{\mu }_{C},{\mu }_{G},{\mu }_{T},{D}_{2}^{A},{D}_{2}^{C},{D}_{2}^{G},{D}_{2}^{T},{n}_{{AA}},\ldots ,{n}_{{TT}},{n}_{{AAA}},\ldots ,{n}_{{TTT}}\right)$$
(3)

For each protein pair in the dataset, we could convert a gene sequence into a natural vector. Assume that the kth protein pair corresponds to protein i and protein j, we then juxtaposed the two natural vectors. The features of the protein pair can be described as two 184-dimensional feature vectors:

$${d}_{k}=({N}_{i},{N}_{j})\;{{{{{\rm{or}}}}}}\;({N}_{j},{N}_{i})$$
(4)

where Ni and Nj were feature vectors for proteins i and j.

However, protein feature vectors correlated to the length of protein (the number of nucleotides) complicate the comparison between two protein pairs. The feature vectors were standardized by the Z-score method69, with the mean value of 0 and standard deviation of 1:

$${d}_{k}^{{\prime} }\left(l\right)=\frac{{d}_{k}\left(l\right)-\mu \left(l\right)}{\sigma \left(l\right)},k=1,2,\ldots ,n,{{{{{\rm{and}}}}}}\,l=1,2,\ldots ,184$$
(5)

where \({d}_{k}\left(l\right)\) was the lth feature of the kth protein pair, and n was the number of protein pairs. μ(l) and σ(l) were the mean value and standard deviation of all proteins for the lth feature, respectively:

$$\mu \left(l\right)=\frac{\mathop{\sum }\nolimits_{k=1}^{n}{d}_{k}\left(l\right)}{n},\sigma \left(l\right)=\sqrt{\frac{\mathop{\sum }\nolimits_{k=1}^{n}{\left({d}_{k}\left(l\right)-\mu \left(l\right)\right)}^{2}}{n}}$$
(6)

For data standardization, the training set was standardized in the above way, and the mean and standard deviation of the training set were used to standardize the test set. \({d}_{k}^{{\prime} }\), the input features of a classifier, were a 184-dimensional vector containing the statistical features of nucleotide, dinucleotide, and triplet nucleotide.

Support vector machine

SVM70 is a supervised learning method that has been popularly utilized for classification and regression in computational biology. It constructs a hyperplane to maximize the margin. Specifically, SVM finds several samples as support vectors that minimize the distance of samples between different classes, and the minimum distance is called the margin. The optimal hyperplane is located in the center of the margin, where the larger the margin is, the smaller the generalization error will be. Therefore, this hyperplane can better divide the samples in the training set according to the labels given. In this study, the Gaussian radial basis function was chosen as the kernel function, and the regularization parameter c and the Gauss kernel function parameter γ were optimized by a grid search approach. Here, the parameters were the default of LIBSVM, and cross-validation was used to avoid over-fitting (Supplementary Table 10). And select a cut-off value of 0.5. The application of the SVM classifier under optimal parameter settings guarantees the reliability of classification prediction with the minimum error.

Random forest

Random forest (RF)70 is another typical classification and regression method with biological applications. It uses multiple decision trees, which do not correlate them, to obtain the final result by voting or taking the mean value, generating an overall model with high accuracy and generalizability. For a new sample, each decision tree assigns the sample to a class, and then the voting method is used to determine which class is selected more frequently as the final classification result. Three parameters are usually adjusted when training an RF model, the number of trees to grow (denoted as n_Tree), the number of randomly selected features at each decision split (denoted as n_feature), and the minimum node size of a terminal node (denoted as min_leaf). In our study, n_Tree values under 200 were evaluated to find the optimal value concerning computational time, cost, and overfitting (Supplementary Table 10). All other parameters were set to default values.

Performance evaluation

Common performance evaluation metrics were used, including accuracy (Acc.), precision (Pre.), sensitivity (Sen.), specificity (Spe.), Matthews correlation coefficient (MCC), F1-score, and the area under a ROC curve (AUC) to evaluate the predictive performance of the proposed method. These metrics were defined as follows:

$${{{{{\rm{Accuracy}}}}}}=\frac{{TP}+{TN}}{{TP}+{TN}+{FP}+{FN}}$$
(7)
$${{{{{\rm{Precision}}}}}}=\frac{{TP}}{{TP}+{FP}}$$
(8)
$${{{{{\rm{Sensitivity}}}}}}=\frac{{TP}}{{TP}+{FN}}$$
(9)
$${{{{{\rm{Specificity}}}}}}=\frac{{TN}}{{TN}+{FP}}$$
(10)
$${{{{{\rm{MCC}}}}}}=\frac{{TP}\times {TN}-{FP}\times {FN}}{\sqrt{\left({TP}+{FN}\right)\times \left({TN}+{FP}\right)\times \left({TP}+{FP}\right)\times \left({TN}+{FN}\right)}}$$
(11)
$$F1\mbox{-}{{{{{\rm{score}}}}}}=\frac{2* {{{{{\rm{precision}}}}}}* {{{{{\rm{sensitivity}}}}}}}{{{{{{\rm{precision}}}}}}+{{{{{\rm{sensitivity}}}}}}}$$
(12)

where TP represents the number of positive samples that are correctly predicted; FN represents the number of positive samples that are incorrectly predicted; TN represents the number of negative samples that are correctly predicted; FP represents the number of negative samples that are incorrectly predicted. The receiver operating characteristic (ROC) curve was generated by plotting the TP rate against the FP rate at various thresholds; the abscissa of the ROC curve was 1-specificity and the longitudinal coordinate is sensitivity.

Statistics and reproducibility

All quantitative results of K-fold cross-validation were shown using mean ± standard deviation. Graphing of three types of PPNI networks was performed with Pajek. The computational experiments were repeated at least two additional times with a similar outcome.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.