Introduction

DNA-binding proteins are those proteins that bind and interact with DNA. DNA-binding proteins play important role in the structural composition of the DNA and in gene regulations. Non-specific structural proteins often help to organize and compact the chromosomal DNA. The other important role is to regulate and effect various cellular processes like transcription, DNA replication, DNA recombination, repair and modification. These proteins in their independently folded domains have at least one structural motif and have affinity to DNA1. DNA-binding proteins or ligands have many important applications as antibiotics, drugs, steroids for various biological effects and in bio-physical, bio-chemical and biological studies of DNA2.

Many experimental methods are being used to identify DNA-binding proteins: filter binding assays3, genetic analysis4, X-ray crystallography5, chromatin immunoprecipitation on microarrays6, NMR7,8 etc. However, these experimental methods are costly and time consuming9. Therefore, there is a growing interest in finding new computational methods to replace experimental methods to identify DNA-binding proteins. Moreover, the number of newly discovered protein sequences has been increasing extremely fast due to the advent of modern protein sequencing technologies. For example, in 1986 the Swiss-Prot10 database contained only 3,939 protein sequence entries, but now the number has jumped to 88,032,926 according to the release 2017_07 of July, 5, 2017 by the UniProtKB/Swiss-Prot (http://web.expasy.org/docs/relnotes/relstat.html). It means that the number of protein sequence entries is now thausands times more than the number from about 25 years ago. Facing the flood of new protein sequences generated in the post genomic age, it is highly desired to develop automated computational prediction approaches for rapidly and effectively identifying and characterizing DNA-binding proteins.

Computational methods that have been used to predict the DNA-binding proteins can be broadly categorized into two groups: structure based methods11,12 and sequence based methods13,14,15,16,17,18,19. In most of the cases, DNA-binding protein identification is formulated as a binary classification problem in the supervised learning setting. The sequence based methods are built depending only on the sequence based information extracted from the training data where structure based methods also exploits structure based features. In20, structural motifs and electrostatic potentials were used to predict DNA-binding proteins. DNA-binding domain hunter (DBD-Hunter)21 was proposed to identify DNA-binding proteins using structure comparison and evaluation of a statistical potential derived from the interactions between DNA base pairs and protein residues. The iDBPs server was proposed in22 used global features like average surface electrostatic potential, the dipole moment and cluster-based amino acid conservation patterns. Low resolution α-carbon only models generated by TASSER23 to predict DNA-binding proteins in24. One of the major difficulties in structure based methods is that the structure of most of the proteins are unknown. However, structural information like presence of motifs and other information is very crucial in DNA recognition of binding proteins. Therefore, we hypothesize that even partial information of the protein structure could play very important role in identifying their function of binding DNA.

Many machine learning algorithms are applied to solve this problem in the literature. Among them are: Logistic Regression24, Hidden Markov Models20, Random Forest22,25,26, Artificial Neural Network27, Support Vector Machines14,28, Naive Bayes classifier15 etc. A number of softwares, web-servers and prediction methods are available in the literature for DNA-binding protein prediction. Among them are: DNABinder28, DNA-Prot25, iDNA-Prot26, iDNA-Prot|dis13, DBPPred15, iDNAPro-PseAAC14, PseDNA-Pro29, Kmer1 + ACC30, Local-DPP16, etc. Kumar et al.28 used evolutionary information from PSSM profiles with support vector machines and established a web-server called DNABinder. They compared the effectiveness of the PSSM based features with amino acid composition, di-peptide composition and 4-parts amino acid compositions as features.

DNA-Prot is another software proposed in25. They used amino acid composition, physio-chemical properties and secondary structure information as features and trained their model using a Random Forest classifier. Lin et al.26 presented a web-server named iDNA-Prot where they used grey model to incorporate amino acid sequence as features into the general form of pseudo amino acid composition and trained their model using Random Forest classifier. Amino acid distance-pair coupling information and the amino acid reduced alphabet profile was incorporated into the general form of pseudo amino acid composition31 by Liu et al.13. They also offered a freely available web-server called iDNA-Prot|dis. On of the most successful prediction method so far is DBPPred proposed in15. They used a wrapper based best first feature selection technique to select optimal set of features. They used features based on amino acid composition, PSSM scores, secondary structures and relative solvent accessibility and trained their model using Random Forest and Gaussian Naive Bayesian classifiers.

Liu et al.14 presented iDNAPro-PseAAC as a web server. They used evolutionary information as their input features. They used profile-based protein representation and selected a set of 23 optimal features using Linear Discriminant Analysis (LDA). Their model was trained using Support Vector Machine (SVM) classifier. Kmer composition and auto-cross covariance transformation was used in30 in a subsequent work. Their method trained by SVM is known as Kmer1 + ACC in the literature. They also developed another server called PseDNA-Pro29. PseDNA-Pro used amino acid composition, pseudo amino acid composition and physicochemical distance transformation based features to train their model. Wei et al. proposed Local-DPP16 by using Random Forest classifier on local pseudo position specific scoring matrix features. Among other recent works are SVM-PSSM-DT32, PNImodeler33, CNNsite34, BindUP35, etc.

One of the most important but also most difficult problems in computational biology is how to express a biological sequence with a discrete model or a vector, yet still keep considerable sequence-order information or key pattern characteristic. This is because all the existing machine-learning algorithms can only handle vector but not sequence samples, as elucidated in a recent review36. However, a vector defined in a discrete model may completely lose all the sequence-pattern information. To avoid completely losing the sequence-pattern information for proteins, the pseudo amino acid composition or PseAAC37 was proposed. Ever since then, the concept of PseAAC has been rapidly and widely penetrated into nearly all the areas of computational proteomics38,39. Because it has been widely and increasingly used, recently three powerful open access soft-wares, called ‘PseAAC-Builder’, ‘propy’, and ‘PseAAC-General’, were established: the former two are for generating various modes of Chou’s special PseAAC; while the 3rd one for those of Chou’s general PseAAC, including not only all the special modes of feature vectors for proteins but also the higher level feature vectors such as “Functional Domain” mode, “Gene Ontology” mode, and “Sequential Evolution” or “PSSM” mode. Encouraged by the successes of using PseAAC to deal with protein or peptide sequences, four web-servers called ‘PseKNC’, ‘PseKNC-General’, ‘repDNA’, and ‘repRNA’ were developed for generating various feature vectors for DNA/RNA sequences as well. Particularly, recently a very powerful web-server called Pse-in-One40 has been established that can be used to generate any desired feature vectors for protein or peptides and DNA or RNA sequences according to the need of users’ studies. In the current study, we are to use 14 different modes of the general PseAAC derived from evolutionary and structural informations to identify DNA-binding proteins.

As done in a series of recent publications41,42,43,44,45,46,47,48 in compliance with Chou’s 5-step rule, to establish a really useful sequence-based statistical predictor for a biological system, we should follow the following five guidelines: (a) construct or select a valid benchmark dataset to train and test the predictor; (b) formulate the biological sequence samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (c) introduce or develop a powerful algorithm to operate the prediction; (d) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; (e) establish a user-friendly web-server for the predictor that is accessible to the public. In this paper, we propose iDNAProt-ES, identification of DNA-binding Proteins using Evolutionary and Structure Features. In our proposed method, a number of novel features have been derived from sequence based evolutionary information and structural information of a given protein to train a SVM classifier with linear kernel. We used recursive feature elimination technique to reduce the number of features and to derive an optimal set of features for DNA-binding protein prediction. We have tested our method on standard benchmark datasets. Experimental results show that iDNAProt-ES significantly outperforms other state-of-the-art methods found in the literature and thus have potentials to be used as a DNA-binding protein prediction tool.

Results and Discussion

In this section, we present the results of the experiments that were carried out in this study. All the methods were implemented in Python language using Python3.4 version and Scikit-learn library49 of Python was used for the implementation of machine learning algorithms. All experiments were conducted on a Computing Machine provided by CITS, United International University. Each of the experiments were carried 50 times and only the average is reported as results.

Comparison With Other Methods

To compare the performance of our predictor iDNAProt-ES with the state-of-the-art algorithms found in the literature, we first used the benchmark dataset. using this dataset, we performed jack knife test and report accuracy, sensitivity, specificity, MCC and auROC values in Table 1. We compare the results achieved by iDNAProt-ES with previous state-of-the-art methods found in the literature including: DNABinder28, DNA-Prot25, iDNA-Prot26, iDNA-Prot|dis13, DBPPred15, iDNAPro-PseAAC14, PseDNA-Pro29, Kmer1 + ACC30 and Local-DPP16. The results reported in this paper for these methods are taken from14,16.

Table 1 Comparison of performance of the proposed method with other state-of-the-art predictors using jack knife test on the benchmark dataset.

The best values in Table 1 are shown in bold faced font. For the benchmark dataset our method iDNAProt-ES significantly outperforms the previous state-of-th-art in terms of all the evaluation metrics used. Accuracy of iDNAProt-ES is 90.18% compared to the previous best 79.20% by Local-DPP16. The higher MCC value and auROC also depicts the effective ness of our method.

To assess the performance and generality of iDNAProt-ES further, we applied it on the independent dataset introduced in15. Here, we used the same model trained using iDNAProt-ES on the benchmark dataset and tested using the independent dataset. We report the performance metrics in Table 2 for the independent dataset. Here too the best values are shown in bold faced font. We could notice that our algorithm is showing better performance in terms of accuracy and auROC compared to the other state-of-the-art algorithms. However, the sensitivity, specificity and MCC values are not the best, but comparable to the other methods. Although we demonstrate consistant prediction performance enhancement for both train and test benchmarks, yet the improvement achieved on the train set is larger than the test set. The main reasons for such phenomena are because of feature selection and parameter tuning steps that are conducted on the train set. Despite we made sure that we separate a validation set for those tasks, still it is possible that the tuned parameters are more homogeneous to samples in the train set. However, repeating the enhancement on the independent test benchmark support the generality of our proposed method.

Table 2 Comparison of performance of the proposed method with other state-of-the-art predictors on the independent dataset.

Effect of Feature Selection

In this section, we show the effect of the feature selection algorithm that we used. For this experiment we used 10-fold cross validation on both of the datasets to find the optimal set of features using recursive feature elimination technique. We varied the number of features from 25 \(\cdots \) 100 using the recursive feature elimination technique for two SVM kernels: sigmoid and linear. The highest accuracy was found when the number of reduced features were set to 86. Figure 1 shows the plot of accuracy against the number of reduced features using recursive feature selection algorithm using two classifiers. The list of selected features are provided in Suplementary file 1.

Figure 1
figure 1

Effect of number of features selected on the accuracy on the benchmark dataset.

Color map of the rankings of the features as ranked by the RFE algorithm is given in Fig. 2. This color map depicts the distribution of selected features over all the features. Selected features include Dubchuck features, PSSM bigram, PSSM Auto-Covariance, PSSM 1-lead bigram and PSSM segmented distribution from the evolutionary group of features extracted for PSSM and the rest of the features were structural features generated by SPIDER2. It reveals the importance of both type of features: evolutionary and structural. A list of selected features is given in the supporting information.

Figure 2
figure 2

Color map showing the importance or ranking of the features on the benchmark dataset.

We then compared the performance of this feature selection technique with other feature selection techniques: tree based method50 and randomized sparse elimination51,52 and with no feature elimination. We performed 10-fold cross validation for these experiments too and applied different feature elimination techniques on the benchmark dataset and report the results in Table 3.

Table 3 Comparison of performance of different feature selection methods on the benchmark dataset using 10-fold cross validation.

Here too, we show the best values achieved in bold faced fonts. We could easily note that recursive feature elimination technique was the best among the feature elimination techniques that were used in the experiments. We also show the Receiver Operating Curve (ROC) for each of these methods for the benchmark dataset in Fig. 3.

Figure 3
figure 3

Receiver Operating Characteristic (ROC) curve of different feature selection methods on the benchmark dataset.

Effect of Classifier Selection

To justify the classifier selection for our algorithm, we ran another set of experiments on the benchmark dataset using 10-fold cross validation. Several classifiers were tested in the experiments: SVM with linear kernel, SVM with Radial Basis Function (RBF) kernel, SVM with sigmoid kernel, Random Forest Classifier, Naive Bayes Classifier and Logistic Regression Classifier. The results achieved in these experiments are shown in Table 4.

Table 4 Comparison of performance of different Classifiers on the benchmark dataset using 10-fold cross validation.

The best values in Table 4 are shown in bold faced fonts. We could see the SVM classifier with linear kernel outperformed all other classifiers. The closest competitor to linear kernel was the logistics regression classifier and the SVM with RBF kernel. We also show the ROC curve for this experiment in Fig. 4.

Figure 4
figure 4

Receiver Operating Characteristic (ROC) curve of different classifiers for the benchmark dataset.

Web Server Implementation

To make the predictor iDNAProt-ES freely available for use and test we implemented a web server. This web application is freely available to use at: http://brl.uiu.ac.bd/iDNAProt-ES/. This is a very easy to use website and the model here is trained using the benchmark dataset. To use this site for identification of DNA-binding proteins, one has to provide two input files: PSSM file generated by PSI-BLAST53 and a SPD file generated by SPIDER254. After these files are uploaded iDNAProt-ES, will extract features and follow a similar procedure as shown in Fig. 5. A detail guideline is provided in the website to use the predictor. A screen-shot of the web application is given in Fig. 6. As pointed out in39 and demonstrated in a series of recent publications41,42,43,44,45,46,47,48,55, user-friendly and publicly accessible web-servers represent the future direction for developing practically more useful prediction methods and enhance their impact39, we shall make efforts to assure the iDNAProt-ES server is always in the normal working state.

Figure 5
figure 5

System flow diagram of iDNAProt-ES showing the training and prediction procedure as flowchart.

Figure 6
figure 6

Screen shot of Web-Server homepage.

Materials and Methods

To establish a novel feature set and good predictor we first collected two benchmark datasets. We then extracted features from the data sets which are able to discriminate the DNA-binding proteins, develop the list of reduced features from the global set of features which can contribute to improve prediction accuracy of prediction, and selected and developed powerful classification algorithm to perform prediction. We finally performed cross-validation tests to evaluate the accuracy of predictor.

The framework of our proposed method iDNAProt-ES is depicted in the Fig. 5. There are two phases in the framework for prediction: training phase and prediction phase. In training phase, at first a training dataset is selected. Next, each protein sequence from the training dataset is then passed to the PSI-BLAST53 and SPIDER356 softwares, that provide two output files PSSM and SPD3 respectively. PSSM file is responsible for evolutionary information and SPD3 is responsible for structural information. These two files are then passed to the iDNAProt-ES feature extractor, which extract 14 sets of features. These 14 feature sets contains total 1548 sub-features in total. Note that tools and application servers are available in the literature that extracts features from PSSM files57. Then all these extracted features (1548) from the feature extraction method are then passed to the iDNAProt-ES feature selector to reduced the features to improve the prediction accuracy. We can get the list of reduced feature set from this method which is provided in Supplementary file 1. The reduced features are used to train a model using SVM classifier and stored later for prediction.

In the prediction phase, iDNAProt-ES first a query protein sequence and passed to the PSI-BLAST and SPIDER3 to generate two output files PSSM and SPD3 respectively as similar to the training phase. These two files are then used by the feature extractor and feature selector of iDNAProt-ES. The reduced features are passed to the previously saved model in training phase to predict whether the protein is DNA-binding or not. These phase takes very little time compared to the training phase.

Datasets

We require a set of reliable benchmark datasets in order to develop an effective predictor using suitable classification algorithm and feature set. Any dataset consists of positive and negative samples and can be formally denotes as following:

$${\mathbb{S}}={{\mathbb{S}}}^{+}\cup {{\mathbb{S}}}^{-}$$
(1)

Here \({{\mathbb{S}}}^{+}\) represents the set of positive instances or DNA-binding proteins and \({{\mathbb{S}}}^{-}\) denotes the negative samples or non-DNA-binding proteins. In this paper, we use two datasets that are extensively used in the literature for DNA-binding protein prediction problem13,14,16,29,58. The first dataset which we refer to as the benchmark dataset throughout this paper was introduced in13. The DNA-binding proteins were extracted from the latest version of Protein Database (PDB)59 with the mmCIF keyword of ‘DNA-binding protein’ using the advanced search interface. To build a high quality and non-redundance benchmark they first removed all the sequences with length less than 50 and then removed all the protein sequences with unkonwn amino acids (identified in the sequence with non-standard symbol ‘X’ or ‘Z’). Finally, they removed all the proteins with more than 25% sequence similarity using PISCES 40. In this way, they guarantee that there is no or very little structural overlap among the proteins in these benchmark13,14,16. As a result they build benchmark dataset consists of 525 DNA-binding protein and 550 non-DNA-binding protein. They specified DNA-binding and non-DNA-binding proteins in the following manner. They first specified proteins from different domains and label the one with DNA-binding sites as DNA-binding proteins and those without such sites as non-DNA-binding proteins13,14. Note that the input for this benchmark is a protein and not a binding domains and the target is to find if a given protein has any binding sites which is referred DNA-binding protein or not which is referred non-DNA-binding protein. It is important to highlight that having proteins with very low sequential similarity (less than 25%) with at least 50 amino acids and no unknown residue guarantee no or very low domain overlap13,14,16,29,58.

The second benchmark which is used as the Independent test dataset is also constructed by Lou et al.15. We use this data set wihch is referred PDB186 to be able to directly compare our results with previous studies found in the literature on an independent test set. In the dataset, 93 proteins are DNA-binding proteins and 93 proteins are non-DNA-binding proteins. They use similar strict critera to extract this benchmark as well. They first removed proteins with less than 60 amino acid length and removed those with unknown (‘X’ or ‘Z’) residue. They then used the NCBI’s BLASTCLUST53 to remove those proteins from the dataset that have more than 25% sequence identity.

Feature Extraction

Different types of feature extraction methods are used in the literature of DNA-binding protein prediction. These include: pseudo position specific scoring matrix based features16, pseudo amino acid composition proposed by Chou and physicochemical distance transformation29, etc. In this study, we explore evolutionary and structural information embedded in the protein sequences as features. Protein sequences are used to fetch evolutionary information extracted as PSSM (Position-Specific Scoring Matrix) files generated by PSI-BLAST53. In addition to that, structural information are extracted from the spd files, output of SPIDER254 software. Following sections describes the feature extraction in detail.

PSSM based features

We used evolutionary information from PSSM files generated using three iterations of the PSI-BLAST algorithm53 using the non-redundant database (nr) provided by NCBI. The cut-off threshold value of E was set to 0.001. PSSM file returns the log-odds of the substitution probabilities of a given protein at each position for all possible amino acid symbols after the alignment60. This is a L × 20 matrix which we refer in this paper as PSSM matrix. Given a protein sequence P consisting L amino acid residues as following:

$$P={R}_{1}{R}_{2}{R}_{3}\ldots \ldots \ldots \ldots \ldots .{R}_{L}$$
(2)

The frequency profile to P generated by the PSI-BLAST53 and matrix M can be represented as:

$${\mathbb{M}}=\{\begin{array}{cccc}{m}_{1,1} & {m}_{1,2} & \cdots & {m}_{1,L}\\ {m}_{2,1} & {m}_{2,2} & \cdots & {m}_{1,L}\\ \vdots & \vdots & \ddots & \vdots \\ {m}_{20,1} & {m}_{20,2} & \cdots & {m}_{20,L}\end{array}\}$$
(3)

where 20 is the number of standard amino acids; m i,j is the target frequency representing the probability of amino acid i (i = 1, 2, …, 20) appearing in sequence position j (j = 1, 2, 3, … L) of protein P during evolutionary process. We first normalize the pssm matrix using the procedure proposed in61 for protein sub-cellular localization. After normalization, we generated five groups of features from the normalized PSSM matrix. We will denote the normalized matrix throughout this section as N which is a two dimensional matrix of dimension L × 20. The features generated from PSSM file information are enumerated in the following:

  1. 1.

    Amino acid composition: The PSSM file is used to generate a consensus sequence. A consensus sequence is built by taking the amino acid with highest substitution probability or frequency in the PSSM matrix at each position. Amino Acid composition then counts the occurrences of each amino acid residue and normalizes by the length of the protein sequence.

    $$AA{C}_{j}=\frac{1}{L}\,\sum _{i=1}^{L}\,aa(i,j),1\le i\le 20$$
    (4)

    Here,

    $$aa(i,j)=\{\begin{array}{ll}1, & {\rm{if}}\,{s}_{j}={a}_{i}\\ 0, & {\rm{else}}\end{array}$$

    where s j is an amino acid in the protein sequence and a i is one of the 20 different amino acid symbols62.

  2. 2.

    Dubchak features: Theses features were previously used for protein fold recognition63 and protein subcellular localization61. They group the amino acid residues according to various physicochemical properties polarity, solvability, hydro-phobicity etc and calculates the composition, transition and distribution of these groupings. The size of the feature vector is 105.

  3. 3.

    PSSM Bigram: PSSM bigram represents the transition probabilities of two adjacent amino acid residue positions. These features are previously used in solving protein subcellular localization and protein fold recognition61,63 and defined as below:

    $$\mathrm{PSSM} \mbox{-} \mathrm{bigram}(k,l)=\frac{1}{L}\,\sum _{i=1}^{L-1}\,{N}_{i,k}{N}_{i+\mathrm{1,}l}\mathrm{(1}\le k\le 20,1\le l\le 20)$$
    (5)
  4. 4.

    PSSM 1-lead Bigram: PSSM 1-lead bigram is defined as the transition probabilities of the amino acid residue positions at 1 distance or separation. It can be formally defined as:

    $$\mathrm{PSSM} \mbox{-} 1 \mbox{-} \mathrm{lead} \mbox{-} \mathrm{bigram}(k,l)=\frac{1}{L}\,\sum _{i=1}^{L-2}\,{N}_{i,k}{N}_{i+\mathrm{2,}l}\mathrm{(1}\le k\le 20,1\le l\le \mathrm{20)}$$
    (6)
  5. 5.

    PSSM Composition: PSSM composition is created by taking the normalized sum of the values in each of the columns of the PSSM matrix61. Each column of the PSSM matrix represents one of the 20 amino acid residues. It is defined as:

    $$PSSM \mbox{-} Composition(k,l)=\frac{1}{L}\,\sum _{i=1}^{L-1}\,{N}_{i,j}\mathrm{(1}\le j\le \mathrm{20)}$$
    (7)
  6. 6.

    PSSM Auto-Covariance: Auto-Covariance of PSSM is a feature61,64 depending of a distance factor, DF as parameter. In this study we used, DF = 10. The feature is formally defined as:

    $$\mathrm{PSSM} \mbox{-} \mathrm{Auto} \mbox{-} \mathrm{Covariance}(k,j)=\frac{1}{L}\,\sum _{i=1}^{L-k}\,{N}_{i,j}{N}_{i+k,j}\mathrm{(1}\le j\le 20,1\le k\le DF)$$
    (8)
  7. 7.

    PSSM Segmented Distribution: Previously, the segmented distribution of the PSSM matrix proposed in65 was used as feature for sub-cellular localization of proteins in66. The idea is to find the distribution of the values in the PSSM matrix column wise by calculating the partial sums columnwise starting from the first row and the last row and iterating until the partial running sum is F p % of the total sum. The details of the procedure for this feature generation can be found in65,66,67. In this paper, we used F p  = 5, 10, 25.

SPIDER based features

We used SPIDER254, a freely available software that provides information on accessible surface area, torsion angles, structure motifs in each amino acid residue position. We then extract a novel set of features from the information provided by SPIDER2 as SPD file. The feature extraction is enumerated here in details:

  1. 1.

    Secondary Structure Occurence: There are three types of motifs structural motifs in proteins: α-helix (H), β-sheet (E) and random coil (C). Secondary Structure Occurrence is the count or frequency of each type present in mino-acid residue positions.

    $$\mathrm{SS} \mbox{-} \mathrm{Occurence}(i)=\sum _{j=1}^{L}\,s{m}_{ij},1\le i\le 3$$
    (9)

    Here, L is the length of the protein and

    $$s{m}_{ij}=\{\begin{array}{ll}1, & {\rm{if}}\,S{S}_{j}={\mu }_{i}\\ 0, & {\rm{else}}\end{array}$$

    where SS j is the structural motif at position j of the protein sequence and μ i is one of the 3 different motif symbols.

  2. 2.

    Secondary Structure Composition: This feature is secondary structure motif occurrence normalized by the length of the phage protein length. This is similar to the amino acid composition except that here we are taking the count of motif symbols in stead of amino acid symbols.

    $$\mathrm{SS} \mbox{-} \mathrm{Occurence}(i)=\frac{1}{L}\,\sum _{j=1}^{L}\,s{m}_{ij},1\le i\le 3$$
    (10)

    Here, L is the length of the protein and

    $$s{m}_{ij}=\{\begin{array}{ll}1, & {\rm{if}}\,S{S}_{j}={\mu }_{i}\\ 0, & {\rm{else}}\end{array}$$

    where SS j is the structural motif at position j of the protein sequence and μ i is one of the 3 different motif symbols.

  3. 3.

    Accessible Surface Area Composition: The accessible surface area composition is the normalized sum of accessible surface area defined by:

    $$\mathrm{ASA} \mbox{-} \mathrm{Composition}=\frac{1}{L}\,\sum _{i=1}^{L}\,ASA(i)$$
    (11)
  4. 4.

    Torsional Angles Composition: For four different types of torsional angles: ϕ, ψ, τ and θ we first convert each of them into radians from degree angles and then take sign and cosine of the angles at each residue position. Thus we get a matrix of dimension L × 8. We denote this matrix by T is this section for torsional angles. Torsional angles composition is defined as:

    $$\mathrm{Torsional} \mbox{-} \mathrm{Angles} \mbox{-} \mathrm{Composition}({\rm{k}})=\frac{1}{L}\,\sum _{i=1}^{L}\,{T}_{i,k}\mathrm{(1}\le k\le \mathrm{8)}$$
    (12)
  5. 5.

    Structural Probabilities Composition: Structural probabilities for each position of the amino acid residue are given in spd3 file as a matrix of dimension L × 3. We denote it by P. Structural probabilities composition is defined as:

    $$\mathrm{Structural} \mbox{-} \mathrm{Probabilities} \mbox{-} \mathrm{Composition}(k)=\frac{1}{L}\,\sum _{i=1}^{L}\,{P}_{i,k}\mathrm{(1}\le k\le \mathrm{3)}$$
    (13)
  6. 6.

    Torsional Angles Bigram: Bigram for the torsional angles is similar to that of PSSM matrix and defined as:

    $$\mathrm{Torional} \mbox{-} \mathrm{angles} \mbox{-} \mathrm{bigram}(k,l)=\frac{1}{L}\,\sum _{i=1}^{L-1}\,{T}_{i,k}{T}_{i+\mathrm{1,}l}\mathrm{(1}\le k\le 8,1\le l\le \mathrm{8)}$$
    (14)
  7. 7.

    Structural Probablities Bigram: Bigram of the structural probabilities is similar to that of PSSM matrix and defined as:

    $$\mathrm{Structural} \mbox{-} \mathrm{Probabilities} \mbox{-} \mathrm{bigram}(k,l)=\frac{1}{L}\,\sum _{i=1}^{L-1}\,{P}_{i,k}{P}_{i+\mathrm{1,}l}\mathrm{(1}\le k\le 3,1\le l\le \mathrm{3)}$$
    (15)
  8. 8.

    Torsional Angles Auto-Covariance: This feature is also derived from torsional angles and defined as:

    $$\mathrm{Torsional} \mbox{-} \mathrm{Angles} \mbox{-} \mathrm{Auto} \mbox{-} \mathrm{Covariance}(k,j)=\frac{1}{L}\,\sum _{i=1}^{L-k}\,{T}_{i,j}{T}_{i+k,j}\mathrm{(1}\le j\le 8,1\le k\le DF)$$
    (16)
  9. 9.

    Structural Probablities Auto-Covariance: This feature is also derived from structural probabilities and defined as:

$$\mathrm{Structural} \mbox{-} \mathrm{Probabilities} \mbox{-} \mathrm{Auto} \mbox{-} \mathrm{Covariance}(k,j)=\frac{1}{L}\,\sum _{i=1}^{L-k}\,{P}_{i,j}{P}_{i+k,j}\mathrm{(1}\le j\le 3,1\le k\le DF)$$
(17)

The features generated and used in this paper are summarized in Table 5.

Table 5 Summary of evolutionary and structural features used in this paper.

Feature Selection

As the number of features extracted is large, we apply feature reduction to derive an optimal set of features for DNA-binding protein prediction. Previously several feature elimination techniques like correlation-based feature subset selection method25, tree-based feature selection15, best-first greedy feature selection15, etc. In this paper, we have used Recursive feature elimination (RFE) first proposed in68. The algorithm in depicted as pseudo-code in Algorithm 1. This algorithm uses backward correlation based feature elimination technique. This algorithm starts with a dataset \({\mathbb{D}}\), a classifier \({\mathbb{C}}\) and k the number of reduced features as parameter. In each iteration of the algorithm, the dataset is used to train a model, \({\mathbb{M}}\) and based on that the lowest ranked feature is removed. The dataset is then transformed using the resulting features. This process is continues until the number of features is equal to k.

Algorithm 1
figure a

RecursiveFeatureElimination(\({\mathbb{D}}\), \({\mathbb{C}}\), k).

Description of the Classifier

We have used Support vector machine (SVM) as the classifier in our method, iDNAProt-ES. SVM69,70 construct a separating hyper-plane to maximize the margin between the positive and negative instances. The nearest points in the hyper-plane are called support vectors. SVM first constructs a hyper-plane based on the training dataset, and then maps an input vector from the input space into a vector in a higher dimensional space, where the mapping is determined by a kernel function. A trained SVM can output a class label (in our case, DNA-binding protein or non-DNA-binding protein) based on the mapping vector of the input vector. There are a number of popular kernels. In this paper we explore three kernel functions as described below:

  1. 1.

    The Linear kernel function can be defined as

    $$K({X}_{i},{X}_{j})={X}_{i}{X}_{j}$$
    (18)
  2. 2.

    The (Gaussian) or Radial Basis Function kernel (RBF) can be defined as

    $$K({X}_{i},{X}_{j})=exp(-\gamma {(\Vert {X}_{i}-{X}_{j}\Vert )}^{2})$$
    (19)
  3. 3.

    The Sigmoid kernel function can be defined as

$$K({X}_{i},{X}_{j})=\,\tanh (\gamma \mathrm{.}{X}_{i}{X}_{j})+r)$$
(20)

Here gamma and r are the kernel parameters.gamma must be greater than 0. The best kernel was the linear kernel with the parameters, C = 1000 and γ = 0.01.

Performance Evaluation

Evaluating the performance of a new predictor is very essential71. Various comparison metrics are used in the literature14,61,72 to evaluate the performance of the predictor. There are two cross validation methods are often used: sub-sampling or K-fold(such as 5 fold, 10 fold) test and Jackknife test73. According to the penetrating analysis in31, the jackknife test is the least arbitrary than the sub-sampling test. Therefore, the jackknife test has been widely recognized and increasingly adopted by researchers to examine the quality of various predictors74,75,76,77 and in the literature of DNA-binding protein prediction13,15,29,58. In this study, we used both test K-fold cross validation and jackknife test.

We use four performance metrics, i.e. sensitivity (Sn), specificity (Sp), accuracy (Acc), Matthews correlation coefficient (MCC) and the area under the ROC curve (AUC) to measure the prediction performance as compared to the other methods in the literature. The first four metrics are defined as follows:

$$Sn=\frac{TP}{TP+FN}$$
(21)
$$Sp=\frac{TN}{TN+FP}$$
(22)
$$Acc=\frac{TP+TN}{TP+TN+FP+FN}$$
(23)
$$MCC=\frac{(TP\times TN)-(FP\times FN)}{\sqrt{(TP+FP)\,(TP+FN)\,(TN+FP)\,(TN+FN)}}$$
(24)

where TP, FP, TN and FN represent the numbers of true positives, false positives, true negatives and false negatives, respectively. The set of metrics is valid only for the single-label systems. For the multi-label systems whose existence has become more frequent in system biology78,79 and system medicine41,55, a completely different set of metrics as defined in80 is needed. In this study, we also use the metrics receiver-operating characteristic curve (auROC) to assess the prediction performance. Its plots the true positive rate (sensitivity) against the false positive rate (1-specificity) at different threshold settings. A predictor with perfect classification has a ROC curve passing through the top left corner (100% sensitivity and 100% specificity). Therefore, the closer the ROC curve is to the top left corner, the better the overall performance of the predictor is. Thus, auROC is used as the primary measure to assess how well a predictor can distinguish between two classes.

Data and Material Availability

All the data and materials used in this paper are available at: http://brl.uiu.ac.bd/iDNAProt-ES/.

Conclusion

In this paper, we present iDNAProt-ES, a novel prediction method for identification of DNA-binding proteins. We have used evolutionary and structural features for the classification extracted from PSSM files and SPD files generated by PSI-BLAST and SPIDER2, respectively. We also used recursive feature elimination to select an optimal set of features. The final model for prediction was developed using Support Vector Machine (SVM) with linear kernel. iDNAProt-ES was tested on a standard benchmark dataset and an independent dataset and achieved significantly improved results on both of the datasets. The method is freely available for use at: http://brl.uiu.ac.bd/iDNAProt-ES/.

The superiority of iDNAProt-ES was clearly noticeable in the experiments done in this study. In future, we wish to update the prediction method by incorporating an enhanced dataset. For practical application, as pointed out previously21, a key issue is that the number of non-DNA-binding proteins are much higher than that of DNA-binding proteins. Therefore, an enhanced dataset with balancing methods could further enhance the performance of the predictor.