PDNAsite: Identification of DNA-binding Site from Protein Sequence by Incorporating Spatial and Sequence Context

Protein-DNA interactions are involved in many fundamental biological processes essential for cellular function. Most of the existing computational approaches employed only the sequence context of the target residue for its prediction. In the present study, for each target residue, we applied both the spatial context and the sequence context to construct the feature space. Subsequently, Latent Semantic Analysis (LSA) was applied to remove the redundancies in the feature space. Finally, a predictor (PDNAsite) was developed through the integration of the support vector machines (SVM) classifier and ensemble learning. Results on the PDNA-62 and the PDNA-224 datasets demonstrate that features extracted from spatial context provide more information than those from sequence context and the combination of them gives more performance gain. An analysis of the number of binding sites in the spatial context of the target site indicates that the interactions between binding sites next to each other are important for protein-DNA recognition and their binding ability. The comparison between our proposed PDNAsite method and the existing methods indicate that PDNAsite outperforms most of the existing methods and is a useful tool for DNA-binding site identification. A web-server of our predictor (http://hlt.hitsz.edu.cn:8080/PDNAsite/) is made available for free public accessible to the biological research community.

Protein-DNA interactions play important roles in a wide range of fundamental biological processes such as gene regulation, transcription, DNA replication, DNA repair and DNA packaging [1][2][3][4][5] . The knowledge about DNA-binding residues, binding specificity and binding affinity helps to not only understand the recognition mechanism of protein-DNA complex, but also give clues for protein function annotation. For example, Ptashne 6 has reported that the interactions between DNA and transcription factors are essential for gene replication and transcription regulation; Kornberg 7 has presented that the interactions between DNA and histones are involved in chromosome packaging in the cell nucleus. Bullock and Fersht 8 have shown that mutations of DNA-binding residues, such as those on the tumor repressor protein P53, may predispose individuals to cancer. Therefore, a reliable identification of DNA-binding sites in DNA-binding protein is important for protein function annotation, in silico modeling of transcription regulation and site-directed mutagenesis. Several experimental techniques have been proposed to identify the DNA-binding sites and investigate the interaction modes between proteins and DNAs. For example, biophysical methods are used to uncover the molecular details of specific residue-residue contacts; alanine-scanning mutagenesis has been employed to identify the amino acids involved in target recognition 9 by the m5C methyltransferase and to distinguish specific amino acids important for DNA binding and transcription activation by SoxS 10 . However, traditional experimental techniques are very time-consuming and laborious to operate. There is an urgent need for computational tools that can rapidly and reliably identify DNA-binding sites in DNA-binding proteins.
Many machine learning based predictors have been developed for the aforementioned task. They are typically trained from a set of input features, which can be generally divided into three categories: protein sequence

Materials and Methods
Several recent publications [26][27][28] have demonstrated that a useful biological prediction model should consist of valid benchmark dataset(s), an effective feature extraction procedure, an efficient predicting algorithm, a fair evaluation criteria and a web-server. Following this criteria, we describe the PDNAsite in details.
The framework diagram of PDNAsite is shown in Fig. 1 where the training phrase is shown on the left panel while the test phase is shown on the right panel. The method contains three steps: feature extraction, LSA operation and ensemble learning. Feature extraction is used to extract seven types of features from the sequence context and the structure context by applying the sequence and spatial sliding windows, respectively. Then LSA is operated on the sub feature space spanned by PSSM features to remove redundancy and reduce dimensionality. Finally, ensemble learning integrates several SVM base classifiers for prediction.
Data sets. In this study, two protein sequence data sets used in previous studies are used to evaluate the performance of PDNAsite. They are PDNA-62 and PDNA-224, respectively.
Scientific RepoRts | 6:27653 | DOI: 10.1038/srep27653 PDNA-62. PDNA-62 was firstly constructed by Ahmad et al. 15 to train an ANN classifier to distinguish DNA-binding sites from non-binding residues. It was later employed to train different machine learning classifiers by many studies, including ANN, SVM, Random Forest and Naïve Bayes 12,14,17,21 . PDNA-62 was derived from the structure data of 62 protein-DNA complexes in the Protein Data Bank (PDB) 20 . The dataset contains 67 sequences and the sequence identity between any two sequences is less than 25%. As in most previous studies 15,17,29,30 , in the structure of the protein-DNA complexes, a residue in protein is regarded as interacting with DNA if the side chain or the backbone atoms of the residue falls within a cutoff distance of 3.5 Å from any atom of the partner DNA molecule in the complex. All the other residues were regarded as non-binding sites. As a result, this data set contains 1,215 DNA-binding residues and 6,948 non-binding residues. As this dataset has been used in many studies, it is convenient for comparing the predicting accuracy of PDNAsite with that of other existing methods.
PDNA-224. PDNA-224 was proposed recently by Li et al. 24 through extracting the structure data of 224 protein-DNA complexes from PDB. It includes 224 protein sequences and the redundancy between any two PDNAsite contains three steps: feature extraction, LSA operation and ensemble learning. Feature extraction is used to extracting features from the protein sequence and structure by applying the sequence and spatial sliding windows. LSA operation is applied to the sub feature space spanned by the PSSM features. Ensemble learning is applied to build several base classifiers and then integrates them into an ensemble classier. sequences has been removed by using 25% sequence identity as the cutoff similarity. By using the same criterion as in PDNA-62, there are 3,778 DNA-binding residues and 53,570 non DNA-binding residues. It has been applied to train a DNA-binding site predictor (PreDNA), which is a classifier built by integrating a SVM classifier based on sequence information and a classifier based on structural alignment algorithm. PreDNA has been reported as the best-performing predictor for DNA-binding site prediction up to now. So PDNA-224 was used to evaluate the performance of PDNAsite for comparison with the best-performing predictor. For more detailed information about PDNA-224, please refer to literature 24 . Evaluation metrics. In binary-labeled classification problems, five typical evaluation metrics are often used to evaluate the discriminative powers of DNA-binding predictors: Sensitivity (SN), Specificity (SP), Strength (ST), Accuracy (ACC), and Mathews Correlation Coefficient (MCC). The five metrics can be calculated by the following formulae where TP is the number of correctly predicted positive instances, TN the number of correctly predicted negative instances, FP the number of incorrectly predicted negative instances, and FN the number of incorrectly predicted positive instances, respectively.
Since the data sets used in this study are imbalanced, the strength(ST), taken as the average of sensitivity and specificity, is used to provide a fair measure of classifier performance 11,15,30,31 . Also, MCC can measure the matching degree between prediction results and real results. Therefore, in this paper, ST and MCC are used as the main metrics and the other three metrics are provided for reference only.
To further evaluate the discriminating power of classifiers on an imbalanced data set, the Receiver Operating Characteristic (ROC) curve 32 and the area under ROC curve (AUC) 33 are also used. The ROC curve is probably the most robust approach for classifier evaluation and comparison 33 . The ROC curve is drawn by plotting the true positive rate (i.e. sensitivity) against the false positive rate, which equals to (1-spicificity). In this study, the ROC curve is generated by varying the output threshold of a classifier and plotting the true positive rate against false positive rate for each threshold value. AUC is a reliable measure for classifier performance. An AUC of 1.0 indicates perfect classifier whereas an AUC of classifier no better than random is 0.5. Spatial and sequence context. In the study of DNA-binding site prediction, the residue-wise data instances derived from sequence were used as samples to train and evaluate classifiers. In order to make the full use of the sequence context for a target residue, a sliding window of size w(being an odd number) is used. Then, a residue-wise data instance was commonly defined as a fragment with w consecutive amino acids with the target residue positioned in the middle and (w-1)/2 neighboring residues on either side. The residues contained in the sequence sliding window provide the sequence context information for a target residue. However, research results in many literatures 22,25,34 have indicated that the spatial context can also contribute to the identification of DNA-binding site from non-binding sites. In order to extract the spatial context of a target site for its prediction, we also proposed a spatial sliding window with size m. The spatial sliding window is defined as a sphere with the target site positioned at the center and (m-1) sites with the shortest spatial distance to the target site contained in it. The distance between sites is calculated based on the coordinates of their C α atoms.
For a target site, the sites contained in the spatial sliding window are referred to as the spatial context, while the sites contained in the sequence window are referred to as the sequence context. As some residues in the sequence context may also be within the cutoff spatial distance from the target site, there may be sites contained simultaneously by both the sequence context and the spatial context, referred to as the overlapping sites. Since these sites are closed to the target site within both the sequence distance and the spatial distance, they can have greater effect on the function of the target site. So when the sequence context and the spatial context are combined to extract features, the overlapping sites should be used twice. Figure 2 shows the diagram of the sequence sliding window and the spatial sliding window of residue P of position 415 in chain N of 1A02. The red residue denotes the target site, the blue and cyan residues are the sites contained in the sequence sliding window and the magenta and cyan residues are the sites contained in the spatial sliding window, where the cyan residues are the overlapping sites between the sequence sliding window and the spatial sliding window.
Therefore, in this paper, a residue-wise data instance is defined as the combination of the sequence context and the spatial context. As a result, a residue-wise data instance should contain (m + w − 1) residues, since all the overlapping sites apart from the target site should be used twice. A residue-wise data instance is labeled with 1 (positive) if the target residue is binding or −1 (negative) if the target residue is non-binding. As SVM classifiers only take numerical values for classification, the residue-wise data instances need to be encoded into feature vectors. In this work, the feature space of residue-wise data instances is constructed by extracting the sequence information and structure information from the spatial context and the sequence context, including local amino acid composition, evolutionary information in terms of PSSM, B-factor, secondary structure, and Solvent accessible Scientific RepoRts | 6:27653 | DOI: 10.1038/srep27653 surface area. The spatial context and the sequence context are extracted by applying the spatial sliding window and the sequence sliding window, respectively. Details of the sequence characteristics and the structure characteristics used in this paper will be introduced in the following text.

Feature extraction. Evolutionary information. Evolutionary information in terms of PSSM of protein
sequences is obtained by running the PSI-BLAST 35 program to search against the non-redundant (NR) database through three iterations with 0.001 as the E-value cutoff for multiple sequence alignment. In PSSM, there are 20 values for each sequence position. In order to make full use of the evolutionary information, we process PSSM using the following two steps. First, all elements of a PSSM are scaled between 0 and 1 using the following equation.
Second, for a residue-wise data instance, the evolutionary information is incorporated into the feature vector by concatenating the corresponding PSSM columns for the sites in the spatial context and the sequence context. Furthermore, for a data instance, the sums in term of evolutionary information for each amino acid type are also calculated for the left sequence sliding window, the right sequence sliding window, the whole sequence sliding window, and the spatial sliding window, and are added into the feature vector.
Local amino acid composition. For a data instance, two local amino acid compositions are calculated. One is the amino acid composition for the residues in the sequence sliding window and the other is that for the residues in the spatial sliding window.
Identity vector. For a data instance, the identity of the target residue is incorporated by using a 20-feature vector with 1 occurring at the position corresponding to that residue type and 0 for the remaining residue types. For example, if the target residue is Arg, 1 is used at position four with 0 at all other 19 positions.
Solvent accessible surface area (ASA). In this paper, the ASA of every residue in protein is calculated from DSSP 36 . Before encoding the ASAs of the target residue and its neighboring residues, the ASA is divided by the maximum ASA of the corresponding residue type to calculate its relative ASA (RASA). Then, for a data instance, the RASA values of the residues in the spatial context and the sequence context are encoded and added into feature vector.
Secondary structure. Secondary structure assignments of all residues in the proteins are made with DSSP 36 , which classify every residue as one of the nine types: alpha helix (H), residue in isolated beta-bridge (B), extended strand participates in beta ladder (E), 3-helix (or 310 helix) (G), 5-helix (or pi-helix) (I), hydrogen-bonded turn (T), bend (S), loop (L) and irregular (no designation). In this paper, the 9 types of secondary structure are approximately combined into 3 types: helix (H), β -strand (E) and coil (C). The secondary structure of the target residue is encoded using mutually orthogonal binary vectors: (1,0,0) for helix, (0,1,0) for β -strand and (0,0,1) for coil. Additionally, the secondary structure compositions for the residues in the left sequence sliding window, the right sequence sliding window, the whole sequence sliding window and the spatial sliding window are added into the The diagram shows a part of the 3D structure of protein-DNA complex 1A02 and the structure is shown in form of cartoon. The red residue denotes the target site, the combination of the blue and the cyan residues is the sequence context of the target site and the combination of the magenta and the cyan residues is the spatial context of the target site, where the cyan residues are the overlapping sites between the sequence context and the spatial context. feature vector, respectively. The values in the structure composition denote the proportion of the number of residues with the corresponding secondary structure type over the total number.
Net charge of a residue. Due to the negative ambience around the DNA, the charge reciprocality of a residue may play an important role in its binding to the partner DNA. Therefore, the net charge of a residue is used as a feature for classification. A charge of +1 is ascribed to Arg and Lys and −1 to Asp and Glu. His is specified a charge of + 0.5 and all other residues are taken as neutral. The net charge of the sites in the sequence and spatial sliding windows are calculated.
B-factor of a residue. The B-factor of protein crystal structure reflects the fluctuation of atoms about their average positions and provides important information about protein dynamics. The thermal motion is useful for analyzing the dynamic properties of proteins 37 . Therefore, in this work, the B-factor of the C α and that of the C β of the residues in the sequence and spatial windows were encoded. In addition, the sum of the B-factor of the C α over the residues in spatial sliding window was also calculated.
Latent Semantic Analysis (LSA). LSA is a method for extracting and representing the contextual meaning of words by statistical computations. Latent Semantic Analysis (LSA) is suitable to remove redundancies in feature space. Recently, LSA has been successfully applied to many bioinformatics problems. For example, Dong and his coworkers 38 developed SVM classifiers for protein remote homology detection by applying the LSA operation. For this problem, the starting point of the LSA operation is the construction of a triplet-sequence matrix W with dimension (M* N) which denotes the co-occurrences between triplets and protein sequences. Triplets denote the combinations of three amino acid types. In the triplet-sequence matrix W, each sequence is expressed as a column vector. However, this representation does not recognize the triplets with similar function in the sequence and the dimension is too large. To resolve these problems, singular value decomposition is used to process the triplet-sequence matrix W. Let K be the rank of W, W can be decomposed into three matrices: . One can reduce the dimensions by deleting the smaller singular values in the diagonal matrix and ignore the corresponding columns of matrix U and rows of matrix V. Additionally, Liu et al. 39 further improved the prediction accuracy for protein remote homology detection by applying LSA on the top-n-gram-sequence matrix which denotes the co-occurrences between top-n-grams and protein sequences.
Through the analysis of the three matrices (word-document matrix, the triplet-sequence matrix, and the top-n-gram-sequence matrix), we discovered that all these three matrices are constructed by features of the same type, for example, words or biological sequences. We speculate that LSA could only be suitable for processing the feature space constructed by features of the same type, such as words in text, triplet or top-n-gram in protein sequence. In this paper, we construct a feature-instance matrix W which denotes the co-occurrences between features and protein sequences. Since there is much redundant information between the PSSM features, we need to apply LSA to decrease the redundant information. However, the features used to construct the matrix W do not belong to the same type. So matrix W cannot be processed by LSA directly. In this work, we take the sub space of W with dimension of (20* ( w + m − 1)) spanned by only the PSSM features, denoted by W′ . Since W′ contains features of the same, we can then use LSA. Support Vector Machine. SVM can be used to resolve both binary-labeled and multi-labeled classification problems. For a binary-labeled classification problem, SVM first maps the input feature space into a higher-dimensional space and then seeks an optimal hyperplane, which maximizes the separation margin between the two classes of training instances, to separate the positive instances from negative instances. As SVM can transform the input features of the instances from a low dimensional space to a higher dimensional space, it has superior generalization power for most classification problems. In this study, the LIBSVM software package available at https://www.csie.ntu.edu.tw/~cjlin/libsvm/40 is used. The radial basis function (RBF) is taken as the kernel function. RBF is defined as where γ is a training parameter. A smallerγ value makes the decision boundary smoother. Another parameter for SVM training is the regularization factor C, which controls the trade-off between low training error and large margin. The optimal value of the parametersγ and C are obtained by five-fold cross-validation in this work. The data sets used in this study have many more negative instances than positive instances, which will have a great impact on the prediction performance of classifiers. In order to deal with imbalanced data sets, ensemble learning is used. Ensemble learning first divides the negative instances into n folds with non-overlapping instances, where the number of instances in each fold is approximately equal to that of the positive instances. Then the negative instances in each fold and the positive instances are combined to form a new data set. Thus n new data sets are constructed. Finally, the n new data sets are used as training data sets to train n base classifiers, which are subsequently combined as an ensemble classifier for prediction.
Five-fold cross-validation is a widely used validation method, where the data set is first divided into five folds with no overlapping instances, and each time one fold is used as the test set and the remaining four folds are taken as the training set. This process is repeated five times until all the instances in the original set are tested once. The average performance over five such runs is used as the final prediction performance. In this study, the performances of our method on the two data sets are evaluated by applying five-fold cross-validation.
Performance Results. Selection of window sizes w and m. To evaluate the performance of PDNAsite and compare it with other existing predictors, we first analyze the impacts of the sequential sliding window size w and the spatial sliding window size m on the prediction performance of PDNAsite. The impacts of w and m on the prediction performance of PDNAsite on PDNA-62 by five-fold cross-validation are shown in Fig. 3a,b, respectively. As can be seen from Performance comparison of sequence sliding window with spatial sliding window. In this study, the features from the sequence context and the spatial context are used to construct the feature vector for each target site. In order to find out the contributions of the spatial context and the sequence context to the identification of DNA-binding residue, we conduct performance evaluations using three sets of features: sequence context, spatial context and combined use of both. The performances of the predictors using different context on PDNA-62 and PDNA-224 are shown in Tables 1 and 2, respectively. As can be seen from Table 1, on PDNA-62, the predictor using spatial context achieved better performance than that using sequence context by 0.013 in terms of MCC, 0.96% in terms of ST and 0.007 in terms of AUC. The predictor using both of them achieved 0.563 MCC, 84.63% ST and 0.917 AUC, outperforming the one using sequence context alone by 0.036 MCC, 2.41% ST and 0.024 AUC (n = 5, p = 4.15E-4, 1.83E-4 and 3.15E-4, 1-tailed, paired t-test, for MCC, ST and AUC, respectively) indicating the improvement is quite significant. As can be seen from Table 2, on PDNA-224, the predictor using spatial context  The predictor using both the sequence and the spatial context.  achieved better performance than that using sequence context by 0.012 MCC, 1.38% ST and 0.012 AUC. The predictor using both of them outperformed the one using sequence context alone by 0.041 MCC, 2.96% ST and 0.026 AUC (n = 5, p = 9.7E-5, 2.54E-4 and 3.1E-5, 1-tailed, paired t-test, for MCC, ST and AUC, respectively) indicating the improvement is quite significant. The ROC curves of the predictors using different contexts on PDNA-62 and PDNA-224 are shown in Figs 4 and 5, respectively. The ROC curves of the predictors with different context also indicate that the spatial context gives more performance gain than the sequence context and the combination of them can further improve the prediction performance.

ACC (%) MCC SN (%) SP (%) ST (%) AUC
Application of LSA on feature-instance matrix W. LSA is an efficient feature extraction technique widely used to remove noise information for a feature space. In this paper, we applied LSA in two different ways: one is applying LSA on the whole feature space, and the other is employing LSA on the sub feature space spanned by PSSM features. The prediction performances of the two ways on PDNA-62 and PDNA-224 by five-fold cross-validation are shown in Tables 3 and 4   Comparison with existing methods. DNA-binding sites have been predicted successfully by many predictors.
To demonstrate the discriminating power of our proposed PDNAsite, its prediction performance is compared with other existing state-of-the-art methods. As a meaningful comparison must be made on the same data sets, the following predictors which used the either of the two datasets are used as comparison including Dps-pred 15 , Dbs-pssm 29 , BindN 12 , Dp-bind 21 , Dp-Bind 41 , BindN-RF 14 , BindN+ 17 which used the first dataset, and PreDNA 24 which used both datasets. PreDNA 24 is the best-performing predictor reported so far. It integrated a machine learning model and a structural alignment model for prediction where the structural alignment model used the amino acid-nucleotide pairs with distance less than 16 Å as the alignment units. In each alignment unit, the distance between the amino acid and the nucleotide is calculated based on their coordinates in the 3D structure of the protein-DNA complex. However, in most cases, the binding sites and the non-binding sites in the training dataset and the test dataset are defined based on the distances between the sites and its neighboring nucleotides. As such, the binding site can be distinguished from the non-binding site based on the distance information directly. We argue that for training classifiers for DNA-binding sites, the distance information between amino acid and nucleotide should not be used as features. Therefore, in order to fairly compare the performance of our proposed PDNAsite with the existing methods, we only consider the PreDNA without using the structural alignment model. The prediction accuracies of our method and the existing methods by five-fold cross-validation on PDNA-62 are shown in Table 5. As can be seen from the table, our method performs better than PreDNA by 0.082 MCC, 3.39% ST (n = 5, p = 2.3E-5, 3.36E-4 and 5.0E-5, 1-tailed, one-sample t-test, for MCC, ST and AUC, respectively), indicating that not only PDNAsite is the best performer, the improvement is significant on PDNA-62. The comparison between our predictor and PreDNA 24 on PDNA-224 by five-fold cross-validation is shown in Table 6. Our method outperforms PreDNA by 0.045 in MCC, 3.47% in ST and 0.010 in AUC (n = 5, p = 1.25E-4, 5.0E-4 and 1.5E-4, 1-tailed, one-sample t-test, for MCC, ST and AUC, respectively), indicating significant performance improvement.
DNABind 42 is a recently proposed predictor for DNA-binding site prediction, which also used some spatial context as classification features, including degree, closeness and betweenness 43 . These features are calculated from the graph structure formed by the target site and its spatial neighboring sites. The features used in this paper include the amino acid composition, secondary structure, evolutionary information and physiochemical information contained in spatial context. In order to demonstrate the effectiveness of the spatial context proposed in this paper for the prediction of DNA-binding site, we compared our predictor with DNABind 42 on DS123, HOLO83 and APO83. As our predictor is only trained by DS123 without using any information in the template library used by DNABind 42 , we just compared our predictor with the machine learning-based protocol in DNABind 42 (DNABind ML ). The results of the two methods are shown in Table 7. It can be observed that our method outperforms DNABind ML with 2.66% ST and 0.044 AUC (n = 5, p = 8.5E-5, 9.4E-5 and 5.0E-4, 1-tailed, one-sample t-test, for MCC, ST and AUC, respectively) on DS123 and with 3.98% ST and 0.009 AUC on HOLO83. On APO83, our predictor performs similar to that of DNABind ML .  Table 3. Performance comparison of predictor with different LSA on PDNA-62 by five-fold crossvalidation. a The predictor using both the sequence and spatial contexts and not applying LSA. b The predictor using both the sequence and spatial contexts and applying LSA on the whole feature space. c The predictor using both the sequence and the spatial context and applying LSA on the sub feature space spanned by the PSSM features, that is, our PDNAsite.  Table 4. Performance comparison of predictor with different LSA on PDNA-224 by five-fold crossvalidation. a The predictor using both the sequence and the spatial context. b The predictor using both the sequence and the spatial context and applying LSA on the whole feature space. c The predictor using both the sequence and the spatial context and applying LSA on the sub feature space spanned by the PSSM features, that is, our PDNAsite.

ACC (%) MCC SN (%) SP (%) ST (%) AUC
DNABR 19 is a sequence based DNA-binding site prediction method, which performs better than the three methods proposed by Wang et al., including BindN, BindN-RF and BindN+ . To compare with DNABR, an independent test dataset TS-72 with 72 protein chains is applied. TS-72 was first proposed for evaluating the performance of DNABR 19 by extracting proteins-DNA complexes from PDB 20 . On the dataset with 3.5 Å as the distance threshold, results show that the AUC values are 0.8783, 0.8669, 0.7488, 8257, and 0.8445 for our method, DNABR, BindN, BindN-RF, and BindN+ method, respectively. For this evaluation our method, BindN, BindN-RF, and BindN+ are trained on the PDNA62 whereas DNABR is trained on a much larger dataset TR265 with 265 protein chains and the AUC values for the other four methods are referenced from Ma et al.'s work 19 . It indicates that our method performs better than DNABR and other three methods on TS-72.
Analysis of number of binding sites in spatial context. Different target sites generally have different spatial context. For example, some sites may only contain either non-binding sites or binding sites in their spatial context while other sites may have both of them. Figure 6a,b show the sensitivity and specificity of the predictions for sites with different number of binding sites in their spatial context, respectively. In Fig. 6a, the x-axis represents the number of binding sites contained in the spatial context and the y-axis represents the predicting sensitivity for the sites with certain number of binding sites in their spatial context. In Fig. 6b, the x-axis has the same meaning as the one for Fig. 6a and the y-axis denotes the predicting specificity for the sites with certain number of binding sites in their spatial contexts.
From Fig. 6a, we can see that the predicting sensitivity increases as the number of binding sites in the spatial context increases to 10; and PDNAsite gets the maximal sensitivity when the number of binding sites in the spatial context equals to or greater than 10. From Fig. 6b, we can see that the predicting specificity decreases as the number of binding sites in the spatial context increases to 10; and PDNAsite gets the maximal specificity when the number of binding sites in the spatial context equals to 0. This phenomenon indicates that, as the number of   binding sites in the spatial context increases, the target site has more capacity to bind to its corresponding DNA molecule, meaning that the number of binding sites in the spatial context has a great impact on the prediction. Therefore, the spatial context extracted from the spatial sliding window can act as a very important discriminant feature for DNA-binding site identification. We can also conclude that the interactions between neighboring binding sites in their spatial structure are important for protein-DNA recognition and their binding ability.
Case study. Epstein-Barr nuclear antigen 1 (PDB 1B3T) activates the initiation of DNA replication once every cell cycle from the Epstein-Barr virus (EBV) latent origin of DNA replication, oriP 44 Fig. 7. Figure 7a,b denote the real sites and predicted sites of 1B3T, respectively. And Fig. 7c,d denote the real sites and predicted sites of 1KX5, respectively. From the figure, we can see that most of the real binding sites are covered by the predicted binding sites, indicating that most real binding sites were successfully predicted by PDNAsite. As there are much more non-binding sites than binding sites in a protein sequence, there are some false predicted non-binding sites shown in Fig. 7b,d.
Use of the Web-Server of PDNAsite. In this work, we also provides a user-friendly web-server of PDNAsite freely accessible to the public. This section provides a step-by-step guideline on how to use PDNAsite.
Step 1. use the URL (http://hlt.hitsz.edu.cn:8080/PDNAsite/) to get to the homepage of the web-server as shown as in Fig. 8. Click the Read Me button to see instructions on how to use the server.
Step 2. Click the Server button for protein sequence query. The input file must be in the PDB format, which contains all the 3D information of the target protein sequence. The input file should be named as the PDB id of the target protein sequence with '.pdb' as the suffix. For example, if the identifier of you target protein sequence is '1A02N' , where '1A02' is the PDB id of the protein entry to which the target sequence belongs and 'N' is the sequence id of the target sequence, its input PDB file should be name as '1A02.pdb' . If the PDB file contains two or more sequences, which is allowed in our system, our server returns the predicting results of all the sequences in the input PDB file.
Step 3. Once an input file is selected from your file system, click the Submit button to upload your selected file. As the PDNAsite need some time to call some external procedures, such as PSI-BLAST, you need to wait for several minutes until the predicting results are returned to you. Figure 9 demonstrates the predicting results of all the sequences contained in the PDB file of the target protein sequence with identifier as '1A02N' , where '+ ' and '− ' represent DNA-binding site and non DNA-binding site, respectively. The predicting specificity of the subset of sites with certain number of binding sites in their spatial context. The x-axis has the same meaning as the one of (a) and the y-axis denotes the predicting specificity on the sites with certain number of binding sites in their spatial context.   only sequence information have advantage of rapid predicting speed, but their prediction accuracy are lower. For the methods based on the combination of sequence information and structural information, they usually have higher prediction accuracy, but their predicting speed are slower. As DBindR and DNABINDPROT were evaluated on different datasets with PDNAsite, we cannot make quantitative comparisons between PDNAsite and these two predictors. DBindR does not use any protein structure information whereas PDNAsite uses many useful structure features in addition to the features used in DBindR. Thus, we can speculate that PDNAsite can obtain better predicting performance than DBindR. For DNABINDPROT, its SN value is very low. This means that DNABINDPROT will miss many actual DNA binding sites. In contrast, PDNAsite can predict more actual binding sites as candidate DNA-binding sites compared to that of DNABINDPROT. For other methods, the quantitative comparisons on the six datasets demonstrate that PDNAsite performs better than them indicating that our predictor can filter out non-binding residues more precisely and keep more actual DNA binding sites as candidate DNA-binding sites. Therefore, since the experimental determination of DNA-binding site is costly and time-consuming, our web-server can reduce the cost of the experimental methods for DNA-binding site identification by obtaining less false non-binding sites and more actual binding sites as candidate DNA-binding sites and further facilitate the improvement of other bioinformatics problems including DNA-binding protein prediction and the analysis of protein-DNA interactions. Our method is especially useful for biologists who make an attempt to identify the sites involving protein-DNA interaction in protein chains, because we can filter out non-binding residues more precisely and keep the candidate binding residues for further analysis by experiments.

Conclusion
In this work, spatial sliding window and sequence sliding window are proposed to extract spatial context and sequence context, respectively. Then the features in the spatial and sequence context are combined to construct the feature space. Subsequently, the LSA is applied to reduce the dimension of the feature space. Finally, a predictor (PDNAsite) for the identification of DNA-binding site is developed by integrating SVM classifier and ensemble learning. The prediction performance on the two datasets PDNA-62 and PDNA-224 by five-fold cross-validation demonstrates that: (1) the predictor employing spatial context outperforms the one using sequence context; (2) the predictor employing both context performs better than either of them individually with significant improvement. Consequently, for the identification of DNA-binding site, the spatial context is more significant than the sequence context, and at the same time, both contexts are complementary to each other. Moreover, when LSA is applied to reduce the redundancy in the sub feature space spanned by the PSSM features, the performance of PDNAsite can be further improved. Performance comparisons between PDNAsite and other existing state-of-the-art methods on the datasets demonstrate that our predictor gives the best performance. We have also applied our predictor to predict the binding sites in two typical protein-DNA complexes: 1B3T and 1KX5. The results show that it can predict most of the DNA-binding sites from the protein sequences successfully. An analysis of the number of binding sites in the spatial context for all sites indicates that the spatial context is useful for the identification of DNA-binding site and the interactions between binding sites next to each other are important for the protein-DNA recognition and their binding ability. A web-server of our predictor at http://hlt. hitsz.edu.cn:8080/PDNAsite/ is also made available for free access to the biology research community.