PreDBA: A heterogeneous ensemble approach for predicting protein-DNA binding affinity

The interaction between protein and DNA plays an essential function in various critical natural processes, like DNA replication, transcription, splicing, and repair. Studying the binding affinity of proteins to DNA helps to understand the recognition mechanism of protein-DNA complexes. Since there are still many limitations on the protein-DNA binding affinity data measured by experiments, accurate and reliable calculation methods are necessarily required. So we put forward a computational approach in this paper, called PreDBA, that can forecast protein-DNA binding affinity effectively by using heterogeneous ensemble models. One hundred protein-DNA complexes are manually collected from the related literature as a data set for protein-DNA binding affinity. Then, 52 sequence and structural features are obtained. Based on this, the correlation between these 52 characteristics and protein-DNA binding affinity is calculated. Furthermore, we found that the protein-DNA binding affinity is affected by the DNA molecule structure of the compound. We classify all protein-DNA compounds into five classifications based on the DNA structure related to the proteins that make up the protein-DNA complexes. In each group, a stacked heterogeneous ensemble model is constructed based on the obtained features. In the end, based on the binding affinity data set, we used the leave-one-out cross-validation to evaluate the proposed method comprehensively. In the five categories, the Pearson correlation coefficient values of our recommended method range from 0.735 to 0.926. We have demonstrated the advantages of the proposed method compared to other machine learning methods and currently existing protein-DNA binding affinity prediction approach.

The interaction between protein and DNA is one of the kernel problems in molecular biology and plays significant roles in several biological actions, such as DNA replication, repair, and alteration processes 1 . Researchers have been focused on analyzing the interactions of proteins to DNA [2][3][4] to understand the identification mechanism of protein-DNA complexes. During the past few years, many laboratory programs for investigating protein binding have been proposed. Electrophoretic mobility shift assays (EMSAs) 5,6 , conventional chromatin immunoprecipitation (ChIP) 7 , peptide nucleic acid (PNA) assisted identification of RNA-binding proteins (RBPs) (PAIR) 8 , X-ray crystals 9 and nuclear magnetic resonance (NMR) spectroscopy 10 have been applied to expose protein-DNA binding residues. However, these laboratory methods are expensive and time-consuming. Alternatively, low cost and efficient computational methods are particularly meaningful toward studying the interaction of protein-DNA complexes.
Quantitative prediction of protein-DNA binding affinity is essential for the recognition of protein-DNA interactions. Many computational prediction techniques, including empirical scoring functions [11][12][13][14][15] , knowledge-based methods [16][17][18] and quantitative structure-activity relationships 19,20 , have been proposed for the binding affinity of protein-ligand complexes and protein-protein complexes [21][22][23] . Although there have been many methods to develop the scoring functions in protein-ligand and protein-protein docking simulations, most of them are based on a series of binding affinities benchmarks 24,25 . However, this is a requirement for growing and establishing protein-DNA binding affinity.
In this paper, a novel computational method named PreDBA is proposed to predict the protein-DNA binding affinity quantitatively. Figure 1 shows the flowchart of our way. According to the style of DNA that interacts with protein 26 , we classify the protein-DNA complexes into five groups. For each class, a heterogeneous ensemble model is constructed to predict the binding affinity. For each class of the protein-DNA complex, we performed a systematic analysis of whether the features affect predicted binding affinity. The

Datasets.
We manually curated a set of 201 protein-DNA complexes with experimentally determined binding affinity from the literature. We only selected the protein-DNA crystal structures deposited in the PDB that have better than 3 resolution. Proteins with sequence similarity >40 % were excluded by using CD-HIT 28 . At last, we got 100 protein-DNA complexes and built the binding affinity dataset (displayed in the Supplementary Table) along with the laboratory conditions (temperature). Dissociation Gibbs free energy (ΔG) is used to measure the binding affinity 21 , which is calculated as follows: d where T is the temperature, R is the gas constant (1.987 × 10 −3 kcal mol −1 K −1 ), and K d is the dissociation constant.

Classification of complexes.
It is deserving noting that previous studies have illustrated that the interaction between proteins and DNAs 2 is associated with the structure of the DNA molecule, that is, various features related to the construction of DNA will affect the binding affinity of various class of DNA. Previous studies have built predictive models 2 by classifying protein-DNA complexes by different kinds of DNA. Therefore, based on the rule of the Nucleic Acid Database (NDB) 26 , the protein-DNA complexes are divided into three categories: I) complexes with single-stranded DNA (SS), II) complexes with duplex DNA, III) miscellaneous complexes (MISC). According to previous studies 29, 30 , it has been confirmed that protein-DNA binding site residues have an essential influence on the interaction of protein and DNA. Actually, the binding site residues are believed to play essential roles in directing the binding affinity. To balance the amount of each class of the protein-DNA complexes, we further divided the compounds with duplex DNA into three various categories based on the percentage of binding site residues in the protein of the protein-DNA complexes according to previous research 21 , viz., Double I, Double II and Double III (≤10%, 10-20% and ≥20% of binding site residues, respectively). Some guidelines have been proposed to identify the DNA-binding sites in previous research, such as the distance between contacting atoms in protein and DNA 31 , reduction in solvent accessibility on binding 32 and interaction energy between protein and DNA 33 . The distance-based criteria are used in most of the prediction studies for analyzing the binding sites of protein-DNA complex to identify binding sites. In our work, a residue in the DNA-binding protein is defined as a binding site if the distance between any protein atoms and DNA atoms is ≤5.0.
Regression models and performance evaluation. We train the stacking heterogeneous ensemble method using the selected features for every class of protein-DNA complexes to predict binding affinities. First, we use three different regression methods to create predictions (Adaboost Regression (AdaR) 34 , Gradient Boosted Regression Tree (GBRT) 35 and Bagging Regression (BagR) 36 ), then we integrate them up by XGBoost Regression (XGBR) 37 to make a terminal forecast.
We used Pearson's correlation coefficient 38 to assess the correlation between the predicted values and experimental values. Moreover,the Pearson correlation coefficient r is defined as follows: Significance of different classifications. To verify the significance of DNA type for protein-DNA complexes classification, we performed the following experiments. Instead of using all the complexes as a whole, one and two optional characteristics are applyed to train the prediction model, respectively. We use one and two optional characteristics for each class to build a heterogeneous ensemble prediction models and calculate the performance indicators separately. As can be seen from Table 1, the prediction accuracy after classifying the complexes is much better than the prediction accuracy before classification. In all five groups of complexes, the correlation coefficient of a predictive model based on an optional feature is higher than 0.45. But the entire complexes have a correlation coefficient of only 0.165. And the two properties correlation coefficient is > 0.5 in all of the types. Moreover, the scatter plot of the experimental vs predicted binding affinity are shown in Figs. 2 and 3. Figures 2 and 3 shows the experimental and predicted ΔG of all the protein-DNA complexes before and after classification, respectively. As can be seen from Fig. 3, most points positioned close to the diagonal line. And at the same time, most of the points in Fig. 2 are randomly distributed. Pre-and post-classification comparisons illustrate that our approach of using classification before predicting the protein-DNA binding affinity is effectual. The reason for the difficulty in modeling may be the weak correlation between different classes of complexes. Therefore, before establishing a practical predictive model, the importance of the classification of the protein-DNA complexes are stressed.

Number of complexes
Prediction of binding affinity. We established regression models for each protein-DNA complexes to do the prediction of the protein-DNA binding affinity. The performance of our method are displayed in Table 2. The Pearson's correlation coefficients for all complex categories are greater than 0.73, which means that the predicted binding affinity is closely related to the actual value. Moreover, the great value of other evaluation criteria also prove the superiority of our approach. All results of the performance evaluation measures have proved our method is useful, and the classification can improve the accuracy of the algorithm effectively.  www.nature.com/scientificreports www.nature.com/scientificreports/ Next, to further explore the characteristics of the governing protein-DNA complex binding affinity prediction, we evaluate the prediction performance of the methods in various classes. Figure 4 shows the predicted and actual values of the binding affinity of each of the five types of complexes, respectively. As shown in Fig. 4, we can see that, except for a few individual positions, most predicted binding affinity values closely match the corresponding experimental binding affinity values for each protein-DNA complex. We have analyzed the performance of the approach we used to predict the binding affinity for each group of protein-DNA complexes, and the details are described below.
Complexes with single-stranded DNA. For this class, the protein fraction of the complex binds to single-stranded DNA. There are eight protein-DNA complexes in this group, with a smallest binding affinity of 4.3 kcal mol −1 , with a varied range of 8 kcal mol −1 , up to 12.3 kcal mol −1 . Based on these 4 various characteristics, the Pearson's correlation coefficient of our model reached 0.94 by using leave-one-out cross-validation method. Further, the mass of the beta sheet has been identified as the essential factor of predicting the protein-DNA binding affinity. Moreover, the number of the beta sheet of the protein and the pairwise interactions GA/CT and GC/CG have also played a vital role in protein binding to DNA. As can be seen from our predictions, our approach could accurately predict the binding affinity of 87.5% of the complexes with a deviation of 1 kcal mol −1 using the leave-one-out test.
Complexes with duplex DNA. This type of protein-DNA complex includes two parts: protein and double-stranded DNA. We have divided this type of complex into three categories, namely Double I, Double II, and Double III. Below we will introduce which features affect these types of compounds. The specific prediction results for each type of complex are displayed below. The pairwise interactions AA/TT, CA/GT are essential for the prediction. The binding affinity for 28 of 33 complexes has been accurately predicted within the deviation of 2 kcal mol −1 using the leave-one-out test. 3. Double III. Double III is a collection of the binding sites of protein-DNA more than 20% in proteinbinding double-stranded DNA, with 25 complex samples. And the absolute average value of ΔG is 9.7 kcal mol −1 . Through the prediction of three characteristics, we can get a correlation coefficient of 0.843. In this class of complex binding affinity prediction process, we found that the Nearest-neighbor bases of DNA play a decisive role. The binding affinity for 20 and 17 of 24 complexes has been accurately predicted within the deviation of 2 and 1 kcal mol −1 , respectively, using the leave-one-out test.
Miscellaneous complexes. The Miscellaneous has twenty complex samples, and the absolute average value of the binding free energy ΔG of this class is 10.01 kcal mol −1 . For this class of complexes, we used four features to built the forest model for the prediction of the protein-DNA binding affinity and obtained a correlation coefficient of 0.834. We found that the protein aspect that plays a decisive role in predicting the results. The molecular mass and the amount of the alpha helix in protein are two meaningful features. Meanwhile, the amount of aromatic and positively charged residues in the protein and the total amount of hydrogen bonds in protein are all important for the prediction. By observing the prediction results, we found that the features we used have a beneficial effect on predicting the binding affinity of the miscellaneous. Our approach could precisely predict the binding affinity of 90% of the complexes with a deviation of 1 kcal mol −1 using the leave-one-out test.  42 , Adaboost Regression (AdaR), Bagging Regression (BagR), XGBoost Regression (XGBR), and Gradient Boosted Regression Tree (GBRT). As shown in the Table 3, the performance of PreDBA for all categories of complexes is significantly better than other regression models. In addition, we also calculated the average of the performance indicators of various regression models, as shown in Fig. 5. The average correlation coefficient of the PreDBA model reached 0.84, and the average MAE value equal to 0.88, and the average R2 value is 0.65, which are higher than the other four methods. It is conclude that the heterogeneous ensemble model makes our approach perform better than other regression methods.
In order to verify the validity of the machine learning algorithms utilized in our stacking model, we analyzed the effects of different algorithm combinations. Table 4 shows the correlation coefficient between predicted binding affinities and real values by using different algorithm combinations in the first layer of the stacking model. As can be seen from Table 4, different model combinations have various effects on the prediction results. The performance of our PreDBA method combining all the three algorithms (GBRT+AdaR+BagR) is better than using only one or two algorithms.  www.nature.com/scientificreports www.nature.com/scientificreports/ Comparison with state-of-the-art approach. As far as we know, there is only one existing protein-DNA binding affinity quantitative prediction method DDNA3 43 . DDNA3 is an upgraded version of DDNA 44 . DDNA3 uses a knowledge-based energy function to predict protein-DNA complex binding affinity. We apply DDNA3 to predict the binding affinity of the complexes by using our data set and contrast it to our method. Table 5 shows the comparison of the DDNA3 process with our approach by using the correlation coefficient criterion. From this table, we can see that our PreDBA is significantly better than DDNA3 in predicting each class.
Web server. We develop a web server to predict the protein-DNA binding affinity available to the research community, which is freely accessible at http://predba.denglab.org/. The PreDBA web server is developed in Perl, Python, JavaScript, jQuery (AJAX), and CSS. It accepts protein-DNA complex 3D structures in PDB format or PDB codes as input. The binding affinities of the protein-DNA complexes will be predicted and displayed.    www.nature.com/scientificreports www.nature.com/scientificreports/

Discussion
In this paper, we generate a non-redundant dataset of protein-DNA binding affinity, which In this paper, we generate a non-redundant dataset that contains binding affinity values of one hundred protein-DNA complexes. Based on the structural classification, we developed a way termed PreDBA by using heterogeneous ensemble models to forecast the protein-DNA binding affinities. By using the leave-one-out cross-validation procedure,the mean correlation coefficient we obtained is 0.82. For understand the importance of selected features for protein-DNA binding affinity in each class, we systematically analyzed the features of all classes. We also compared the regression approach we used with some different standard regression methods and proved that our approach has the most significant effect. Furthermore, we compared PreDBA with the pioneer protein-DNA binding affinity prediction method DDNA3, and the results confirm that PreDBA does have a better outcome. Finally, we have developed a web server (http://predba.denglab.org/) that can be used to predict binding affinity of protein-DNA affinity freely. We hope our PreDBA method can be helpful for the study of all aspects of the interaction between protein and DNA.

Methods
Features extraction. We obtain 52 characteristics to forecast the binding affinity of the protein-DNA compounds. The characteristics are principally come from the structural and sequential information of proteins and DNA in the protein-DNA compounds. The specific characteristics are listed below.  Table 6. Features selected in each class of complexes. www.nature.com/scientificreports www.nature.com/scientificreports/ Protein sequential features. The sequential information of protein are extracted from the PDB files. Based on each amino acid in the protein sequence, we then calculated the molecular mass 45 of the protein sequence . Also, we assessed the whole amount of hydrogen bonds 46 included in the protein sequence. Moreover, based on the sequence information of the protein, we calculated the physical and chemical properties, including the hydrophilic and hydrophobic residues 47 in the protein, the aromatic and positively charged residues 48 in the protein, the polar residues in the protein and the charged residues in the protein.
Protein structural features. The tool we applied to get the protein secondary structure information is the DSSP algorithm. The secondary structure of protein mainly including the amount and the portion of the alpha helix and the beta sheet in the protein, the molecular mass of the alpha helix 49,50 and the beta sheet 51 . Meantime, the solvent-accessible surface area (SASA) 52 of the protein are collected.
DNA sequential features. Based on DNA base sequential information, we obtained two features for predicting binding affinity, as described below.
1. DNA Molecular mass. We used the sequence information of the DNA in the complex to gain the molecular weight of the DNA sequence. The molecular mass of single-stranded DNA (ssDNA) and double-stranded DNA (dsDNA) are calculated as follows: Features selection. Since the binding affinities of different categories of compounds have a significant correlation with the structure of DNAs and proteins, we perform feature selection for each type of protein-DNA compound iteratively and independently. For each type of complex, we use correlation coefficients to measure the relationship between each feature and binding affinity. Next, the calculated correlation coefficients are sorted in descending order, and the top 10 features are selected for each type of complexes. Ultimately, the greedy algorithm are used to select the appropriate feature set for each type of complex until the capability no longer improves. Selected features of each protein-DNA complexes are shown in the Table 6. In general, to avoid overfitting, the final optimal feature set contains should less than five features for all five groups of complexes.
The stacking heterogeneous ensemble method. Among machine learning methods, the performance of ensemble learning methods 56-62 is very superior, so we use ensemble learning methods to predict the binding affinity of protein-DNA complexes. As one of the unique ensemble learning algorithms of ensemble learning, the stacking heterogeneous ensemble approach has a superior appearance. The flowchart of our method is displayed in Fig. 6. In our method, the stacking heterogeneous ensemble model includes two layers and contains one or more machine learning models in each layer. As shown in Fig. 6, there are three conventional machine learning models on the first layer of the PreDBA method, including the Gradient Boosted Regression Tree model, the Adaboost Regression model, and the Bagging Regression model. And there is a single one machine learning model, XGBoost Regression model, in the second layer.