DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation

Liu, Bin; Wang, Shanyi; Wang, Xiaolong

doi:10.1038/srep15479

Download PDF

Article
Open access
Published: 20 October 2015

DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation

Bin Liu^1,2,
Shanyi Wang¹ &
Xiaolong Wang^1,2

Scientific Reports volume 5, Article number: 15479 (2015) Cite this article

2488 Accesses
96 Citations
1 Altmetric
Metrics details

Subjects

Abstract

DNA-binding proteins play an important role in most cellular processes. Therefore, it is necessary to develop an efficient predictor for identifying DNA-binding proteins only based on the sequence information of proteins. The bottleneck for constructing a useful predictor is to find suitable features capturing the characteristics of DNA binding proteins. We applied PseAAC to DNA binding protein identification and PseAAC was further improved by incorporating the evolutionary information by using profile-based protein representation. Finally, Combined with Support Vector Machines (SVMs), a predictor called iDNAPro-PseAAC was proposed. Experimental results on an updated benchmark dataset showed that iDNAPro-PseAAC outperformed some state-of-the-art approaches and it can achieve stable performance on an independent dataset. By using an ensemble learning approach to incorporate more negative samples (non-DNA binding proteins) in the training process, the performance of iDNAPro-PseAAC was further improved. The web server of iDNAPro-PseAAC is available at http://bioinformatics.hitsz.edu.cn/iDNAPro-PseAAC/.

Accurate structure prediction of biomolecular interactions with AlphaFold 3

Article 08 May 2024

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

Proteome-scale discovery of protein degradation and stabilization effectors

Article 20 March 2024

Introduction

DNA-binding proteins have diverse functions in the cell and play vital roles in various cellular processes, such as gene regulation, DNA replication and repair¹. Identification of DNA-binding proteins is one of the most important tasks in the annotation of protein functions. In recent years, DNA-binding proteins can be identified by several experimental techniques, including filter binding assays², X-ray crystallography³ and NMR⁴. However, it is time-consuming and expensive to identify DNA-binding proteins by experimental approaches. Facing the avalanche of new protein sequences generated in the post-genomic and big data age^5,6, it is highly desired to develop automated methods for rapidly and effectively identifying DNA-binding proteins basing on the protein sequence information alone.

The computational methods for DNA binding protein identification can be grouped into two categories: (i) methods based on structures (ii) methods based on sequences. The first type makes use of both the structural and sequential information of target proteins (see, e.g.,^7,8,9,10). Although these methods show promising predictive performance, the structural information of proteins is not always available, particularly for the huge amount of proteins, which prevents the application of these methods. In contrast, the second type methods overcome this shortcoming by only requiring the sequence information as input for the prediction^{11,12,13,14,15,16,17,18,19,20}.

A key to improve the performance of the sequence-based methods is to find suitable feature extraction algorithms that can capture the characteristics of DNA binding proteins and non DNA binding proteins. Motivated by the successful application of Chou’s pseudo amino acid composition (PseAAC) to many important tasks in the field of computational proteomics, here we are to propose a new approach for DNA binding protein identification called iDNAPro-PseAAC, which extends the classic PseAAC approach by incorporating the evolutionary information in the form of profile-based protein representation²¹. The iDNAPro-PseAAC has the following advantages compared with other currently available approaches: (i) It is able to incorporate the global or long range sequence-order effects by means of PseAAC. (ii) The evolutionary information imbedded in the profile-based protein representation is employed by iDNAPro-PseAAC. (iii) It considers the various physical-chemical properties of amino acids.

To establish a really useful statistical predictor for a protein system, we need to consider the following procedures: (i) Construct or select a valid benchmark dataset to train and test the predictor. (ii) Formulate the protein samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the attribute to be predicted. (iii) Introduce or develop a powerful algorithm (or engine) to operate the prediction. (iv) Perform properly cross-validation tests to objectively evaluate the anticipated accuracy of the predictor. Below, we are going to describe how to build the new predictor according to the four procedures.

Results

The influence of λ and ω on the performance of iDNAPro-PseAAC

There are two parameters λ and ω in iDNAPro-PseAAC, which would influence its performance (see method section). λ can be any integer between 1 and L-1, where L is the shortest length of sequences in the benchmark dataset. The range of ω is 0-1. The performance of iDNAPro-PseAAC with different λ and ω combinations is shown in Fig. 1 and Supplementary S1, from which we can see that iDNAPro-PseAAC achieves the best performance with λ = 3 and ω = 0.7. These parameter values are used in the following experiments.

Discriminant Visualization

To further study the discriminant power of features, we calculate the discriminant weight vectors in the feature space. In the SVM training process, we can get the sequence-specific weights, which can be used to calculate the discriminant weight of each feature. The feature discriminant weight vector W can be calculated as following:

where the weight vector A of the training set with N samples obtained from the kernel-based training; M is the matrix of sequence representatives; j is the dimension of the feature vector. The element in W represents the discriminative power of the corresponding feature.

The discriminative weights of all the 23 features are shown in Fig. 2. We can see that 9 amino acids show positive values, while the other 11 amino acids show negative values. Interestingly, most of the amino acids with positive values, such as R, K, have been reported as important residues in DNA binding proteins and they are redundant in DNA protein binding regions²². iDNAPro-PseAAC is able to capture this kind of features of DNA binding proteins, which could explain the reason for its better performance. Another interesting pattern is that all the three features capturing the sequence-order effects (λ = 1, 2, 3) show negative values, indicating that this kind of features is useful for representing the features of non DNA binding proteins.

Results on the benchmark dataset

Table 1 shows the predictive results of iDNAPro-PseAAC on the benchmark dataset by using Jackknife test. For comparison, the results of four state-of-the-art methods are also listed, including DNAbinder (dimension 21)²³, DNAbinder (dimension 400)²³, DNA-Prot²⁴ and iDNA-Prot²⁵. The reason why we select these four methods is that they have public available software tools with reported optimized parameters. Their optimized results on the benchmark dataset can be easily obtained by using these tools and parameter settings.

Table 1 A comparison of the jackknife test results by iDNAPro-PseAAC with the other methods on the benchmark dataset of Eq. 2 (cf. Supporting Information S3).

Full size table

From Table 1 we can see that iDNAPro-PseAAC achieves the best performance. In order to further study the performance of the proposed method, the ROC curve is employed to evaluate the performance of different methods. ROC curve is a graphical plot that illustrates the performance of a binary classifier system with its discrimination threshold varying. The horizontal coordinate is false-positive rate and the vertical is true-positive rate. The true-positive rate is also known as sensitivity in biomedical informatics, or recall in machine learning^5,26. The false-positive rate is also known as the fall-out and can be calculated as 1 -specificity. The area under the curve (AUC) is the evaluation criteria for the classifier. Figure 3 shows the ROC curve of the five methods, from which we can see that iDNAPro-PseAAC outperforms other four approaches in terms of AUC.

Performance comparison with other related computational predictors

To further evaluate the performance of iDNAPro-PseAAC and facilitate the comparison against previous predictors, an independent test dataset PDB186 constructed by Lou et al. is used²⁷, where 93 proteins are DNA-binding proteins and 93 proteins are non-DNA-binding proteins. To avoid the homology bias, we use the NCBI’s BLASTCLUST²⁸ to remove those proteins from the benchmark dataset that have more than 25% sequence identity to any protein within a same subset of the PDB186 dataset. The iDNAPro-PseAAC is re-trained on the resulting benchmark dataset and then this model is used to predict the samples in the independent dataset. The results are shown in Table 2 and the ROC curves of various methods are plotted in Fig. 4. Compared with the results listed in Table 2, we can see that iDNAPro-PseAAC can achieve stable performance on the independent dataset, indicating that the proposed method is a useful tool for DNA binding protein identification. iDNAPro-PseAAC outperforms other approaches except for DBPPred. However, our method is more efficient than DBPPred. DBPPred uses 1486 features derived from predicted secondary structure, predicted relative solvent accessibility and position specific scoring matrix. These features are calculated with the help of two software tools, including SPINE-X and Psi-Blast. All these tools require a time consuming multiple sequence alignment process²⁹. Furthermore, these features contain several parameters, which should be optimized on a validate dataset. This requires additional running time for DBPPred and raises the risk of over-fitting problem caused by this parameter optimization process. In contrast, the 23 features used in iDNAPro-PseAAC can be easily generated only based on the protein sequences. Therefore, iDNAPro-PseAAC is more efficient than DBPPred and avoids the risk of over-fitting.

Table 2 A comparison of the results^a obtained by iDNAPro-PseAAC and the other methods on the independent dataset PDB186.

Full size table

Influence of Negative Samples on the Predictive Performance

In real world application, there are more non DNA binding proteins (negative samples) than the DNA binding proteins (positive samples)³⁰. However, in order to avoid the classifier biased problem, a balanced benchmark dataset is used to construct iDNAPro-PseAAC. Therefore, it is interesting to explore the influence of different negative sets on the predictive performance of iDNAPro-PseAAC. In this regard, we conduct the following experiments. First we extend the size of the negative set in the the benchmark dataset by selecting more non DNA binding proteins from PDB³¹. After removing the redundant proteins sharing more than 25% similarity with the independent dataset, we obtain 2059 negative samples, which were listed in Supplementary S2. The extended negative set is then randomly divided into 4 subsets. For each subset, its size is approximately equal to that of the positive set in the the benchmark dataset . These four subsets are respectively combined with the positive set and four new datasets are generated. Four predictors of iDNAPro-PseAAC trained with these four datasets can be represented as iDNAPro-PseAAC-1, iDNAPro-PseAAC-2, iDNAPro-PseAAC-3 and iDNAPro-PseAAC-4, respectively. Their performance is then evaluated on the independent dataset. Table 3 shows the results of the four methods and the corresponding ROC curves are plotted in Fig. 5, from which, we can see that the four predictors show similar performance, indicating that different subsets of negative samples don’t have significant impact on the performance of iDNAPro-PseAAC. Next, we investigate if these four predictors can be combined to further improve the performance. In this regard, we employ a simple ensemble learning approach to combine them^32,33. For each test sample, it is predicted by the four predictors respectively and the final class label of the test sample is assigned based on the average values of the four probability values calculated by the four predictors. The results of iDNAPro-PseAAC-EL (iDNAPro-PseAAC with the ensemble learning approach) are shown in Table 3 and Fig. 5. Performance improvement can be observed. This is because by using ensemble learning method, more negative samples are used to train iDNAPro-PseAAC, leading to a more accurate predictor.

Table 3 Results on independent dataset PDB186 achieved by iDNAPro-PseAAC trained with different datasets.

Full size table

Discussion

Because of the importance of DNA binding protein identification, computational predictors only using the sequence information for DNA binding protein identification is highly desired. In this study, we proposed a method called iDNAPro-PseAAC for DNA binding protein identification, which combines the pseudo amino acid composition with profile-based protein representation. Experimental results show that it outperform other approaches in both benchmark dataset and independent dataset. Furthermore, the discriminative model can be analyzed to reveal the in-depth features of DNA binding proteins, which would benefit the researchers who want to investigate the characteristics of DNA binding proteins. Some recent studies have shown that DNA-binding proteins also regulate the microRNA targets and involve in the noncoding RNA-protein-disease network ^{34,35,36,37,38,39}. We believe that this predictor would be a high throughput tool for DNA binding protein investigation.

Methods

Benchmark Dataset

A reliable and stringent benchmark dataset is necessary to build and evaluate a statistical predictor. In this regard, an updated benchmark dataset for this study is constructed based on the latest version of Protein Data Bank (PDB)³¹, which can be formulated as:

where represents the subset containing DNA binding proteins (positive samples), represents the subset containing non DNA binding proteins (negative samples) and the symbol ∪ is the “union” in the set theory. DNA-binding protein sequences are collected from the PDB by searching the mmCIF keyword of ‘DNA binding protein’, ‘protein-DNA complex’ and other key words with similar meaning. To construct a high quality and non-redundant benchmark dataset, the protein sequences obtained should be filtered by the following 2 criteria. (1) Proteins with length less than 50 AA were removed, which might be fragment. (2) To reduce redundancy and homology bias, the sequence similarity lower than 25% between any two proteins were cut off by using PISCES⁴⁰. Finally, we obtained 525 DNA binding proteins for the subset of . 550 non DNA binding proteins were randomly selected from the PDB according to the above criteria. The accession codes and sequences of the 525 positive and 550 negative samples are given in the Supplementary S3.

Profile-based protein representation

Profile-based protein representation²¹ is an efficient approach to extract the evolutionary information from frequency profiles. Its main steps are as follows.

Given the protein sequence P consisting L amino acids as formulated as:

where R₁ represents the 1^st residue, R₂ represents the 2^nd residue and so forth. The frequency profile of sequence P generated by PSI-BLAST²⁸ with default parameters can be represented as a matrix M:

where 20 is the number of standard amino acids; m_ij is the target frequency representing the probability of amino acid i (i = 1, 2, …, 20) appearing in sequence position j (j = 1, 2, 3…, L) of protein P during evolutionary process. The m_ij is calculated as:

where f_ij represents the observed frequency of amino acid i in column j, α is the number of different amino acids in column j-1. β is a free parameter set to a constant value of 10, which is initially used by PSI-BLAST. g_ij is the pseudo-count for standard amino acid i in position j. It is calculated as follows:

where p_k is the background frequency of amino acid k, q_ik is the score of amino acid i being aligned to amino acid j in BLOSUM62 substitution matrix, which is the default score matrix of PSI-BLAST.

For each column in M, the amino acids are sorted in descending order according to their frequency values. Thus the sorted matrix can be represented as:

where

The profile-based protein representation P′ of protein P can be generated by combining the most frequent amino acids in all the columns of M and can be represented as:

where represents the most frequent amino acid in the i-th column of , whose frequency value is .

Pseudo amino acid composition (PseAAC)

One of the most important but also most difficult problems in computational biology and biomedicine is how to formulate a biological sequence with a discrete model or a vector, yet still keep considerable sequence order information. This is because all the existing operation engines, such as SVM (Support Vector Machine) and NN (Neural Network), can only handle vector but not sequence samples, as elaborated in^41,42. However, a vector defined in a discrete model may completely lose all the sequence-order information. To avoid completely losing the sequence-order information for proteins, the pseudo amino acid composition or PseAAC was proposed⁴³.

The PseAAC approach then performs on the profile-based protein representation P′ (c.f. Eq. 9) to convert it into a fixed length feature vector by using PseAAC:

where T is transpose operator, λ is the distance parameter considering the sequence-order effects of residues in proteins. x_u can be calculated by

where f_u is the occurrence frequency of 20 standard amino acid in profile-based protein representation P′. ω is the weight factor for the sequence-order effect. θ_k is the sequence-order correlation factor, which can be calculated as:

where is the i-th amino acid in P′. L is the length of P′. k is the distance between two amino acids along P′. represents the scores calculated according to seven kinds of physical-chemical properties of amino acids (their values are listed in Supplementary S4, which can be calculated by:

where and are the normalized physicochemical property values of amino acid and in property j, which can be calculated by the following equation:

where represents the raw physicochemical property value of amino acid a_j in property j. a_k (k = 1, 2, 3, 4, …, 20) represents the 20 standard amino acids.

Support Vector Machine

In machine learning, support vectors are supervised learning models with associated learning algorithms⁴⁴. For a given training samples, the basic mission of SVM is constructing a separating hyper-plane to maximize the margin of different samples in training set. An SVM model is a representation of examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap.

In this study, we adopt the Lib-SVM package. The kernel function was set as Radial Basis Function(RBF) which can be defined as:

The two parameters C and γ were were optimized on the benchmark dataset by the grid tools in the LIBSVM. After optimizing, C is set as 8192 and γ is set as 8.0.

The flowchart of generating the feature vectors and constructing the SVM classifier for iDNAPro-PseAAC is shown in Fig. 6.

Evaluation methodology

How to evaluate the performance of a new predictor is a key component. There are three cross-validation methods, which are often used: independent dataset, subsampling or K-fold(such as 5-fold, 7-fold, or 10-fold) test and Jackknife test. However, there are considerable arbitrariness exists in the independent dataset test and the K-fold cross validation. Jackknife can make the least arbitrary and has been widely used in computational genomics and proteomics. In the jackknife test, each of the proteins sequence in the benchmark is singled out as an independent test sample in turn.

Also, four metrics called the sensitivity(Sn), specificity(Sp), accuracy(Acc) and Mathew’s correlation coefficient(MCC), are often used to measure the test quality of a predictor from different angles⁴⁵.

where TP represents the number of the true positive; TN, the number of the true negative; FP, the number of the false positive; FN, the number of the false negative; SN, the sensitivity; Sp, the specificity; Acc, the accuracy; MCC, the Mathew’s correlation coefficient.

In the study, we also use the metrics receiver operation characteristics (ROC) score. ROC curve is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied⁴⁶. A score 1 denotes perfect separation of positive samples from negative ones, whereas a score of 0 indicates that none of the sequences selected by the algorithm is positive.

Additional Information

How to cite this article: Liu, B. et al. DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation. Sci. Rep. 5, 15479; doi: 10.1038/srep15479 (2015).

References

Jones, K. A., Kadonaga, J. T., Rosenfeld, P. J., Kelly, T. J. & Tjian, R. A cellular DNA-binding protein that activates eukaryotic transcription and DNA replication. Cell 48, 79–89, 10.1016/0092-8674(87)90358-8 (1987).
Article CAS PubMed Google Scholar
Helwa, R. & Hoheisel, J. Analysis of DNA–protein interactions: from nitrocellulose filter binding assays to microarray studies. Anal Bioanal Chem 398, 2551–2561, 10.1007/s00216-010-4096-7 (2010).
Article CAS PubMed Google Scholar
Jaiswal, R., Singh, S. K., Bastia, D. & Escalante, C. R. Crystallization and preliminary X-ray characterization of the eukaryotic replication terminator Reb1-Ter DNA complex. Acta Crystallographica Section F 71, 414–418, 10.1107/S2053230X15004112 (2015).
Article CAS Google Scholar
Omichinski, J. et al. NMR structure of a specific DNA complex of Zn-containing DNA binding domain of GATA-1. Science 261, 438–446, 10.1126/science.8332909 (1993).
Article CAS ADS PubMed Google Scholar
Lin, C. et al. LibD3C: Ensemble Classifiers with a Clustering and Dynamic Selection Strategy. Neurocomputing 123, 424–435 (2014).
Article Google Scholar
Li, P., Guo, M., Wang, C., Liu, X. & Zou, Q. An overview of SNP interactions in genome-wide association studies. Briefings in Functional Genomics 14, 143–155 (2015).
Article CAS Google Scholar
Bowie, J., Luthy, R. & Eisenberg, D. A method to identify protein sequences that fold into a known three-dimensional structure. Science 253, 164–170, 10.1126/science.1853201 (1991).
Article CAS ADS PubMed Google Scholar
Gao, M. & Skolnick, J. DBD-Hunter: a knowledge-based method for the prediction of DNA–protein interactions. Nucleic Acids Research 36, 3978–3992, 10.1093/nar/gkn332 (2008).
Article CAS PubMed PubMed Central Google Scholar
Ohlendorf, D. H., Anderson, W. F., Fisher, R. G., Takeda, Y. & Matthews, B. W. The molecular basis of DNA-protein recognition inferred from the structure of cro repressor. Nature 298, 718–723 (1982).
Article CAS ADS Google Scholar
Stawiski, E. W., Gregoret, L. M. & Mandel-Gutfreund, Y. Annotating Nucleic Acid-Binding Function Based on Protein Structure. Journal of Molecular Biology 326, 1065–1079, 10.1016/S0022-2836(03)00031-7 (2003).
Article CAS PubMed Google Scholar
Liu, B. et al. PseDNA-Pro: DNA-Binding Protein Identification by Combining Chou’s PseAAC and Physicochemical Distance Transformation. Molecular Informatics 34, 8–17, (2015).
Article Google Scholar
Wang, L. & Brown, S. J. BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucleic Acids Research 34, W243–W248, 10.1093/nar/gkl298 (2006).
Article CAS PubMed PubMed Central Google Scholar
Hwang, S., Gou, Z. & Kuznetsov, I. B. DP-Bind: a web server for sequence-based prediction of DNA-binding residues in DNA-binding proteins. Bioinformatics 23, 634–636, 10.1093/bioinformatics/btl672 (2007).
Article CAS PubMed Google Scholar
Ofran, Y., Mysore, V. & Rost, B. Prediction of DNA-binding residues from sequence. Bioinformatics 23, i347–i353, 10.1093/bioinformatics/btm174 (2007).
Article CAS PubMed Google Scholar
Wu, J. et al. Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature. Bioinformatics 25, 30–35, 10.1093/bioinformatics/btn583 (2009).
Article CAS PubMed Google Scholar
Kern, S. et al. Identification of p53 as a sequence-specific DNA-binding protein. Science 252, 1708–1711, 10.1126/science.2047879 (1991).
Article CAS Google Scholar
Cai, Y.-d. & Lin, S. L. Support vector machines for predicting rRNA-, RNA- and DNA-binding proteins from amino acid sequence. Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics 1648, 127–133, 10.1016/S1570-9639(03)00112-2 (2003).
Article CAS Google Scholar
Lin, C. et al. Hierarchical Classification of Protein Folds Using a Novel Ensemble Classifier. PLoS ONE 8, e56499 (2013).
Article CAS ADS Google Scholar
Wei, L., Liao, M., Gao, X. & Zou, Q. An Improved Protein Structural Prediction Method by Incorporating Both Sequence and Structure Information. IEEE Transactions on Nanobioscience 14, 339–349 (2015).
Article Google Scholar
Liu, B. et al. iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance-Pairs and Reduced Alphabet Profile into the General Pseudo Amino Acid Composition. PLoS ONE 9, e106691 (2014).
Article ADS Google Scholar
Liu, B. et al. Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection. Bioinformatics 30, 472–479 (2014).
Article CAS Google Scholar
Andrea, S., Ondřej, K., Filip, Ž. & Jakub, T. Prediction of DNA-binding propensity of proteins by the ball-histogram method using automatic template search. BMC Bioinformatics 13, S3 (2012).
Google Scholar
Kumar, M., Gromiha, M. & Raghava, G. Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC Bioinformatics 8, 463 (2007).
Article CAS Google Scholar
Kumar, K. K., Pugalenthi, G. & Suganthan, P. N. DNA-Prot: Identification of DNA Binding Proteins from Protein Sequence Information using Random Forest. Journal of Biomolecular Structure and Dynamics 26, 679–686, 10.1080/07391102.2009.10507281 (2009).
Article CAS PubMed Google Scholar
Lin, W.-Z., Fang, J.-A., Xiao, X. & Chou, K.-C. iDNA-Prot: Identification of DNA Binding Proteins Using Random Forest with Grey Model. PLoS ONE 6, e24756, 10.1371/journal.pone.0024756 (2011).
Article CAS ADS PubMed PubMed Central Google Scholar
Wei, L. et al. Improved and Promising Identification of Human MicroRNAs by Incorporating a High-quality Negative Set. IEEE/ACM Transactions on Computational Biology and Bioinformatics 11, 192–201 (2014).
Article Google Scholar
Lou, W. et al. Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes. PLoS ONE 9, e86703, 10.1371/journal.pone.0086703 (2014).
Article CAS ADS PubMed PubMed Central Google Scholar
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389–3402, 10.1093/nar/25.17.3389 (1997).
Article CAS PubMed PubMed Central Google Scholar
Zou, Q., Hu, Q., Guo, M. & Wang, G. HAlign: Fast Multiple Similar DNA/RNA Sequence Alignment Based on the Centre Star Strategy. Bioinformatics, 10.1093/bioinformatics/btv177 (2015).
Song, L., Li, D., Zeng, X., Yunfeng Wu, L. G. & Zou, Q. nDNA-prot: Identification of DNA-binding Proteins Based on Unbalanced Classification. BMC Bioinformatics 15, 298 (2014).
Article Google Scholar
Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Research 28, 235–242 (2000).
Article CAS Google Scholar
Wang, C., Hu, L., Guo, M., Liu, X. & Zou, Q. imDC: an ensemble learning method for imbalanced classification with miRNA data. Genetics and Molecular Research 14, 123–133 (2015).
Article CAS Google Scholar
Zhao, X., Zou, Q., Liu, B. & Liu, X. Exploratory predicting protein folding model with random forest and hybrid features. Current Proteomics 11, 289–299 (2014).
Article CAS Google Scholar
Zou, Q., Li, J., Song, L., Zeng, X. & Wang, G. Similarity computation strategies in the microRNA-disease network: A Survey. Briefings in Functional Genomics, 10.1093/bfgp/elv024 (2015).
Zeng, X., Zhang, X. & Zou, Q. Integrative approaches for predicting microRNA function and prioritizing disease-related microRNA using biological interaction networks. Briefings in Bioinformatics, 10.1093/bib/bbv033 (2015).
Zou, Q. et al. Prediction of microRNA-disease associations based on social network analysis methods. BioMed Research International 2015, 810514 (2015).
PubMed PubMed Central Google Scholar
Shi, H., Wu, Y., Zeng, Z. & Zou, Q. A Discussion of MicroRNAs in Cancers. Current Bioinformatics 9, 453–462 (2014).
Article Google Scholar
Zou, Q., Li, J., Wang, C. & Zeng, X. Approaches for recognition disease genes based on Network. BioMed Research International 2014, 416323 (2014).
PubMed PubMed Central Google Scholar
Wang, Q. et al. Briefing in family characteristics of microRNAs and their applications in cancer research. BBA–Proteins and Proteomics 1844, 191–197 (2014).
Article CAS Google Scholar
Wang, G. & Dunbrack, R. L. PISCES: recent improvements to a PDB sequence culling server. Nucleic Acids Research 33, W94–W98, 10.1093/nar/gki402 (2005).
Article CAS PubMed PubMed Central Google Scholar
Liu, B. et al. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA and protein sequences. Nucleic Acids Research W1, W65–W71 (2015).
Article Google Scholar
Liu, B., Liu, F., Fang, L., Wang, X. & Chou, K.-C. repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics 31, 1307–1309 (2015).
Article Google Scholar
Chou, K.-C. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins: Structure, Function and Bioinformatics 43, 246–255, 10.1002/prot.1035 (2001).
Article CAS Google Scholar
Suykens, J. A. K. & Vandewalle, J. Least Squares Support Vector Machine Classifiers. Neural Processing Letters 9, 293–300, 10.1023/A:1018628609742 (1999).
Article Google Scholar
Liu, B. et al. Identification of real microRNA precursors with a pseudo structure status composition approach. PLoS ONE 10, e0121501 (2015).
Article Google Scholar
Liu, B., Chen, J. & Wang, X. Application of Learning to Rank to protein remote homology detection. Bioinformatics, 10.1093/bioinformatics/btv413 (2015).
Szilágyi, A. & Skolnick, J. Efficient Prediction of Nucleic Acid Binding Function from Low-resolution Protein Structures. Journal of Molecular Biology 358, 922–933, 10.1016/j.jmb.2006.02.053 (2006).
Article CAS PubMed Google Scholar
Gao, M. & Skolnick, J. A Threading-Based Method for the Prediction of DNA-Binding Proteins with Application to the Human Genome. PLoS Computational Biology 5, e1000567, 10.1371/journal.pcbi.1000567 (2009).
Article CAS ADS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. 61300112 and 61272383), the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry, the Natural Science Foundation of Guangdong Province (2014A030313695), Strategic Emerging Industry Development Special Funds of Shenzhen (JCYJ20140508161040764) and National High Technology Research and Development Program of China (863 Program) [2015AA015405].

Author information

Authors and Affiliations

School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
Bin Liu, Shanyi Wang & Xiaolong Wang
Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
Bin Liu & Xiaolong Wang

Authors

Bin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Shanyi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaolong Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

B.L. conceived of the study and designed the experiments, participated in designing the study, drafting the manuscript and performing the statistical analysis. S.Y.W. participated in coding the experiments and drafting the manuscript. X.L.W. participated in performing the statistical analysis. All authors read and approved the final manuscript.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Electronic supplementary material

Supplementary Information

Rights and permissions

This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Reprints and permissions

About this article

Cite this article

Liu, B., Wang, S. & Wang, X. DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation. Sci Rep 5, 15479 (2015). https://doi.org/10.1038/srep15479

Download citation

Received: 29 May 2015
Accepted: 28 September 2015
Published: 20 October 2015
DOI: https://doi.org/10.1038/srep15479

This article is cited by

Deep-WET: a deep learning-based approach for predicting DNA-binding proteins using word embedding techniques with weighted features
- S. M. Hasan Mahmud
- Kah Ong Michael Goh
- Watshara Shoombuatong
Scientific Reports (2024)
FTWSVM-SR: DNA-Binding Proteins Identification via Fuzzy Twin Support Vector Machines on Self-Representation
- Yi Zou
- Yijie Ding
- Quan Zou
Interdisciplinary Sciences: Computational Life Sciences (2022)
A sequence-based multiple kernel model for identifying DNA-binding proteins
- Yuqing Qian
- Limin Jiang
- Fei Guo
BMC Bioinformatics (2021)
DP-BINDER: machine learning model for prediction of DNA-binding proteins by fusing evolutionary and physicochemical information
- Farman Ali
- Saeed Ahmed
- Shahid Akbar
Journal of Computer-Aided Molecular Design (2019)
SkipCPP-Pred: an improved and promising sequence-based predictor for predicting cell-penetrating peptides
- Leyi Wei
- Jijun Tang
- Quan Zou
BMC Genomics (2017)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

The influence of λ and ω on the performance of iDNAPro-PseAAC

Discriminant Visualization

Results on the benchmark dataset

Performance comparison with other related computational predictors

Influence of Negative Samples on the Predictive Performance

Discussion

Methods

Benchmark Dataset

Profile-based protein representation

Pseudo amino acid composition (PseAAC)

Support Vector Machine

Evaluation methodology

Additional Information

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Ethics declarations

Competing interests

Electronic supplementary material

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links