A machine learning technique for identifying DNA enhancer regions utilizing CIS-regulatory element patterns

Enhancers regulate gene expression, by playing a crucial role in the synthesis of RNAs and proteins. They do not directly encode proteins or RNA molecules. In order to control gene expression, it is important to predict enhancers and their potency. Given their distance from the target gene, lack of common motifs, and tissue/cell specificity, enhancer regions are thought to be difficult to predict in DNA sequences. Recently, a number of bioinformatics tools were created to distinguish enhancers from other regulatory components and to pinpoint their advantages. However, because the quality of its prediction method needs to be improved, its practical application value must also be improved. Based on nucleotide composition and statistical moment-based features, the current study suggests a novel method for identifying enhancers and non-enhancers and evaluating their strength. The proposed study outperformed state-of-the-art techniques using fivefold and tenfold cross-validation in terms of accuracy. The accuracy from the current study results in 86.5% and 72.3% in enhancer site and its strength prediction respectively. The results of the suggested methodology point to the potential for more efficient and successful outcomes when statistical moment-based features are used. The current study's source code is available to the research community at https://github.com/csbioinfopk/enpred.


Materials and methods
Benchmark dataset. The benchmark dataset of DNA enhancer sites, originally constructed and used in recent past by iEnhancer-2L 26 , was re-used in the proposed method. In the current dataset, information related to nine different cell lines (K562, H1ES, HepG2, GM12878, HSMM, HUVEC, NHEK, NHLF and HMEC) was used in the collection of enhancers and 200 bp fragments were extracted from DNA sequences. The annotation of chromatin state information was performed by ChromHMM. The whole genome profile included multiple histone marks such as, H3K27ac H3K4me1, H3K4me3, etc. To remove pairwise sequences from the dataset, CD-HIT 39 tool was used to remove sequences having more than 20% similarity. The benchmark dataset includes 2968 DNA enhancer sequences from which 1484 are non-enhancer sequences and 1484 are enhancer sequences. From 1484 enhancer sequences, 742 are strong enhancers and 742 are weak enhancers for the second layer classification. Furthermore, the independent dataset used by iEnhancer-5Step 29 was utilized to enhance the effectiveness and performance of the proposed model. The independent dataset included 400DNA enhancer sequences from which 200 (100 strong and 100 weak enhancers) are enhancers and 200 are non-enhancers. Table 1 includes the breakdown of the benchmark dataset. The details of the above mention dataset is provided in the Supplementary Material (see Online Supporting Information S1, Online Supporting Information S2 and Online Supporting Information S3).
It is not always simple to understand the semantics of a piece of data, which in turn reflects the difficulty of developing biological data models. It can be difficult to come to a consensus about the data in a given domain because different people will emphasize different features, use different terminology, and have different www.nature.com/scientificreports/ perspectives on how things should be seen. The fact that biosciences are non-axiomatic and that different, though closely related communities have very different perspectives on the same or similar concepts makes the situation even more difficult. Biological data models, however, can be useful for creating, making explicit, and communicating precise and in-depth descriptions of data that is already available or soon to be produced. It is hoped that the current study will increase the use of biological data models in bioinformatics, alleviating the management and sharing issues that are currently becoming more and more problematic.
In statistical based prediction models, the benchmark dataset mostly includes training datasets and testing datasets. By utilizing various benchmark datasets, results obtained are computed from fivefold and tenfold crossvalidations. The definition of a benchmark dataset is used in Eq. (1): where D + contains 1484 enhancers and D − contains 1484 non-enhancers. D + strong contains 742 strong enhancers, D + weak contains 742 weak enhancers and U denotes the symbol of "union" in the set theory.
Feature extraction. An effective bioinformatics predictor is the need of researchers in medicine and pharmacology to formulate the biological sequence with a vector or a discrete model without losing any key-order characteristics or sequence-pattern information. The reason for this fact, as explained in a comprehensive stateof-the-art review 40 , that the existing machine-learning algorithms cannot handle sequences directly but rather in vector formulations. However, there exists some possibility that all the sequence-pattern information from a vector might be lost in a discrete model formulation. To overcome the sequence-pattern information loss from proteins, Chou proposed pseudo amino acid composition (PseAAC) 41 . In almost all areas of bioinformatics and computational proteomics 40 , the Chou's PseAAC concept has been widely used ever since it was proposed. In the recent past, three publicly accessible and powerful softwares, 'propy' 42 , 'PseAAC-Builder' 43 and 'PseAAC-General' 44 were developed and the importance and popularity of Chou's PseAAC in computational proteomics has increased more ever since. 'PseAAC-General' calculates Chou's general PseAAC 45 and the other two software generate Chou's special PseAAC in various modes 46 . The Chou's general PseAAC included not only the feature vectors of all the special modes, but also the feature vectors of higher levels, such as "Gene Ontology" mode 45 , "Functional Domain" mode 45 and "Sequential Evolution" mode or "PSSM" mode 45 . Using PseAAC successfully for finding solutions to various problems relevant to peptide/protein sequences, encouraged the idea to introduce PseKNC (Pseudo K-tuple Nucleotide Composition) 47 for generating different feature vectors for DNA/ RNA sequences 48,49 which proved very effective and efficient as well. In recent times a useful, efficient and a very powerful webserver called 'Pse-in-One' 50 and its recently updated version 'Pse-in-One2.0' 51 were developed that are able to generate any preferred feature vector of pseudo components for DNA/RNA and protein/peptide sequences.
In this study, we utilized the Kmer 52 approach to represent the DNA sequences. According to Kmer, the occurrence frequency of 'n' neighboring nucleic acids can be represented from a DNA sequence. Hence, by using the sequential model, a sample of DNA having 'w' nucleotides is expressed generally as Eq. (2) where Y 1 is represented as the first nucleotide of the DNA sample S, Y 2 as the second nucleotide having the 2nd position of occurrence in DNA sample S and so on so fourth Y w denotes the last nucleotide of the DNA sample. 'w' is the total length of the nucleotides in a DNA sample. The Y v nucleotide can be any four of the nucleotides which can be represented using the aforementioned discrete model. The nucleotide Y v can be further described using Eq. (3) Here ∈ is the symbol used to represent the set theory 'member of ' property and 1 ≤ v ≤ n. The components that are defined by the aforementioned discrete model utilize relevant nucleotides useful features to expedite the extraction methods. These components are further used in statistical moments based feature extraction methods.
(1) www.nature.com/scientificreports/ Statistical moments. Statistical moments are quantitative measures that are used for the study of the concentrations of some key configurations in a collection of data used for pattern recognition related problems 53 . Several properties of data are described by different orders of moments. Some moments are used to reveal eccentricity and orientation of data while some are used to estimate the data size [54][55][56][57][58][59] . Several moments have been formed by various mathematicians and statisticians based on famous distribution functions and polynomials [60][61][62] . These moments were utilized to explicate the current problem 63 . The moments that are used in calculations of mean, variance and asymmetry of the probability distribution are known as raw moments. They are neither location-invariant nor scale-invariant. Similar type of information is obtained from the Central moments, but these central moments are calculated using the centroid of the data. The central moments are location-invariant with respect to centroid as they are calculated along the centroid of the data, but still they remain scale-variant. The moments based on Hahn polynomials are known as Hahn moments. These moments are neither location-variant nor scale-invariant [64][65][66][67] . The fact that these moments are sensitive to biological sequence ordered information amplifies the reason to choose them as they are primarily significant in extracting the obscure features from DNA sequences. These features have been utilized in previous research studies 54,[59][60][61][68][69][70][71][72][73] and have proved to be more robust and effective in extracting core sequence characteristics. The use of scale-invariant moment has consequently been avoided during the current study. The values quantified from utilizing each method enumerate data on its own measures. Furthermore, the variations in data source characteristics imply variations in the quantified value of moments calculated for arbitrary datasets. In the current study, the 2D version of the aforementioned moments is used and hence the linear structured DNA sequence as expressed by Eq. (2) is transformed into a 2D notation. The DNA sequence, which is 1D, is transformed to a 2D structure using row major scheme through the following Eq. (4): where the sample sequence length is 'z' and the2-dimensional square matrix has ' d ' as its dimension. The ordering obtained from Eq. (4) is used to form matrix M (Eq. 5) having 'm' rows and 'm' columns.
The transformation from M matrix to square matrix M' is performed using the mapping function 'Ʀ' . This function is defined as Eq. (6): If the population of square matrix M' is done as row major order then, i = x m + 1 and j = xmodm. Any vector or matrix, which represents any pattern, can be used to compute different forms of moments. The values of M' are used to compute raw moments. The moments of a 2D continuous function A j, k to order (j + k) are calculated from Eq. (7): The raw moments of 2D matrix M, with order (j + k) and up to a degree of 3,are computed using the Eq. The centroid of any data is considered as its center of gravity. The centroid is the point in the data where it is uniformly distributed in all directions in the relations of its weighted average 74,75 . The central moments are also computed up to degree-3, using the centroid of the data as their reference point, from the following Eq. (8): The degree-3 central moments with ten distinct feature sare labeled as µ 00 , µ 01 , µ 10 , µ 11 ,µ 02 ,µ 20 ,µ 12 , µ 21 , µ 30 &µ 03 . The centroids a and b are calculated from Eqs. (9) and (10): The Hahn moments are computed by transforming 1D notations into square matrix notations. This square matrix is valuable for the computations of discrete Hahn moments or orthogonal moments as these moments www.nature.com/scientificreports/ are of 2D form and require a two-dimensional square matrix as input data. These Hahn moments are orthogonal in nature that implies that they possess reversible properties. Usage of this property enables the reconstruction of the original data using the inverse functions of discrete Hahn moments. This further indicates that the compositional and positional features of a DNA sequence are somehow conserved within the calculated moments. M' matrix is used as 2D input data for the computations of Orthogonal Hahn moments. The order 'm' Hahn polynomial can be computed from Eq. (11): The aforementioned Pochhammer symbol ( ) was defined as follows in Eq. (12): And was simplified further by the Gamma operator in Eq. (13): The Hahn moments raw values are scaled using a weighting function and a square norm given as in Eq. (14): The Hahn moments are computed up to degree-3for the 2-D discrete data as follows in Eq. (16): The 10 key Hahn moments-based features are represented by H 00 , H 01 , H 10 , H 11 ,H 02 ,H 20 ,H 12 ,H 21 ,H 30 andH 03 . Matrix M' was utilized in computing ten Raw, ten Central and ten Hahn moments for every DNA sample sequence up to degree-3 which later are unified into the miscellany super feature vector (SFV).

DNA-position-relative-incident-matrix (D-PRIM).
The DNA characteristics such as ordered location of the nucleotides in the DNA sequences are of pivotal significance for identification. The relative positioning of nucleotides in any DNA sequence is considered core patterns prevailing the physical features of the DNA sequence. The DNA sequence is represented by D-PRIM in (4 × 4) order. The matrix in Eq. (17) is used to extract positionrelative attributes of every nucleotide in the given DNA sequence.
The position occurrence values of nucleotides are represented here using the notation N x→y . Here the indication score of the y th position nucleotide is determined using N x→y with respect to the xth nucleotide first occurrence in the sequence. The nucleotide type ' y'substitutes this score in the biological evolutionary process. The DNA-reverse-position-relative-incident-matrix (D-RPRIM). It often happens in cellular biology that the same ancestor is responsible for evolving more than one DNA sequence. These cases mostly outcome homologous sequences. The performance of the classifier is hugely affected by these homologous sequences and hence for producing accurate results, sequence similarity searching is reliable and effectively useful. In machine learning, accuracy and efficiency is hugely dependent on the meticulousness and thoroughness of algorithms through which most pertinent features in the data are extracted. The algorithms used in machine learning have the ability to learn and adapt the most obscure patterns embedded in the data while understanding and uncovering them during the learning phase. The procedure followed during the computation of D-PRIM was utilized in computations of D-RPRIM but only with reverse DNA sequence ordering. The position occurrence values of nucleotides www.nature.com/scientificreports/ are represented here using the notation N x→y . Here the indication score of the y th position nucleotide is determined using N x→y with respect to the xth nucleotide first occurrence in the sequence. The nucleotide type ' y ' substitutes this score in the biological evolutionary process. The occurrence positional values, in alphabetical order, represented as 4 native nucleotides. This procedure further uncovered hidden patterns for prediction and ambiguities between similar DNA sequences were also alleviated. The 2D matrix D-RPRIM was formed with (4 × 4) order having16unique coefficients. It is defined by Eq. (18): Similarly, 30 raw, central and Hahn moments (10 raw, 10 central & 10 Hahn), up to degree-3, were computed using the 2D S D-RPRIM matrix through which 30 features were also obtained with 16 unique coefficients and they were also incorporated into the miscellany Super Feature Vector (SFV).
Frequency-distribution-vector (FDV). The distribution of occurrence of every nucleotide was used to compute the frequency distribution vector. The frequency distribution vector (FDV) is defined as in Eq. (19): Here ρ i is the frequency of occurrence of the ith (1 ≤ i ≤ 4) relevant nucleotide. Furthermore, the relative positions of nucleotides in any sequence are highly utilized using these measures. The miscellany Super Feature Vector (SFV) includes these four features from FDV as unique attributes. The violin plots of nucleotide composition and overall frequency normalization is shown in Figs. 4a-d and 5.
D-AAPIV (DNA-accumulative-absolute-position-incidence-vector). The distributional information of nucleotides was stored using frequency distribution vector which used the hidden patterns features of DNA sequences  Here α i is any element of D-AAPIV, from DNA sequence S j having 'n' total nucleotides, which can be calculated using Eq. (21): D-RAAPIV (DNA-reverse-accumulative-absolute-position-incidence-vector). D-RAAPIV is calculated using the reverse DNA sequence as input with the same method used using D-AAPIV calculations. This vector is calculated to find the deep and hidden features of every sample with respect to reverse relative positional information. D-RAAPIV is formed as the following Eq. (24) using the reversed DNA sequence and generates four valuable features. These four critical features from D-RAAPIV are also added into the miscellany Super Feature Vector (SFV).
Here α i is any element of D-RAAPIV, from DNA sequence S j having 'n' total nucleotides, which can be calculated using Eq.    www.nature.com/scientificreports/ features with more robustness to noise and effective against the sensitive DNA Enhancer sites as shown in Fig. 6. All the combined features efficiently differentiate from Enhancers and Non Enhancer sites.

Classification algorithm. Random forests.
In the past, ensemble learning methods have been applied in many bioinformatics relevant research studies 76,77 and have produced highly efficient outcomes in measures of performance. Ensemble learning methods utilize many classifiers in a classification problem with aggregation of their results. The two most commonly used methods are boosting 78,79 and bagging 80 which perform classifications using trees. In boosting, the trees which are successive, propagate extra weights to points which are predicted incorrectly by the previous classifiers. The weighted vote decides the prediction in the end. Whereas, in bagging, the successive trees do not rely on previous trees, rather, each tree is constructed independently from the data using a bootstrap sample. The simple majority vote decides the prediction in the end. In bioinformatics and related fields, random forests have grown in popularity as a classification tool. They have also performed admirably in extremely complex data environments. A random sample of the observations, typically a bootstrap sample or a subsample of the original data, is used to build each tree in a random forest. Out-of-bag (OOB) observations are those that are not included in the subsample or the bootstrap sample, respectively. The so-called OOB error can be produced, for instance, by using the OOB observations to estimate the random forest prediction error. The OOB error is frequently used to gauge how well the random forest classifier predicts outcomes and aids in identifying model uncertainties. The OOB error has the benefit of using the entire original sample for both building the random forest classifier and estimating error. In order to add more randomness to bagging, Leo Breiman 81 constructed random forests. The random forests changed the construction of the classification trees by adding the construction of each tree from the data using a different bootstrap sample. The splitting of each node, in standard classification trees, is performed by dividing each node equally among all the variables. However, in random forests, the splitting of each node is performed by choosing the best among a subset of predictors which are chosen randomly at that node (Fig. 7 shows the structure of the random forest classifier). As compared to many other classifiers, such as support vector machine, discriminant analysis and neural networks, this counterintuitive strategy perform very well and is robust against overfitting 76 .
Algorithm: supervised learning using random forest. Scikit-Learn 82 library using python was implemented for random forest classifier for fitting the trainings and simulations in our proposed method. The number of trees was increased from the default parameter value of 10 to 100. The number of trees parameter value was optimized to 100 using hyper parameter tuning methods and optimal value for the parameter was searched using the successive halving technique in scikit-learn 82 library. The searching space for the parameter "n_estimators" in random forest classifier was (5-500) which was optimized to 100 after successful halving. One of the key findings observed during the experimentation process was that forest with more than 100 trees minimally contribute to the accuracy of the classifier, but can enhance the overall size of the proposed model substantially. Figure 8a illustrates a flowchart to show the overall process of the proposed method.
Out-of-bag estimation. It is frequently asserted that the OOB error is a neutral estimator of the true error rate. Every observation is "out-of-bag" for some of the trees in a random forest because each tree is constructed from a different sample of the original data. Then, only those trees can be used for which the observation was not used in the construction to derive the prediction for the observation. Each observation is given a classification as a result, and the error rate can be calculated using these predictions. The resulting error rate is referred to as OOB www.nature.com/scientificreports/ error. Breiman 81 was the first to propose this process, and it has since gained widespread acceptance as a reliable technique for error estimation in "Random forests". Each new tree is fitted from a bootstrap sample of the training observations z i = x i , y i when training the random forest classifier using bootstrap aggregation. The average error for each z i calculated using predictions from the trees that do not contain z i in their respective bootstrap sample is known as the out-of-bag (OOB) error. This makes it possible to fit and validate the random forest classifier while it is being trained. The OOB error is calculated at the addition of each new tree during training, as shown in the plot below. A practitioner can roughly determine the value of n estimators at which the error stabilizes using the resulting Fig. 8b. The scikit-learn 82 library was used to process the out of bag error estimation.

Ethical approval. This article does not contain any studies involved with human participants or animals
performed by any of the authors.

Experiments and results
For the assessment and verifications of the model and to analyze its performance, some methods are used to evaluate them. These methods evaluate the classifiers using inspection attributes which are based on the outcomes of classification assessments and estimates.

Cross-validation. k-fold cross validation. K-fold cross validation (KFCV) technique is most commonly
used by practitioners for estimation of errors in classifications. Also known as rotation estimation, KFCV splits a dataset into 'K' folds which are randomly selected and are equal in size approximately. The prediction error of the fitted model is calculated by predicting the kth part of the data which is dependent on other K − 1 parts to fit the model. The error estimates of K from the prediction are combined together using the same procedure for each k = 1, 2, … , K. www.nature.com/scientificreports/ Although the generalization performance of any classifier is mostly estimated using unbiased approximations in jackknife tests, two drawbacks exists in this test, firstly, the variance is high as estimates used in all the datasets are very similar to each other, secondly, its calculative expensive as n estimates are required to be computed, www.nature.com/scientificreports/ and the total number of observations to test is n in the dataset. The fivefold and tenfold cross validation tests are proven to be a good compromise between computational requirements and impartiality.
In the KFCV tests, the selection of 'K' is considered as a significant attribute. To testify errors in prediction models, cross validations (K = 5 and K = 10) tests have been used in many research studies. 5-Fold and 10-Fold tests proved to have accurate results in our proposed model and proved to be much better than state-of-the-art methods. These results are listed in Tables 4, 5  Here true-positives (TP), TN (true-negatives), FP (false-positives) and FN (false-negatives) represent the outcomes from the cross validation tests. Unfortunately, the conventional formulations from the above mentioned metrics in Eq. (24) lack in intuitiveness and due to this fact, understanding these measures especially MCC, many scientists have faced difficulties. To ease this difficulty, the above conventional equations were converted by Xu 83 and Feng 84 using Chou's four intuitive equations which used the symbols introduced by Chou 85 . The symbols that define these equations are; Y + , Y − , Y + − andY − + .The description of these symbols is defined in Table 2. From the above correspondence in Table 2, we can define Eq. (25): From the above correspondence in Table 2, we can define Eq. (26): The above Eq. (26) has the same meaning as the Eq. (24) but it is more easy to understand and intuitive. Table 3 defines the detail description of these equations.
The set of metrics used in above Table 3 are not applicable to multi-labeled prediction models rather they are only useful for single labeled-systems. A different set of metrics exists for multi-labeled-systems which have been used by various researchers [86][87][88] . The comparison of existing classifiers with proposed method is mentioned in Tables 4, 5 and 6.

Results and discussions
The classification algorithms with their predictions results using benchmark dataset are shown in Tables 4, 5 and 6. iEnhancer-EL 89 and iEnhancer-2L 26 produced better outcomes using ensemble classifiers and achieved accuracy of 78.03% and 76.89% respectively in which they were successful in predicting strong enhancers with accuracy of 65.03% and 61.93% respectively. Whereas EnhancerPred 27 achieved 80.82% accuracy and used SVMs which produced slightly better results in predicting strong enhancers with 62.06% accuracy. Similarly, iEnhancer-2L-Hybrid 90 and iEnhancer-5Step 29 improved the accuracy results with their prediction model and acquired77.86% and 82.3% accuracies respectively with identifying the strong enhancers with 65.83% and 68.1% accuracies respectively. In contrast, 91.68%and 84.53%accuracy was achieved in predicting enhancers and their  Table 2. Description of symbols used to define these equations.

Symbols
Description of symbols www.nature.com/scientificreports/ strength respectively by the currently proposed method after utilizing obscure features from statistical moments and random forest classifications using 5-Fold cross validation tests (see Table 4 and Fig. 9 for ROCs). Furthermore, tenfold cross-validation test was also conducted using random forest classifier on benchmark dataset and obtained the accuracy results are listed in Table 5. The ROCs of 10-fold cross-validation tests are shown in Figs. 10 and 11. The violin plots of 5 fold cross-validation tests are shown in Fig. 12. In addition to cross validation tests, an independent test was also performed using the independent dataset. The comparison of proposed model and state-of-the-art methods using independent dataset is listed in Table 6 and ROC is shown in Fig. 13. Furthermore, jackknife test was also performed on these datasets. A detailed comparison of some selected machine learning algorithms using jackknife test is mentioned in Table 7. The Precision-Recall (PR) curves for enhancer and their strength prediction is also labeled in Figs. 14 and 15 respectively. The proposed method is based on the feature sets that are evaluated using Hahn moments which are easier for the random forest based classifier Table 3. Description of equations used Eqs. (26).

When Then Description
Y + − = 0 Sn = 1 None of the enhancer is predicted as a non-enhancer All of the enhancers were incorrectly predicted as non-enhancers None of the non-enhancer is incorrectly predicted as an enhancer All of the non-enhancers are incorrectly predicted as enhancers None of the enhancers and none of the non-enhancers were incorrectly predicted Y + − = Y + and Y − + = Y − ACC = 0, MCC = −1 All of the enhancers and all of the non-enhancers were incorrectly predicted The overall prediction is not a better than any other random prediction outcome  www.nature.com/scientificreports/ www.nature.com/scientificreports/ to classify the feature vectors in acute time and are very efficient as compared to previous methods which were not able to produce better results on the computational cost of training and testing using classification process.

Web-server.
As observed in past studies [91][92][93][94][95] , the development of a web-server is highly significant and useful for building more useful prediction methodologies. Thus, efforts for a user friendly webserver have been made in past 72,96-99 to provide ease to biologists and scientists in drug discovery. The software code which has  www.nature.com/scientificreports/ been developed for the proposed method is accessible at https:// github. com/ csbio infopk/ enpred which is developed using Python, Scikit-Learn and Flask. The webserver to the current study will be provided for the research community in near future.

Conclusion
In the proposed research, an efficient model for predicting the enhancers and their strength using statistical moments and random forest classifier is developed. In recent past, many methods were proposed to predict enhancers, but our method has proved to be better in accuracy than the existing state-of-the-art methods. Our method achieved accuracies of91.68% and 84.53% for enhancer and strong enhancer classifications using 5 Fold tests on a benchmark dataset which is currently the highest and accurate classification method for prediction of enhancers and their strength.