Analysis and prediction of super-enhancers using sequence and chromatin signatures

Background Super-enhancers are clusters of active enhancers densely occupied by the Mediators, transcription factors and chromatin regulators, control expression of cell identity and disease associated genes. Current studies demonstrated the possibility of multiple factors with important roles in super-enhancer formation; however, a systematic analysis to asses the relative contribution of chromatin and sequence features of super-enhancers and their constituents remain unclear. In addition, a predictive model that integrates various types of data to predict super-enhancers has not been established. Results Here, we integrated diverse types of genomic and epigenomic datasets to identify key signatures of super-enhancers and their constituents and to investigate their relative contribution. Through computational modelling, we found that Cdk8, Cdk9 and Smad3 as new key features of super-enhancers along with many known. Comprehensive analysis of these features in embryonic stem cells and pro-B cells revealed their role in the super-enhancer formation and cellular identity. Further, we observed significant correlation and combinatorial predictive ability among many cofactors at the constituents of super-enhancers. By utilizing these features, we developed computational models which can accurately predict super-enhancers and their constituents. We validated these models using cross-validation and also independent datasets in four human cell-types. Conclusions Our analysis of these features and prediction models can serve as a resource to further characterize and understand the formation of super-enhancers. Taken together, our results also suggest a possible cooperative and synergistic interactions of numerous factors at super-enhancers and their constituents. We have made available our analysis pipeline as an open-source tool with a command line interface at https://github.com/asntech/improse.

enhancers across 86 different human cell-types and tissues due to its availability [18]. Other studies used the coactivator protein P300 to define super-enhancers [28,29]. However, the knowledge about these factors' ability to define a set of super-enhancers in a particular cell-type and their relative and combinatorial importance remains limited. Master transcription factors which might form the superenhancer domains are largely unknown for most of the cell-types, while performing ChIP-seq for the Mediator complex is difficult and costly. Current studies demonstrated the possibility of multiple cofactors with important roles in super-enhancer formation; however, a predictive model that integrates various types of data to predict super-enhancers and their constituents (enhancers within a super-enhancer) has not been established. In addition, the degree to which the sequence-specific features of constituents by itself explains the differences between super-enhancers and typical enhancers remains unknown.
Herein, to identify key features of super-enhancers and to investigate their relative contribution to predict super-enhancers, we integrated diverse types of publicly available datasets, including ChIPseq data for histone modifications, chromatin regulators and transcription factors, DNase I hypersensitive sites and genomic data. Using correlation analysis and computational modelling, we found that Med1, Med12, H3K27ac, Brd4, Cdk8, Cdk9, p300 and Smad3 were significantly correlated and had a higher predictive importance. By utilizing these features, we developed imPROSE (integrated methods for prediction of super-enhancers) to predict super-enhancers and their constituents from a list of enhancers. We implemented and compared six different state-of-the art learning models and validated them using 10-fold stratified cross validation as well as by independent datasets in four human cell-types. imPROSE trained on Smad3 and H3K27ac data in mESC predicts more cell-type-specific super-enhancers in pro-B cells as compared with H3K27ac-based method (ROSE). We also performed a genome-wide analysis of Cdk8, Cdk9 and Smad3 binding in mESC and pro-B cells to analyse and assess their relative importance in defining super-enhancers. By using ChIP-seq data, RNA-seq based gene expression data, Gene Ontology (GO) and motif analyses we found that these factors differentiate super-enhancers from typical enhancers. Our prediction model and derived features can be further used as a platform to precisely define super-enhancers in other cell-types.

Analysis of features including chromatin and transcription factors
Studies have shown that super-enhancers are occupied by various cofactors, chromatin regulators, histone modifications, RNA polymerase II and transcriptions factors [16,18]. An understanding of the occupancy of these factors at the constituents of super-enhancers and typical enhancers is lacking.
We extensively analysed 32 publicly available ChIP-seq and DNase-seq datasets to unveil their association with the constituents of super-enhancers and typical enhancers in mouse embryonic stem cells (mESC). We found that most of these factors, which are enriched in super-enhancers, were also highly enriched in the constituents of super-enhancers relative to typical enhancers ( Fig. 1a; Figure   S1 in Additional file 1). It is understandable to see that Oct4, Sox2 and Nanog were nearly equally enriched across the constituents of super-enhancers and typical enhancers because these constituents are defined by intersecting OSN co-bound regions.
Through correlation analysis, we found that most of the factors were highly correlated at the constituents of super-enhancers compared to typical enhancers (Fig. 1b). This suggests a possible combinatorial interplay among these cofactors at super-enhancers and that could be the reason that super-enhancers are more active and sensitive to perturbation as compared to typical enhancers.

Analysis of sequence specific features
Super-enhancers differ from typical enhancers in terms of size, ChIP-seq density of various cofactors, TF content, ability to activate transcription, and sensitivity to perturbation [16][17][18]. But, to what extent constituents of super-enhancers differ from the constituents of typical enhancers in terms of sequence composition remains unknown. To gain insights into their biological functions, we sought to identify DNA sequence signatures of constituent enhancers. We tested GC-content, repeat fraction, size, and phastCons across the constituents of super-enhancers and typical enhancers in mouse ESC and pro-B cells.
Previous studies have shown that GC-rich regions have distinct features including frequent TF binding [32], active conformation [33] and nucleosome formation [34]. We found that constituents of super-enhancers are significantly more GC-rich than the constituents of typical enhancers (p-value < 2.2e-16, Wilcoxon rank sum test) (Fig. 1c). This suggests that GC content has an important role in super-enhancers formation and it can be a defining feature to distinguish them from typical enhancers.
Enhancers, larger than 3 kb have been shown to be cell-type-specific and are known as stretch enhancers [26]. We checked the size (bp) of constituents and found that constituents of superenhancers are significantly larger than the constituents of typical enhancers (p-value < 2.2e-16, Wilcoxon rank sum test) (Fig. 1d). A previous study showed that majority of super-enhancers do overlap with stretch enhancers [35]. Taken together, these results suggests that super-enhancers are actually clusters of stretch enhancers.
Enhancers are hardly conserved across mammalian genomes and evolved recently from ancestral DNA exaptation, rather than lineage-specific expansions of repeat elements [36]. We did not find any significant difference in conservation at constituents of super-enhancers and typical enhancers in mESC (p-value = 0.6285, Wilcoxon rank sum test), but in pro-B cells the conservation score was statistically significant (p-value < 1.7e-4, Wilcoxon rank sum test) (Fig. 1e). Similarly, there was no significant difference in repeat fraction at constituents of both super-enhancers and typical enhancers in mESC (p-value = 0.0202, Wilcoxon rank sum test) and pro-B cells (p-value = 0.8976, Wilcoxon rank sum test) (Fig. 1f).

Feature-ranking revealed previously known and new features of super-enhancers
With the increasing discovery of factors associated with super-enhancers, the determination of their relative importance in defining super-enhancers is important. Hence, we ranked chromatin and transcription factors to find a minimal optimal subset, which can be used to optimally distinguish super-enhancers from typical enhancers. We used a random-forest based approach, Boruta [37] to assess the importance of each feature by ranking them based on their predictive importance (Fig. 2a) (Methods). We also used an out-of-bag approach to calculate the relative importance of each feature and achieved almost identical results ( Figure S5 in Additional file 1).
After ranking chromatin features, we found Brd4, H3K27ac, Cdk8, Cdk9, Med12 and p300 as the six most important factors with Brd4 and H3K27ac as the top two most informative factors (Fig. 2a). It was particularly interesting to observe that Cdk8 and Cdk9 were ranked as the third and fourth most informative features, respectively. Cdk9, a subunit of the positive transcription elongation factor b (P-TEFb) has been found in enhancers and promoters of active genes along with the Mediator coactivator [17]. Cdk8, a subunit of Mediator complex, positively regulates precise steps in the assembly of transcriptional elongation, including the recruitment of P-TEFb and Brd4 [38].
Previous studies have shown that five ESC transcription factors (Sox2, Oct4, Nanog, Esrrb and Klf4) and other TFs (Smad3, Stat3, Tcf3, Nr5a2, Prdm14 and Tcfcp2l1) were enriched in superenhancers [16,18]. As these transcription factors are specific for ESC biology, we ranked them separately from other cofactors to find their relative importance in ESC. Surprisingly, Smad3 turn to be the most informative among other transcription factors including Klf4 and Esrrb which were previously described as key defining features of super-enhancers [16] (Fig. 2b). It will be interesting to further understand the importance of Cdk8, Cdk9 and Smad3 in the formation of super-enhancers.
We also ranked the chromatin and transcription factors together, not surprisingly Med1 was ranked as most informative feature followed by H3K27ac, Brd4, Cdk8, Cdk9. We found that Smad3 was ranked higher than p300 and also ESC specific master TFs (Fig. 2c). To test this in more differentiated cell -types, we have ranked the several factors, including H3K27ac, H3K4me1, H3K4me3, p300, Smad3, PU.1, Foxo1, Ebf1, Pol2 and DNaseI in pro-B cell. Interestingly, we found that Smad3 turn to be the most informative feature followed by PU.1, p300 and H3K27ac (Fig. 2d).
This shows that Smad3 might be more informative feature to distinguish super-enhancers in differentiated cells.

Comparison of six different state-of-the-art machine learning models
We compared six different state-of-the-art, supervised, machine learning models, including Random Forest, linear SVM, k-NN, AdaBoost, Naive Bayes and Decision Tree. We used chromatin, transcription factors and sequence-specific features to train these models individually and evaluated their performance using 10-fold cross validation. The parameters used for each model can be found in the methods section.
Using all chromatin, transcription factors and sequence-specific features, Random Forest performed well with AUC=0.98, while Naive Bayes performed poorly with AUC=0.85 (Fig. 3b, Figure   S2D in Additional file 1). Linear SVM and k-NN performed equally with AUC=0.95, while Decision Tree achieved AUC=0.91. We found that AdaBoost performed almost equally with AUC=0.96 as compared to Random Forest with AUC=0.98, but we achieved stable precision and recall for Random Forest.
We further compared these models, using individual features as well as different types of features and found feature-type-specifics for each model ( Figure S3; and Nanog) co-bound sites. To avoid over-fitting, we used a hybrid-sampling approach to balance the training and test data (Additional file 1). The prediction accuracy was assessed using 10-fold stratified cross-validation (Methods).
We investigated the individual predictive power of features using all the six models (Table S2 in Additional file 1). The AUC (Area Under the Curve) scores reported in this manuscript are based on the Random Forest model, unless stated otherwise. The features which were ranked as more important predictors in (Fig. 2a,  This shows that combinatorial information of cofactors is more effective at predicting superenhancers than the individual features alone. We performed the combinatorial analysis based on feature types in the following section. We also noticed that the top ranked features, including Cdk8, Cdk9, p300, CBP, Smad3 and Brd4 are highly correlated with Med1, at the constituents of superenhancers and typical enhancers (Fig. 1b).

Prediction using sequence-specific features
We used genomic features, including conservation score (phastCons), GC content and repeat fraction, and investigated their individual and combinatorial predictive power. The model achieved AUC=0.58 for phastCons, AUC=0.63 for GC content and AUC=0.64 for repeat fraction (Fig. 3e) We also tested prediction accuracy using a sequence specific k-mer based approach [39]. We

Combinatorial predictive ability of chromatin and genomic features
Previous studies have shown a combinatorial interplay between multiple histone modifications, transcriptions factors and DNA motifs, and this can be functionally informative [40,41]. We noticed significant correlation among many chromatin regulators and TFs at the constituents of superenhancers. This suggests the existence of a combinatorial relationship among these factors, which might dictate an accurate explanation for their role in super-enhancer formation. This combinatorial information could also be more predictive than the individual information of each factor.
We therefore investigated the combinatorial predictive power of chromatin, TFs and sequencespecific features. We tested 22 different combinations of these features and reported precision, recall, F1-score, AUC, and PRC (Fig. 3f) 3g). It was particularly interesting to see that a model trained on genomic features, including GC content, phastCons and repeat fraction, achieved AUC=0.81. We also examined the combinatorial importance of these grouped features based on their functionality and type ( Figure S2E in Additional file 1). The increase in the model AUC is statistically significant (p-value = 0.001, Wilcoxon rank sum statistic) ( Figure S6 in Additional file 1).
Our analysis shows that the combinatorial information greatly increased the predictive power of the models. Further, the sequence-specific features alone are reliable predictors, and the addition of sequence-specific features to other features greatly enhanced their predictive power.

Model validation using independent datasets
The above described models performed well on mESC data using 10-fold cross validation as a validation strategy. To further validate, we used independent datasets, which were not seen by the models during the training. We used publicly available data in four human tumor cell-types, including B-cell lymphoma (P493-6), multiple myeloma (MM1.S), small cell lung carcinoma (H2171) and glioblastoma (U87). We chose these cell-types because ChIP-seq data for MED1 and the two topranked chromatin features, BRD4 and H3K27ac, were publicly available (Table S2 in Additional file 1).
Initially, we used H3K27ac ChIP-seq peaks 2 kb upstream and downstream of the transcription start site (TSS), to define constituent enhancers and ranked those constituents based on MED1 ChIP-seq signal to define super-enhancers as described in [16][17][18].
First, we trained the model on ESC data and tested it on each of the four human cell-type data and achieved AUC=0.92, 0.90, 0.90 and 0.86 for P493-6, MM1.S and U87 cells, respectively (Fig. 3h). We also checked the classification accuracy after combining five cell-types data by training the model on four cell-types data and testing it on one of the remaining cell-type data and repeated this for all the combinations ( Figure S2H in Additional file 1). We achieved the highest AUC=0.95 for the model tested on P493-6 cell-type data and the lowest AUC=0.88 for the model tested U87 cell-type data ( Figure S2F in Additional file 1). The classification measures, including precision, recall, F1-score and AUC for each tested cell-type after training the model on remaining four cell-types can be found in (Table S3 in Additional file 1).
We next trained the model on one genome data and tested on another. We used four human celltypes (P493-6, Plasma cell, H2171 and U87), and one mouse cell-type (mESC) data. The model trained on mouse cell-type data tested it on human cell-type data, achieved AUC=0.90 (Fig. 3i). The model trained on human cell-type data and tested on mouse cell-type data, achieved AUC=0.85.
We also tested whether a model trained on constituent data can predict the stitched regions and vice versa. The model, trained with H3K27ac data, accurately predicted the stitched super-enhancers with AUC=0.92 (Fig. 3j). The same model when trained on stitched data and tested on constituent data, performed poorly with AUC=0.68.

Genome-wide profiles of Cdk8, Cdk9 and Smad3 at super-enhancers
Through the ranking of chromatin and transcription factors, we found that Cdk8, Cdk9 and Smad3 were important features along many known signatures of super-enhancers, including H3K27ac, Brd4, Med12 and p300 which are well characterized at super-enhancers [16][17][18]28]. However, the genome-wide profiles of Cdk8, Cdk9 and Smad3 are not well characterized at super-enhancers.
Hence, we investigated the genome-wide profiles of Cdk8, Cdk9 and Smad3 at super-enhancers, identified by using Med1 in mESC. We found that, ChIP-seq binding sites of Cdk8, Cdk9 and Smad3 were highly co-localized with Med1, Brd4, H3K27ac, p300 and DNaseI (Fig. 4a, b). Like Med1, the ChIP-seq density for Cdk8, Cdk9 and Smad3 is exceptionally higher at super-enhancers compared to typical enhancers (Fig. 4c). The ChIP-seq density of Med1, Cdk8, Cdk9 and Smad3 at superenhancers is significantly higher compared to typical enhancers (p-value < 2.2e-16, Wilcoxon rank sum test) (Fig. 4d). Similarly, Med1, Cdk8, Cdk9 and Smad3 are also enriched at ESC superenhancers identified using Med1 (Fig. 4f). The ChIP-seq binding for Med1 and master TFs, including Sox2, Oct4 and Nanog, is exceptionally higher and forms clusters at super-enhancers, which are associated with cell-type-specific genes [16]. We found that the ChIP-seq binding sites for Cdk8, Cdk9 and Smad3 also form clusters at super-enhancer regions and associated with cell-type-specific genes.
For example, the super-enhancer (mSE_00038) is associated with ESC pluripotency gene Sox2 (Fig.   3e). In another example, the super-enhancer (mSE_00085) is associated with the ESC pluripotency gene Nanog and the super-enhancer (mSE_00084) is associated with Dppa3 (developmental pluripotency associated 3) gene, which plays a key role in cell division and maintenance of cell pluripotency ( Figure S4A in Additional file 1).

Identification and characterization of super-enhancers by using Cdk8, Cdk9 and Smad3 in mESC
Since we found that Cdk8, Cdk9 and Smad3 are highly correlated with Med1 and co-occupy superenhancers genome-wide (Fig. 4), we investigated the importance of Cdk8, Cdk9 and Smad3 in superenhancer formation and also compared them with super-enhancers identified by Med1. We used ChIP-seq data and RNA-seq data to identify and characterize super-enhancers by using Cdk8, Cdk9 and Smad3 in mESC. We found 400, 494 and 435 super-enhancers by using Cdk8, Cdk9 and Smad3, respectively (Fig. 5a). A list of all the super-enhancers and typical enhancers can be found in (Additional file 2). Further, Cdk8, Cdk9 and Smad3 successfully identified 88%, 84% and 73% of the Med1 super-enhancers, respectively (Fig. 5a). After Med1, we can see more clear distinction of superenhancers and typical enhancers by using Cdk8 as compared with H3K27ac ( Figure 5b). The majority of super-enhancers identified using Cdk8, Cdk9 and Smad3 do overlap with super-enhancers identified using Med1, which is 66% of the super-enhancers identified using Med1 (Fig. 5c).
In ESC, the DNA motifs Klf4 and Esrrb were particularly enriched at the constituents of superenhancers, compared to typical enhancers [16]. Hence, we tested the frequency of these two motifs at the constituents of super-enhancers and typical enhancers; we defined using Cdk8, Cdk9 and Smad3. We found that the frequency of binding motifs Klf4 and Esrrb is significantly higher at constituents of super-enhancers than typical enhancers (p-value < 2.2e-16, Wilcoxon rank sum test) ( Figure S4G in Additional file 1). Further, when we compared the frequency of known ESC specific motifs (Oct4, Sox2, Nanog, Esrrb and Klf4) at the constituents of super-enhancers and typical enhancers defined by Med1, Cdk8, Cdk9 and Smad3. We found a higher frequency of these motifs at super-enhancers defined by Cdk9, Cdk8 and Smad3 as compared with Med1 (Fig. 5e). Further, the frequency of these motifs was slightly higher at the typical enhancers identified by Med1 as compared with Cdk8, Cdk9 and Smad3.
The genes associated with super-enhancers are significantly expressed, compared to genes associated with typical enhancers [16][17][18]. To test this we associated genes with the super-enhancers and typical enhancers as described in [16,18]. We found that 65% of the Med1 super-enhancers associated genes were also associated with super-enhancers identified by Cdk8, Cdk9 and Smad3 ( Fig. 5f). Further, these genes associated with super-enhancers were significantly expressed, compared to genes associated with typical enhancers (p-value < 2.2e-16, Wilcoxon rank sum test) (Fig. 5g).
Super-enhancers known to be highly enriched for cell-type-specific master regulators and these regulators should have a higher rank. Hence, we checked the rank of super-enhancers, associated with the key cell identity genes, including Oct4, Sox2, Nanog, Esrrb and Klf4 in ESC, and ranked the factors based on the average rank of the super-enhancers that are associated with these genes (Fig.   5h). These genes were selected due to their important roles in the pluripotency and reprogramming of ESC biology [42][43][44]. The rankings for Med1 super-enhancers were downloaded from [18]. We found that Smad3 achieved a highest rank followed by Med1, Cdk9 and Cdk8. The Smad3 achieved a higher rank for Oct4, compared with other genes. This might be due to the fact that Smad3 co-bonded with the master transcription factor Oct4 genome-wide in ESC [45]. Further, we found almost similar ChIP-seq patterns for factors including Med1, H3K27ac, Brd4, Cdk8, Cdk9 and Smad3 at superenhancers regions defined by all three factors (Cdk8, Cd9 and Smad3) and are associated with celltype-specific genes, including Nanog and Dppa3 (Fig. 5i). Like Med1, the genes associated with 1 super-enhancers ranked by Cdk8, Cdk9 and Smad3 were enriched with cell-type-specific GO terms, supporting the notion that super-enhancers regulate cellular identity genes ( Figure S7a in Additional file 1). Taken together, our results indicate the role of Cdk8, Cdk9 and Smad3 in defining and formation of super-enhancers.
In ESC, the highly ranked super-enhancers identified using Smad3 were associated with ES cellidentity genes, including Oct4, Sox2, Nanog, Klf4 and Esrrb compared to Med1 super-enhancers (Fig.   5f). A previous study showed that Smad3 co-occupies the master transcription factors genome-wide [45]. We also observed Smad3 turn to be the highly ranked feature when we ranked Smad3 and several other factors in pro-B cells (Fig. 2d) Hence, we argued that Smad3 could be used to define super-enhancers instead of Med1. We already showed the ability of Smad3 in defining super-enhancers in ESC. To test this in more differentiated cells, we identified and characterized super-enhancers in pro-B cells using Smad3. We compared the super-enhancers identified using Smad3 with previously identified super-enhancers that use Med1 in pro-B cells [16]. The ChIP-seq density of Smad3 super-enhancers is exceptionally higher compared to typical enhancers (Fig. 6a). Further, Smad3 have strong binding along with Med1 and PU.1 at a super-enhancer(mSE_00293) which is associated with Foxo1 gene (Fig. 6b).
By using Smad3, we identified 694 super-enhancers and among these, 65% were identified by Med1 ( Figure S4f in Additional file 1) and with Smad3 we can see a more clear distinction of superenhancers and typical enhancers as compared with H3K27ac (Fig. 6c). The ChIP-seq density at super-enhancers identified using Smad3 is significantly higher, compared to typical enhancers (pvalue < 2.2e-16, Wilcoxon rank sum test) (Fig. 6d). The genes associated with Smad3 superenhancers are significantly expressed, compared with typical enhancers (p-value < 2.2e-16, Wilcoxon rank sum test) (Fig. 6e). The GO terms for super-enhancers ranked by Smad3 are highly enriched and cell-type-specific, compared to Med1 ( Figure S7b in Additional file 1).
Further, to test the functional importance of super-enhancers identified only by Smad3 or by Med1, we performed GO analysis on these subset of super-enhancers. Interestingly, the super-enhancers identified by Smad3 but not by Med1 turn to be highly enriched for cell-type-specific GO terms such as immune cell development and immune system development. While super-enhancers identified by Med1 but not by Smad3 low enriched for cell-type-specific GO terms (Fig. 6f). These results, taken together with previous studies, demonstrate the importance of Smad3 in super-enhancer formation and Smad3 could be used to define super-enhancers.

imPROSE predicts cell-type-specific super-enhancers
To further test the ability of imPROSE, we compared it with the most commonly used H3K27ac based method ROSE [16]. We have demonstrated above that H3K27ac was ranked as the most informative feature in mESC and Smad3 was ranked as the most informative feature in pro-B cells. Hence, we trained imPROSE using H3K27ac and Smad3 data in mESC and predicted super-enhancers in pro-B

cells. Our model predicted ~2,000 super-enhancers in pro-B cells, but when we used ROSE with
H3K27ac data, we found 934 super-enhancers (Additional files 3 and 4). To assess the functional importance of super-enhancers identified by these two methods, we performed GO enrichment analysis using GREAT tool [46]. We have listed the top 20 GO terms enriched at super-enhancers Interestingly, we found several GO terms pertinent to the specific biological functions in pro-B cells are highly enriched at the super-enhancers identified by imPROSE and low enriched at superenhancers identified by ROSE. For example, immune system process, hematopoietic or lymphoid organ development and leukocyte differentiation are significantly enriched for genes associated with super-enhancers predicted by imPROSE (Fig. 7a). Further, cell-type-specific GO terms such as regulation of B cell activation and regulation of B cell proliferation were highly enriched for typical enhancers defined by ROSE as compared with imPROSE ( Figure S8 in Additional file 1). This shows that several super-enhancers were possibly labelled as typical enhancers by ROSE.
To further assess the ability of imPROSE, we carried out motif analysis at the constituents of super-enhancers identified by both ROSE and imPROSE methods. We used TRAP tool [47] to find motifs and reported the top ranked motifs. We found that most of the motifs were highly enriched at super-enhancers identified by imPROSE (Fig. 7b). More interestingly, motifs such as PU.1 and Ebf1 were significantly enriched at SEs predicted by imPROSE, which have previously been shown to play important role in the control of B cell identity [48]. Therefore, taken together these results suggest that the integration of multiple factors appears to be a better way to predict super-enhancers than the the current H3K27ac based approach (ROSE).

Discussion
Super-enhancers regulate expression of key genes that are critical for cellular identity, thus, alterations at these regions can lead to several disorders. Hence, exploring these cis-regulatory elements and their features to uncover their molecular mechanisms will help us in designing better, precise, and personalized drugs.
In this study, we first presented a systematic approach to rank and access the importance of Among the chromatin features we found that Brd4, H3K27ac, Cdk8, Cdk9, Med12 and p300 were the top six features and by assessing their individual predictive powers we achieved higher AUC for higher ranked features. Previous studies showed the importance of these highly ranked features and their role in transcriptional regulation. Through our ranking of features, H3K27ac achieved a higher ranking, compared to other histone modifications. H3K27ac was found as a mark to separate active enhancers from poised enhancers [11] -this shows that super-enhancers might be the clusters of active enhancers. The Mediator sub-units Med1 and Med12 has been known as master coordinators of cell lineage and development [16,49]. We did not include Mediator sub-unit Med1 in our training data because a Med1 signal was used to define super-enhancers [16]. Bromodomain-containing protein 4 (Brd4), a member of the BET protein family which functions as an epigenetic reader and transcriptional regulator that binds acetylated lysines in histones [50], was ranked as the second highest important feature. Brd4 has been associated with anti-pause enhancers (A-PEs) which regulate the RNA Polymerase II (Pol II) promoter-proximal pause release [51,52]. Brd4 regulates the positive transcription elongation factor b (P-TEFb) to allow Pol II phosphorylation and the subsequent elongation of target genes [52,53]. In ESC it specifically governs the transcriptional elongation by occupying super-enhancers and by recruiting Mediator and Cyclin dependent kinase 9 (Cdk9) to these super-enhancers [54]. Cdk9, a sub-unit of P-TEFb, has been found at enhancers and promoters of active genes along with the Mediator coactivator [17]. Cyclin-dependent kinase 8 (Cdk8), a subunit of Mediator complex, positively regulates precise steps in the assembly of transcriptional elongation, including the recruitment of P-TEFb and BRD4 [38]. During the preparing of this manuscript, a very recent study has demonstrated that Cdk8 regulates the key genes associated with super-enhancers in acute myeloid leukaemia (AML) cells [55]. We identified and characterized super-enhancers in mESC by using Cdk8 and Cdk9, to further validate their importance in super-enhancer formation and cell identity.
Among the transcriptions factors, we found that Smad3, Esrrb, Klf4, Tcfcp2l1, Nr5f2a and Stat3 were the top ranked features and by assessing their individual predictive powers we achieved higher AUC for higher ranked features. It was particularly interesting to see that Smad3 was ranked as the best feature among the transcription factors including Esrrb and Klf4. We know that Smad3 is a target of the TGF-β signaling pathway, and studies have shown that Smad3 is recruited to enhancers formed by master transcription factors [45]. We found significant correlation between Smad3 and coactivators p300/CBP at super-enhancers and previous studies have shown that p300/CBP interacts with Smad3 [30,31]. The evidence for the enrichment of Smad3 at super-enhancers shows how the transforming growth factor beta (TGF-β) signaling pathway can converge on key genes that control ES cell identity. A very recent study validates our findings by showing that super-enhancers provide a platform for signalling pathways, including TGF-β to regulate genes that control cell identity during development and tumorigenesis [56]. To validate further, we identified and characterized superenhancers using Smad3 in mESC and pro-B cells. By integrating ChIP-seq and RNA-seq data we showed the importance of Smad3 in super-enhancer formation and cell identity.
By investigating sequence-specific features, we found that the constituents of super-enhancers were significantly GC-rich. The GC-richness of a genomic region is associated with several distinctive features that can affect the cis-regulatory potential of a sequence [32,33]. GC-rich and AT-rich chromatin domains are marked by distinct patterns of histone modifications. GC-rich chromatin domains tend to occur in a more active conformation and histone deacetylase activity represses this propensity throughout the genome [33]. Also GC content and nucleosome occupancy are positively correlated [32] and GC-rich sequences promote nucleosome formation [34]. Transcription factors tend to bind GC-rich regions in the genome, regardless of the distance and orientation [32]. This suggests that there is a role for the GC content in the formation of super-enhancers, which control the cell-typespecific gene expression.
Enhancers function due to cooperative and synergistic interplay of different coactivators and transcription factors [57]. A recent study showed that multiple enhancer variants cooperatively contribute to altered expression of their gene targets [58]. It is not well understood whether constituents of super-enhancers work synergistically or additively. The constituents of superenhancers make frequent physical contacts with one another [59] and extensive cooperative binding of transcription factors have been found at super-enhancers [60]. A study in ESC demonstrated the functional importance of super-enhancer constituents [56]. Further, two recent studies have suggested additive and functional hierarchy among the constituents of α -globin and Wap superenhancer locus, respectively [61,62]. But, a very recent study argues that it is still need to be determined whether the constituents of a super-enhancer functions synergistically or additively [63].

Conclusions
We integrated diverse types of genomic and epigenomics datasets to predict super-enhancers and their constituents in a cell-type-specific manner. We investigated the relative importance of each feature in predicting super-enhancers, and also their combinatorial predictive power. We demonstrated that the model trained on one cell-type can be used to predict super-enhancers in other cell-types, and also performed better than the current H3K27ac-based approach. More importantly, we found Cdk8, Cdk9 and Smad3 as new signatures, which can be used to define super-enhancers where Mediator or master transcription data is not available. Taken together with previous studies, our results suggest a possible cooperative and synergistic interactions of numerous factors at super-enhancers. Our feature analysis and prediction models can serve as a resource to further characterize and understand the formation of super-enhancers.

Data description
We  (Table S5 in Additional file 1).
We also obtained processed RNA-seq based gene expression data (RPKM) from [64] and [16] for mESC and pro-B cells, respectively. We downloaded super-enhancer regions in mESC and pro-B cells identified using Med1 ChIP-seq occupancy from dbSUPER [65]. We also used other genomic features including GC content, conservation score (phastCons) and repeat fraction downloaded from the UCSC table browser [66].
Data pre-processing and feature extraction Initially, ChIP-seq reads were aligned to mouse genome-build mm9 using bowtie [67] (Version 0.12.9) with parameters (-k 1, -m 1, -n 2, -e 70, -best). We calculated read densities for 30 ChIP-seq datasets at the constituents of super-enhancers (646) and typical enhancers (9981) and normalized it as described in [16,68] . Briefly, for each constituent region, reads were extended by 200bp and the density of reads per base pair was calculated using bamToGFF (https://github.com/BradnerLab/pipeline). Next, these densities were normalized in units of reads per million mapped reads per base pair (rpm/bp) with background subtraction. We used the similar approach for DNase-seq data but without background subtraction.
The data for model validation was aligned to hg19 as described above. We used MACS (Modelbased Analysis of ChIP-Seq) [69] (Version 1.4.2) to perform the peak calling and to find ChIP-seqenriched regions over background. We used a p-value (10 -9 ) as the enrichment threshold. To generate wiggle files, we used MACS with parameter -w -S --space=50.
For DNA sequence motif data, we collected DNA binding motif information (PWM) from the transfac professional database version 2014 [70] for all the 11 transcription factors (Fig. 2a). We computed the binding affinity score for the constituents of super-enhancer and typical enhancer sequences using the Transcription factor Affinity Prediction (TRAP) [47] using individual TF's position weight matrix (PWM).

Data sampling
There are two commonly used sampling approaches, over-sampling and under-sampling. In oversampling we increased the size of the minority class while in under-sampling we through away the samples from the majority class to balance the data. The current data we are dealing with is highly imbalanced. We used Weka implementation of SMOTE [71] to perform over-sampling with parameters (nearest neighbours=5, random seed=1 and oversampling percentage=500). We used a hybrid approach by first applying over-sampling using SMOTE on the minority class and then undersampling on the majority class. We performed under-sampling by randomly selecting a subset of size similar to the minority class. In this study we used hybrid-sampling approach to perform analysis because it performed better (Additional file 1).

Feature-ranking
To find an optimal feature subset, we used two Random Forest based approaches. First, we used Boruta algorithm [37] to rank important features. Briefly, it finds important features by measuring the relevance of each original feature with respect to a reference attribute using Random Forest. Second, we used Random Forest's out-of-bag approach to calculate the relative importance of each feature.
Briefly, this approach takes one feature out and measures its relative importance and contribution to the model. other cell-types. We also ranked these features together in mESC and pro-B cells.

Training data
We downloaded 10627 loci of constituents of super-enhancers and typical enhancers in mESC defined based of Med1 ChIP-seq signal [16]. Among these 646 were constituents of super-enhancers and 9981 were typical enhancers. The median size of enhancer constituents is 703bp and superenhancer constituents is 862bp. After performing hybrid-sampling we have 10,336 instances of data and among these 50% (5,168) are constituents of super-enhancers, and 50% (5,168) are constituents of typical enhancers. We considered constituents of super-enhancers as positive class and constituents of typical enhancers as negative class. In total, we have 45 features, including 20 chromatin features, 11 transcription factors, 11 DNA motifs and three sequence-specific features. We excluded Med1 from our training data because super-enhancers were defined based on Med1 ChIPseq signal [16]. Not surprisingly, we achieved best classification results by using Med1 as a feature.

Prediction models
We investigated six state-of-art supervised approaches including: Random Forest [72], Support Vector Machines (SVM) [73], K-Nearest Neighbor (k-NN) [74], AdaBoost [75], Decision Tree and Naive Bayes. For all the analysis, we used scikit-learn (version 0.14.1), a Python library for machine learning [76]. We used LibSVM [77] with a linear kernel and regularization parameter C = 1.0. We used Random Forest with the number of trees=20. We calculated an out-of-bag error to find the optimal number of trees to use for Random Forest ( Figure S2C in Additional file 1). For other models we used default parameters set in the scikit-learn library.

K-mer based prediction
We used a sequence specific enhancer prediction method (Kmer-SVM) [39] to classify constituents of typical enhancers and super-enhancers. We used the default settings with k-mer size=5, spectrum kernel, regularization parameter (C) = 1.0. We used 5-fold cross validation for model validation.

Performance evaluation
We used 10-fold stratified cross-validation (CV) to validate models, which makes the folds by preserving the percentage of samples for each class. Stratified CV is generally consider a better scheme than standard CV in terms of bias and variance [78]. To evaluate the performance of the models, we reported precision, recall, F1-score, area under the ROC curve (AUC) and the precisionrecall curve (PRC). The receiver-operating characteristic (ROC) is a graphical representation of true positive rate (sensitivity) v/s false positive rate (1-sensitivity). The true positive rate is also known as sensitivity or recall. The false positive rate is also known as (1-sensitivity). F1 score is an accuracy measure, which considers both the precision and the recall of the test to compute the score. The mathematical representation of these measures is as follows: • Precision = true positive/(true positive + false positive) • Recall = true positive/(true positive + false negative) We also tested if the increase in model AUC is statistically significant by using permutation test (1000 runs). The p-value is calculated using Wilcoxon rank sum statistic.

Super-enhancer identification
We used ROSE (Rank Ordering of Super-Enhancers) with parameters (stitching distance=12.5 kb, TSS exclusive zone= +/-2 kb) to define super-enhancers as described in [16]. We used the ChIP-seq peaks for H3K27ac as enhancer constituents and MED1 signal to rank them.

Assigning genes to super-enhancers and typical enhancers
We assigned genes to super-enhancers and typical enhancers using a proximity rule as descried in [16,18]. It is known that enhancers tend to loop and communicate with target genes [7], and most of these enhancer-promoter interactions occur within a distance of ~50kb [79]. This approach identified a large proportion of true enhancer/promoter interactions in ESC [80]. Hence, we assigned all transcriptionally active genes to super-enhancers and typical enhancers within a 50kb window.

Motif analysis
We used FIMO (Find Individual Motif Occurrences) version 4.10.0 [81] for motif analysis with p-value < 10 -4 with a custom library of TRANSFAC motifs including (Oct4: M01124, Sox2: M01272, Nanog: M01123, Esrrb:M01589, Klf4: M01588). The number of occurrences of each motif were counted for the constituents of super-enhancer and typical enhancer regions. Motif analysis in Fig. 7b was performed with TRAP tool using TRANSFAC vertebrates motif library, mouse promoters as the background, and Benjamini-Hochberg as the correction [47]. A DNA sequence (FASTA format) was extracted from mm9 genome as input for FIMO and TRAP.

Gene ontology analysis
We performed the gene ontology analysis using Genomic Regions Enrichment of Annotations Tool (GREAT) web tool (version 3.0.0) [46] with whole-genome as background and default parameters. We reported the top Gene Ontology (GO) terms with the lowest p-value.

Visualization and statistical analysis
We generated box plots using R programming language by extended the whiskers to 1.5x the interquartile range. The P-values were calculated based on Wilcoxon signed-rank test for box plots, by using wilcox.test function in R. We used ngs.plot [82], to generate heat maps and normalized binding profiles at the constituents of super-enhancers and typical enhancers and their flanking 3kb regions (for example, Fig. 1a