ENCODE discovers many new transcription-factor-binding-site motifs and explores their properties
To directly identify regulatory regions, we mapped the binding locations of 119 different DNA-binding proteins and a number of RNA polymerase components in 72 cell types using ChIP-seq (Table 1, Supplementary Table N1, ref 19); 87 (73%) were sequence-specific TFs (TFSS). Overall, 636,336 binding regions covering 231Mb (8.1%) of the genome are enriched for regions bound by DNA-binding proteins across all cell types. We assessed each protein-binding site for enrichment of known DNA-binding motifs and the presence of novel motifs. Overall, 86% of the DNA segments occupied by TFSS contained a strong DNA-binding motif and in most (55%) cases, the known motif was most enriched (Pouya Kheradpour and Manolis Kellis, personal communication).
We organized all the information associated with each TF, including the ChIP-seq peaks, discovered motifs, and associated histone modification patterns, in FactorBook26, a public resource which will be updated as the project proceeds.
We observed reduced levels of individual variation at functional binding sites compared to reshuffled motif matches and flanking regions for other Drosophila factors as well as human TFs (Figure 2A). Notably, the significance of this effect was similarly high in Drosophila and humans, despite the fact that the SNP frequency differed approximately 11-fold (2.9% vs 0.25%, respectively), as closely reflected by the 7.5-fold difference in the number of varying TFBS. This is consistent with the overall differences in the total number of SNPs detected in these two species, likely resulting from their different ancestral effective population sizes 39. We also observed a significant anti-correlation between variation frequency at motif positions and their information content in both species (Figure 2B).
We proposed to express the deleterious effect of TFBS mutations in terms of mutational load, a known population genetics metric that combines the frequency of mutation with predicted phenotypic consequences that it causes 31,32 (see Materials and methods for details). We adapted this metric to use the reduction in PWM score associated with a mutation as a crude but computable measure of such phenotypic consequences.
We do not assume that TFBS load at a given site reduces an individual's biological fitness. Rather, we argue that binding sites that tolerate a higher load are less functionally constrained. This approach, although undoubtedly a crude one, makes it possible to consistently estimate TFBS constraints for different TFs and even different organisms and ask why TFBS mutations are tolerated differently in different contexts.
Among the TF binding sites that were ubiquitously functional, we compared the genomic footprints of sites where binding activated or repressed transcription in all four cell lines. Among the transcription factors we examined (see Table 1), YY1 had the most examples of each case (9 ubiquitously activating and 16 ubiquitously repressing sites). Fig. 2 shows the motifs derived from this analysis for YY1. The most striking difference between the YY1 motif for sites where binding is associated with activation (Fig. 2 (b)) and those where binding is associated with repression (Fig. 2 (c)) occurs at position 4, where the G has greater information content for repressing cases (p < 0.012 using a permutation test, see Additional file 1, Fig. S7). The repressive YY1 binding sites are closer to translational start sites than are the activating YY1 binding sites (p=7.7 x 10-4). Indeed, 12 of the repressing YY1 binding sites are located directly over the translational start site, whereas only a single activating YY1 binding sites is. The mutagenesis experiments reported here elucidate the functional distinction between the different classes of YY1 binding sites that were noted in a previous analysis of DNA binding (ChIP-chip) 75: the class of YY1 binding sites localized around the translational start site are strongly associated with transcriptional repression, while those localized closer to the TSS are associated with activation.
πIn Fig. 2 (d), we report the vertebrate phyloP score 87 for each nucleotide, averaged over sites where YY1 binding results in activation or repression of transcription, respectively. Error bars indicate the standard error of the mean. Conservation is generally high for YY1, relative to that for the other transcription factors in our study. At position 4 of the YY1 motif, we observe that mean conservation is lower among the activating sites compared to the repressing sites (p < 0.06 using a Wilcoxon rank sum test). We also note that, while both activation- and repression-associated classes of YY1 binding sites show greater conservation over the binding site, relative to flanking regions, the conservation of the repression-associated class is greater than that of the activation-associated class, even beyond the 5' and 3' ends of the YY1 motif.
We also found that the recognition sequences for a small number of factors were consistently linked with elevated chromatin accessibility across all classes of sites and all cell types (Supplementary Fig. 6c), indicating that regulators acting through these sequences are key drivers of the accessibility landscape.
Many eukaryotic genes are co-regulated by multiple TFs in a cell type-specific manner (Maston et al. 2006). For 70 of the 87 sequence-specific TFs, we discovered the canonical motifs as well as significant secondary motifs that were distinct from the canonical motifs of the TFs in question and that correspond to the canonical motifs of other TFs. Two scenarios may result in secondary motifs: two TFs bind to neighboring sites (co-binding), or one TF protein binds to another that in turn binds to DNA (tethered binding). To distinguish between these scenarios, we computed the percentages of peaks in a ChIP-seq dataset that contain sites for the canonical TF only, a non-canonical TF only, or both, and then we sorted the datasets by the percentages of peaks with only non-canonical motif sites (Fig. 2a; see Table S3 for the underlying data). We reasoned that if sites of a non-canonical motif were frequently found to be in the same ChIP-seq peaks as canonical motif sites (hence adjacent to them), the two TFs are likely to interact at the protein level and influence each other in binding to their DNA sites. Conversely, if the majority of the peaks contain only sites for non-canonical motifs, then tethered binding is a more plausible model. In this fashion, we identified 151 potential tethered binding and 104 co-binding sequence-specific TF pairs (255 in total). We then compared the pairs we discovered with experimentally detected pairs reported in a mammalian-two-hybrid study (Ravasi et al. 2010; Matys et al. 2003) and in the BIOGRID database (Badis et al. 2009; Stark et al. 2006) and found evidence for physical interaction for 27 (10.6%) of the pairs. Eighteen of the 151 tethered binding predictions were validated in the mammalian-two-hybrid data. We randomly picked 151 TF pairs for 5,000 trials and on average 4.19 pairs were validated in the mammalian-two-hybrid experiments (maximum 13 pairs), indicating that our predicted TF pairs were highly significant (p-value<2e-4). Thus our results both recapitulated previously reported observations and revealed novel potential interactions that can be tested by experimentation (see Table S3 for summary of all pairs).
Co-binding TFs bind to neighboring sites in the genome. For some TFs, multiple molecules of the same TF also can occupy neighboring sites. We asked whether these neighboring sites prefer to be on the same strand or opposite strands, and whether they prefer to be in a specific range of distances. In addition to the analysis presented in the previous section, which compared the canonical motif with each non-canonical motif discovered in the same dataset, we also compared motifs discovered in different datasets collected using the same cell line. In Fig. 2b,c we summarize the heterotypic and homotypic TF pairs that show statistically significant orientation or distance preferences, separately in non-repetitive and repetitive regions of the genome (the underlying data are in Table S4). Out of the 78 motifs discovered from ChIP-seq datasets, 36 motifs (92 pairs; 62 heterotypic pairs and 30 homotypic pairs) are included in Fig. 2b, suggesting that preferred arrangements of nearby TF binding sites is a common phenomenon. The neighboring sites for many heterotypic TF pairs (e.g., CTCF-NF-Y, ELF1-GABP, and FOXA-HNF4) as well as the neighboring homotypic sites of many TFs (e.g., AP-1, CTCF, and USF) show a strong preference for an edge-to-edge distance of less than 30 bp and varying degrees of preference for one orientation over the other. For example, neighboring NF-Y sites prefer to be in the same orientation. NF-Y also prefers one orientation to the other when co-binding with SP1, PBX3 (its motif is UA2), and USF. We hypothesized that these 92 TF pairs are more likely to represent protein-protein interactions than the TF pairs we identified in the previous section without testing for position or orientation preferences. Indeed, 14 heterotypic pairs and 17 homotypic pairs (33.7%) were detected in the aforementioned mammalian-two-hybrid study (Ravasi et al. 2010) or in the BIOGRID database (Stark et al. 2006).
TFs tend to bind gene-rich regions of the genome due to their role in regulating target gene expression (Carroll et al. 2006). Nonetheless, repetitive elements are known to harbor functional TF binding sites, especially when such elements occur near genes. We systematically compared our compilation of TF binding sites with all repeats annotated in the human genome and the results are summarized in Fig. 3a. We confirmed the previously reported enrichment of STAT1, NF-Y, and CTCF binding sites in various repetitive elements (Bourque et al. 2008; Schmid and Bucher 2010), and we uncovered many more TFs whose binding sites are enriched in certain repetitive elements, e.g., UA1 sites in THE1B and THE1D retrotransposons. It was shown that a long terminal repeat (LTR) region of the THE1D retrotransposon was recruited as an alternative promoter for the human IL2RB gene and that the activity of this alternative promoter is regulated by DNA methylation (Cohen et al. 2011). The UA1 motif we identified in ZBTB33 peaks contains a prominent CGCG center (Fig. 1c) and ZBTB33 is known to bind methylated CpG dinucleotides (Yoon et al. 2003), raising the interesting possibility that the THE1B/D retrotransposons spread ZBTB33 binding sites across the genome, and that the regulation of the newly recruited target genes can be modulated by the DNA methylation mechanism. Fig. 2c and 3b summarizes all motif pairs that show statistically significant distance or orientation preference in repetitive regions of the genome. The NF-Y-USF site pairs that typically have an end-to-end distance of 5-6 bp are nearly all located in the MLT1 family of retrotransposons. Similarly, the NF-Y-NF-Y site pairs at a 9-bp distance are found most often in LTR12 retrotransposons. There are 181 copies of the MLT1J transposon in the genome that contain sites for the NF-Y, USF, and ZNF143 motifs simultaneously, bound directly by NF-Y, USF, and ZNF143 TFs, respectively. The relative distance among the sites are nearly invariant (Fig. 2c), indicating recent duplications of MLT1J. Our results suggest a mechanism whereby retrotransposons amplify functional TF site pairs across the genome through transposition, potentially bringing new genes under the regulation of those TFs.
The majority of the ENCODE ChIP-seq data were produced using five cell lines-K562, GM12878, HepG2, H1-hESC, and HeLa. Integrating ChIP-seq data with RNA-seq data for these five cell lines, we asked whether genes that are preferentially expressed in a given cell line (defined by the average expression level in one cell line being more than 10-fold higher than that in any of the remaining four cell lines) show enriched TF binding sites in the corresponding cell line. This is indeed the case for a large fraction of genes and Fig. 4a shows five examples, one per cell line.
We then asked whether the non-canonical motifs we discovered also reflect cell type specificity. Fig. 4b plots the non-canonical motifs (circles) detected in the ChIP-seq datasets of sequence- specific TFs for each of the five cell lines (squares) with the most ENCODE ChIP-seq datasets. Cell line-specific non-canonical motifs are placed close to their respective cell lines in Fig. 4b. We defined cell line-specific motifs as those that were discovered three times more often in one cell line than in any other cell line. The remaining non-canonical motifs are placed in the center of the figure, and these motifs correspond to TFs that cooperate with other sequence-specific TFs across multiple cell lines. The thickness of the solid line connecting a non-canonical motif to a cell line indicates the proportion of datasets in that cell line that revealed the motif as a non- canonical motif.
In Fig. 4b, we also included all non-sequence-specific TFs (diamonds) for which there are ChIP- seq data in these cell lines. Dashed lines connect non-sequence-specific TFs to the motifs discovered in their ChIP-seq peaks. Two non-sequence-specific TFs show cell line-specific enrichment in motifs: the enhancer-binding protein EP300 and the histone deacetylase HDAC2. There are seven datasets for EP300 in seven different cell lines, and three datasets for HDAC2 in three different cell lines. Distinct motifs were found in different cell lines: SPI1 for EP300 in GM12878 cells; GATA1 (and GATA1-ext) for both EP300 and HDAC2 in K562 cells; FOXA and HNF4 for HDAC2, and FOXA and TCF7L2 for P300 in HepG2 cells; SOX2-OCT4 and UA9 for HDAC2, and TEAD1 for EP300 in H1-hESC cells; and CEBPB, AP-1, and CREB for EP300 in HeLa cells. As described in the previous section, many of these motifs were most frequently and specifically observed as secondary motifs for sequence-specific TFs in the respective cell lines. Because non-sequence-specific TFs do not bind DNA directly, they tether onto sequence-specific TFs to bind target DNA. EP300 is known to interact with AP-1 and CEBPB (Chi-Chung Wang et al. 2007; Mink et al. 1997), and HDAC2 with TAL1-GATA (the motif is GATA1-ext) (Hu et al. 2009). Our results highlight that the interactions of EP300 and HDAC2 with sequence-specific TFs are highly cell type dependent.
We detect systematic relationships between specific combinations of regulatory factors between pair distal-promoter DNaseI hypersensitive sites. For example, KLF4, SOX2, OCT4 (also called POU5F1) and NANOG are known to form a well-characterized transcriptional network controlling the pluripotent state of embryonic stem cells33. We found significant enrichment (P < 0.05) of the KLF4, SOX2 and OCT4 motifs within distal DHSs correlated with promoter DHSs containing the NANOG motif; enrichment of NANOG, SOX2 and OCT4 distal motifs co-occurring with promoter motif OCT4; and enrichment of distal SOX2 and OCT4 motifs with promoter SOX2 motifs (Supplementary Fig. 15a).
We also find significant co-associations between promoter types (defined by the presence of cognate motif classes; see Supplementary Methods) and motifs in paired distal DHSs. For example, when a member of the ETS domain family (motifs ETS1, ETS2, ELF1, ELK1, NERF (also called ELF2), SPIB, and others) is present within a promoter DHS, motif PU.1 (also called SPI1) is significantly more likely to be observed in a correlated distal DHS (P < 10-5).
Comprehensive scans of DNaseI hypersensitive regions for high-confidence matches to all recognized transcription factor motifs in the TRANSFAC10 and JASPAR11 databases revealed striking enrichment of motifs within footprints (P ~ 0, Z-score = 204.22 for TRANSFAC; Z-score = 169.88 for JASPAR; Fig. 1b and Supplementary Fig. 3).
Given the enrichment of known sequence motifs within footprints, we sought to identify novel motifs within this genomic compartment. We performed de novo motif discovery within the ~45 million footprints identified in each of the 41 cell types resulting in 683 unique motif models (Fig. 6a). A total of 394 of the 683 (58%) de novo motifs matched distinct experimentally grounded motif models, accounting collectively for 90% of all unique entries across the three databases (Fig. 6b and Supplementary Fig. 14a-c). Notably, 289 of the footprint-derived motifs were absent from major databases (Fig. 6b and Supplementary Fig. 14d). These novel motifs populate millions of DNaseI footprints (Fig. 6c), and show features of in vivo occupancy and evolutionary constraint similar to motifs for known regulators, including marked anti-correlation with nucleotide-level vertebrate conservation (Figs 3b, 6e and Supplementary Figs. 8, 15a).
Cell-selective gene regulation is mediated by the differential occupancy of transcriptional regulatory factors at their cognate cis-acting elements. Figure 7b shows a heat-map representation of cell-selective occupancy at motifs for 60 known transcriptional regulators and for 29 novel motifs. This approach appropriately identified a number of known cell-selective transcriptional regulators including: (1) the pluripotency factors OCT4 (also called POU5F1), SOX2, KLF4 and NANOG in human embryonic stem cells37; (2) the myogenic factors MEF2A and MYF6 in skeletal myocytes38; and (3) the erythrogenic regulators GATA1, STAT1 and STAT5A in erythroid cells39-41 (Fig. 7b).
Many of the novel, footprint-derived motifs displayed markedly cell-selective occupancy patterns highly similar with the aforementioned well-established regulators. This suggests that many novel motifs correspond to recognition sequences for important but uncharacterized regulators of fundamental biological processes. Notably, both known and novel motifs with high cell-selective occupancy predominantly localized to distal regulatory regions (Fig. 7c), further highlighting the role of distal regulation in developmental and cell-selective processes 42,43.
Using the whole set of binding peaks of all TRFs in each cell line as background, we found that motifless binding peaks have very significant overlaps with our HOT regions (Table 5). This is true no matter we consider all TRF peaks in the whole genome, or only those in intergenic regions. In all cases, the z-score is more than 25, which corresponds to a p-value of less than 3'10-138. A substantial portion of binding at HOT regions is thus attributed to non-sequence-specific binding. In our separate study, we found that motifless binding peaks have stronger DNase I hypersensitivity signals 20, which is also a signature of our HOT regions (Figure 4).
Our analysis also highlights the need of a more comprehensive catalog of sequence motifs of DNA binding proteins. If we instead define a TRF binding peak as motifless as long as it lacks either a previously-known motif or a newly discovered one, i.e., it could still have a motif from the other source, the overlap of the resulting "motifless" peaks with our HOT regions becomes statistically insignificant. Requiring a motifless binding peak to lack both types of motifs is likely more reliable.
The performance of the classifier using only proximal promoter information is close to that of a random classifier, across all tasks. All the classifiers using DHS sequences display strong improvements in performance over this baseline in discriminating genes that are up-regulated in different cell types (UR vs. UR-Other, Figure 5A), with a greater improvement in performance coming from the Split DHS approach with separate features for the TSS and Distal DHSs (median AuROC ~0.73). Similar results were obtained when training classifiers to distinguish between specifically up- and down-regulated genes from the same cell types (UR vs. DR, Figure 5B), and to distinguish up-regulated from constitutively expressed genes (UR vs. Const., Figure 5C). Discriminating down-regulated genes from different cell types (DR vs. DROther), and down-regulated from constitutively expressed genes (DR vs. Const.), resulted in lower accuracies but still showed the trend of better performance with DHS compared to proximal promoter sequence (Supplemental Figure 2A-B). All results clearly indicate that strong performance improvement is achieved by scanning for TFBS matches in open chromatin regions.
Identifying candidate regulators
In addition to classifying genes belonging to different groups, we inspected the classifiers to identify motifs that were most informative in the classification task, i.e., those PWMs that had large regression coefficients (Supplemental Table 4). This identified several TFs with known impact on transcriptional output in the cell line of interest. For example, YY1, SPI1 and IRF8 are crucial in the specification of B-cells (GM12878 cell line) (Lu et al. 2003; Liu et al. 2007; Sokalski et al. 2011). We also identified the REST motif as a positive regulator of UR genes in medulloblastoma cell line that is of neural origin (Supplemental Table 6). REST specifically down-regulates neuron-specific genes in many non-neuronal cell lines, and its expression is suppressed in neurons (Schoenherr and Anderson 1995). As a result, the model identified the cis-elements that are present in the DHS associated with neuron specific genes as the factor that separates these genes from the genes up-regulated elsewhere. This example illustrates that the inactivation of a repressor can also explain up-regulation of genes. Other well characterized factors included ETS1 in HUVEC cells and HNF4A for HepG2 cells (Cereghini 1996; Oda et al. 1999; Yordy et al. 2005).
For HNF4A in HepG2 and GATA1 in K562 cells, ChIP data is available from the ENCODE project. To validate the predictions made by our model, we looked for overlap of these ChIP sites with DHS sites associated with different sets of genes. In HepG2 cells, 19% of all genes with an associated DHS overlapped a HNF4A binding site. Strikingly, 64.5% of the UR genes had a DHS overlapping an HNF4A ChIP peak (pvalue< 1e-12, binomial test). Conversely, only 10.5% of DR genes had a DHS that overlapped an HNF4A site (p-value<1e-3). In K562 cells, we found that 6% of all genes had an associated DHS with a GATA1 ChIP peak. However, 31.5% (p-value<1e-12) of UR genes and only 3.5% (p-value<0.1) of DR genes had a DHS with a GATA ChIP peak. The ChIP binding data provided strong and independent evidence that our models identify relevant factors that regulate the transcriptional program in these cells.
To assess the presence of additional sequence motifs not accounted for by the sets of known PWMs, we used the discriminative version of MEME to perform motif finding (Bailey et al. 2010), identifying motifs differentially enriched between UR and UR-other respectively DR genes (Supplemental Table 8). While some of the identified motifs corroborated the importance of features from the set of top 10 TFs (FOXA2 [formerly HNF3B] in HepG2), others corresponded to TFs that were not in this list. These are candidate TFs that are not among the most differentially expressed, but still might be involved in the transcriptional program, potentially through other steps of activation. We note that we largely did not recover the motifs recently identified in a subset of 7 of the 19 cell lines (Song et al. 2011). In contrast to this study, which used the sequences from cell-type specific DHS as foreground and the subsets of cell-type specific DHS in other cell types as background, we analyzed the sequences from all DHS associated to a gene, and defined the background according to the classification tasks.
For several factors, we observed indicative footprints in the region of the motif (Figure 6). For example, CRX was predictive of UR genes in the medulloblastoma cell line, and it exhibited a protected region at the motif (Figure 6A). Importantly, in other cell lines such as GM12878, LnCAP and MCF7, the CRX motif did not display a similar level of protection. While CRX has been shown to be expressed in certain types of medulloblastoma sub-types (Kool et al. 2008), other factors such as OTX2 have nearly identical PWMs and are known to be important for transcriptional regulation in medulloblastomas (Bunt et al. 2011). This highlights a caveat in predicting expression from motifs; while we can identify biologically relevant motifs, this type of analysis only suggests a subset of factors that likely bind to a specific motif.
Motif analysis of genomic regions bound by TCF7L2
To investigate the predominant motifs enriched in TCF7L2 binding sites, we applied a de novo motif discovery program, ChIPMotifs 28, 29, to the sets of TCF7L2 peaks in each cell type. We retrieved 300 bp for each loci from the top 1,000 binding sites in each set of TCF7L2 peaks and identified the top represented 6-mer and 8-mer (Additional file 13). For all cell lines, the same 6-mer (CTTTGA) and 8-mer (CTTTGATC) motif was identified (except for HCT116 cells, for which the 8-mer was CCTTTGAT). These sites are almost identical to the Transfac binding motifs for TCF7L2 (TCF4-Q5:SCTTTGAW) and for the highly related family member LEF1 (LEF1-Q2:CTTTGA) and to experimentally discovered motifs in previous TCF7L2 ChIP-chip and ChIP-seq data 11, 30. These motifs are present in a large percentage of the TCF7L2 binding sites. For example, more than 80% of the top 1,000 peaks in each dataset from each cell type contain the core TCF7L2 6-mer W1 motif, with the percentage gradually dropping to approximately ~50% of all peaks (Additional file 14).
Because the TCF7L2 motif is present in all the cell lines at the same genomic locations, but TCF7L2 binds to different subsets of the TCF7L2 motifs in the different cell lines, this suggests that a cell type-specific factor may help to recruit and/or stabilize TCF7L2 binding to specific sites in different cells. Also, as shown above, TCF7L2 binds to enhancer regions, which are typified by having binding sites for multiple factors. To test the hypothesis that TCF7L2 associates with different transcription factor partners in different cell types, we identified motifs for other known transcription factors using the program HOMER 31. For these analyses, we used the subset of TCF7L2 binding sites that were specific to each of the 6 different cell types. The top 4 significantly enriched non-TCF7L2 motifs for each dataset are shown in Table 3; many of these motifs correspond to binding sites for factors that are expressed in a cell type-enriched pattern. To assess the specificity of the identified motifs with respect to TCF7L2 binding, we chose one motif specific to HepG2 TCF7L2 binding sites (HNF4α) and one motif specific to MCF7 TCF7L2 binding sites (GATA3) and plotted motif densities in the HepG2 cell type-specific TCF7L2 peaks (Figure 4A) and the MCF7 cell type-specific TCF7L2 peaks (Figure 4B). In HepG2 cells, the HNF4α motif, but not the GATA3 motif, is highly enriched at the center of TCF7L2 binding region. In contrast, in MCF7 cells the GATA3 motif, but not the HNF4α motif, is highly enriched at the center of TCF7L2 binding regions.
We find that transcription factors can have cell type specific primary motifs. This finding is in addition to the finding that TFs bind loci with cell-type specific cofactor motifs. We specifically show that YY1 and JunD can bind cell-type specific primary motifs that may correspond to cell-type specific oligomerization partners. This is particularly relevant for JunD and YY1 binding sites that are accessible in multiple cell types, but only bound in one, where the specificity seems solely provided by the differential primary motif. This is in contrast to factors that are differentially recruited by cofactors, where differential DNase-accessiblity has equal or greater capacity for predicting cell-type specific binding.