Transcription factor motifs associated with modular chromatin state dynamics of high-resolution regulatory elements

Integrative analysis of 111 reference human epigenomes.

Roadmap Epigenomics Consortium et al.Nature 10.1038/nature14248

We next exploited the dynamics of epigenomic modifications at cis-regulatory elements to gain insights into gene regulation. We focused on 2.3M regions (12.6% of the genome) showing DNA accessibility in any reference epigenome and regulatory (promoter or enhancer) chromatin states, considering enhancer-only, promoter-only, or enhancer-promoter alternating states separately (Fig. S11). We clustered enhancer-only elements (Enh, EnhBiv, EnhG) into 226 enhancer modules of coordinated activity (Fig. 7a), promoter-only elements into 82 promoter modules (Fig. S11a) and promoter/enhancer ‘dyadic’ elements into 129 modules (Fig. S11b), enabling us to distinguish ubiquitously-active, lineage-restricted, and tissue-specific modules for each group. Focusing on the enhancer-only clusters, we found that the neighboring genes of enhancers in the same module showed significant enrichment for common functions{Ashburner, 2000 #308} (Fig. 7b, Fig. S11c,d), common genotype-phenotype associations65 (Fig. 7c), and common expression in their mouse orthologs (Fig. S12), each annotation type showing strong consistency with the known biology of the corresponding tissues. For example, stem-cell enhancers are enriched near developmental patterning genes, immune cell enhancers near immune response genes, and brain enhancers near learning and memory genes (Fig. 7b). Sub-clustering of individual modules continued to reveal distinct enrichment patterns of individual sub-modules (Fig. S11e), suggesting increased diversity of regulatory processes beyond the 226 modules used here.

The genome sequence of enhancers in the same module showed substantial enrichment for sequence motifs67 associated with diverse transcription factors (Fig. S13a). We found 84 significantly enriched motifs in 101 modules (Extended Data 8), indicating that enhancer modules likely represent co-regulated sets, and proposing candidate upstream regulators for nearly half of all modules. Direct application of the same approach and thresholds to the putative regulatory regions annotated in each of the 111 reference epigenomes led to significant enrichment for only 10 enriched motifs in 15 reference epigenomes (Fig. S13b,c) of which 8 are blood samples, and focusing on the regions unique to each of the 17 tissue groups (Fig. 2b) only led to 19 enriched motifs in 10 tissue groups (Fig. S13d,e), emphasizing the importance of studying regulatory motif enrichments at the level of enhancer modules.

We next sought to distinguish likely activator and repressor motifs, by identifying regulators whose expression pattern across cell/tissue types shows a strong (positive or negative) correlation with the activity of enhancers in the enriched module9. We focused on the 40 most strongly expression-correlated regulators (Extended Data 9a), and used the module-level motif enrichments to link each regulator to the cell/tissue types that define each module (Fig. 8). We found that many of the inferred links correspond to known regulatory relationships, including: OCT4 (also known as POU5F1) in pluripotent cells, HNF1B and HNF4A1 in liver and other digestive tissues, RFX4 in neurosphere and neuronal cells, and MEF2D in muscle. The most enriched regulators showed primarily positive correlations, suggesting they function as transcriptional activators, while a subset of factors showed a negative correlation, with the factor expressed in the lineages where its motif showed enhancer depletion, suggesting a repressive role. For example, REST (also known as NRSF), a known repressor of neuronal lineages was least expressed in neuronal tissues, where its motif was most enriched in enhancers, and a similar signature was found for ZBTB1B, a known repressor of myogenesis and brain development.

Regulatory motifs predicted to be drivers of enhancer activity patterns showed significant enrichment in tissue-specific high-resolution (6bp-40bp) DNase digital genomic footprints (DGF)68 in matching cell types (Extended Data 9b, Table S5b), providing DNA accessibility evidence that the motifs are indeed bound in these cell types. In addition, they showed positional bias relative to both the center of DGF locations, and relative to their boundaries (Extended Data 10), a property not found for shuffled motifs69. These positional biases were highly tissue- and cell type-specific for most activating factors (Extended Data 9c), including POU5F in iPSCs, MEF2D in heart, HNF1B in GI tissues, BHLH in brain, SPI1 in immune cells, and MEF2 in heart and muscle, in each case matching the tissues that showed the highest enrichment. In contrast, for repressive factors and CTCF, positional biases were found in large numbers of tissues, even when the motifs were not enriched in active enhancers. For example, REST (NRSF) was positionally biased in DGF sites in nearly all tissues except brain (Extended Data 9c), even though it was only enriched in active enhancers in brain (Extended Data 9a), consistent with widespread repressive binding in non-brain tissues.

Disease variants map to transcription factor binding sites but rarely disrupt consensus motifs

Genetic and epigenetic fine mapping of causal autoimmune disease variants.

Farh, K. K.-H. et al.Nature 10.1038/nature13835

The enrichment of candidate causal variants within enhancers suggests that they affect disease risk by altering gene regulation, but does not distinguish the underlying mechanisms. Enhancer activity is dependent on complex interplay between transcription factors, chromatin, non-coding RNAs and tertiary interactions of DNA loci27. A straightforward hypothesis is that disease SNPs alter transcription factor binding. Indeed, PICS SNPs tend to coincide with nucleosome-depleted regions, characterized by DNase hypersensitivity and localized (~150 bp) dips in H3K27ac signal26, which are indicative of transcription factor occupancy (Fig. 5A).

We therefore overlapped PICS SNPs with 31 transcription factor binding maps generated by ENCODE26 (Fig. 5b). Candidate causal SNPs are strongly enriched within binding sites for immune-related transcription factors, including NFkB, PU1 (also known as SPI1), IRF4, and BATF. Variants associated with different diseases correlate to different combinations of transcription factors that control immune cell identity and response to stimulation. For example, multiple sclerosis SNPs preferentially coincide with NFkB, EBF1 and MEF2A-bound regions, whereas rheumatoid arthritis and coeliac disease SNPs preferentially coincide with IRF4 regions.

Next, we examined whether causal variants disrupt or create cognate sequence motifs recognized by these transcription factors. We focused on 823 of the highest-likelihood non-coding PICS SNPs, an estimated 30% of which represent true causal variants. We identified PICS SNPs that alter motifs for NFkB (n = 2), AP-1 (n = 8), or ETS/ELF1 (n = 5). Overall, we identified 7 known transcription factor motifs and 6 conserved sequence motifs28,29 with a significant tendency to overlap causal variants likely to alter binding affinity. Of the highest-likelihood SNPs, 7% affected one of these over-represented motifs, with a roughly equal distribution between motif creation and disruption (Extended Data Fig. 9).

A notable motif-disrupting PICS SNP is the Crohn’s disease-associated variant rs17293632 (C>T, minor allele increases disease risk; PICS probability~54%), which resides in an intron of SMAD3 (Fig. 5c). SMAD3 encodes a transcription factor downstream of transforming growth factor b (TGF-b) with pleiotropic roles in immune homeostasis30. The SNP disrupts a conserved AP-1 consensus site. ChIP-seq data for AP-1 factors (Jun, Fos) in a heterozygous cell line reveal robust binding to the reference sequence, but not to the variant sequence created by the SNP. As described above, a prominent AP-1 signature is associated with enhancers activated upon immune stimulation (Fig. 2a). This suggests that rs17293632 may increase Crohn’s disease risk by directly disrupting AP-1 regulation of the TGF-b-SMAD3 pathway.

Despite this and other compelling examples, only ~7% of the highest-likelihood non-coding PICS SNPs alter an over-represented TF motif. Scanning a large database of transcription factor motifs, we found that ~13% of high-likelihood causal SNPs create or disrupt some known consensus sequence derived by in vitro selection28, whereas ~27% create or disrupt a putative consensus sequence derived from phylogenetic analysis29. However, these proportions are similar to the rate for background SNPs (Fig. 5d). Even extrapolating for uncertainty in causal SNP assignments, our data suggest that at most 10–20% of non-coding GWAS hits act by altering a recognizable transcription factor motif.

Notwithstanding their infrequent coincidence to the precise transcription factor motifs, non-coding PICS SNPs have a strong tendency to reside in close proximity to such sequences. Candidate causal variants are most significantly enriched in the vicinity of NFkB, RUNX1, AP-1, ELF1, and PU1 motifs (Extended Data Fig. 9), with 26% residing within 100 bp of such a motif. These findings parallel recent studies of genetic variation in mice, where DNA variants affecting NFkB binding are dispersed in the vicinity of the actual binding sites31. Our results suggest that many causal non-coding SNPs modulate transcription factor dependent enhancer activity (and confer disease risk) by altering adjacent DNA bases whose mechanistic roles are not readily explained by existing gene regulatory models.

Elucidating possible mechanisms by which allelically biased enhancer activities arise

Integrative analysis of haplotype-resolved epigenomes across human tissues.

Leung, D. et al.Nature 10.1038/nature14217

To further elucidate the mechanism by which allelically biased enhancer activities arise, we examined SNPs that potentially disrupt or weaken TF binding motifs. We calculated changes in motif score between alleles (motif disruption score) at allelic enhancers and discovered 133 TF motifs showing significant concordance between allelic reduction of enhancer activities and TF motif disruption (Fig. 5a and b) (FDR=10%, Supplementary Table 9)(see Supplementary Information). Moreover, genes with allelically biased expression were concordant with enhancer motif disruptions within close proximity (<20kb) or displaying strong Hi-C interactions at longer distances (>20kb)(see Supplementary Information)(Fig. 5c). Our results therefore suggest that genetic variations are likely responsible for allelic enhancer activities and consequently allelically biased gene expression.

Epigenome-derived surface ectoderm regulatory network

Regulatory network decoded from epigenomes of surface ectoderm-derived cell types.

Lowdon, R. F. et al.Nature Communications 10.1038/ncomms6442

Given their regulatory element signatures, overlap with DNase I-hypersensitive sites and enrichment for relevant transcription factor binding site (TFBS) motifs, we hypothesized that hypomethylated SE-DMRs may be regulatory elements that coordinate expression of genes essential for function of SE-derived cells. To test this, we sought to connect these putative regulatory elements to genes in a SE gene network. We associated DMRs with nearby putative target genes and queried databases of TF-target genes and gene–gene interactions to construct regulatory relationships among these genes. The result is a highly connected network with a statistically significant number of connections (1458 edges, 278 nodes; P-value=1.25e−4), whose distribution follows a power law (R2=0.89).

Strikingly, the transcription factors near the top of the inferred SE network were those whose motifs were enriched in the hypomethylated SE-DMRs. This observation, along with the network connectivity data, suggested that TFAP2a, TFAP2c and KLF4 may regulate many of the downstream genes in this network. To identify biological processes associated with each set of hypomethylated DMRs containing either TFAP2 or KLF4 TFBSs, we performed GREAT analysis14. The network was characterized by two partially overlapping major branches (summarized data in Fig. 5a. The first branch included the transcription factors TFAP2a and TFAP2c, and connected to genes associated with SE-relevant GO terms, for example, ‘hemidesmosome assembly’, which is a structural complex critical for epithelial cells21, and ‘Notch signalling’, which functions in mammary cell fate commitment22 and keratinocyte homeostasis23 (Fig. 5b). The second branch was characterized by KLF4 and was associated with ‘mammary gland development’ and ‘Wnt signalling’, which influences both breast and keratinocyte cell fate decisions24, 25 (Fig. 5c). Thus, we observed a highly structured set of connections between regulatory elements and putative target genes that underlie and integrate signalling pathways vital for both keratinocyte and mammary gland epithelial cell function.

Positional properties of regulatory sequence motifs associated with chromatin marks

Predicting the human epigenome from DNA motifs.

Whitaker, J. W., Chen, Z. & Wang, W. Nature Methods 10.1038/nmeth.3065

The identified cis elements may play various roles in shaping the epigenome, such as setting the boundary of a histone modification domain or opening chromatin to allow remodeling enzymes to bind DNA. These roles may restrain the relative location (edge or center) of a motif within the modified regions (Fig. 4a). Although the majority of the motifs fell into the ‘neutral’ category (no location preference), numerous motifs showed biased location distributions (Fig. 4b and Supplementary Fig. 7). The heterochromatin mark H3K9me3 was associated with edge and neutral motifs but not with any central motifs, suggesting that the edge motifs may help set the boundary of the large H3K9me3 domain. Concentrated marks of H3K4me1, H3K4me3 and H3K27ac were associated with central motifs, which may guide the recruitment of the chromatin-modifying enzymes to initiate, or other factors to maintain, the modifications29,30. Interestingly, whereas the enhancer marks H3K4me1 and H3K27ac had no edge motifs, the promoter marker H3K4me3 was associated with several edge motifs, which may help define the promoter boundary. In contrast to H3K9me3, the widespread histone mark H3K27me3 and the DMV were largely associated with central motifs, which suggests different regulatory mechanisms. The transcriptional activity mark H3K36me3 almost exclusively associated with neutral motifs.

The majority (81%) of the H3K9me3 edge motifs were found in H1, and these matched the known motifs of KLF12, Rel homology domain and YY1 (Fig. 4c,d). Multiple lines of evidence support these associations. KLF12 mediates transcriptional repression through interaction with phosphoprotein CtBP31, which forms a complex with histone methyltransferase and DNA-binding proteins to target H3K9 for methylation32. NFKB1, a member of the Rel homology domain TFs, is known to function with deacetylase SIRT6 to repress gene expression via H3K9 deacetylation33, which clears the site for methylation. YY1 is a transcriptional regulator that directs localization of histone acetyltransferases, deacetylases and members of the PRC2 complex34, which directs the placement of H3K9me3 and H3K37me3 (ref. 35). Furthermore, YY1 knockdown during mouse spermatogenesis results in global decrease of H3K9me3 (ref. 36). Further analysis of the H1 H3K9me3 edge motifs suggested that they may represent a regulatory system present in human embryonic stem cells for establishing regions of heterochromatin and repressing gene expression (Supplementary Note and Supplementary Data 1). In light of findings that show H3K9me3 as a primary epigenomic determinant during induced pluripotent stem cell reprogramming37, we speculate that these interactions may be important in establishing and maintaining the pluripotent state.

DNA methylation defines a breast cell regulatory network

Epigenetic and transcriptional determinants of the human breast.

Gascard, P. et al.Nature Communications 10.1038/ncomms7351

To discover putative regulatory elements including actively bound enhancers20 we identified 26,601 and 53,751 DNA unmethylated regions (UMR) in myoepithelial and luminal epithelial cells from whole genome bisulfite datasets, respectively21 (Supplementary Figure 15 and Supplementary Table 8). We validated these UMRs against orthologous MeDIP-seq and MRE-seq datasets (Supplementary Figure 16). Nearest gene-based analysis22 of the UMRs revealed highly significant enrichment for expected pathways, including smooth muscle contraction (binomial FDR Q-value =10-71) for myoepithelial cells and mammary gland epithelium development (binomial FDR Q-value =10-72) for luminal epithelial cells (Supplementary Figure 17). The UMRs overlapped almost exclusively with enhancer chromatin states (Supplementary Figure 18), displayed significant overlap with ENCODE transcription factor (TF) ChIP-seq regions (luminal epithelial: 58%, myoepithelial: 60%, with ~20% expected by chance), and intersection with GWAS alleles revealed a direct overlap with 10 SNPs recently associated with breast cancer risk loci (Supplementary Figure 19, Supplementary Table 9), suggesting that these UMR regions are highly enriched in regulatory elements that define breast cell types.

Strikingly, we observed an asymmetry in the number of regulatory UMRs between breast cell types with 51 transcription factors having at least 2 times more sites in luminal than myoepithelial UMRs, while no TFs were more abundant in myoepithelial UMRs (Supplementary Table 10). The top three transcription factors (FOXA1, GATA-3 and ZNF217) ranked by abundance of UMR-defined regulatory elements, are critical regulators of luminal cell biology and were at least 8 times more abundant in luminal vs. myoepithelial cells (Figure 3a). To explore how these regulatory elements could influence the breast cell transcriptional program, we associated them with genes and found a highly significant overlap (81%) between cell type-specific UMRs and proximal upregulated DE genes (Figure 3b). Interestingly, this association was highly directional, with the majority (90.3%) of proximal UMRs associated with increased transcription in their respective cell types (Figures 3b and d). This directionality was reduced for distal UMRs. Only 6% of commonly expressed genes were found to be associated with cell type- specific proximal UMRs, highlighting their importance in defining a cell type-specific transcriptional program (Figure 3c).

Figure 1
figure 1

Enhancer/promoter module clustering and gene set enrichment analysis. a. Promoter modules. Clustering of 81,232 DNaseI-accessible promoter-marked regions into 82 promoter modules (1.44% of genome). b. Promoter/enhancer ‘dyadic’ modules. Clustering of 129,960 DNaseI-accessible promoter/enhancer regions into 226 enhancer modules (0.99% of the genome). c. Enhancer modules. Clustering of 2,328,936 DNaseI-accessible enhancer-marked regions into 129 dyadic modules (12.64% of the genome). Same as Fig. 7a, with cluster names shown. d. Gene Ontology Biological Processes gene set enrichments (y-axis) for genes proximal to enhancer regions in each of the 226 activity-based clusters (x-axis). Colors indicate level of statistical significance (white: p > 0.01, yellow: 0.001 < p < 0.01, orange: 0.0001 < p < 0.001, red: p < 0.0001). Cluster sizes in terms of number of enhancers are shown at the top, along with cluster reference numbers. d. Subclustering of cluster c98 (indicated by an arrow and asterisk in (a)) into 9 subclusters (top), each having a distinct Gene Ontology Biological Processes gene set enrichment pattern (bottom).

Figure 2: Regulatory modules from epigenome dynamics.
figure 2

a. Enhancer modules by activity-based clustering of 2.3 million DNase-accessible regions classified as Enh, EnhG or EnhBiv (color) across 111 reference epigenomes. Vertical lines separate 226 modules. Broadly-active enhancers shown first. Module IDs shown in Fig. S11c. b-c. Proximal gene enrichments54 (b) for each module using gene ontology (GO) biological process (panel b) and human phenotypes (panel c). Rectangles pinpoint enrichments for selected modules. Representative gene set names (left) selected using bag-of-words enrichment.

Figure 3
figure 3

Additional gene set enrichment analysis results. Top. Additional gene set enrichment analyses. Cluster sizes in terms of number of enhancers are shown at the top, along with cluster reference numbers as in Fig. S11. a. Disease Ontology Database 120 results. Rectangles indicate areas of interest, pointing to groups of cell type-restricted clusters clearly enriched for specific gene sets. b. Mouse Genome Informatics 121 database enrichments, indicating clusters of enhancer regions associated with orthologous mouse genes with particular anatomical regional expression patterns.

Figure 4: Regulatory motifs enriched in clusters.
figure 4

Enrichment (red) or depletion (blue) of regulatory motifs (rows) in the enhancer modules (columns) relative to shuffled control motifs. For each motif is shown the motif name, consensus logo, and correlation between regulator expression and module activity: positive correlation (orange) is indicative of activators, and negative correlation (purple) indicates a repressive role for the factor. Only clusters with enrichment or depletion of at least 2^1.5-fold for one motif are shown. b. Average activity level of enhancers of each module in each reference epigenome (black=high, white=low). Bottom: Total size of each enhancer module showing enrichment (in kb).

Figure 5: Linking regulators to their target enhancers
figure 5

Module-level regulatory motif enrichment (Fig. S11) and correlation between regulator expression and module activity patterns (Extended Data 8a) are used to link regulators (boxes) to their likely target tissue and cell types (circles). Edge weight represents motif enrichment in the reference epigenomes of highest module activity.

Figure 6
figure 6

Heatmaps showing tissue/cell type similarity measured using different epigenomic marks. a-c. Pearson correlation values calculated between epigenomes for a variety of marks, assessed within relevant chromatin states of the 15-state core model (see Methods). a. Five core marks in their corresponding relevant chromatin states. b. H3K27ac in Enh and TssA state regions. c. H3K9ac in Enh and TssA state regions. d-e. Similarly, for DNase (d) in DNase regions, RNA-seq (e) using RPKM values across genes.

Figure 7
figure 7

Multidimensional scaling (MDS) plots showing tissue/cell type similarity using different epigenomic marks. a-i. Multi-Dimensional Scaling (MDS) analysis results, showing reference epigenomes using their group coloring defined in Fig. 2. Thin lines connect same-group reference epigenomes. The first 5 axes of variation are shown in pairs. Marks are assessed in the same regions as used for Figures S1 and S9. a. H3K4me1, b. H3K4me3, c. H3K27me3, d. H3K27ac, e. H3K9ac, f. DNase, g. H3K36me3, h. RNA-seq RPKM, i. H3K9me3

Figure 8: Causal variants map to regions of TF binding.
figure 8

a, Plot depicts composite H3K27ac and DNase signals26 in immune cells over PICS autoimmunity SNPs. PICS SNPs overall coincide with nucleosome-depleted, hypersensitive sites, indicative of TF binding. b, Bar plot indicates TFs whose binding is enriched near PICS SNPs for all 21 autoimmune diseases26. Heatmap depicts enrichment of these TFs near variants associated with specific diseases (red:high; blue:low). c, H3K27ac, DNaseI26 and conservation signals, and selected TF binding intervals are shown in a SMAD3 intronic locus. rs17293632, a noncoding candidate causal SNP for Crohn’s disease, disrupts a conserved AP-1 binding motif in an enhancer marked by H3K27ac in CD14+ monocytes. Summing of ChIP-seq reads overlapping the SNP in the heterozygous HeLa cell line shows that only the intact motif binds AP-1 TFs, Jun and Fos. d, Bar graph shows the fraction of PICS SNPs (black) versus random SNPs from the same locus (white) that create or disrupt one of the significantly enriched motifs, any Selex motif, or any conserved K-mer. Error bars indicate standard deviation from 1000 iterations using locus-matched control SNPs.

Figure 9: Motif disruption by genetic variants is concordant with allelic H3K27ac biases at enhancers.
figure 9

a) Differential GABPA binding motif scores between two alleles (P1-­‐P2motif scores) in LV is correlated with the proportion of H3K27ac reads corresponding to the P1 allele (top). Values range from negative to positive, indicating P1 and P2 motif disruption, respectively. An example on chromosome 12 illustrates P1, with a motif preserving C allele, has higher H3K27ac enrichment and the P2, with the motif disrupting T allele, has little H3K27ac enrichment (bottom).

Figure 10: SE-DMRs are regulatory elements in a gene network.
figure 10

(a) Summary of the TF-target gene regulatory network derived from SE-DMR analyses. The categories at the bottom of the panel represent enriched biological processes or pathways for genes associated with DMRs containingTFAP2 or KLF4 motifs. TFAP2-associated TFs/pathways highlighted in blue; KLF4-associated pathways in grey. (b) Functional enrichment for TFAP2 motif containing hypomethylated SE-DMRs. (c) Functional enrichment for KLF4 motif containing hypomethylated SE-DMRs. (d) RNA expression values for SE-DMR associated hemidesmosome/basement membrane genes for SE and non-SE cell types. Skin cell-type values are averages (mean) of three biological replicates. Error bars are s.e.m. (e) WashU Epigenome Browser screenshot of hemidesmosome/basement membrane genes. MeDIP-seq tracks depicted in green, yellow and blue; all track y axes heights are 60 RPKM. DNase-seq track is shown in light blue. Genes depicted as black lines. SE-DMRs depicted as red boxes and TFAP2motifs as maroon lines.

Figure 11: Predictive motifs have location preferences.
figure 11

(a) Hierarchical clustering of 812 motifs showing positive interplay in the 'single-mark' analysis by their location preferences. The motifs were scanned against their corresponding modification peaks. The scores were then summed in five bins that represent different regions of the peaks (Supplementary Fig. 8). The bin scores for each motif were then hierarchically clustered. Motifs with edge or central preferences were classified by comparing edge and center bin scores and by using a χ2 test P-value cutoff of 10−10 (Online Methods). (b) Summary of the location preference of motifs by mark specificity. (c) Motifs that associate with H3K9me3 edges in H1. NFKB1 is given as an example of a Rel homology domain. (d) Four screen shots49 showing YY1 ChIP-seq reads at the edge of a region of H3K9me3. Clockwise from the top left, the four YY1 sites start at chromosome 2 (chr2) 17515620, chr6 16069456, chr12 14424514 and chr2 626745 (genome version: hg18). (e) De novo (G+C)-(A+T) hybrid motif aligned to the NR4A2 monomeric and dimeric motifs. (f) Average profile of two (G+C)-(A+T) hybrid motifs (red and blue lines) and H3K4me3 (black line) at 13,962 TSSs (Online Methods).