Integrative analyses of reference epigenomes reveal context-specific regulatory motifs, factors, modules, pathways and networks
Transcription factor motifs associated with modular chromatin state dynamics of high-resolution regulatory elements
Integrative analysis of 111 reference human epigenomes.
Roadmap Epigenomics Consortium et al.Nature 10.1038/nature14248
We next exploited the dynamics of epigenomic modifications at cis-regulatory elements to gain insights into gene regulation. We focused on 2.3M regions (12.6% of the genome) showing DNA accessibility in any reference epigenome and regulatory (promoter or enhancer) chromatin states, considering enhancer-only, promoter-only, or enhancer-promoter alternating states separately (Fig. S11). We clustered enhancer-only elements (Enh, EnhBiv, EnhG) into 226 enhancer modules of coordinated activity (Fig. 7a), promoter-only elements into 82 promoter modules (Fig. S11a) and promoter/enhancer ‘dyadic’ elements into 129 modules (Fig. S11b), enabling us to distinguish ubiquitously-active, lineage-restricted, and tissue-specific modules for each group. Focusing on the enhancer-only clusters, we found that the neighboring genes of enhancers in the same module showed significant enrichment for common functions{Ashburner, 2000 #308} (Fig. 7b, Fig. S11c,d), common genotype-phenotype associations65 (Fig. 7c), and common expression in their mouse orthologs (Fig. S12), each annotation type showing strong consistency with the known biology of the corresponding tissues. For example, stem-cell enhancers are enriched near developmental patterning genes, immune cell enhancers near immune response genes, and brain enhancers near learning and memory genes (Fig. 7b). Sub-clustering of individual modules continued to reveal distinct enrichment patterns of individual sub-modules (Fig. S11e), suggesting increased diversity of regulatory processes beyond the 226 modules used here.
The genome sequence of enhancers in the same module showed substantial enrichment for sequence motifs67 associated with diverse transcription factors (Fig. S13a). We found 84 significantly enriched motifs in 101 modules (Extended Data 8), indicating that enhancer modules likely represent co-regulated sets, and proposing candidate upstream regulators for nearly half of all modules. Direct application of the same approach and thresholds to the putative regulatory regions annotated in each of the 111 reference epigenomes led to significant enrichment for only 10 enriched motifs in 15 reference epigenomes (Fig. S13b,c) of which 8 are blood samples, and focusing on the regions unique to each of the 17 tissue groups (Fig. 2b) only led to 19 enriched motifs in 10 tissue groups (Fig. S13d,e), emphasizing the importance of studying regulatory motif enrichments at the level of enhancer modules.
We next sought to distinguish likely activator and repressor motifs, by identifying regulators whose expression pattern across cell/tissue types shows a strong (positive or negative) correlation with the activity of enhancers in the enriched module9. We focused on the 40 most strongly expression-correlated regulators (Extended Data 9a), and used the module-level motif enrichments to link each regulator to the cell/tissue types that define each module (Fig. 8). We found that many of the inferred links correspond to known regulatory relationships, including: OCT4 (also known as POU5F1) in pluripotent cells, HNF1B and HNF4A1 in liver and other digestive tissues, RFX4 in neurosphere and neuronal cells, and MEF2D in muscle. The most enriched regulators showed primarily positive correlations, suggesting they function as transcriptional activators, while a subset of factors showed a negative correlation, with the factor expressed in the lineages where its motif showed enhancer depletion, suggesting a repressive role. For example, REST (also known as NRSF), a known repressor of neuronal lineages was least expressed in neuronal tissues, where its motif was most enriched in enhancers, and a similar signature was found for ZBTB1B, a known repressor of myogenesis and brain development.
Regulatory motifs predicted to be drivers of enhancer activity patterns showed significant enrichment in tissue-specific high-resolution (6bp-40bp) DNase digital genomic footprints (DGF)68 in matching cell types (Extended Data 9b, Table S5b), providing DNA accessibility evidence that the motifs are indeed bound in these cell types. In addition, they showed positional bias relative to both the center of DGF locations, and relative to their boundaries (Extended Data 10), a property not found for shuffled motifs69. These positional biases were highly tissue- and cell type-specific for most activating factors (Extended Data 9c), including POU5F in iPSCs, MEF2D in heart, HNF1B in GI tissues, BHLH in brain, SPI1 in immune cells, and MEF2 in heart and muscle, in each case matching the tissues that showed the highest enrichment. In contrast, for repressive factors and CTCF, positional biases were found in large numbers of tissues, even when the motifs were not enriched in active enhancers. For example, REST (NRSF) was positionally biased in DGF sites in nearly all tissues except brain (Extended Data 9c), even though it was only enriched in active enhancers in brain (Extended Data 9a), consistent with widespread repressive binding in non-brain tissues.
Disease variants map to transcription factor binding sites but rarely disrupt consensus motifs
Genetic and epigenetic fine mapping of causal autoimmune disease variants.
Farh, K. K.-H. et al.Nature 10.1038/nature13835
The enrichment of candidate causal variants within enhancers suggests that they affect disease risk by altering gene regulation, but does not distinguish the underlying mechanisms. Enhancer activity is dependent on complex interplay between transcription factors, chromatin, non-coding RNAs and tertiary interactions of DNA loci27. A straightforward hypothesis is that disease SNPs alter transcription factor binding. Indeed, PICS SNPs tend to coincide with nucleosome-depleted regions, characterized by DNase hypersensitivity and localized (~150 bp) dips in H3K27ac signal26, which are indicative of transcription factor occupancy (Fig. 5A).
We therefore overlapped PICS SNPs with 31 transcription factor binding maps generated by ENCODE26 (Fig. 5b). Candidate causal SNPs are strongly enriched within binding sites for immune-related transcription factors, including NFkB, PU1 (also known as SPI1), IRF4, and BATF. Variants associated with different diseases correlate to different combinations of transcription factors that control immune cell identity and response to stimulation. For example, multiple sclerosis SNPs preferentially coincide with NFkB, EBF1 and MEF2A-bound regions, whereas rheumatoid arthritis and coeliac disease SNPs preferentially coincide with IRF4 regions.
Next, we examined whether causal variants disrupt or create cognate sequence motifs recognized by these transcription factors. We focused on 823 of the highest-likelihood non-coding PICS SNPs, an estimated 30% of which represent true causal variants. We identified PICS SNPs that alter motifs for NFkB (n = 2), AP-1 (n = 8), or ETS/ELF1 (n = 5). Overall, we identified 7 known transcription factor motifs and 6 conserved sequence motifs28,29 with a significant tendency to overlap causal variants likely to alter binding affinity. Of the highest-likelihood SNPs, 7% affected one of these over-represented motifs, with a roughly equal distribution between motif creation and disruption (Extended Data Fig. 9).
A notable motif-disrupting PICS SNP is the Crohn’s disease-associated variant rs17293632 (C>T, minor allele increases disease risk; PICS probability~54%), which resides in an intron of SMAD3 (Fig. 5c). SMAD3 encodes a transcription factor downstream of transforming growth factor b (TGF-b) with pleiotropic roles in immune homeostasis30. The SNP disrupts a conserved AP-1 consensus site. ChIP-seq data for AP-1 factors (Jun, Fos) in a heterozygous cell line reveal robust binding to the reference sequence, but not to the variant sequence created by the SNP. As described above, a prominent AP-1 signature is associated with enhancers activated upon immune stimulation (Fig. 2a). This suggests that rs17293632 may increase Crohn’s disease risk by directly disrupting AP-1 regulation of the TGF-b-SMAD3 pathway.
Despite this and other compelling examples, only ~7% of the highest-likelihood non-coding PICS SNPs alter an over-represented TF motif. Scanning a large database of transcription factor motifs, we found that ~13% of high-likelihood causal SNPs create or disrupt some known consensus sequence derived by in vitro selection28, whereas ~27% create or disrupt a putative consensus sequence derived from phylogenetic analysis29. However, these proportions are similar to the rate for background SNPs (Fig. 5d). Even extrapolating for uncertainty in causal SNP assignments, our data suggest that at most 10–20% of non-coding GWAS hits act by altering a recognizable transcription factor motif.
Notwithstanding their infrequent coincidence to the precise transcription factor motifs, non-coding PICS SNPs have a strong tendency to reside in close proximity to such sequences. Candidate causal variants are most significantly enriched in the vicinity of NFkB, RUNX1, AP-1, ELF1, and PU1 motifs (Extended Data Fig. 9), with 26% residing within 100 bp of such a motif. These findings parallel recent studies of genetic variation in mice, where DNA variants affecting NFkB binding are dispersed in the vicinity of the actual binding sites31. Our results suggest that many causal non-coding SNPs modulate transcription factor dependent enhancer activity (and confer disease risk) by altering adjacent DNA bases whose mechanistic roles are not readily explained by existing gene regulatory models.
Elucidating possible mechanisms by which allelically biased enhancer activities arise
Integrative analysis of haplotype-resolved epigenomes across human tissues.
Leung, D. et al.Nature 10.1038/nature14217
To further elucidate the mechanism by which allelically biased enhancer activities arise, we examined SNPs that potentially disrupt or weaken TF binding motifs. We calculated changes in motif score between alleles (motif disruption score) at allelic enhancers and discovered 133 TF motifs showing significant concordance between allelic reduction of enhancer activities and TF motif disruption (Fig. 5a and b) (FDR=10%, Supplementary Table 9)(see Supplementary Information). Moreover, genes with allelically biased expression were concordant with enhancer motif disruptions within close proximity (<20kb) or displaying strong Hi-C interactions at longer distances (>20kb)(see Supplementary Information)(Fig. 5c). Our results therefore suggest that genetic variations are likely responsible for allelic enhancer activities and consequently allelically biased gene expression.
Epigenome-derived surface ectoderm regulatory network
Regulatory network decoded from epigenomes of surface ectoderm-derived cell types.
Lowdon, R. F. et al.Nature Communications 10.1038/ncomms6442
Given their regulatory element signatures, overlap with DNase I-hypersensitive sites and enrichment for relevant transcription factor binding site (TFBS) motifs, we hypothesized that hypomethylated SE-DMRs may be regulatory elements that coordinate expression of genes essential for function of SE-derived cells. To test this, we sought to connect these putative regulatory elements to genes in a SE gene network. We associated DMRs with nearby putative target genes and queried databases of TF-target genes and gene–gene interactions to construct regulatory relationships among these genes. The result is a highly connected network with a statistically significant number of connections (1458 edges, 278 nodes; P-value=1.25e−4), whose distribution follows a power law (R2=0.89).
Strikingly, the transcription factors near the top of the inferred SE network were those whose motifs were enriched in the hypomethylated SE-DMRs. This observation, along with the network connectivity data, suggested that TFAP2a, TFAP2c and KLF4 may regulate many of the downstream genes in this network. To identify biological processes associated with each set of hypomethylated DMRs containing either TFAP2 or KLF4 TFBSs, we performed GREAT analysis14. The network was characterized by two partially overlapping major branches (summarized data in Fig. 5a. The first branch included the transcription factors TFAP2a and TFAP2c, and connected to genes associated with SE-relevant GO terms, for example, ‘hemidesmosome assembly’, which is a structural complex critical for epithelial cells21, and ‘Notch signalling’, which functions in mammary cell fate commitment22 and keratinocyte homeostasis23 (Fig. 5b). The second branch was characterized by KLF4 and was associated with ‘mammary gland development’ and ‘Wnt signalling’, which influences both breast and keratinocyte cell fate decisions24, 25 (Fig. 5c). Thus, we observed a highly structured set of connections between regulatory elements and putative target genes that underlie and integrate signalling pathways vital for both keratinocyte and mammary gland epithelial cell function.
Positional properties of regulatory sequence motifs associated with chromatin marks
Predicting the human epigenome from DNA motifs.
Whitaker, J. W., Chen, Z. & Wang, W. Nature Methods 10.1038/nmeth.3065
The identified cis elements may play various roles in shaping the epigenome, such as setting the boundary of a histone modification domain or opening chromatin to allow remodeling enzymes to bind DNA. These roles may restrain the relative location (edge or center) of a motif within the modified regions (Fig. 4a). Although the majority of the motifs fell into the ‘neutral’ category (no location preference), numerous motifs showed biased location distributions (Fig. 4b and Supplementary Fig. 7). The heterochromatin mark H3K9me3 was associated with edge and neutral motifs but not with any central motifs, suggesting that the edge motifs may help set the boundary of the large H3K9me3 domain. Concentrated marks of H3K4me1, H3K4me3 and H3K27ac were associated with central motifs, which may guide the recruitment of the chromatin-modifying enzymes to initiate, or other factors to maintain, the modifications29,30. Interestingly, whereas the enhancer marks H3K4me1 and H3K27ac had no edge motifs, the promoter marker H3K4me3 was associated with several edge motifs, which may help define the promoter boundary. In contrast to H3K9me3, the widespread histone mark H3K27me3 and the DMV were largely associated with central motifs, which suggests different regulatory mechanisms. The transcriptional activity mark H3K36me3 almost exclusively associated with neutral motifs.
The majority (81%) of the H3K9me3 edge motifs were found in H1, and these matched the known motifs of KLF12, Rel homology domain and YY1 (Fig. 4c,d). Multiple lines of evidence support these associations. KLF12 mediates transcriptional repression through interaction with phosphoprotein CtBP31, which forms a complex with histone methyltransferase and DNA-binding proteins to target H3K9 for methylation32. NFKB1, a member of the Rel homology domain TFs, is known to function with deacetylase SIRT6 to repress gene expression via H3K9 deacetylation33, which clears the site for methylation. YY1 is a transcriptional regulator that directs localization of histone acetyltransferases, deacetylases and members of the PRC2 complex34, which directs the placement of H3K9me3 and H3K37me3 (ref. 35). Furthermore, YY1 knockdown during mouse spermatogenesis results in global decrease of H3K9me3 (ref. 36). Further analysis of the H1 H3K9me3 edge motifs suggested that they may represent a regulatory system present in human embryonic stem cells for establishing regions of heterochromatin and repressing gene expression (Supplementary Note and Supplementary Data 1). In light of findings that show H3K9me3 as a primary epigenomic determinant during induced pluripotent stem cell reprogramming37, we speculate that these interactions may be important in establishing and maintaining the pluripotent state.
DNA methylation defines a breast cell regulatory network
Epigenetic and transcriptional determinants of the human breast.
Gascard, P. et al.Nature Communications 10.1038/ncomms7351
To discover putative regulatory elements including actively bound enhancers20 we identified 26,601 and 53,751 DNA unmethylated regions (UMR) in myoepithelial and luminal epithelial cells from whole genome bisulfite datasets, respectively21 (Supplementary Figure 15 and Supplementary Table 8). We validated these UMRs against orthologous MeDIP-seq and MRE-seq datasets (Supplementary Figure 16). Nearest gene-based analysis22 of the UMRs revealed highly significant enrichment for expected pathways, including smooth muscle contraction (binomial FDR Q-value =10-71) for myoepithelial cells and mammary gland epithelium development (binomial FDR Q-value =10-72) for luminal epithelial cells (Supplementary Figure 17). The UMRs overlapped almost exclusively with enhancer chromatin states (Supplementary Figure 18), displayed significant overlap with ENCODE transcription factor (TF) ChIP-seq regions (luminal epithelial: 58%, myoepithelial: 60%, with ~20% expected by chance), and intersection with GWAS alleles revealed a direct overlap with 10 SNPs recently associated with breast cancer risk loci (Supplementary Figure 19, Supplementary Table 9), suggesting that these UMR regions are highly enriched in regulatory elements that define breast cell types.
Strikingly, we observed an asymmetry in the number of regulatory UMRs between breast cell types with 51 transcription factors having at least 2 times more sites in luminal than myoepithelial UMRs, while no TFs were more abundant in myoepithelial UMRs (Supplementary Table 10). The top three transcription factors (FOXA1, GATA-3 and ZNF217) ranked by abundance of UMR-defined regulatory elements, are critical regulators of luminal cell biology and were at least 8 times more abundant in luminal vs. myoepithelial cells (Figure 3a). To explore how these regulatory elements could influence the breast cell transcriptional program, we associated them with genes and found a highly significant overlap (81%) between cell type-specific UMRs and proximal upregulated DE genes (Figure 3b). Interestingly, this association was highly directional, with the majority (90.3%) of proximal UMRs associated with increased transcription in their respective cell types (Figures 3b and d). This directionality was reduced for distal UMRs. Only 6% of commonly expressed genes were found to be associated with cell type- specific proximal UMRs, highlighting their importance in defining a cell type-specific transcriptional program (Figure 3c).
Rights and permissions
About this article
Cite this article
4. Regulatory models: networks, motifs, modules, sequence drivers and predictive models. Nature (2015). https://doi.org/10.1038/nature14312
Published:
DOI: https://doi.org/10.1038/nature14312