The epigenome is established and maintained by the site-specific recruitment of chromatin-modifying enzymes and their cofactors. Identifying the cis elements that regulate epigenomic modification is critical for understanding the regulatory mechanisms that control gene expression patterns. We present Epigram, an analysis pipeline that predicts histone modification and DNA methylation patterns from DNA motifs. The identified cis elements represent interactions with the site-specific DNA-binding factors that establish and maintain epigenomic modifications. We cataloged the cis elements in embryonic stem cells and four derived lineages and found numerous motifs that have location preference, such as at the center of H3K27ac or at the edges of H3K4me3 and H3K9me3, which provides mechanistic insight about the shaping of the epigenome. The Epigram pipeline and predictive motifs are at http://wanglab.ucsd.edu/star/epigram/.
This is a preview of subscription content
Subscribe to Journal
Get full journal access for 1 year
only $9.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
Yuan, G.C. Linking genome to epigenome. Wiley Interdiscip. Rev. Syst. Biol. Med. 4, 297–309 (2012).
Mendenhall, E.M. et al. GC-rich sequence elements recruit PRC2 in mammalian ES cells. PLoS Genet. 6, e1001244 (2010).
Thomson, J.P. et al. CpG islands influence chromatin structure via the CpG-binding protein Cfp1. Nature 464, 1082–1086 (2010).
Klattenhoff, C.A. et al. Braveheart, a long noncoding RNA required for cardiovascular lineage commitment. Cell 152, 570–583 (2013).
Tsai, M.C. et al. Long noncoding RNA as modular scaffold of histone modification complexes. Science 329, 689–693 (2010).
Baudat, F. et al. PRDM9 is a major determinant of meiotic recombination hotspots in humans and mice. Science 327, 836–840 (2010).
Struhl, K. & Segal, E. Determinants of nucleosome positioning. Nat. Struct. Mol. Biol. 20, 267–273 (2013).
Segal, E. et al. A genomic code for nucleosome positioning. Nature 442, 772–778 (2006).
Yuan, G.C. & Liu, J.S. Genomic sequence is highly predictive of local nucleosome depletion. PLoS Comput. Biol. 4, e13 (2008).
Kaplan, N. et al. The DNA-encoded nucleosome organization of a eukaryotic genome. Nature 458, 362–366 (2009).
Iyer, V. & Struhl, K. Poly(dA:dT), a ubiquitous promoter element that stimulates transcription via its intrinsic DNA structure. EMBO J. 14, 2570–2579 (1995).
Segal, E. & Widom, J. Poly(dA:dT) tracts: major determinants of nucleosome organization. Curr. Opin. Struct. Biol. 19, 65–71 (2009).
Wu, R. & Li, H. Positioned and G/C-capped poly(dA:dT) tracts associate with the centers of nucleosome-free regions in yeast promoters. Genome Res. 20, 473–484 (2010).
Zhang, Y. et al. Evidence against a genomic code for nucleosome positioning. Nat. Struct. Mol. Biol. 17, 920–923 (2010).
Ernst, J. & Kellis, M. Interplay between chromatin state, regulator binding, and regulatory motifs in six human cell types. Genome Res. 23, 1142–1154 (2013).
Ha, M., Hong, S. & Li, W.H. Predicting the probability of H3K4me3 occupation at a base pair from the genome sequence context. Bioinformatics 29, 1199–1205 (2013).
Xie, W. et al. Epigenomic analysis of multi-lineage differentiation of human embryonic stem cell. Cell 153, 1134–1148 (2013).
Benjamini, Y. & Speed, T.P. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 40, e72 (2012).
Cheung, M.S., Down, T.A., Latorre, I. & Ahringer, J. Systematic bias in high-throughput sequencing data and its correction by BEADS. Nucleic Acids Res. 39, e103 (2011).
Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589 (2010).
Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010).
Yuan, Y., Guo, L., Shen, L. & Liu, J.S. Predicting gene expression from sequence: a reexamination. PLoS Comput. Biol. 3, e243 (2007).
Creyghton, M.P. et al. Histone H3K27ac separates active from poised enhancers and predicts developmental state. Proc. Natl. Acad. Sci. USA 107, 21931–21936 (2010).
Heintzman, N.D. et al. Histone modifications at human enhancers reflect global cell-type-specific gene expression. Nature 459, 108–112 (2009).
Graham, V., Khudyakov, J., Ellis, P. & Pevny, L. SOX2 functions to maintain neural progenitor identity. Neuron 39, 749–765 (2003).
Mauvieux, L., Villey, I. & de Villartay, J.P. TEA regulates local TCR-Jα accessibility through histone acetylation. Eur. J. Immunol. 33, 2216–2222 (2003).
Choi, J.Y. et al. Subnuclear targeting of Runx/Cbfa/AML factors is essential for tissue-specific differentiation during embryonic development. Proc. Natl. Acad. Sci. USA 98, 8650–8655 (2001).
Morrisey, E.E., Ip, H.S., Tang, Z., Lu, M.M. & Parmacek, M.S. GATA-5: a transcriptional activator expressed in a novel temporally and spatially-restricted pattern during embryonic development. Dev. Biol. 183, 21–36 (1997).
Lupien, M. et al. FoxA1 translates epigenetic signatures into enhancer-driven lineage-specific transcription. Cell 132, 958–970 (2008).
He, H.H. et al. Nucleosome dynamics define transcriptional enhancers. Nat. Genet. 42, 343–347 (2010).
Schuierer, M. et al. Induction of AP-2α expression by adenoviral infection involves inactivation of the AP-2rep transcriptional corepressor CtBP1. J. Biol. Chem. 276, 27944–27949 (2001).
Shi, Y. et al. Coordinated histone modifications mediated by a CtBP co-repressor complex. Nature 422, 735–738 (2003).
Kawahara, T.L. et al. SIRT6 links histone H3 lysine 9 deacetylation to NF-κB-dependent gene expression and organismal life span. Cell 136, 62–74 (2009).
Woo, C.J., Kharchenko, P.V., Daheron, L., Park, P.J. & Kingston, R.E. Variable requirements for DNA-binding proteins at Polycomb-dependent repressive regions in human HOX clusters. Mol. Cell. Biol. 33, 3274–3285 (2013).
de la Cruz, C.C. et al. The Polycomb group protein SUZ12 regulates histone H3 lysine 9 methylation and HP1α distribution. Chromosome Res. 15, 299–314 (2007).
Wu, S., Hu, Y.C., Liu, H. & Shi, Y. Loss of YY1 impacts the heterochromatic state and meiotic double-strand breaks during mouse spermatogenesis. Mol. Cell. Biol. 29, 6245–6256 (2009).
Chen, J. et al. H3K9 methylation is a barrier during somatic cell reprogramming into iPSCs. Nat. Genet. 45, 34–42 (2013).
Wang, J. et al. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 22, 1798–1812 (2012).
Wang, Z. et al. Structure and function of Nurr1 identifies a class of ligand-independent nuclear receptors. Nature 423, 555–560 (2003).
Sekiya, T. et al. The nuclear orphan receptor Nr4a2 induces Foxp3 and regulates differentiation of CD4+ T cells. Nat. Commun. 2, 269 (2011).
Rohs, R. et al. Origins of specificity in protein-DNA recognition. Annu. Rev. Biochem. 79, 233–269 (2010).
Jolma, A. et al. DNA-binding specificities of human transcription factors. Cell 152, 327–339 (2013).
Kasowski, M. et al. Extensive variation in chromatin states across humans. Science 342, 750–752 (2013).
Kim, J. et al. Ikaros DNA-binding proteins direct formation of chromatin remodeling complexes in lymphocytes. Immunity 10, 345–355 (1999).
Hathaway, N.A. et al. Dynamics and memory of heterochromatin in living cells. Cell 149, 1447–1460 (2012).
Miller, J.C. et al. A TALE nuclease architecture for efficient genome editing. Nat. Biotechnol. 29, 143–148 (2011).
Mali, P. et al. RNA-guided human genome engineering via Cas9. Science 339, 823–826 (2013).
Chinenov, Y. & Kerppola, T.K. Close encounters of many kinds: Fos-Jun interactions that mediate transcription regulatory specificity. Oncogene 20, 2438–2452 (2001).
Wang, T. et al. STAR: an integrated solution to management and visualization of sequencing data. Bioinformatics 29, 3204–3210 (2013).
Hawkins, R.D. et al. Distinct epigenomic landscapes of pluripotent and lineage-committed human cells. Cell Stem Cell 6, 479–491 (2010).
Karolchik, D. et al. The UCSC Genome Browser database: 2014 update. Nucleic Acids Res. 42, D764–D770 (2014).
Pruitt, K.D. et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 42, D756–D763 (2014).
Quinlan, A.R. & Hall, I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Koehler, R., Issac, H., Cloonan, N. & Grimmond, S.M. The uniqueome: a mappability resource for short-tag sequencing. Bioinformatics 27, 272–274 (2011).
Bailey, T.L. DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics 27, 1653–1659 (2011).
Harbison, C.T. et al. Transcriptional regulatory code of a eukaryotic genome. Nature 431, 99–104 (2004).
Caruana, R., Karampatziakis, N. & Yessenalina, A. in Proc. 25th Int. Conf. Mach. Learn. 96–103 (ACM, 2008).
Hall, M. et al. The WEKA data mining software: an update. SIGKDD Explor. 11, 10–18 (2009).
Gupta, S., Stamatoyannopoulos, J.A., Bailey, T.L. & Noble, W.S. Quantifying similarity between motifs. Genome Biol. 8, R24 (2007).
Tanaka, E., Bailey, T., Grant, C.E., Noble, W.S. & Keich, U. Improved similarity scores for comparing motifs. Bioinformatics 27, 1603–1609 (2011).
Matys, V. et al. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 34, D108–D110 (2006).
Portales-Casamar, E. et al. JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res. 38, D105–D110 (2010).
Robasky, K. & Bulyk, M.L. UniPROBE, update 2011: expanded content and search tools in the online database of protein-binding microarray data on protein-DNA interactions. Nucleic Acids Res. 39, D124–D128 (2011).
Xie, Z., Hu, S., Blackshaw, S., Zhu, H. & Qian, J. hPDI: a database of experimental human protein-DNA interactions. Bioinformatics 26, 287–289 (2010).
Ward, J.H. Jr. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58, 236–244 (1963).
This work was partially supported by the US National Institutes of Health (U01 ES017166 to W.W., principal investigator, B. Ren). The authors wish to thank B. Ren, D.R. Westhead and M.H. Sherman for discussion of this work. We are grateful to M. Snyder for providing the SNP data of the 19 individuals.
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 Cross-validation procedures used to assess the prediction of epigenome from DNA motifs.
A schematic describes the different cross-validation procedures that were used to validate our predictions. In the top left the standard five-fold cross-validation procedure is shown. Beneath we show how shuffling of test sequences is carried out during cross-validation. This is done to ensure all the prediction performance is coming from the motifs and not simple sequence biases, such as GC-content. In the top right we show how sequences that are removed during SSB are still tested by training the model on the entire set of sequences that were selected during SSB.
The left ROC curve shows the prediction performance when the SSB step is carried out. The right ROC curve shows the performance on the same dataset but with the omission of the SSB step.
Supplementary Figure 3 The levels of histone modification ChIP-seq reads inside peaks and background regions in H1.
The bar plot shows the AUC from the ‘single mark analysis’ in H1. The violin plots show the normalized read counts (ChIPRPKM – inputRPKM) of ChIP-seq reads for each of the modifications. The levels are shown both inside and outside the modification peaks. The read counts were adjusted by region size and total reads. Then the corresponding input score was subtracted. The violin plots with dotted borderlines show the enrichment within modification peaks while the violin plots with solid borderlines show enrichment outside modifications peaks.
(a) As Fig. 2a–b but showing the results of the ‘mark-specific analysis’. In the left hand schematic different colored stars represents different histone modifications. (b) As Fig. 2a–c but showing the results of the ‘typical background analysis’. (c) As Fig. 2a–b but showing the results of the ‘cell-type-specific analysis’. In the left hand schematic pink represents the epigenome of a different cell-type. The middle plot shows the H1 and MSC comparison. (d) Shows the average AUC for each histone modification in each of the four analyze.
The average cell-type-specific prediction performance for each mark is shown. On the left are the results from comparing H1 to each of the four derived cell-types. On the right in the performance when H1 is compared to IMR90 and seven ENCODE cell-types (A549, CD14+, GM12878, HeLa, HepG2, HUVEC and K562).
Supplementary Figure 6 Combined clustering of cell type– and modification–specific interplay between DNA motifs and the epigenome.
The heat map shows the clustering of 589 motif groups. The dendrogram shows the motif groups clustered by their interplay with epigenomic modification and cell-type. Both motifs that are enriched and depleted from modification peaks are shown. On the side of the heat map furthest from the dendrogram, a histogram shows the number of motifs in each cluster. Furthest from the dendrogram are two plots that show the locations of: (i) motif groups that contain both positive and negative interplay for the same modification (ii) the locations of motif groups that correspond to certain groups proteins that share DNA-binding motifs.
Schematics represent the types of interplay that were identified for each of the histone modifications. Pie charts represent the proportion of a modifications predictive motifs that are unique to that modification or that overlap with other modifications.
A motifs profile is constructed using the complete (pre-SSB) set of sequences that were identified for that particular cell-type, modification and analysis type.
The figure illustrates the SSB process. (a) The sequences from each set are separately binned by region length and GC-content. In the figure only a subset of the bins are shown: region lengths from 500-700bps and GC-content from 45-47%. (b) Bins with uneven numbers are highlighted in red. (c) Sequences are randomly removed from bins that possess more sequences than their corresponding bin in the other set.
(a) A general overview of the Epigram workflow is shown. (b) On the left 9-mers that differ by from the seed (show at the top 9-mer) by one or two positions are aligned. Differing positions are highlighted in red. Adjacently to the right of the 9-mers are their weight scores (W). On the right of the arrow a PWM is produced from the alignment. When making the PWM the 9-mers are weighted by their W scores. (c) The alignment shown in (b) is expanded by one position.
Supplementary Figures 1–10 and Supplementary Note (PDF 4459 kb)
Comparison of YY1 ChIP-seq peaks to the ChIP-seq peaks of other factors (XLSX 14 kb)
A list of datasets that were used in this study (XLSX 16 kb)
The motif with the greatest information content from each of the 589 groups (meme format) (TXT 170 kb)
Annotation of the de novo motif groups with known motifs (XLSX 75 kb)
About this article
Cite this article
Whitaker, J., Chen, Z. & Wang, W. Predicting the human epigenome from DNA motifs. Nat Methods 12, 265–272 (2015). https://doi.org/10.1038/nmeth.3065
Applied Microbiology and Biotechnology (2022)
ncHMR detector: a computational framework to systematically reveal non-classical functions of histone modification regulators
Genome Biology (2020)
ICGEC: a comparative method for measuring epigenetic conservation of genes via the integrated signal from multiple histone modifications between cell types
BMC Genomics (2020)
Cell Death & Disease (2020)
Nature Communications (2020)