Enhancer–promoter interactions are encoded by complex genomic signatures on looping chromatin

Journal name:
Nature Genetics
Volume:
48,
Pages:
488–496
Year published:
DOI:
doi:10.1038/ng.3539
Received
Accepted
Published online

Abstract

Discriminating the gene target of a distal regulatory element from other nearby transcribed genes is a challenging problem with the potential to illuminate the causal underpinnings of complex diseases. We present TargetFinder, a computational method that reconstructs regulatory landscapes from diverse features along the genome. The resulting models accurately predict individual enhancer–promoter interactions across multiple cell lines with a false discovery rate up to 15 times smaller than that obtained using the closest gene. By evaluating the genomic features driving this accuracy, we uncover interactions between structural proteins, transcription factors, epigenetic modifications, and transcription that together distinguish interacting from non-interacting enhancer–promoter pairs. Most of this signature is not proximal to the enhancers and promoters but instead decorates the looping DNA. We conclude that complex but consistent combinations of marks on the one-dimensional genome encode the three-dimensional structure of fine-scale regulatory interactions.

At a glance

Figures

  1. Predictive power of promoter-proximal genomic features.
    Figure 1: Predictive power of promoter-proximal genomic features.

    (ah) Ratio of various ChIP-seq signals, including Pol II (POLR2A) (a), enhancer- and promoter-associated histone modifications (bd), known looping factors (e,f), and selected transcription factors (g,h), anchored at the TSS of interacting versus non-interacting promoters in K562 cells, along with the log2-transformed fold change (L2FC) and P value corrected for multiple testing (q value). All promoters have active chromatin marks and show transcription. The top row shows expected patterns for promoter-associated marks at the TSS, such as a high ratio of H3K4me3 to H3K4me1. Some of these marks are enriched in interacting promoters, whereas others such as lysine 4 methylation patterns are not. The bottom row shows TSS-proximal patterns for several proteins associated with chromatin looping. CTCF and RAD21 are enriched at interacting promoters, whereas the transcription factors CUX1 and HCFC1 are enriched and depleted, respectively.

  2. Ratio of the CTCF and RAD21 ChIP-seq signals occurring within interacting enhancers and non-interacting enhancers, anchored at peaks for CTCF, RAD21, and the transcription factors CUX1 and HCFC1 for the K562 cell line.
    Figure 2: Ratio of the CTCF and RAD21 ChIP-seq signals occurring within interacting enhancers and non-interacting enhancers, anchored at peaks for CTCF, RAD21, and the transcription factors CUX1 and HCFC1 for the K562 cell line.

    CUX1 and HCFC1 are highly enriched at loop-associated enhancers when co-occurring with CTCF and RAD21. The context dependence of protein binding is demonstrated by RAD21, which is not enriched at interacting promoters (Fig. 1). Note that CTCF and RAD21 are already enriched at their respective peaks within interacting enhancers but are further enriched when anchored at CUX1 or HCFC1 peaks.

  3. Predicting a chromatin loop that skips over multiple active promoters in K562 cells.
    Figure 3: Predicting a chromatin loop that skips over multiple active promoters in K562 cells.

    ENCODE-called peaks are shown for the top nine predictive data sets of an interacting promoter (P1) and enhancer (E1) in K562 cells, separated by other active promoters and enhancers. Active enhancers are segments marked 'E' by combined ChromHMM and Segway annotations, and active promoters are segments marked 'TSS' and expressed in K562 cells with RPKM >0.3. Ensembl genes are also displayed, with introns denoted as thin lines and exons denoted as rectangles. The left and right fragments of the Hi-C assay are also shown to visually confirm that E1 interacts with P1. This figure shows a straightforward example of an enhancer (E1) looping over multiple active promoters (P2–P4) to reach its true target (P1). Existing interactions in the window between E1 and P1 do not block looping, and P1 is the target of other distal regulatory elements within the window. P2–P4 are each missing a looping-associated RAD21 mark that has elevated predictive importance in this cell line. In addition, P2 and P3 are missing the highly predictive CUX1 transcription factor (Fig. 2). Interpreting loops often depends on a more complex interaction of marks (Supplementary Fig. 4).

  4. The TargetFinder pipeline.
    Figure 4: The TargetFinder pipeline.

    Features are generated from hundreds of diverse data sets for pairs of enhancers and promoters of expressed genes found to have significant Hi-C interactions (positives), as well as random pairs of enhancers and promoters without significant interactions (negatives). These labeled samples are used to train an ensemble classifier that predicts whether enhancer–promoter pairs from new or held-out samples interact, as well as estimates the importance of each feature for accurate prediction. Classifier predictions are probabilities, and a decision threshold (commonly 0.5 but with the possibility of adjustment) converts these to positive or negative prediction labels. This figure excludes selection of minimal predictor sets and evaluation of the accuracy of output predictions using held-out Hi-C interaction data.

  5. TargetFinder performance by cell line, model type, and number of features.
    Figure 5: TargetFinder performance by cell line, model type, and number of features.

    (a) Performance of boosted trees using features for enhancers and promoters only (E/P), extended enhancers and promoters (EE/P), and enhancers and promoters plus the windows between them (E/P/W). (b) Cross-validated performance of TargetFinder predictions for a baseline (random guessing null) model, a linear support vector machine (SVM), a single decision tree, and a boosted ensemble of decision trees. Performance is given as the balance of precision and recall (F1), averaging 83% across cell lines and corresponding to a mean FDR of 12%. Ensemble methods use complex interactions between features to greatly increase the accuracy of predicted interactions. Performance is also high for a combined cell line set comprising the K562, GM12878, HeLa-S3, and IMR90 data sets, with features restricted to data sets shared by all cell lines. (c) Recursive feature elimination (Online Methods) evaluates predictor subsets of size 1 up to the maximum for each cell line, increasing by powers of 2 for computational efficiency. Near-optimal performance was achieved using ~16 predictors for lineage-specific models as well as the combined model, whereas lower but acceptable performance required 8 predictors. The maximum size for the feature subset shown is 32 to enhance the visibility of smaller feature subsets. NHEK lacks a measurement at a subset size of 32 because its data set included fewer than 32 total features. There were ten runs per cell line; error bars, s.e.m.

  6. Predictive importance of genomic features across cell lines and regions.
    Figure 6: Predictive importance of genomic features across cell lines and regions.

    Importance (Online Methods) is discretized by quartiles; grid entries are filled in black when a data set was unavailable in a cell line. The highest average importance is assigned to features in the window region, followed by those in promoters. Promoter methylation and Pol II occupancy are more important in the combined '4 lines' classifier (K562, GM12878, HeLa-S3, and IMR90) than in individual cell lines. Data on highly predictive features such as CAGE were available in most but not all cell lines needed for inclusion in the combined model. Data for certain transcription factors were available in multiple cell lines but are not universally predictive, such as FOS in the window region. Data for other transcription factors were only available in a single cell line but are highly predictive, such as WHSC1 and ZMIZ1 in the window region of K562 cells and RUNX3 in the window region of GM12878 cells.

  7. Influence of features by region.
    Figure 7: Influence of features by region.

    (a,b) Feature values (a) and predictive importance (b) for features in promoter, enhancer, and window regions. Despite having the lowest feature values, the window dominates had higher predictive importance than the enhancer and promoter regions. There were ten runs per cell line. The middle line in each plot represents the median; error bars represent 1.5 times the interquartile range.

  8. Identification of complex interactions between DNA-binding proteins and epigenetic marks.
    Figure 8: Identification of complex interactions between DNA-binding proteins and epigenetic marks.

    (ac) Results for three cell lines are shown: K562 (a), GM12878 (b), and HeLa-S3 (c). Scatterplots show univariate feature significance (two-sample Kolmogorov–Smirnov test) versus multivariate feature importance (estimated via a boosted trees classifier) for the three cell lines. To highlight data sets that are predictive in combination with other features (multivariate) but are not predictive individually (univariate), only features with a multivariate rank less than 25 and a univariate rank greater than 25 are shown. For example, the lower right corner of the K562 plot in a shows that H2AZ, WHSC1, CUX1, and SUMO2 are among the top ten predictive features when the colocalization of other proteins is known. H2AZ has similar context-dependent importance in GM12878 (b) and HeLa-S3 (c) cells. Many features predictive in one or more cell lines were not assayed uniformly and thus could not be included in the combined model (for example, HCFC1, CUX1, and SUMO2).

Accession codes

Referenced accessions

Gene Expression Omnibus

References

  1. Schaub, M.A., Boyle, A.P., Kundaje, A., Batzoglou, S. & Snyder, M. Linking disease associations with regulatory information in the human genome. Genome Res. 22, 17481759 (2012).
  2. Lomelin, D., Jorgenson, E. & Risch, N. Human genetic variation recognizes functional elements in noncoding sequence. Genome Res. 20, 311319 (2010).
  3. Alexandrov, N.N. et al. Features of Arabidopsis genes and genome discovered using full-length cDNAs. Plant Mol. Biol. 60, 6985 (2006).
  4. Hillier, L.W. et al. Whole-genome sequencing and variant discovery in C. elegans. Nat. Methods 5, 183188 (2008).
  5. Massouras, A. et al. Genomic variation and its impact on gene expression in Drosophila melanogaster. PLoS Genet. 8, e1003055 (2012).
  6. Tang, R. et al. Candidate genes and functional noncoding variants identified in a canine model of obsessive-compulsive disorder. Genome Biol. 15, R25 (2014).
  7. Manolio, T.A., Brooks, L.D. & Collins, F.S. A HapMap harvest of insights into the genetics of common disease. J. Clin. Invest. 118, 15901605 (2008).
  8. Gusev, A. et al. Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases. Am. J. Hum. Genet. 95, 535552 (2014).
  9. Frazer, K.A., Murray, S.S., Schork, N.J. & Topol, E.J. Human genetic variation and its contribution to complex traits. Nat. Rev. Genet. 10, 241251 (2009).
  10. Lindblad-Toh, K. et al. A high-resolution map of human evolutionary constraint using 29 mammals. Nature 478, 476482 (2011).
  11. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 5774 (2012).
  12. Celniker, S.E. et al. Unlocking the secrets of the genome. Nature 459, 927930 (2009).
  13. Bernstein, B.E. et al. The NIH Roadmap Epigenomics Mapping Consortium. Nat. Biotechnol. 28, 10451048 (2010).
  14. Ernst, J. et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 473, 4349 (2011).
  15. Boyle, A.P. et al. Annotation of functional variation in personal genomes using RegulomeDB. Genome Res. 22, 17901797 (2012).
  16. Ward, L.D. & Kellis, M. HaploReg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants. Nucleic Acids Res. 40, D930D934 (2012).
  17. Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310315 (2014).
  18. Gulko, B., Hubisz, M.J., Gronau, I. & Siepel, A. A method for calculating probabilities of fitness consequences for point mutations across the human genome. Nat. Genet. 47, 276283 (2015).
  19. Lettice, L.A. et al. A long-range Shh enhancer regulates expression in the developing limb and fin and is associated with preaxial polydactyly. Hum. Mol. Genet. 12, 17251735 (2003).
  20. Sanyal, A., Lajoie, B.R., Jain, G. & Dekker, J. The long-range interaction landscape of gene promoters. Nature 489, 109113 (2012).
  21. Kvon, E.Z. et al. Genome-scale functional characterization of Drosophila developmental enhancers in vivo. Nature 512, 9195 (2014).
  22. Wang, D., Rendon, A. & Wernisch, L. Transcription factor and chromatin features predict genes associated with eQTLs. Nucleic Acids Res. 41, 14501463 (2013).
  23. Yip, K.Y. et al. Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors. Genome Biol. 13, R48 (2012).
  24. Aran, D., Sabato, S. & Hellman, A. DNA methylation of distal regulatory sites characterizes dysregulation of cancer genes. Genome Biol. 14, R21 (2013).
  25. Rödelsperger, C. et al. Integrative analysis of genomic, functional and protein interaction data predicts long-range enhancer–target gene interactions. Nucleic Acids Res. 39, 24922502 (2011).
  26. Thurman, R.E. et al. The accessible chromatin landscape of the human genome. Nature 489, 7582 (2012).
  27. Wilczynski, B., Liu, Y.-H., Yeo, Z.X. & Furlong, E.E.M. Predicting spatial and temporal gene expression using an integrative model of transcription factor occupancy and chromatin state. PLoS Comput. Biol. 8, e1002798 (2012).
  28. Fullwood, M.J. et al. An oestrogen-receptor-α-bound human chromatin interactome. Nature 462, 5864 (2009).
  29. Dekker, J., Rippe, K., Dekker, M. & Kleckner, N. Capturing chromosome conformation. Science 295, 13061311 (2002).
  30. Dostie, J. et al. Chromosome Conformation Capture Carbon Copy (5C): a massively parallel solution for mapping interactions between genomic elements. Genome Res. 16, 12991309 (2006).
  31. de Wit, E. & de Laat, W. A decade of 3C technologies: insights into nuclear organization. Genes Dev. 26, 1124 (2012).
  32. Rao, S.S.P. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 16651680 (2014).
  33. Dixon, J.R. et al. Chromatin architecture reorganization during stem cell differentiation. Nature 518, 331336 (2015).
  34. Schoenfelder, S. et al. The pluripotent regulatory circuitry connecting promoters to their long-range interacting elements. Genome Res. 25, 582597 (2015).
  35. Mifsud, B. et al. Mapping long-range promoter contacts in human cells with high-resolution capture Hi-C. Nat. Genet. 47, 598606 (2015).
  36. Maston, G.A., Evans, S.K. & Green, M.R. Transcriptional regulatory elements in the human genome. Annu. Rev. Genomics Hum. Genet. 7, 2959 (2006).
  37. Moore, B.L., Aitken, S. & Semple, C.A. Integrative modeling reveals the principles of multi-scale chromatin boundary formation in human nuclear organization. Genome Biol. 16, 110 (2015).
  38. Zhang, Y. et al. Chromatin connectivity maps reveal dynamic promoter-enhancer long-range associations. Nature 504, 306310 (2013).
  39. Corradin, O. & Scacheri, P.C. Enhancer variants: evaluating functions in common disease. Genome Med. 6, 85 (2014).
  40. Shaulian, E. & Karin, M. AP-1 as a regulator of cell life and death. Nat. Cell Biol. 4, E131E136 (2002).
  41. Bailey, S.D. et al. ZNF143 provides sequence specificity to secure chromatin interactions at gene promoters. Nat. Commun. 2, 6186 (2015).
  42. Michaud, J. et al. HCFC1 is a common component of active human CpG-island promoters and coincides with ZNF143, THAP11, YY1, and GABP transcription factor occupancy. Genome Res. 23, 907916 (2013).
  43. Adelman, K. & Lis, J.T. Promoter-proximal pausing of RNA polymerase II: emerging roles in metazoans. Nat. Rev. Genet. 13, 720731 (2012).
  44. Margueron, R. & Reinberg, D. The Polycomb complex PRC2 and its mark in life. Nature 469, 343349 (2011).
  45. Benveniste, D., Sonntag, H.-J., Sanguinetti, G. & Sproul, D. Transcription factor binding predicts histone modifications in human cell lines. Proc. Natl. Acad. Sci. USA 111, 1336713372 (2014).
  46. Visel, A. et al. ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature 457, 854858 (2009).
  47. Schwartz, C. et al. Recruitment of p300 by C/EBPβ triggers phosphorylation of p300 and modulates coactivator activity. EMBO J. 22, 882892 (2003).
  48. Wang, H. et al. Role of histone H2A ubiquitination in Polycomb silencing. Nature 431, 873878 (2004).
  49. Niskanen, E.A. et al. Global SUMOylation on active chromatin is an acute heat stress response restricting transcription. Genome Biol. 16, 153 (2015).
  50. Hay, R.T. SUMO: a history of modification. Mol. Cell 18, 112 (2005).
  51. MacPherson, M.J., Beatty, L.G., Zhou, W., Du, M. & Sadowski, P.D. The CTCF insulator protein is posttranslationally modified by SUMO. Mol. Cell. Biol. 29, 714725 (2009).
  52. Fujioka, S. et al. NF-κB and AP-1 connection: mechanism of NF-κB-dependent regulation of AP-1 activity. Mol. Cell. Biol. 24, 78067819 (2004).
  53. Hanlon, M. & Sealy, L. Ras regulates the association of serum response factor and CCAAT/enhancer-binding protein β. J. Biol. Chem. 274, 1422414228 (1999).
  54. Jozwik, K.M. & Carroll, J.S. Pioneer factors in hormone-dependent cancers. Nat. Rev. Cancer 12, 381385 (2012).
  55. Sharma, M. et al. hZimp10 is an androgen receptor co-activator and forms a complex with SUMO-1 at replication foci. EMBO J. 22, 61016114 (2003).
  56. Upadhyay, G., Chowdhury, A.H., Vaidyanathan, B., Kim, D. & Saleque, S. Antagonistic actions of Rcor proteins regulate LSD1 activity and cellular differentiation. Proc. Natl. Acad. Sci. USA 111, 80718076 (2014).
  57. Nolis, I.K. et al. Transcription factors mediate long-range enhancer-promoter interactions. Proc. Natl. Acad. Sci. USA 106, 2022220227 (2009).
  58. Deshane, J. et al. Sp1 regulates chromatin looping between an intronic enhancer and distal promoter of the human heme oxygenase-1 gene in renal cells. J. Biol. Chem. 285, 1647616486 (2010).
  59. Listman, J.A. et al. Conserved ETS domain arginines mediate DNA binding, nuclear localization, and a novel mode of bZIP interaction. J. Biol. Chem. 280, 4142141428 (2005).
  60. van Riel, B. & Rosenbauer, F. Epigenetic control of hematopoiesis: the PU.1 chromatin connection. Biol. Chem. 395, 12651274 (2014).
  61. Liu, Z., Scannell, D.R., Eisen, M.B. & Tjian, R. Control of embryonic stem cell lineage commitment by core promoter factor, TAF3. Cell 146, 720731 (2011).
  62. Bertolino, E. & Singh, H. POU/TBP cooperativity: a mechanism for enhancer action from a distance. Mol. Cell 10, 397407 (2002).
  63. Nimura, K. et al. A histone H3 lysine 36 trimethyltransferase links Nkx2-5 to Wolf-Hirschhorn syndrome. Nature 460, 287291 (2009).
  64. Blackwood, E.M. & Kadonaga, J.T. Going the distance: a current view of enhancer action. Science 281, 6063 (1998).
  65. Islam, A.B., Richter, W.F., Lopez-Bigas, N. & Benevolenskaya, E.V. Selective targeting of histone methylation. Cell Cycle 10, 413424 (2011).
  66. Dorsett, D. & Kassis, J.A. Checks and balances between cohesin and polycomb in gene silencing and transcription. Curr. Biol. 24, R535R539 (2014).
  67. Levine, S.S. et al. The core of the polycomb repressive complex is compositionally and functionally conserved in flies and humans. Mol. Cell. Biol. 22, 60706078 (2002).
  68. Vernimmen, D. et al. Polycomb eviction as a new distant enhancer function. Genes Dev. 25, 15831588 (2011).
  69. Fabre, P.J. et al. Nanoscale spatial organization of the HoxD gene cluster in distinct transcriptional states. Proc. Natl. Acad. Sci. USA 112, 1396413969 (2015).
  70. Ing-Simmons, E. et al. Spatial enhancer clustering and regulation of enhancer-proximal genes by cohesin. Genome Res. 25, 504513 (2015).
  71. Hoffman, M.M. et al. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat. Methods 9, 473476 (2012).
  72. Ernst, J. & Kellis, M. ChromHMM: automating chromatin-state discovery and characterization. Nat. Methods 9, 215216 (2012).
  73. Ramsköld, D., Wang, E.T., Burge, C.B. & Sandberg, R. An abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data. PLoS Comput. Biol. 5, e1000598 (2009).
  74. Li, Q., Brown, J.B., Huang, H. & Bickel, P.J. Measuring reproducibility of high-throughput experiments. Ann. Appl. Stat. 5, 17521779 (2011).
  75. Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 17601774 (2012).
  76. Langmead, B. & Salzberg, S.L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357359 (2012).
  77. Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 9, R137 (2008).
  78. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 28252830 (2011).
  79. McKinney, W. Python for Data Analysis (O'Reilly, 2012).
  80. Quinlan, A.R. & Hall, I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841842 (2010).
  81. Burges, C.J.C. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2, 121167 (1998).
  82. Kingsford, C. & Salzberg, S.L. What are decision trees? Nat. Biotechnol. 26, 10111013 (2008).
  83. Friedman, J.H. Stochastic gradient boosting. Comput. Stat. Data Anal. 38, 367378 (2002).
  84. Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning (Springer, 2009).
  85. Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389422 (2002).
  86. Ambroise, C. & McLachlan, G.J. Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. USA 99, 65626566 (2002).

Download references

Author information

Affiliations

  1. Gladstone Institutes, San Francisco, California, USA.

    • Sean Whalen &
    • Katherine S Pollard
  2. Division of Biostatistics, Institute for Human Genetics, University of California, San Francisco, San Francisco, California, USA.

    • Sean Whalen &
    • Katherine S Pollard
  3. Institute for Computational Health Sciences, University of California, San Francisco, San Francisco, California, USA.

    • Sean Whalen &
    • Katherine S Pollard
  4. Invitae Corporation, San Francisco, California, USA.

    • Rebecca M Truty

Contributions

S.W., R.M.T., and K.S.P. designed the experiments and wrote the manuscript. S.W. and R.M.T. implemented the experiments.

Competing financial interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Text and Figures (1,286 KB)

    Supplementary Tables 1–3, Supplementary Figures 1–20 and Supplementary Note.

Additional data