Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

The dynamic, combinatorial cis-regulatory lexicon of epidermal differentiation

Abstract

Transcription factors bind DNA sequence motif vocabularies in cis-regulatory elements (CREs) to modulate chromatin state and gene expression during cell state transitions. A quantitative understanding of how motif lexicons influence dynamic regulatory activity has been elusive due to the combinatorial nature of the cis-regulatory code. To address this, we undertook multiomic data profiling of chromatin and expression dynamics across epidermal differentiation to identify 40,103 dynamic CREs associated with 3,609 dynamically expressed genes, then applied an interpretable deep-learning framework to model the cis-regulatory logic of chromatin accessibility. This analysis framework identified cooperative DNA sequence rules in dynamic CREs regulating synchronous gene modules with diverse roles in skin differentiation. Massively parallel reporter assay analysis validated temporal dynamics and cooperative cis-regulatory logic. Variants linked to human polygenic skin disease were enriched in these time-dependent combinatorial motif rules. This integrative approach shows the combinatorial cis-regulatory lexicon of epidermal differentiation and represents a general framework for deciphering the organizational principles of the cis-regulatory code of dynamic gene regulation.

This is a preview of subscription content

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: A high-resolution integrated multiomic data resource in primary keratinocyte differentiation.
Fig. 2: Epigenomic and transcriptomic landscapes in epidermal differentiation.
Fig. 3: Deep-learning models of chromatin accessibility show dynamic predictive motif instances across the differentiation timecourse.
Fig. 4: Combinatorial in silico perturbation analysis to infer heterotypic cis-regulatory logic.
Fig. 5: Validation of combinatorial motif pairs using MPRAs.
Fig. 6: A combinatorial motif lexicon in keratinocyte differentiation.

Data availability

ATAC-seq, ChIP–seq, PAS-seq, HiChIP and MPRA data can all be found on the Gene Expression Omnibus: GSE181416. There are no restrictions to access of the datasets. Training datasets for machine learning can be found at Zenodo73,74 and trained models can also be found at Zenodo75,76. hg19 annotations can be found at https://hgdownload.soe.ucsc.edu/downloads.html and GENCODE annotations can be found at https://www.gencodegenes.org/human/release_19.html. FANTOM5 (ref. 77) transcription factors can be found at https://fantom.gsc.riken.jp/5/sstar/Browse_Transcription_Factors_hg19. The HOCOMOCO38 database can be found at https://hocomoco11.autosome.ru/.

Code availability

Integrative analysis code and scripts78 can be found at https://github.com/vervacity/ggr-project and the deep-learning code79 can be found at https://github.com/kundajelab/tronn.

References

  1. 1.

    Gray, H. & Lewis, W. H. Anatomy of the Human Body (Bartleby, 1918).

  2. 2.

    Lopez-Pajares, V., Yan, K., Zarnegar, B. J., Jameson, K. L. & Khavari, P. A. Genetic pathways in disorders of epidermal differentiation. Trends Genet. 29, 31–40 (2013).

    CAS  PubMed  Google Scholar 

  3. 3.

    Truong, A. B., Kretz, M., Ridky, T. W., Kimmel, R. & Khavari, P. A. p63 regulates proliferation and differentiation of developmentally mature keratinocytes. Genes Dev. 20, 3185–3197 (2006).

    CAS  PubMed  PubMed Central  Google Scholar 

  4. 4.

    Levine, M. Transcriptional enhancers in animal development and evolution. Curr. Biol. 20, R754–R763 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. 5.

    Spitz, F. & Furlong, E. E. M. Transcription factors: from enhancer binding to developmental control. Nat. Rev. Genet. 13, 613–626 (2012).

    CAS  PubMed  Google Scholar 

  6. 6.

    Kundaje, A. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  7. 7.

    Reiter, F., Wienerroither, S. & Stark, A. Combinatorial function of transcription factors and cofactors. Curr. Opin. Genet. Dev. 43, 73–81 (2017).

    CAS  PubMed  Google Scholar 

  8. 8.

    Rubin, A. J. et al. Lineage-specific dynamic and pre-established enhancer-promoter contacts cooperate in terminal differentiation. Nat. Genet. 49, 1522–1528 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  9. 9.

    Arnosti, D. N., Barolo, S., Levine, M. & Small, S. The eve stripe 2 enhancer employs multiple modes of transcriptional synergy. Development 122, 205–214 (1996).

    CAS  PubMed  Google Scholar 

  10. 10.

    Banerji, J., Rusconi, S. & Schaffner, W. Expression of a beta-globin gene is enhanced by remote SV40 DNA sequences. Cell 27, 299–308 (1981).

    CAS  PubMed  Google Scholar 

  11. 11.

    Levo, M. & Segal, E. In pursuit of design principles of regulatory sequences. Nat. Rev. Genet. 15, 453–468 (2014).

    CAS  PubMed  Google Scholar 

  12. 12.

    Thanos, D. & Maniatis, T. Virus induction of human IFN beta gene expression requires the assembly of an enhanceosome. Cell 83, 1091–1100 (1995).

    CAS  PubMed  Google Scholar 

  13. 13.

    Michaletti, A. et al. Multi-omics profiling of calcium-induced human keratinocytes differentiation reveals modulation of unfolded protein response signaling pathways. Cell Cycle 18, 2124–2140 (2019).

    PubMed  PubMed Central  Google Scholar 

  14. 14.

    Hopkin, A. S. et al. GRHL3/GET1 and trithorax group members collaborate to activate the epidermal progenitor differentiation program. PLoS Genet. 8, e1002829 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. 15.

    Lopez, R. G. et al. C/EBPalpha and beta couple interfollicular keratinocyte proliferation arrest to commitment and terminal differentiation. Nat. Cell Biol. 11, 1181–1190 (2009).

    CAS  PubMed  Google Scholar 

  16. 16.

    Lopez-Pajares, V. et al. A LncRNA-MAF:MAFB transcription factor network regulates epidermal differentiation. Dev. Cell 32, 693–706 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  17. 17.

    Segre, J. A., Bauer, C. & Fuchs, E. Klf4 is a transcription factor required for establishing the barrier function of the skin. Nat. Genet. 22, 356–360 (1999).

    CAS  PubMed  Google Scholar 

  18. 18.

    Sen, G. L. et al. ZNF750 is a p63 target gene that induces KLF4 to drive terminal epidermal differentiation. Dev. Cell 22, 669–677 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  19. 19.

    Eraslan, G., Avsec, Ž., Gagneur, J. & Theis, F. J. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 20, 389–403 (2019).

    CAS  PubMed  Google Scholar 

  20. 20.

    Kelley, D. R., Snoek, J. & Rinn, J. Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  21. 21.

    Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  22. 22.

    Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  23. 23.

    Avsec, Ž. et al. Base-resolution models of transcription factor binding reveal soft motif syntax. Nat. Genet. 53, 354–366 (2021).

    CAS  PubMed  Google Scholar 

  24. 24.

    Ching, T. et al. Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface 15, 20170387 (2018).

    PubMed  PubMed Central  Google Scholar 

  25. 25.

    Greenside, P., Shimko, T., Fordyce, P. & Kundaje, A. Discovering epistatic feature interactions from neural network models of regulatory DNA sequences. Bioinformatics 34, i629–i637 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  26. 26.

    Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. In Proc. 34th International Conference on Machine Learning (eds. Precup, D. & Teh, W. W.) 3145–3153 (JMLR, 2017).

  27. 27.

    Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).

    CAS  PubMed  PubMed Central  Google Scholar 

  28. 28.

    Rubin, A. J. et al. Coupled single-cell CRISPR screening and epigenomic profiling reveals causal gene regulatory networks. Cell 176, 361–376.e17 (2019).

    CAS  PubMed  Google Scholar 

  29. 29.

    Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).

    PubMed  PubMed Central  Google Scholar 

  30. 30.

    McDowell, I. C. et al. Clustering gene expression time series data using an infinite Gaussian process mixture model. PLoS Comput. Biol. 14, e1005896 (2018).

    PubMed  PubMed Central  Google Scholar 

  31. 31.

    Simpson, C. L., Patel, D. M. & Green, K. J. Deconstructing the skin: cytoarchitectural determinants of epidermal morphogenesis. Nat. Rev. Mol. Cell Biol. 12, 565–580 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  32. 32.

    Candi, E., Schmidt, R. & Melino, G. The cornified envelope: a model of cell death in the skin. Nat. Rev. Mol. Cell Biol. 6, 328–340 (2005).

    CAS  PubMed  Google Scholar 

  33. 33.

    Sen, G. L., Webster, D. E., Barragan, D. I., Chang, H. Y. & Khavari, P. A. Control of differentiation in a self-renewing mammalian tissue by the histone demethylase JMJD3. Genes Dev. 22, 1865–1870 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  34. 34.

    Ezhkova, E. et al. Ezh2 orchestrates gene expression for the stepwise differentiation of tissue-specific stem cells. Cell 136, 1122–1135 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  35. 35.

    Ezhkova, E. et al. EZH1 and EZH2 cogovern histone H3K27 trimethylation and are essential for hair follicle homeostasis and wound repair. Genes Dev. 25, 485–498 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  36. 36.

    Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: visualising image classification models and saliency maps. Preprint at https://arxiv.org/abs/1312.6034 (2013).

  37. 37.

    Harvey, C. T. et al. QuASAR: quantitative allele-specific analysis of reads. Bioinformatics 31, 1235–1242 (2015).

    PubMed  Google Scholar 

  38. 38.

    Kulakovskiy, I. V. et al. HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP–Seq analysis. Nucleic Acids Res. 46, D252–D259 (2018).

    CAS  PubMed  Google Scholar 

  39. 39.

    Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  40. 40.

    Boxer, L. D., Barajas, B., Tao, S., Zhang, J. & Khavari, P. A. ZNF750 interacts with KLF4 and RCOR1, KDM1A, and CTBP1/2 chromatin regulators to repress epidermal progenitor genes and induce differentiation genes. Genes Dev. 28, 2013–2026 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  41. 41.

    Liu, T. et al. Cistrome: an integrative platform for transcriptional regulation studies. Genome Biol. 12, R83 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  42. 42.

    McDade, S. S. et al. Genome-wide characterization reveals complex interplay between TP53 and TP63 in response to genotoxic stress. Nucleic Acids Res. 42, 6270–6285 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  43. 43.

    Li, Z. et al. Identification of transcription factor binding sites using ATAC-seq. Genome Biol. 20, 45 (2019).

    PubMed  PubMed Central  Google Scholar 

  44. 44.

    Nair, M. et al. Ovol1 regulates the growth arrest of embryonic epidermal progenitor cells and represses c-myc transcription. J. Cell Biol. 173, 253–264 (2006).

    CAS  PubMed  PubMed Central  Google Scholar 

  45. 45.

    Chronis, C. et al. Cooperative binding of transcription factors orchestrates reprogramming. Cell 168, 442–459.e20 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  46. 46.

    Li, D. et al. Chromatin accessibility dynamics during iPSC reprogramming. Cell Stem Cell 21, 819–833.e6 (2017).

    CAS  PubMed  Google Scholar 

  47. 47.

    Di Stefano, B. et al. C/EBPα creates elite cells for iPSC reprogramming by upregulating Klf4 and increasing the levels of Lsd1 and Brd4. Nat. Cell Biol. 18, 371–381 (2016).

    PubMed  Google Scholar 

  48. 48.

    Xu, J., Du, Y. & Deng, H. Direct lineage reprogramming: strategies, mechanisms, and applications. Cell Stem Cell 16, 119–134 (2015).

    CAS  PubMed  Google Scholar 

  49. 49.

    Farley, E. K. et al. Suboptimization of developmental enhancers. Science 350, 325–328 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  50. 50.

    Farley, E. K., Olson, K. M., Zhang, W., Rokhsar, D. S. & Levine, M. S. Syntax compensates for poor binding sites to encode tissue specificity of developmental enhancers. Proc. Natl Acad. Sci. USA 113, 6508–6513 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  51. 51.

    Maytin, E. V. et al. Keratin 10 gene expression during differentiation of mouse epidermis requires transcription factors C/EBP and AP-2. Dev. Biol. 216, 164–181 (1999).

    CAS  PubMed  Google Scholar 

  52. 52.

    Li, J.-J., Cao, Y., Young, M. R. & Colburn, N. H. Induced expression of dominant-negative c-jun downregulates NFκB and AP-1 target genes and suppresses tumor phenotype in human keratinocytes. Mol. Carcinogenesis 29, 159–169 (2000).

    CAS  Google Scholar 

  53. 53.

    Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  54. 54.

    Finucane, H. K. et al. Heritability enrichment of specifically expressed genes identifies disease-relevant tissues and cell types. Nat. Genet. 50, 621–629 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  55. 55.

    Hirata, T. et al. Japanese GWAS identifies variants for bust-size, dysmenorrhea, and menstrual fever that are eQTLs for relevant protein-coding or long non-coding RNAs. Sci. Rep. 8, 8502 (2018).

    PubMed  PubMed Central  Google Scholar 

  56. 56.

    Paternoster, L. et al. Multi-ancestry genome-wide association study of 21,000 cases and 95,000 controls identifies new risk loci for atopic dermatitis. Nat. Genet. 47, 1449–1456 (2015).

    PubMed  PubMed Central  Google Scholar 

  57. 57.

    Colonna, M. AHR: making the keratinocytes thick skinned. Immunity 40, 863–864 (2014).

    CAS  PubMed  Google Scholar 

  58. 58.

    Birnbaum, R. Y. et al. Seborrhea-like dermatitis with psoriasiform elements caused by a mutation in ZNF750, encoding a putative C2H2 zinc finger protein. Nat. Genet. 38, 749–751 (2006).

    CAS  PubMed  Google Scholar 

  59. 59.

    Li, M. et al. Topical vitamin D3 and low-calcemic analogs induce thymic stromal lymphopoietin in mouse keratinocytes and trigger an atopic dermatitis. Proc. Natl Acad. Sci. USA 103, 11736–11741 (2006).

    CAS  PubMed  PubMed Central  Google Scholar 

  60. 60.

    Bentsen, M. et al. ATAC-seq footprinting unravels kinetics of transcription factor binding during zygotic genome activation. Nat. Commun. 11, 4267 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  61. 61.

    ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

    Google Scholar 

  62. 62.

    ENCODE Project Consortium. et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 583, 699–710 (2020).

    Google Scholar 

  63. 63.

    Fornes, O. et al. JASPAR 2020: update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 48, D87–D92 (2020).

    CAS  PubMed  Google Scholar 

  64. 64.

    Luo, K. et al. Quantitative occupancy of myriad transcription factors from one DNase experiment enables efficient comparisons across conditions. Preprint at bioRxiv https://doi.org/10.1101/2020.06.28.171587 (2020).

  65. 65.

    Vierstra, J. et al. Global reference mapping of human transcription factor footprints. Nature 583, 729–736 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  66. 66.

    Weirauch, M. T. et al. Determination and inference of eukaryotic transcription factor sequence specificity. Cell 158, 1431–1443 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  67. 67.

    Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).

    CAS  PubMed  Google Scholar 

  68. 68.

    Ghandi, M., Lee, D., Mohammad-Noori, M. & Beer, M. A. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput. Biol. 10, e1003711 (2014).

    PubMed  PubMed Central  Google Scholar 

  69. 69.

    Maslova, A. et al. Deep learning of immune cell differentiation. Proc. Natl Acad. Sci. USA 117, 25655–25666 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  70. 70.

    Sanford, E. M. et al. Gene regulation gravitates toward either addition or multiplication when combining the effects of two signals. eLife 9, e59388 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  71. 71.

    Sharon, E. et al. Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters. Nat. Biotechnol. 30, 521–530 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  72. 72.

    Smith, R. P. et al. Massively parallel decoding of mammalian regulatory sequences supports a flexible organizational model. Nat. Genet. 45, 1021–1028 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  73. 73.

    Kim, D. S. & Kundaje, A. Classification dataset for ENCODE-Roadmap DNase-seq peaks and transcription factor ChIP–seq peaks. Zenodo https://doi.org/10.5281/zenodo.4059038 (2020).

  74. 74.

    Kim, D. S. & Kundaje, A. Machine learning datasets for epigenomic landscapes in epidermal differentiation. Zenodo https://doi.org/10.5281/zenodo.4062510 (2020).

  75. 75.

    Kim, D. S. & Kundaje, A. Convolutional neural net (CNN) models for ENCODE-Roadmap DNase-seq peaks and transcription factor ChIP–seq peaks—basset architecture. Zenodo https://doi.org/10.5281/zenodo.4059060 (2020).

  76. 76.

    Kim, D. S. & Kundaje, A. Convolutional neural net (CNN) models for epigenomic landscapes in epidermal differentiation—basset architecture, classification and regression. Zenodo https://doi.org/10.5281/zenodo.4062726 (2020).

  77. 77.

    Lizio, M. et al. Update of the FANTOM web resource: high resolution transcriptome of diverse cell types in mammals. Nucleic Acids Res. 45, D737–D743 (2017).

    CAS  PubMed  Google Scholar 

  78. 78.

    Kim, D. vervacity/ggr-project: first release. Zenodo https://doi.org/10.5281/zenodo.5161189 (2021).

  79. 79.

    Kim, D., Arivazhagan, N., Wu, K. & Sharmin, M. kundajelab/tronn: v.1.0.0. Zenodo https://doi.org/10.5281/zenodo.5160998 (2021).

Download references

Acknowledgements

The authors thank members of the Kundaje, Greenleaf and Khavari laboratories for discussions and Stanford Computing (Sherlock) and Pacific Computing Consortium (Nautilus) for computing resources. This work was supported by USVA Office of Research and Development I01BX00140908 (P.A.K.), NIH CA142635, AR45192, AR076965 and HG007919 (P.A.K.), 1DP2GM123485 (A.K.) and RM1-HG007735 (H.Y.C.). H.Y.C. is an Investigator of the Howard Hughes Medical Institute. V.I.R. was supported by the Walter V. and Idun Berry Postdoctoral Fellowship and the Katharine McCormick Advanced Postdoctoral Fellowship.

Author information

Affiliations

Authors

Contributions

D.S.K., A. Kundaje, W.J.G. and P.A.K. conceived the project. D.S.K., V.I.R., D.L.R., J.C., A.J.R., N.J., L.K.H.D., V.L.-P., H.D., M.S., D.R., S.L., M.P.S., A. Kathiria, M.S. and Z.Z. performed experiments and performed analysis. A. Kundaje and P.A.K. guided experiments and data analysis. D.S.K., W.J.G., A.undaje and P.A.K. wrote the manuscript with input from all authors.

Corresponding authors

Correspondence to William J. Greenleaf, Anshul Kundaje or Paul A. Khavari.

Ethics declarations

Competing interests

H.Y.C. is a cofounder of Accent Therapeutics and Boundless Bio and is an advisor to 10x Genomics, Arsenal Biosciences and Spring Discovery. M.P.S. is a cofounder and on the advisory board of Personalis, SensOmics, January, Filtricine, Qbio, Protos, Mirive and Nimo. M.P.S. is also on the advisory board of Genapsys and Tailai. W.J.G. is a consultant for 10x Genomics who has licensed intellectual property associated with ATAC-seq. W.J.G. has additional affiliations with Guardant Health (consultant) and Protillion Biosciences (cofounder and consultant). A. Kundaje has affiliations with Biogen (consultant), ImmuneAI (scientific advisory board) and RavelBio (scientific cofounder and scientific advisory board). The remaining authors declare no competing interests.

Additional information

Peer review information Nature Genetics thanks Brian Capell, Idan Cohen and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Data quality and other characteristics of the regulatory landscape.

(a) Morphology schematic of normal human epidermis. (b) Selected biomarker gene panel from PAS-seq, demonstrating proper differentiation across time in vitro. (c) Principal component analysis (PCA) of other datasets (signal type used for analysis in parenthesis): PAS-seq (log2 of counts), H3K27ac ChIP–seq (log2 of counts), H3K4me1 ChIP–seq (log2 of counts), H3K27me3 ChIP–seq (log2 of counts), HiChIP (normalized fragment counts). (d) Global statistics on ATAC-seq. Top plot shows the number of reproducible peaks across the timepoints. Bottom plot shows the number of up and down regulated differential peaks across time, using day 0 as the baseline. (e) Global statistics on PAS-seq. Top plot shows the number of expressed genes (> approximately 1TPM) at each timepoint. Bottom plot shows the number of up and down regulated differential genes across time, using day 0 as the baseline. (f) Comparison of bulk ATAC-seq in keratinocyte differentiation to scATAC-seq. Each of the bulk ATAC-seq samples was projected into a 2D UMAP of related scATAC-seq data.

Extended Data Fig. 2 Extended analysis of the keratinocyte epigenome.

(a) Analysis of regions with stable (invariant) accessibility and dynamic chromatin modifications surrounding them (28,973 regions). The regions are clustered according to their dynamic chromatin mark patterns and marked with enriched GO terms accordingly. (b) Analysis of regions with stable (invariant) accessibility and stable chromatin modifications (84,678 regions). The regions are clustered according to combinatorial chromatin states and marked with enriched GO terms accordingly. (c) Comparison of accessibility at TSSs, separated into TSSs of dynamic genes, stable genes, and nonexpressed genes, and additionally compared to distal regions. (d) Profile heatmaps for TSSs of dynamic genes. (e) Chromatin states around TSSs of stable genes. (f) Chromatin states around TSSs of nonexpressed genes.

Extended Data Fig. 3 Extended analysis of deep neural net models.

(a) Schematic describing transfer learning. From left to right: first, models are trained on a large compendium of DNase-seq datasets from ENCODE and Roadmap; these weights are used to initialize training for a keratinocyte specific classification model; finally, these weights are used to initialize training for a regression model. (b) Model performance metrics. Left: area under the precision-recall curve (AUPRC) for the ENCODE/Roadmap pre-training classification tasks across 10 folds. Right: AUPRC for accessibility in keratinocyte timepoints across 10 folds, considering transfer learning or fresh initialization (random seeded weights). Box-and-whisker plots show all points, minimum to maximum, with 25th to 75th interquartile range. (c) Precision-recall curves for the classification stage. Top: Precision-recall for prediction of accessible peaks. Bottom: Precision-recall for prediction of strong enhancer state (presence of ATAC-seq, H3K27ac ChIP–seq, and H3K4me1 ChIP–seq). (d) Heatmaps of observed ATAC signal vs neural net predicted ATAC signal across dynamically accessible regions. (e) Validation of contribution scores by comparing to SNPs exhibiting significant allelic imbalance of ATAC-seq signal. Top: Comparison of effect sizes of allelic imbalance of ATAC-seq signal, between SNPs overlapping nonsignificant contribution scores and those overlapping significant contribution scores. Bottom: comparison of model-derived allelic effect predictions (reference allele - alternate allele) on SNPs overlapping significant contribution scores, separated by whether the SNP was considered allele-sensitive (FDR < 0.10) or not allele-sensitive. Box-and-whisker plots show all points, minimum to maximum, with 25th to 75th interquartile range box. (f) Comparison of neural network derived predictive motifs versus enriched motifs derived by HOMER motif discovery. (g) Predictive, active motif instances of KLF4 show higher ChIP–seq signal relative to inactive motifs in CREs. (h) Evaluation of motif instances identified by sequence-only position weight matrix motif match scores against contribution-weighted sequence motif match scores. (i) Predictive motifs show dynamic footprinting. DLX3 motif is shown. (j) Heatmap showing predictive motifs enriched in CREs corresponding to ATAC-seq trajectories. (k) Heatmap showing TFs whose expression was correlated (r > 0.8) with activity of their matched predictive motifs in CREs corresponding to ATAC-seq trajectories.

Extended Data Fig. 4 Repressive motifs in CREs exhibiting decreasing accessibility across keratinocyte differentiation.

(a) Top: dynamics of negative contribution scores of predictive motif instances of CEPBA and KLF4 across time averaged over all CREs exhibiting decreasing accessibility across the timecourse. Bottom: dynamic expression patterns of CEBP and KLF family TFs that exhibit strong anticorrelation with motif activity dynamics across the timecourse. (b) A closing CRE (chr10:60192514-60203992) shows progressively increasing negative contributions of nucleotides in CEBPA motif across the timecourse in concordance with an increasing negative effect on accessibility. Assay ranges are ATAC-seq: 0-500; H3K27ac: 0-20; H3K4me1: 0–50. (c) Functional enrichments for gene sets linked to CREs containing predictive instances of CEBPA and KLF4 motifs with strong negative contribution scores. Left: enrichments linked to closing CREs with negative CEBPA motif scores. Right: enrichments linked to closing CREs with negative KLF4 motif scores.

Extended Data Fig. 5 Analysis of homotypic motif clusters within the keratinocyte epigenome.

(a) Analysis of motif counts on chromatin accessibility using synthetic sequences. Synthetic scrambled background sequences were embedded with varying number of instances of each predictive motif. The neural network was used to predict chromatin accessibility for each synthetic sequence. Left: Each curve summarizes the predicted accessibility with increasing motif density for each motif averaged over 100 random synthetic backgrounds. Middle/right: Predicted chromatin accessibility for increasing density of FOSB, and CEBPD motifs. Each black curve represents a specific random synthetic background sequence, while the red curve is the average pattern across all backgrounds. (b) Relationship between motif affinity and motif density in CREs containing predictive motif instances. Motif affinity is estimated as the average motif PWM match log-odds scores of all predictive instances in a CRE. Motif density is the number of predictive motif instances in each CRE. We observe a striking tradeoff between motif density and the upper limit of average motif affinity. Right: CEBPD motif instances. Left: GRHL motif instances. (c) Motif PWM match scores as a function of distance from the ATAC-seq summit. Left: motif PWM match scores from all motif instances for CEBPD and GRHL motifs. Right: motif PWM match scores for predictive motif instances for CEBPD and GRHL motifs. (d) Proposed principles of cell-type specific homotypic motif clusters. As number of motif instances increases or as motif affinities in a region increase, accessibility increases. The suboptimization of motif sites, particularly when there are more motif instances within a region, acts as an upper limit to prevent ectopic accessibility. Motif affinities are strongest near the accessibility summit.

Extended Data Fig. 6 Examples of interacting pairs of predictive motifs and motif co-occurrence statistics.

(a) Example regions demonstrating interacting motifs. Top row: putative enhancer affecting LAMC2 gene expression with an interacting NFKB2 motif and RUNX1 motif (chr1:183147408-183170430). Assay ranges are ATAC-seq: 0–600; H3K27ac: 0-300; H3K4me1: 0–50. The highlighted region in the signal tracks (left) demonstrates correctly predicted ATAC signal by the neural net (top middle heatmap). Base-resolution contribution score tracks are shown for the wild-type (genomic) sequence and sequences with marginal and joint perturbation of both motifs (middle tracks). The model predicts a super-multiplicative effects of the motif pair on chromatin accessibility (right plot). Bottom row: Analogous plots for a putative enhancer affecting MUC15 gene expression with an interacting GRHL2 motif and ATF4 motif (chr11:26590539-26610606). Assay ranges are ATAC-seq: 0–800; H3K27ac: 0–150; H3K4me1: 0-70. (b) Co-occurrence statistics (size of circle represents number of instances) for motif pairs based on all motif instances identified solely using sequence match scores (left) and motif pairs based on predictive, active motif instances based on contribution-weighted sequence match scores (right). Predictive motif instances highlight less promiscuous, more specific co-occurrence statistics. (c) Analogous co-occurrence statistics for motif pairs using all motif instances (left) and predictive motif instances (right) after filtering for pairs that show significant GO term enrichments for associated target genes. Once again, more specific co-occurrence patterns are observed for the predictive motif instances.

Extended Data Fig. 7 Mapping co-occurring motif pairs to enriched Gene Ontology terms.

Map of combinatorial rules derived from in silico motif interaction analyses. Each row across plots represents a predicted interacting motif pair. From left to right: the motif presence plot demonstrates which motifs are part of the combinatorial rule; the ATAC heatmap demonstrates the average accessibility pattern over CREs containing each motif pair across all time points; the RNA heatmap displays the average gene expression over genes associated with CREs containing each motif pair across the time points; Gene Ontology terms are significantly enriched in the gene sets associated with CREs containing each motif pair.

Extended Data Fig. 8 MPRA data quality and comparisons to epigenomic landscapes.

(a) Distribution of barcodes in plasmid library, demonstrating the skew of barcode representation. (b) Number of barcodes per fragment in plasmid library, demonstrating on average 10 barcodes per fragment tested. (c) Number of reads per MPRA sample. (d) Number of barcodes per fragment in MPRA RNA reads, demonstrating on average 10 barcodes per fragment tested. (e) Average MPRA signal compared to controls, showing ATAC regions on average have activity in between negative controls (genomic negatives and shuffled sequences) and positive controls (promoter sequences). (f) MPRA replicate consistency. Left: Consistency by Pearson R across replicates and timepoints tested. Right: Consistency of MPRA replicate signal for two example replicates in timepoint day 0. (g) Correlation of MPRA signal to various genomic and/or modeling signals: ATAC signal, NN predictions of ATAC signal, H3K27ac, and regression predictions from linear model utilizing NN final layer activations as model inputs (results shown on held-out test data). (h) ATAC signal across timepoints day 0,3, and 6 for sequences containing HOXA1 motif and ETV5 motif. Box-and-whisker plots show all points, minimum to maximum, with 25th to 75th interquartile range.

Extended Data Fig. 9 Additional experimental validation of representative MPRA fragments.

(a) GFP expression of reporters drawn from MPRA fragments (endogenous examples of combinatorial rules) from day 0 to day 7. R4 and R5 are instances of CEBPD/ZNF750 rule. R11 and R12 are instances of KLF4/ZNF750 rule. Scr is a scrambled control fragment. Yellow and black scale bars are 20um. (b) GFP expression from the experiment in (A) quantified. (c) ZNF750 knockout followed by luciferase reporter timecourse expression, demonstrating ZNF750 influence on instances of rules involving ZNF750. Data summarizes three independent experiments and is represented as mean ± s.e.m. One-sided T test was used for comparisons, *P < 0.05, **P < 0.01, ***P < 0.001.

Extended Data Fig. 10 Combinatorial motif pairs are enriched for genetic variants associated with skin phenotypes.

(a) LD score regression analysis showing differential heritability enrichment of various skin-related diseases and traits in different sets of CRE. The skin phenotypes include: psoriasis, dermatitis, acne, actinic keratosis and rosacea. The sets of CREs include: ‘Progenitor’ rules are CREs containing motif pairs that demonstrate decreasing accessibility and activity across the epidermal differentiation timecourse. ‘Early differentiation’ rules are CREs containing motif pairs those that demonstrate maximal accessibility and activity in the middle of the epidermal differentiation timecourse. ‘Late differentiation’ rules are CREs contained motif pairs that demonstrate maximal accessibility and activity at the end of the epidermal differentiation timecourse. ‘Union DHS’ is the union of DNase peaks across all ENCODE DNase datasets. ‘HepG2’ are DNase peaks in the HepG2 liver carcinoma cell line. ‘Union ATAC’ is the union of CREs across all time points of the differentiation timecourse. ‘ATAC timepoints’ are the CREs that are accessible in each time point of the epidermal differentiation timecourse. ‘Dynamic trajectories’ are clusters of CREs that display specific concordant patterns of dynamic accessibility across the epidermal differentiation timecourse. Plots show LDSC score enrichment coefficients with confidence intervals. (b) Predicted dysregulated TF motif lexicon networks by phenotype. Combinatorial rules were overlaid onto the predicted TF network of combinatorial motif interactions to demonstrate dysregulated TF subnetworks. Node size is the sum of the LD score regression coefficients for the significant combinatorial rules involving that node TF motif. Edges and nodes in black represent significantly enriched combinatorial rules, edges and nodes in gray did not pass statistical significance. Edge thickness represents the validated interaction effect of the rule (supermultiplicative, multiplicative, submultiplicative).

Supplementary information

Supplementary Information

Supplementary Materials and Methods, and Fig. 1.

Reporting Summary

Supplementary Tables

Table 1: List of all experiments with metadata. Table 2: ATAC-seq quality control metrics for technical replicates. Table 3: ATAC-seq quality control metrics for biological replicates. Table 4: PAS-seq quality control metrics. Table 5: ChIP–seq quality control metrics. Table 6: HiChIP quality control metrics. Table 7: Cross-validation folds, split by chromosomes. Table 8: List of ENCODE/Roadmap DNase-seq datasets used in transfer learning. Table 9: List of ENCODE/Roadmap TF ChIP–seq datasets used in transfer learning. Table 10: HOCOMOCO reduced set of position weight matrices with associated transcription factors. Table 11: List of FANTOM5 transcription factors. Table 12: List of MPRA sequences. Table 13: List of individual reporter sequences used in validation experiments. Table 14: List of qPCR primers for ChIP–ReChIP.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kim, D.S., Risca, V.I., Reynolds, D.L. et al. The dynamic, combinatorial cis-regulatory lexicon of epidermal differentiation. Nat Genet 53, 1564–1576 (2021). https://doi.org/10.1038/s41588-021-00947-3

Download citation

Search

Quick links