Abstract
Transcription factors bind DNA sequence motif vocabularies in cis-regulatory elements (CREs) to modulate chromatin state and gene expression during cell state transitions. A quantitative understanding of how motif lexicons influence dynamic regulatory activity has been elusive due to the combinatorial nature of the cis-regulatory code. To address this, we undertook multiomic data profiling of chromatin and expression dynamics across epidermal differentiation to identify 40,103 dynamic CREs associated with 3,609 dynamically expressed genes, then applied an interpretable deep-learning framework to model the cis-regulatory logic of chromatin accessibility. This analysis framework identified cooperative DNA sequence rules in dynamic CREs regulating synchronous gene modules with diverse roles in skin differentiation. Massively parallel reporter assay analysis validated temporal dynamics and cooperative cis-regulatory logic. Variants linked to human polygenic skin disease were enriched in these time-dependent combinatorial motif rules. This integrative approach shows the combinatorial cis-regulatory lexicon of epidermal differentiation and represents a general framework for deciphering the organizational principles of the cis-regulatory code of dynamic gene regulation.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
Gradual differentiation uncoupled from cell cycle exit generates heterogeneity in the epidermal stem cell layer
Nature Cell Biology Open Access 10 November 2022
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout






Data availability
ATAC-seq, ChIP–seq, PAS-seq, HiChIP and MPRA data can all be found on the Gene Expression Omnibus: GSE181416. There are no restrictions to access of the datasets. Training datasets for machine learning can be found at Zenodo73,74 and trained models can also be found at Zenodo75,76. hg19 annotations can be found at https://hgdownload.soe.ucsc.edu/downloads.html and GENCODE annotations can be found at https://www.gencodegenes.org/human/release_19.html. FANTOM5 (ref. 77) transcription factors can be found at https://fantom.gsc.riken.jp/5/sstar/Browse_Transcription_Factors_hg19. The HOCOMOCO38 database can be found at https://hocomoco11.autosome.ru/.
Code availability
Integrative analysis code and scripts78 can be found at https://github.com/vervacity/ggr-project and the deep-learning code79 can be found at https://github.com/kundajelab/tronn.
References
Gray, H. & Lewis, W. H. Anatomy of the Human Body (Bartleby, 1918).
Lopez-Pajares, V., Yan, K., Zarnegar, B. J., Jameson, K. L. & Khavari, P. A. Genetic pathways in disorders of epidermal differentiation. Trends Genet. 29, 31–40 (2013).
Truong, A. B., Kretz, M., Ridky, T. W., Kimmel, R. & Khavari, P. A. p63 regulates proliferation and differentiation of developmentally mature keratinocytes. Genes Dev. 20, 3185–3197 (2006).
Levine, M. Transcriptional enhancers in animal development and evolution. Curr. Biol. 20, R754–R763 (2010).
Spitz, F. & Furlong, E. E. M. Transcription factors: from enhancer binding to developmental control. Nat. Rev. Genet. 13, 613–626 (2012).
Kundaje, A. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).
Reiter, F., Wienerroither, S. & Stark, A. Combinatorial function of transcription factors and cofactors. Curr. Opin. Genet. Dev. 43, 73–81 (2017).
Rubin, A. J. et al. Lineage-specific dynamic and pre-established enhancer-promoter contacts cooperate in terminal differentiation. Nat. Genet. 49, 1522–1528 (2017).
Arnosti, D. N., Barolo, S., Levine, M. & Small, S. The eve stripe 2 enhancer employs multiple modes of transcriptional synergy. Development 122, 205–214 (1996).
Banerji, J., Rusconi, S. & Schaffner, W. Expression of a beta-globin gene is enhanced by remote SV40 DNA sequences. Cell 27, 299–308 (1981).
Levo, M. & Segal, E. In pursuit of design principles of regulatory sequences. Nat. Rev. Genet. 15, 453–468 (2014).
Thanos, D. & Maniatis, T. Virus induction of human IFN beta gene expression requires the assembly of an enhanceosome. Cell 83, 1091–1100 (1995).
Michaletti, A. et al. Multi-omics profiling of calcium-induced human keratinocytes differentiation reveals modulation of unfolded protein response signaling pathways. Cell Cycle 18, 2124–2140 (2019).
Hopkin, A. S. et al. GRHL3/GET1 and trithorax group members collaborate to activate the epidermal progenitor differentiation program. PLoS Genet. 8, e1002829 (2012).
Lopez, R. G. et al. C/EBPalpha and beta couple interfollicular keratinocyte proliferation arrest to commitment and terminal differentiation. Nat. Cell Biol. 11, 1181–1190 (2009).
Lopez-Pajares, V. et al. A LncRNA-MAF:MAFB transcription factor network regulates epidermal differentiation. Dev. Cell 32, 693–706 (2015).
Segre, J. A., Bauer, C. & Fuchs, E. Klf4 is a transcription factor required for establishing the barrier function of the skin. Nat. Genet. 22, 356–360 (1999).
Sen, G. L. et al. ZNF750 is a p63 target gene that induces KLF4 to drive terminal epidermal differentiation. Dev. Cell 22, 669–677 (2012).
Eraslan, G., Avsec, Ž., Gagneur, J. & Theis, F. J. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 20, 389–403 (2019).
Kelley, D. R., Snoek, J. & Rinn, J. Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).
Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).
Avsec, Ž. et al. Base-resolution models of transcription factor binding reveal soft motif syntax. Nat. Genet. 53, 354–366 (2021).
Ching, T. et al. Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface 15, 20170387 (2018).
Greenside, P., Shimko, T., Fordyce, P. & Kundaje, A. Discovering epistatic feature interactions from neural network models of regulatory DNA sequences. Bioinformatics 34, i629–i637 (2018).
Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. In Proc. 34th International Conference on Machine Learning (eds. Precup, D. & Teh, W. W.) 3145–3153 (JMLR, 2017).
Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).
Rubin, A. J. et al. Coupled single-cell CRISPR screening and epigenomic profiling reveals causal gene regulatory networks. Cell 176, 361–376.e17 (2019).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
McDowell, I. C. et al. Clustering gene expression time series data using an infinite Gaussian process mixture model. PLoS Comput. Biol. 14, e1005896 (2018).
Simpson, C. L., Patel, D. M. & Green, K. J. Deconstructing the skin: cytoarchitectural determinants of epidermal morphogenesis. Nat. Rev. Mol. Cell Biol. 12, 565–580 (2011).
Candi, E., Schmidt, R. & Melino, G. The cornified envelope: a model of cell death in the skin. Nat. Rev. Mol. Cell Biol. 6, 328–340 (2005).
Sen, G. L., Webster, D. E., Barragan, D. I., Chang, H. Y. & Khavari, P. A. Control of differentiation in a self-renewing mammalian tissue by the histone demethylase JMJD3. Genes Dev. 22, 1865–1870 (2008).
Ezhkova, E. et al. Ezh2 orchestrates gene expression for the stepwise differentiation of tissue-specific stem cells. Cell 136, 1122–1135 (2009).
Ezhkova, E. et al. EZH1 and EZH2 cogovern histone H3K27 trimethylation and are essential for hair follicle homeostasis and wound repair. Genes Dev. 25, 485–498 (2011).
Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: visualising image classification models and saliency maps. Preprint at https://arxiv.org/abs/1312.6034 (2013).
Harvey, C. T. et al. QuASAR: quantitative allele-specific analysis of reads. Bioinformatics 31, 1235–1242 (2015).
Kulakovskiy, I. V. et al. HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP–Seq analysis. Nucleic Acids Res. 46, D252–D259 (2018).
Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589 (2010).
Boxer, L. D., Barajas, B., Tao, S., Zhang, J. & Khavari, P. A. ZNF750 interacts with KLF4 and RCOR1, KDM1A, and CTBP1/2 chromatin regulators to repress epidermal progenitor genes and induce differentiation genes. Genes Dev. 28, 2013–2026 (2014).
Liu, T. et al. Cistrome: an integrative platform for transcriptional regulation studies. Genome Biol. 12, R83 (2011).
McDade, S. S. et al. Genome-wide characterization reveals complex interplay between TP53 and TP63 in response to genotoxic stress. Nucleic Acids Res. 42, 6270–6285 (2014).
Li, Z. et al. Identification of transcription factor binding sites using ATAC-seq. Genome Biol. 20, 45 (2019).
Nair, M. et al. Ovol1 regulates the growth arrest of embryonic epidermal progenitor cells and represses c-myc transcription. J. Cell Biol. 173, 253–264 (2006).
Chronis, C. et al. Cooperative binding of transcription factors orchestrates reprogramming. Cell 168, 442–459.e20 (2017).
Li, D. et al. Chromatin accessibility dynamics during iPSC reprogramming. Cell Stem Cell 21, 819–833.e6 (2017).
Di Stefano, B. et al. C/EBPα creates elite cells for iPSC reprogramming by upregulating Klf4 and increasing the levels of Lsd1 and Brd4. Nat. Cell Biol. 18, 371–381 (2016).
Xu, J., Du, Y. & Deng, H. Direct lineage reprogramming: strategies, mechanisms, and applications. Cell Stem Cell 16, 119–134 (2015).
Farley, E. K. et al. Suboptimization of developmental enhancers. Science 350, 325–328 (2015).
Farley, E. K., Olson, K. M., Zhang, W., Rokhsar, D. S. & Levine, M. S. Syntax compensates for poor binding sites to encode tissue specificity of developmental enhancers. Proc. Natl Acad. Sci. USA 113, 6508–6513 (2016).
Maytin, E. V. et al. Keratin 10 gene expression during differentiation of mouse epidermis requires transcription factors C/EBP and AP-2. Dev. Biol. 216, 164–181 (1999).
Li, J.-J., Cao, Y., Young, M. R. & Colburn, N. H. Induced expression of dominant-negative c-jun downregulates NFκB and AP-1 target genes and suppresses tumor phenotype in human keratinocytes. Mol. Carcinogenesis 29, 159–169 (2000).
Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).
Finucane, H. K. et al. Heritability enrichment of specifically expressed genes identifies disease-relevant tissues and cell types. Nat. Genet. 50, 621–629 (2018).
Hirata, T. et al. Japanese GWAS identifies variants for bust-size, dysmenorrhea, and menstrual fever that are eQTLs for relevant protein-coding or long non-coding RNAs. Sci. Rep. 8, 8502 (2018).
Paternoster, L. et al. Multi-ancestry genome-wide association study of 21,000 cases and 95,000 controls identifies new risk loci for atopic dermatitis. Nat. Genet. 47, 1449–1456 (2015).
Colonna, M. AHR: making the keratinocytes thick skinned. Immunity 40, 863–864 (2014).
Birnbaum, R. Y. et al. Seborrhea-like dermatitis with psoriasiform elements caused by a mutation in ZNF750, encoding a putative C2H2 zinc finger protein. Nat. Genet. 38, 749–751 (2006).
Li, M. et al. Topical vitamin D3 and low-calcemic analogs induce thymic stromal lymphopoietin in mouse keratinocytes and trigger an atopic dermatitis. Proc. Natl Acad. Sci. USA 103, 11736–11741 (2006).
Bentsen, M. et al. ATAC-seq footprinting unravels kinetics of transcription factor binding during zygotic genome activation. Nat. Commun. 11, 4267 (2020).
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
ENCODE Project Consortium. et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 583, 699–710 (2020).
Fornes, O. et al. JASPAR 2020: update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 48, D87–D92 (2020).
Luo, K. et al. Quantitative occupancy of myriad transcription factors from one DNase experiment enables efficient comparisons across conditions. Preprint at bioRxiv https://doi.org/10.1101/2020.06.28.171587 (2020).
Vierstra, J. et al. Global reference mapping of human transcription factor footprints. Nature 583, 729–736 (2020).
Weirauch, M. T. et al. Determination and inference of eukaryotic transcription factor sequence specificity. Cell 158, 1431–1443 (2014).
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).
Ghandi, M., Lee, D., Mohammad-Noori, M. & Beer, M. A. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput. Biol. 10, e1003711 (2014).
Maslova, A. et al. Deep learning of immune cell differentiation. Proc. Natl Acad. Sci. USA 117, 25655–25666 (2020).
Sanford, E. M. et al. Gene regulation gravitates toward either addition or multiplication when combining the effects of two signals. eLife 9, e59388 (2020).
Sharon, E. et al. Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters. Nat. Biotechnol. 30, 521–530 (2012).
Smith, R. P. et al. Massively parallel decoding of mammalian regulatory sequences supports a flexible organizational model. Nat. Genet. 45, 1021–1028 (2013).
Kim, D. S. & Kundaje, A. Classification dataset for ENCODE-Roadmap DNase-seq peaks and transcription factor ChIP–seq peaks. Zenodo https://doi.org/10.5281/zenodo.4059038 (2020).
Kim, D. S. & Kundaje, A. Machine learning datasets for epigenomic landscapes in epidermal differentiation. Zenodo https://doi.org/10.5281/zenodo.4062510 (2020).
Kim, D. S. & Kundaje, A. Convolutional neural net (CNN) models for ENCODE-Roadmap DNase-seq peaks and transcription factor ChIP–seq peaks—basset architecture. Zenodo https://doi.org/10.5281/zenodo.4059060 (2020).
Kim, D. S. & Kundaje, A. Convolutional neural net (CNN) models for epigenomic landscapes in epidermal differentiation—basset architecture, classification and regression. Zenodo https://doi.org/10.5281/zenodo.4062726 (2020).
Lizio, M. et al. Update of the FANTOM web resource: high resolution transcriptome of diverse cell types in mammals. Nucleic Acids Res. 45, D737–D743 (2017).
Kim, D. vervacity/ggr-project: first release. Zenodo https://doi.org/10.5281/zenodo.5161189 (2021).
Kim, D., Arivazhagan, N., Wu, K. & Sharmin, M. kundajelab/tronn: v.1.0.0. Zenodo https://doi.org/10.5281/zenodo.5160998 (2021).
Acknowledgements
The authors thank members of the Kundaje, Greenleaf and Khavari laboratories for discussions and Stanford Computing (Sherlock) and Pacific Computing Consortium (Nautilus) for computing resources. This work was supported by USVA Office of Research and Development I01BX00140908 (P.A.K.), NIH CA142635, AR45192, AR076965 and HG007919 (P.A.K.), 1DP2GM123485 (A.K.) and RM1-HG007735 (H.Y.C.). H.Y.C. is an Investigator of the Howard Hughes Medical Institute. V.I.R. was supported by the Walter V. and Idun Berry Postdoctoral Fellowship and the Katharine McCormick Advanced Postdoctoral Fellowship.
Author information
Authors and Affiliations
Contributions
D.S.K., A. Kundaje, W.J.G. and P.A.K. conceived the project. D.S.K., V.I.R., D.L.R., J.C., A.J.R., N.J., L.K.H.D., V.L.-P., H.D., M.S., D.R., S.L., M.P.S., A. Kathiria, M.S. and Z.Z. performed experiments and performed analysis. A. Kundaje and P.A.K. guided experiments and data analysis. D.S.K., W.J.G., A.undaje and P.A.K. wrote the manuscript with input from all authors.
Corresponding authors
Ethics declarations
Competing interests
H.Y.C. is a cofounder of Accent Therapeutics and Boundless Bio and is an advisor to 10x Genomics, Arsenal Biosciences and Spring Discovery. M.P.S. is a cofounder and on the advisory board of Personalis, SensOmics, January, Filtricine, Qbio, Protos, Mirive and Nimo. M.P.S. is also on the advisory board of Genapsys and Tailai. W.J.G. is a consultant for 10x Genomics who has licensed intellectual property associated with ATAC-seq. W.J.G. has additional affiliations with Guardant Health (consultant) and Protillion Biosciences (cofounder and consultant). A. Kundaje has affiliations with Biogen (consultant), ImmuneAI (scientific advisory board) and RavelBio (scientific cofounder and scientific advisory board). The remaining authors declare no competing interests.
Additional information
Peer review information Nature Genetics thanks Brian Capell, Idan Cohen and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Data quality and other characteristics of the regulatory landscape.
(a) Morphology schematic of normal human epidermis. (b) Selected biomarker gene panel from PAS-seq, demonstrating proper differentiation across time in vitro. (c) Principal component analysis (PCA) of other datasets (signal type used for analysis in parenthesis): PAS-seq (log2 of counts), H3K27ac ChIP–seq (log2 of counts), H3K4me1 ChIP–seq (log2 of counts), H3K27me3 ChIP–seq (log2 of counts), HiChIP (normalized fragment counts). (d) Global statistics on ATAC-seq. Top plot shows the number of reproducible peaks across the timepoints. Bottom plot shows the number of up and down regulated differential peaks across time, using day 0 as the baseline. (e) Global statistics on PAS-seq. Top plot shows the number of expressed genes (> approximately 1TPM) at each timepoint. Bottom plot shows the number of up and down regulated differential genes across time, using day 0 as the baseline. (f) Comparison of bulk ATAC-seq in keratinocyte differentiation to scATAC-seq. Each of the bulk ATAC-seq samples was projected into a 2D UMAP of related scATAC-seq data.
Extended Data Fig. 2 Extended analysis of the keratinocyte epigenome.
(a) Analysis of regions with stable (invariant) accessibility and dynamic chromatin modifications surrounding them (28,973 regions). The regions are clustered according to their dynamic chromatin mark patterns and marked with enriched GO terms accordingly. (b) Analysis of regions with stable (invariant) accessibility and stable chromatin modifications (84,678 regions). The regions are clustered according to combinatorial chromatin states and marked with enriched GO terms accordingly. (c) Comparison of accessibility at TSSs, separated into TSSs of dynamic genes, stable genes, and nonexpressed genes, and additionally compared to distal regions. (d) Profile heatmaps for TSSs of dynamic genes. (e) Chromatin states around TSSs of stable genes. (f) Chromatin states around TSSs of nonexpressed genes.
Extended Data Fig. 3 Extended analysis of deep neural net models.
(a) Schematic describing transfer learning. From left to right: first, models are trained on a large compendium of DNase-seq datasets from ENCODE and Roadmap; these weights are used to initialize training for a keratinocyte specific classification model; finally, these weights are used to initialize training for a regression model. (b) Model performance metrics. Left: area under the precision-recall curve (AUPRC) for the ENCODE/Roadmap pre-training classification tasks across 10 folds. Right: AUPRC for accessibility in keratinocyte timepoints across 10 folds, considering transfer learning or fresh initialization (random seeded weights). Box-and-whisker plots show all points, minimum to maximum, with 25th to 75th interquartile range. (c) Precision-recall curves for the classification stage. Top: Precision-recall for prediction of accessible peaks. Bottom: Precision-recall for prediction of strong enhancer state (presence of ATAC-seq, H3K27ac ChIP–seq, and H3K4me1 ChIP–seq). (d) Heatmaps of observed ATAC signal vs neural net predicted ATAC signal across dynamically accessible regions. (e) Validation of contribution scores by comparing to SNPs exhibiting significant allelic imbalance of ATAC-seq signal. Top: Comparison of effect sizes of allelic imbalance of ATAC-seq signal, between SNPs overlapping nonsignificant contribution scores and those overlapping significant contribution scores. Bottom: comparison of model-derived allelic effect predictions (reference allele - alternate allele) on SNPs overlapping significant contribution scores, separated by whether the SNP was considered allele-sensitive (FDR < 0.10) or not allele-sensitive. Box-and-whisker plots show all points, minimum to maximum, with 25th to 75th interquartile range box. (f) Comparison of neural network derived predictive motifs versus enriched motifs derived by HOMER motif discovery. (g) Predictive, active motif instances of KLF4 show higher ChIP–seq signal relative to inactive motifs in CREs. (h) Evaluation of motif instances identified by sequence-only position weight matrix motif match scores against contribution-weighted sequence motif match scores. (i) Predictive motifs show dynamic footprinting. DLX3 motif is shown. (j) Heatmap showing predictive motifs enriched in CREs corresponding to ATAC-seq trajectories. (k) Heatmap showing TFs whose expression was correlated (r > 0.8) with activity of their matched predictive motifs in CREs corresponding to ATAC-seq trajectories.
Extended Data Fig. 4 Repressive motifs in CREs exhibiting decreasing accessibility across keratinocyte differentiation.
(a) Top: dynamics of negative contribution scores of predictive motif instances of CEPBA and KLF4 across time averaged over all CREs exhibiting decreasing accessibility across the timecourse. Bottom: dynamic expression patterns of CEBP and KLF family TFs that exhibit strong anticorrelation with motif activity dynamics across the timecourse. (b) A closing CRE (chr10:60192514-60203992) shows progressively increasing negative contributions of nucleotides in CEBPA motif across the timecourse in concordance with an increasing negative effect on accessibility. Assay ranges are ATAC-seq: 0-500; H3K27ac: 0-20; H3K4me1: 0–50. (c) Functional enrichments for gene sets linked to CREs containing predictive instances of CEBPA and KLF4 motifs with strong negative contribution scores. Left: enrichments linked to closing CREs with negative CEBPA motif scores. Right: enrichments linked to closing CREs with negative KLF4 motif scores.
Extended Data Fig. 5 Analysis of homotypic motif clusters within the keratinocyte epigenome.
(a) Analysis of motif counts on chromatin accessibility using synthetic sequences. Synthetic scrambled background sequences were embedded with varying number of instances of each predictive motif. The neural network was used to predict chromatin accessibility for each synthetic sequence. Left: Each curve summarizes the predicted accessibility with increasing motif density for each motif averaged over 100 random synthetic backgrounds. Middle/right: Predicted chromatin accessibility for increasing density of FOSB, and CEBPD motifs. Each black curve represents a specific random synthetic background sequence, while the red curve is the average pattern across all backgrounds. (b) Relationship between motif affinity and motif density in CREs containing predictive motif instances. Motif affinity is estimated as the average motif PWM match log-odds scores of all predictive instances in a CRE. Motif density is the number of predictive motif instances in each CRE. We observe a striking tradeoff between motif density and the upper limit of average motif affinity. Right: CEBPD motif instances. Left: GRHL motif instances. (c) Motif PWM match scores as a function of distance from the ATAC-seq summit. Left: motif PWM match scores from all motif instances for CEBPD and GRHL motifs. Right: motif PWM match scores for predictive motif instances for CEBPD and GRHL motifs. (d) Proposed principles of cell-type specific homotypic motif clusters. As number of motif instances increases or as motif affinities in a region increase, accessibility increases. The suboptimization of motif sites, particularly when there are more motif instances within a region, acts as an upper limit to prevent ectopic accessibility. Motif affinities are strongest near the accessibility summit.
Extended Data Fig. 6 Examples of interacting pairs of predictive motifs and motif co-occurrence statistics.
(a) Example regions demonstrating interacting motifs. Top row: putative enhancer affecting LAMC2 gene expression with an interacting NFKB2 motif and RUNX1 motif (chr1:183147408-183170430). Assay ranges are ATAC-seq: 0–600; H3K27ac: 0-300; H3K4me1: 0–50. The highlighted region in the signal tracks (left) demonstrates correctly predicted ATAC signal by the neural net (top middle heatmap). Base-resolution contribution score tracks are shown for the wild-type (genomic) sequence and sequences with marginal and joint perturbation of both motifs (middle tracks). The model predicts a super-multiplicative effects of the motif pair on chromatin accessibility (right plot). Bottom row: Analogous plots for a putative enhancer affecting MUC15 gene expression with an interacting GRHL2 motif and ATF4 motif (chr11:26590539-26610606). Assay ranges are ATAC-seq: 0–800; H3K27ac: 0–150; H3K4me1: 0-70. (b) Co-occurrence statistics (size of circle represents number of instances) for motif pairs based on all motif instances identified solely using sequence match scores (left) and motif pairs based on predictive, active motif instances based on contribution-weighted sequence match scores (right). Predictive motif instances highlight less promiscuous, more specific co-occurrence statistics. (c) Analogous co-occurrence statistics for motif pairs using all motif instances (left) and predictive motif instances (right) after filtering for pairs that show significant GO term enrichments for associated target genes. Once again, more specific co-occurrence patterns are observed for the predictive motif instances.
Extended Data Fig. 7 Mapping co-occurring motif pairs to enriched Gene Ontology terms.
Map of combinatorial rules derived from in silico motif interaction analyses. Each row across plots represents a predicted interacting motif pair. From left to right: the motif presence plot demonstrates which motifs are part of the combinatorial rule; the ATAC heatmap demonstrates the average accessibility pattern over CREs containing each motif pair across all time points; the RNA heatmap displays the average gene expression over genes associated with CREs containing each motif pair across the time points; Gene Ontology terms are significantly enriched in the gene sets associated with CREs containing each motif pair.
Extended Data Fig. 8 MPRA data quality and comparisons to epigenomic landscapes.
(a) Distribution of barcodes in plasmid library, demonstrating the skew of barcode representation. (b) Number of barcodes per fragment in plasmid library, demonstrating on average 10 barcodes per fragment tested. (c) Number of reads per MPRA sample. (d) Number of barcodes per fragment in MPRA RNA reads, demonstrating on average 10 barcodes per fragment tested. (e) Average MPRA signal compared to controls, showing ATAC regions on average have activity in between negative controls (genomic negatives and shuffled sequences) and positive controls (promoter sequences). (f) MPRA replicate consistency. Left: Consistency by Pearson R across replicates and timepoints tested. Right: Consistency of MPRA replicate signal for two example replicates in timepoint day 0. (g) Correlation of MPRA signal to various genomic and/or modeling signals: ATAC signal, NN predictions of ATAC signal, H3K27ac, and regression predictions from linear model utilizing NN final layer activations as model inputs (results shown on held-out test data). (h) ATAC signal across timepoints day 0,3, and 6 for sequences containing HOXA1 motif and ETV5 motif. Box-and-whisker plots show all points, minimum to maximum, with 25th to 75th interquartile range.
Extended Data Fig. 9 Additional experimental validation of representative MPRA fragments.
(a) GFP expression of reporters drawn from MPRA fragments (endogenous examples of combinatorial rules) from day 0 to day 7. R4 and R5 are instances of CEBPD/ZNF750 rule. R11 and R12 are instances of KLF4/ZNF750 rule. Scr is a scrambled control fragment. Yellow and black scale bars are 20um. (b) GFP expression from the experiment in (A) quantified. (c) ZNF750 knockout followed by luciferase reporter timecourse expression, demonstrating ZNF750 influence on instances of rules involving ZNF750. Data summarizes three independent experiments and is represented as mean ± s.e.m. One-sided T test was used for comparisons, *P < 0.05, **P < 0.01, ***P < 0.001.
Extended Data Fig. 10 Combinatorial motif pairs are enriched for genetic variants associated with skin phenotypes.
(a) LD score regression analysis showing differential heritability enrichment of various skin-related diseases and traits in different sets of CRE. The skin phenotypes include: psoriasis, dermatitis, acne, actinic keratosis and rosacea. The sets of CREs include: ‘Progenitor’ rules are CREs containing motif pairs that demonstrate decreasing accessibility and activity across the epidermal differentiation timecourse. ‘Early differentiation’ rules are CREs containing motif pairs those that demonstrate maximal accessibility and activity in the middle of the epidermal differentiation timecourse. ‘Late differentiation’ rules are CREs contained motif pairs that demonstrate maximal accessibility and activity at the end of the epidermal differentiation timecourse. ‘Union DHS’ is the union of DNase peaks across all ENCODE DNase datasets. ‘HepG2’ are DNase peaks in the HepG2 liver carcinoma cell line. ‘Union ATAC’ is the union of CREs across all time points of the differentiation timecourse. ‘ATAC timepoints’ are the CREs that are accessible in each time point of the epidermal differentiation timecourse. ‘Dynamic trajectories’ are clusters of CREs that display specific concordant patterns of dynamic accessibility across the epidermal differentiation timecourse. Plots show LDSC score enrichment coefficients with confidence intervals. (b) Predicted dysregulated TF motif lexicon networks by phenotype. Combinatorial rules were overlaid onto the predicted TF network of combinatorial motif interactions to demonstrate dysregulated TF subnetworks. Node size is the sum of the LD score regression coefficients for the significant combinatorial rules involving that node TF motif. Edges and nodes in black represent significantly enriched combinatorial rules, edges and nodes in gray did not pass statistical significance. Edge thickness represents the validated interaction effect of the rule (supermultiplicative, multiplicative, submultiplicative).
Supplementary information
Supplementary Information
Supplementary Materials and Methods, and Fig. 1.
Supplementary Tables
Table 1: List of all experiments with metadata. Table 2: ATAC-seq quality control metrics for technical replicates. Table 3: ATAC-seq quality control metrics for biological replicates. Table 4: PAS-seq quality control metrics. Table 5: ChIP–seq quality control metrics. Table 6: HiChIP quality control metrics. Table 7: Cross-validation folds, split by chromosomes. Table 8: List of ENCODE/Roadmap DNase-seq datasets used in transfer learning. Table 9: List of ENCODE/Roadmap TF ChIP–seq datasets used in transfer learning. Table 10: HOCOMOCO reduced set of position weight matrices with associated transcription factors. Table 11: List of FANTOM5 transcription factors. Table 12: List of MPRA sequences. Table 13: List of individual reporter sequences used in validation experiments. Table 14: List of qPCR primers for ChIP–ReChIP.
Rights and permissions
About this article
Cite this article
Kim, D.S., Risca, V.I., Reynolds, D.L. et al. The dynamic, combinatorial cis-regulatory lexicon of epidermal differentiation. Nat Genet 53, 1564–1576 (2021). https://doi.org/10.1038/s41588-021-00947-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41588-021-00947-3
This article is cited by
-
Integrated single-cell chromatin and transcriptomic analyses of human scalp identify gene-regulatory programs and critical cell types for hair and skin diseases
Nature Genetics (2023)
-
Obtaining genetics insights from deep learning via explainable artificial intelligence
Nature Reviews Genetics (2023)
-
Benchmarking of deep neural networks for predicting personal gene expression from DNA sequence highlights shortcomings
Nature Genetics (2023)
-
Integrative analyses highlight functional regulatory variants associated with neuropsychiatric diseases
Nature Genetics (2023)
-
Gradual differentiation uncoupled from cell cycle exit generates heterogeneity in the epidermal stem cell layer
Nature Cell Biology (2022)