The arrangement (syntax) of transcription factor (TF) binding motifs is an important part of the cis-regulatory code, yet remains elusive. We introduce a deep learning model, BPNet, that uses DNA sequence to predict base-resolution chromatin immunoprecipitation (ChIP)–nexus binding profiles of pluripotency TFs. We develop interpretation tools to learn predictive motif representations and identify soft syntax rules for cooperative TF binding interactions. Strikingly, Nanog preferentially binds with helical periodicity, and TFs often cooperate in a directional manner, which we validate using clustered regularly interspaced short palindromic repeat (CRISPR)-induced point mutations. Our model represents a powerful general approach to uncover the motifs and syntax of cis-regulatory sequences in genomics data.
This is a preview of subscription content, access via your institution
Open Access articles citing this article.
Genome Biology Open Access 04 August 2023
Human Genomics Open Access 25 July 2023
Chromatin accessibility dynamics of neurogenic niche cells reveal defects in neural stem cell adhesion and migration during aging
Nature Aging Open Access 13 July 2023
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Rent or buy this article
Prices vary by article type
Prices may be subject to local taxes which are calculated during checkout
The raw sequencing data are available from GEO under accession number GSE137193. Data used to train, evaluate and interpret the BPNet models are found on zenodo at https://doi.org/10.5281/zenodo.3371215. Trained BPNet models and all the model interpretation results are on zenodo at https://doi.org/10.5281/zenodo.3371163. The BPNet model trained on ChIP–nexus data is available on Kipoi under the name BPNet-OSKN (http://kipoi.org/models/BPNet-OSKN/). Genome browser tracks showing observed/predicted ChIP–nexus signal and contribution scores for all factors are available at https://genome.ucsc.edu/s/mlweilert/mesc_OSKN_tracks. ATAC-seq data in mouse ESCs used in Fig. 2 and Supplementary Fig. 7 were obtained from GSE134680. Blacklisted regions used to filter genomic coordinates throughout the analysis are available at https://www.encodeproject.org/files/ENCFF547MET. RepeatMasker mm10 annotations were obtained from http://www.repeatmasker.org/genomes/mm10/RepeatMasker-rm405-db20140131/mm10.fa.out.gz. The nuclear magnetic resonance structure 1O4X used to render Sox2 and Oct1 in Fig. 3 is available at https://www.rcsb.org/structure/1o4x. TRANSFAC (v.7.0) was used to identify the TFIIIC B-box discussed in Fig. 3. The PH0134.1 Pbx PWM used for motif validation in Supplementary Fig. 8 and Extended Data Fig. 5 was obtained from JASPAR at http://jaspar.genereg.net/api/v1/matrix/PH0134.1.jaspar. The MA0141.1 Esrrb PWM used in Extended Data Fig. 5 was obtained from JASPAR at http://jaspar.genereg.net/api/v1/matrix/MA0141.1.jaspar. The transfer RNA database GtRNAdb (v.2.0, release 17.1) annotations and associated tRNAscan-SE scores used in Extended Data Fig. 5 were obtained from http://gtrnadb.ucsc.edu/GtRNAdb_archives/release17/genomes/eukaryota/Mmusc10/mm10-tRNAs.tar.gz. Source data are provided with this paper.
The BPNet software package is available at https://github.com/kundajelab/bpnet/. Code to reproduce the results is available at https://github.com/kundajelab/bpnet-manuscript (https://doi.org/10.5281/zenodo.4294813). The ChIP–nexus data processing pipeline is available at https://github.com/kundajelab/chip-nexus-pipeline. Software to trim and deduplicate ChIP–nexus reads is available at https://github.com/Avsecz/nimnexus/.
Gerstein, M. B. et al. Architecture of the human regulatory network derived from ENCODE data. Nature 489, 91–100 (2012).
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Roadmap Epigenomics Consortiumet al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).
Morgunova, E. & Taipale, J. Structural perspective of cooperative transcription factor binding. Curr. Opin. Struct. Biol. 47, 1–8 (2017).
Zinzen, R. P., Senger, K., Levine, M. & Papatsenko, D. Computational models for neurogenic gene expression in the Drosophila embryo. Curr. Biol. 16, 1358–1365 (2006).
Fiore, C. & Cohen, B. A. Interactions between pluripotency factors specify cis-regulation in embryonic stem cells. Genome Res. 26, 778–786 (2016).
Sayal, R., Dresch, J. M., Pushel, I., Taylor, B. R. & Arnosti, D. N. Quantitative perturbation-based analysis of gene expression predicts enhancer activity in early Drosophila embryo. eLife 5, e08445 (2016).
Erceg, J. et al. Subtle changes in motif positioning cause tissue-specific effects on robustness of an enhancer’s activity. PLoS Genet. 10, e1004060 (2014).
Crocker, J. & Ilsley, G. R. Using synthetic biology to study gene regulatory evolution. Curr. Opin. Genet. Dev. 47, 91–101 (2017).
Farley, E. K. et al. Suboptimization of developmental enhancers. Science 350, 325–328 (2015).
Swanson, C. I., Evans, N. C. & Barolo, S. Structural rules and complex regulatory circuitry constrain expression of a Notch- and EGFR-regulated eye enhancer. Dev. Cell 18, 359–370 (2010).
Liu, F. & Posakony, J. W. Role of architecture in the function and specificity of two Notch-regulated transcriptional enhancer modules. PLoS Genet. 8, e1002796 (2012).
Lusk, R. W. & Eisen, M. B. Evolutionary mirages: selection on binding site composition creates the illusion of conserved grammars in Drosophila enhancers. PLoS Genet. 6, e1000829 (2010).
Kulkarni, M. M. & Arnosti, D. N. Information display by transcriptional enhancers. Development 130, 6569–6575 (2003).
Liberman, L. M. & Stathopoulos, A. Design flexibility in cis-regulatory control of gene expression: synthetic and comparative evidence. Dev. Biol. 327, 578–589 (2009).
Junion, G. et al. A transcription factor collective defines cardiac cell fate and reflects lineage history. Cell 148, 473–486 (2012).
King, D. M. et al. Synthetic and genomic regulatory elements reveal aspects of cis-regulatory grammar in mouse embryonic stem cells. eLife 9, e41279 (2020).
Bailey, T. L. et al. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 37, W202–W208 (2009).
Hughes, J. D., Estep, P. W., Tavazoie, S. & Church, G. M. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol. 296, 1205–1214 (2000).
Pavesi, G., Mereghetti, P., Mauri, G. & Pesole, G. Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Res. 32, W199–W203 (2004).
Thijs, G. et al. A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics 17, 1113–1122 (2001).
Cheng, Q. et al. Computational identification of diverse mechanisms underlying transcription factor-DNA occupancy. PLoS Genet. 9, e1003571 (2013).
Guo, Y., Mahony, S. & Gifford, D. K. High resolution genome wide binding event finding and motif discovery reveals transcription factor spatial binding constraints. PLoS Comput. Biol. 8, e1002638 (2012).
Wang, J. et al. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 22, 1798–1812 (2012).
Lee, D., Karchin, R. & Beer, M. A. Discriminative prediction of mammalian enhancers from DNA sequence. Genome Res. 21, 2167–2180 (2011).
Erives, A. & Levine, M. Coordinate enhancers share common organizational features in the Drosophila genome. Proc. Natl Acad. Sci. USA 101, 3851–3856 (2004).
Papatsenko, D., Goltsev, Y. & Levine, M. Organization of developmental enhancers in the Drosophila embryo. Nucleic Acids Res. 37, 5665–5677 (2009).
Ng, F. S. L. et al. Constrained transcription factor spacing is prevalent and important for transcriptional control of mouse blood cells. Nucleic Acids Res. 42, 13513–13524 (2014).
Kharchenko, P. V., Tolstorukov, M. Y. & Park, P. J. Design and analysis of ChIP–seq experiments for DNA-binding proteins. Nat. Biotechnol. 26, 1351–1359 (2008).
Zhang, Y. et al. Model-based analysis of ChIP–seq (MACS). Genome Biol. 9, R137 (2008).
Rozowsky, J. et al. PeakSeq enables systematic scoring of ChIP–seq experiments relative to controls. Nat. Biotechnol. 27, 66–75 (2009).
Guo, Y. et al. Discovering homotypic binding events at high spatial resolution. Bioinformatics 26, 3028–3034 (2010).
Kuan, P. F. et al. A statistical framework for the analysis of ChIP–seq data. J. Am. Stat. Assoc. 106, 891–903 (2011).
Hartonen, T., Sahu, B., Dave, K., Kivioja, T. & Taipale, J. PeakXus: comprehensive transcription factor binding site discovery from ChIP–Nexus and ChIP–Exo experiments. Bioinformatics 32, i629–i638 (2016).
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).
Quang, D. & Xie, X. FactorNet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data. Methods 166, 40–47 (2019).
Bogard, N., Linder, J., Rosenberg, A. B. & Seelig, G. A deep neural network for predicting and engineering alternative polyadenylation. Cell 178, 91–106 (2019).
Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).
Lanchantin, J., Singh, R., Wang, B. & Qi, Y. Deep motif dashboard: visualizing and understanding genomic sequences using deep neural networks. Pac. Symp. Biocomput. 22, 254–265 (2017).
Shrikumar, A. et al. TF-MoDISco v0.4.2.2-alpha: technical note. Preprint at arXiv https://arxiv.org/abs/1811.00416 (2018).
Jha, A., Aicher, J. K., Singh, D. & Barash, Y. Improving interpretability of deep learning models: splicing codes as a case study. Preprint at bioRxiv https://doi.org/10.1101/700096 (2019).
Greenside, P., Shimko, T., Fordyce, P. & Kundaje, A. Discovering epistatic feature interactions from neural network models of regulatory DNA sequences. Bioinformatics 34, i629–i637 (2018).
Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).
Gordân, R., Hartemink, A. J. & Bulyk, M. L. Distinguishing direct versus indirect transcription factor-DNA interactions. Genome Res. 19, 2090–2100 (2009).
Mariani, L., Weinand, K., Vedenko, A., Barrera, L. A. & Bulyk, M. L. Identification of human lineage-specific transcriptional coregulators enabled by a glossary of binding modules and tunable genomic backgrounds. Cell Syst. 5, 187–201 (2017).
Bailey, T. L. & Machanick, P. Inferring direct DNA binding from ChIP–seq. Nucleic Acids Res. 40, e128 (2012).
Rhee, H. S. & Pugh, B. F. Comprehensive genome-wide protein–DNA interactions detected at single-nucleotide resolution. Cell 147, 1408–1419 (2011).
He, Q., Johnston, J. & Zeitlinger, J. ChIP–nexus enables improved detection of in vivo transcription factor binding footprints. Nat. Biotechnol. 33, 395–401 (2015).
Yamada, N., Lai, W. K. M., Farrell, N., Pugh, B. F. & Mahony, S. Characterizing protein–DNA binding event subtypes in ChIP–exo data. Bioinformatics 35, 903–913 (2019).
Starick, S. R. et al. ChIP–exo signal associated with DNA-binding motifs provides insight into the genomic binding of the glucocorticoid receptor and cooperating transcription factors. Genome Res. 25, 825–835 (2015).
Papagianni, A. et al. Capicua controls Toll/IL-1 signaling targets independently of RTK regulation. Proc. Natl Acad. Sci. USA 115, 1807–1812 (2018).
Reményi, A. et al. Crystal structure of a POU/HMG/DNA ternary complex suggests differential assembly of Oct4 and Sox2 on two enhancers. Genes Dev. 17, 2048–2059 (2003).
Banerji, J., Rusconi, S. & Schaffner, W. Expression of a beta-globin gene is enhanced by remote SV40 DNA sequences. Cell 27, 299–308 (1981).
Spitz, F. & Furlong, E. E. M. Transcription factors: from enhancer binding to developmental control. Nat. Rev. Genet. 13, 613–626 (2012).
Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535–548 (2019).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (eds. He, K. et al.) 770–778 (IEEE, 2016); https://doi.org/10.1109/CVPR.2016.90
Van Den Oord, A. & Dieleman, S. WaveNet: a generative model for raw audio. DeepMind https://deepmind.com/blog/article/wavenet-generative-model-raw-audio (2016).
Terooatea, T. W., Pozner, A. & Buck-Koehntop, B. A. PAtCh-Cap: input strategy for improving analysis of ChIP–exo data sets and beyond. Nucleic Acids Res. 44, e159 (2016).
Whyte, W. A. et al. Enhancer decommissioning by LSD1 during embryonic stem cell differentiation. Nature 482, 221–225 (2012).
Novo, C. L. et al. Long-range enhancer interactions are prevalent in mouse embryonic stem cells and are reorganized upon pluripotent state transition. Cell Rep. 22, 2615–2627 (2018).
Festuccia, N. et al. Esrrb extinction triggers dismantling of naïve pluripotency and marks commitment to differentiation. EMBO J. 37, e95476 (2018).
Moorthy, S. D. et al. Enhancers and super-enhancers have an equivalent regulatory role in embryonic stem cells through regulation of single or multiple genes. Genome Res. 27, 246–258 (2017).
Avsec, Ž. et al. The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat. Biotechnol. 37, 592–600 (2019).
Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. In Proc. 34th International Conference on Machine Learning 3145–3153 (2017).
Chew, J.-L. et al. Reciprocal transcriptional regulation of Pou5f1 and Sox2 via the Oct4/Sox2 complex in embryonic stem cells. Mol. Cell. Biol. 25, 6031–6046 (2005).
Chen, X. et al. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell 133, 1106–1117 (2008).
Mitsui, K. et al. The homeoprotein Nanog is required for maintenance of pluripotency in mouse epiblast and ES cells. Cell 113, 631–642 (2003).
Loh, Y.-H. et al. The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells. Nat. Genet. 38, 431–440 (2006).
Salmon-Divon, M., Dvinge, H., Tammoja, K. & Bertone, P. PeakAnalyzer: genome-wide annotation of chromatin binding and modification loci. BMC Bioinformatics 11, 415 (2010).
Gagliardi, A. et al. A direct physical interaction between Nanog and Sox2 regulates embryonic stem cell self-renewal. EMBO J. 32, 2231–2247 (2013).
He, X. et al. A biophysical model for analysis of transcription factor interaction and binding site arrangement from genome-wide binding data. PLoS ONE 4, e8155 (2009).
Xie, L. et al. A dynamic interplay of enhancer elements regulates Klf4 expression in naïve pluripotency. Genes Dev. 31, 1795–1808 (2017).
Mistri, T. K. et al. Dynamic changes in Sox2 spatio-temporal expression promote the second cell fate decision through Fgf4/Fgfr2 signaling in preimplantation mouse embryos. Biochem. J. 475, 1075–1089 (2018).
Tokuzawa, Y. et al. Fbx15 is a novel target of Oct3/4 but is dispensable for embryonic stem cell self-renewal and mouse development. Mol. Cell. Biol. 23, 2699–2708 (2003).
Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589 (2010).
Friman, E. T. et al. Dynamic regulation of chromatin accessibility by pluripotency transcription factors across the cell cycle. eLife 8, e5008 (2019).
Jolma, A. et al. DNA-binding specificities of human transcription factors. Cell 152, 327–339 (2013).
Tomilin, A. et al. Synergism with the coactivator OBF-1 (OCA-B, BOB-1) is mediated by a specific POU dimer configuration. Cell 103, 853–864 (2000).
Botquin, V. et al. New POU dimer configuration mediates antagonistic control of an osteopontin preimplantation enhancer by Oct-4 and Sox-2. Genes Dev. 12, 2073–2090 (1998).
Mistri, T. K. et al. Selective influence of Sox2 on POU transcription factor binding in embryonic and neural stem cells. EMBO Rep. 16, 1177–1191 (2015).
Ambrosetti, D. C., Basilico, C. & Dailey, L. Synergistic activation of the fibroblast growth factor 4 enhancer by Sox2 and Oct-3 depends on protein–protein interactions facilitated by a specific spatial arrangement of factor binding sites. Mol. Cell. Biol. 17, 6321–6329 (1997).
Merino, F., Bouvier, B. & Cojocaru, V. Cooperative DNA recognition modulated by an interplay between protein–protein interactions and DNA-mediated allostery. PLoS Comput. Biol. 11, e1004287 (2015).
Hayashi, Y. et al. Structure-based discovery of NANOG variant with enhanced properties to promote self-renewal and reprogramming of pluripotent stem cells. Proc. Natl Acad. Sci. USA 112, 4666–4671 (2015).
Wang, J., Levasseur, D. N. & Orkin, S. H. Requirement of Nanog dimerization for stem cell self-renewal and pluripotency. Proc. Natl Acad. Sci. USA 105, 6326–6331 (2008).
Todd, C. D., Deniz, Ö., Taylor, D. & Branco, M. R. Functional evaluation of transposable elements as enhancers in mouse embryonic and trophoblast stem cells. eLife 8, e44344 (2019).
Bourque, G. et al. Evolution of the mammalian transcription factor binding repertoire via transposable elements. Genome Res. 18, 1752–1762 (2008).
Kunarso, G. et al. Transposable elements have rewired the core regulatory network of human embryonic stem cells. Nat. Genet. 42, 631–634 (2010).
Sundaram, V. et al. Functional cis-regulatory modules encoded by mouse-specific endogenous retrovirus. Nat. Commun. 8, 14550 (2017).
Xie, D. et al. Rewirable gene regulatory networks in the preimplantation embryonic development of three mammalian species. Genome Res. 20, 804–815 (2010).
Jankowski, A., Szczurek, E., Jauch, R., Tiuryn, J. & Prabhakar, S. Comprehensive prediction in 78 human cell lines reveals rigidity and compactness of transcription factor dimers. Genome Res. 23, 1307–1318 (2013).
Jolma, A. et al. DNA-dependent formation of transcription factor pairs alters their binding specificity. Nature 527, 384–388 (2015).
Mullin, N. P. et al. Distinct contributions of tryptophan residues within the dimerization domain to Nanog function. J. Mol. Biol. 429, 1544–1553 (2017).
Kim, S. et al. Probing allostery through DNA. Science 339, 816–819 (2013).
Soufi, A. et al. Pioneer transcription factors target partial DNA motifs on nucleosomes to initiate reprogramming. Cell 161, 555–568 (2015).
Soufi, A., Donahue, G. & Zaret, K. S. Facilitators and impediments of the pluripotency reprogramming factors’ initial engagement with the genome. Cell 151, 994–1004 (2012).
Winter, D. R., Song, L., Mukherjee, S., Furey, T. S. & Crawford, G. E. DNase-seq predicts regions of rotational nucleosome stability across diverse human cell types. Genome Res. 23, 1118–1129 (2013).
Zhong, J. et al. Mapping nucleosome positions using DNase-seq. Genome Res. 26, 351–364 (2016).
Jin, H., Rube, H. T. & Song, J. S. Categorical spectral analysis of periodicity in nucleosomal DNA. Nucleic Acids Res. 44, 2047–2057 (2016).
Drew, H. R. et al. Structure of a B-DNA dodecamer: conformation and dynamics. Proc. Natl Acad. Sci. USA 78, 2179–2183 (1981).
Müller, J., Oehler, S. & Müller-Hill, B. Repression of lac promoter as a function of distance, phase and quality of an auxiliary lac operator. J. Mol. Biol. 257, 21–29 (1996).
Hochschild, A. & Ptashne, M. Cooperative binding of lambda repressors to sites separated by integral turns of the DNA helix. Cell 44, 681–687 (1986).
Ghosh, R. P. et al. Satb1 integrates DNA binding site geometry and torsional stress to differentially target nucleosome-dense regions. Nat. Commun. 10, 3221 (2019).
Zhu, F. et al. The interaction landscape between transcription factors and the nucleosome. Nature 562, 76–81 (2018).
Ptashne, M. Regulation of transcription: from lambda to eukaryotes. Trends Biochem. Sci 30, 275–279 (2005).
Sun, Y. et al. Zelda overcomes the high intrinsic nucleosome barrier at enhancers during Drosophila zygotic genome activation. Genome Res. 25, 1703–1714 (2015).
Thanos, D. & Maniatis, T. Virus induction of human IFNβ gene expression requires the assembly of an enhanceosome. Cell 83, 1091–1100 (1995).
Merika, M. & Thanos, D. Enhanceosomes. Curr. Opin. Genet. Dev. 11, 205–208 (2001).
Li, Q. & Wrange, O. Accessibility of a glucocorticoid response element in a nucleosome depends on its rotational positioning. Mol. Cell. Biol. 15, 4375–4384 (1995).
Sharon, E. et al. Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters. Nat. Biotechnol. 30, 521–530 (2012).
Cai, H. N., Arnosti, D. N. & Levine, M. Long-range repression in the Drosophila embryo. Proc. Natl Acad. Sci. USA 93, 9309–9314 (1996).
Cui, F. & Zhurkin, V. B. Rotational positioning of nucleosomes facilitates selective binding of p53 to response elements associated with cell cycle arrest. Nucleic Acids Res. 42, 836–847 (2014).
Suryamohan, K. & Halfon, M. S. Identifying transcriptional cis-regulatory modules in animal genomes. Wiley Interdiscip. Rev. Dev. Biol. 4, 59–84 (2015).
Istrail, S. Eric Davidson’s regulatory genome for computer science: causality, logic, and proof principles of the genomic cis-regulatory code. J. Comput. Biol. 26, 653–684 (2019).
Slattery, M. et al. Absence of a simple code: how transcription factors read the genome. Trends Biochem. Sci. 39, 381–399 (2014).
Tseng, A. M., Shrikumar, A. & Kundaje, A. Fourier-transform-based attribution priors improve the interpretability and stability of deep learning models for genomics. Preprint at bioRxiv https://doi.org/10.1101/2020.06.11.147272 (2020).
Klemenz, R., Stillman, D. J. & Geiduschek, E. P. Specific interactions of Saccharomyces cerevisiae proteins with a promoter region of eukaryotic tRNA genes. Proc. Natl Acad. Sci. USA 79, 6191–6195 (1982).
Oler, A. J. et al. Human RNA polymerase III transcriptomes and relationships to Pol II promoter chromatin and enhancer-binding factors. Nat. Struct. Mol. Biol. 17, 620–628 (2010).
Koenecke, N., Johnston, J., He, Q., Meier, S. & Zeitlinger, J. Drosophila poised enhancers are generated during tissue patterning with the help of repression. Genome Res. 27, 64–74 (2017).
Stemmer, M., Thumberger, T., Del Sol Keyer, M., Wittbrodt, J. & Mateo, J. L. Cctop: an intuitive, flexible and reliable crispr/cas9 target prediction tool. PLoS ONE 10, e0124633 (2015).
Labuhn, M. et al. Refined sgRNA efficacy prediction improves large- and small-scale CRISPR-Cas9 applications. Nucleic Acids Res. 46, 1375–1385 (2018).
Connelly, J. P. & Pruett-Miller, S. M. CRIS.py: a versatile and high-throughput analysis program for CRISPR-based genome editing. Sci. Rep. 9, 4194 (2019).
Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 17, 10 (2011).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Landt, S. G. et al. ChIP–seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 22, 1813–1831 (2012).
Kent, W. J., Zweig, A. S., Barber, G., Hinrichs, A. S. & Karolchik, D. BigWig and BigBed: enabling browsing of large distributed datasets. Bioinformatics 26, 2204–2207 (2010).
Li, Q., Brown, J. B., Huang, H. & Bickel, P. J. Measuring reproducibility of high-throughput experiments. Ann. Appl. Stat. 5, 1752–1779 (2011).
Yardımcı, G. G., Frank, C. L., Crawford, G. E. & Ohler, U. Explicit DNase sequence bias modeling enables high-resolution transcription factor footprint detection. Nucleic Acids Res. 42, 11865–11878 (2014).
Chollet, F. et al. Keras. https://keras.io (2015).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. dblp: Computer Science Bibliography https://dblp.org/rec/journals/corr/KingmaB14.html (2015).
Ward, J. H. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58, 236–244 (1963).
Bar-Joseph, Z., Gifford, D. K. & Jaakkola, T. S. Fast optimal leaf ordering for hierarchical clustering. Bioinformatics 17, S22–S29 (2001).
We thank M. Levine and R. Krumlauf for comments and J. Israeli for initial technical help. This work was funded by the Stowers Institute for Medical Research (SIMR), NIH grant no. R01HG010211 to J.Z. and NIH grant nos. DP2GM123485, U01HG009431 and R01HG009674 to A.K. Ž.A. was supported by the German Bundesministerium für Bildung und Forschung through the project MechML (no. 01IS18053F). A.S. was supported by the Stanford BioX Fellowship and HHMI International Student Research Fellowship. Illumina sequencing was performed at SIMR (A. Perera and M. Peterson) and the University of Kansas Medical Center Genomics Core, supported by NIH grant nos. U54HD090216, S10OD021743 and COBRE P30GM122731. Generation of CRISPR/Cas9 mouse ESC lines was performed by the following cores at SIMR: Genome Engineering (K. Delventhal, B. Miller and K. Weaver), Tissue Culture (C. Zhao, A. Murray, Y. Wang, O. Kenzior, Q. Jiang, S. Hime and S. Gosh) and Cytometry (J. Haug and D. DeGraffenreid).
J.Z. owns a patent on ChIP–nexus (no. 10287628). All other authors declare no competing interests.
Peer review information Nature Genetics thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
a, Observed and predicted ChIP-nexus read counts mapping to the forward strand (dark) and the reverse strand (light) for the Zfp281 and Sall1 enhancers located on the held-out (test) chromosome 1. b, Alternative profile shape evaluation metrics showing the difference to random predictions: multinomial negative log-likelihood and Jensen-Shannon (JS) divergence. Both metrics were computed at different resolutions (from 1 bp to 10 bp windows) in held-out test chromosomes 1, 8 and 9. c, auPRC of profile predictions is high across various learning rates on the tuning set chromosomes 2, 3 and 4, demonstrating the robustness of the model. d, The deconvolutional layer slightly improves the profile predictive performance compared to a point-wise convolutional layer (deconvolution size=1). e, auPRC of profile predictions (top) and the Spearman correlation of total count predictions (bottom) for a range of different relative total count weight α in the BPNet loss function parameterized as λ = α/2 n_obs. Relative weight of 1 (center) denotes equal weighting of the counts and profile loss functions. The best performance is obtained for α<1, showing that putting more weight to profile predictions aids both profile and count predictions. f, Observed and predicted total read counts for BPNet (top) and replicate experiments (bottom) across the four studied TFs along with the Spearman correlation coefficient.
Extended Data Fig. 2 Removal of long motifs in retrotransposons and clustering of motifs by similarity.
a, Among all motifs discovered by TF-MoDISco, 18 motifs display unusually high information content (IC) of >30 bits (green). The expected short motifs are shown in gray. b, Histogram of the overlap of short motifs (gray) and long motifs (green) with repeat elements shows that long motifs overlap >80% with annotated retrotransposons. c, Long motifs with their PFM, ID, fraction of motif instances overlapping with a repeat and the most frequent (top class) RepeatMasker annotation. Highlighted within the repeat elements are potential motif instances of Oct4-Sox2, Sox2, Nanog and Klf4 as indicated by the CWMs. d, To identify a set of representative motifs from the 33 short motifs discovered for different TFs (information content <30 bit, shown in Supplementary Fig. 3) and remove redundant short motifs, motifs were clustered by similarity using hierarchical clustering. The results were then manually inspected to select clusters that separate known motifs that are distinct (for example Oct4-Oct4 resembles the known MORE and PORE motifs that bind Oct4 homodimers, which is different from the monomerically bound Oct4 motif). Among very similar motifs within a cluster, we then selected the most abundant motif that was discovered for the most relevant TF (if known). The 11 representative motifs that we selected are shown on the left. Non-canonical motifs were given a name (Nanog-alt for Nanog alternative, Klf4-long for longer Klf4).
Extended Data Fig. 3 BPNet and TF-MoDISco outperform traditional methods in motif discovery and the mapping of motif instances.
a, Motifs discovered by ChExMix, HOMER and MEME for Oct4, Sox2, Nanog and Klf4 ChIP-nexus peaks that are closest to the 11 primary representative BPNet motifs (top row). Green checkmark denotes whether the discovered motif is similar to the BPNet motif. b, Number of motif instances located up to 500 bp (top) or 100 bp (bottom) away from the ChIP-nexus peak summits showing a strong ChIP-nexus footprint. Only motif instances in peaks from held-out test chromosomes (1, 8 and 9) were used for the evaluation. (x-axis) top N motif instances from each of the methods were sorted in descending order of scores (PWM log odds score or CWM contrib score). For BPNet-augm, the center of the genomic region for which the contribution scores were computed was randomly jittered up to 200 bp away from the peak summit. This augmentation prevents BPNet from using the positional information of the peak summit. In the final column (Nanog replicate), the Nanog ChIP-nexus footprint was measured by a separate biological replicate using a different antibody (ɑ-Nanog from Abcam, ab214549), which was not used during training or evaluation.
Extended Data Fig. 4 BPNet training on ChIP-nexus profiles is faster and yields more accurate motif instances than a binary classification model.
a, Predictive performance as measured by the precision-recall curve of the binary classification models predicting the presence or absence of ChIP-nexus peaks from 1 kb DNA sequences evaluated across the held-out (tuning/validation) chromosomes 2, 3 and 4. The model trained to classify the sequences is outperformed when the model is trained to also predict the ChIP-nexus profiles from DNA sequence (without or without profile bias-correction) in addition to classifying them is shown in blue (without or without profile bias-correction) in light blue and with bias-correction in dark blue). b, Training time of the binary classification model trained genome-wide and the sequence-to-profile model (BPNet) trained in ChIP-nexus peaks. c, Detected motifs by TF-MoDISco using the contribution scores in ChIP-nexus peaks of the sequence-to-profile BPNet (profile reg.) or the binary classification model (binary class). A light color denotes a high number of seqlets for each motif. Motifs not discovered or motifs supported by less than 100 seqlets are shown in black. Questionable motifs are displayed separately on the right. d, The number of motif instances (500 bp within ChIP-nexus peak summit) showing a ChIP-nexus footprint (y-axis) within the top N motif instances with highest contribution scores (x-axis) from the held-out (test) chromosomes 1, 8 and 9. A site was considered to show a ChIP-nexus footprint if the number of reads at the position of the aggregate footprint summit (averaged across both strands) is higher than the 90th percentile value of all motif instances detected by the profile regression model for the corresponding TF (that is same as in Extended Data Fig. 3b).
Extended Data Fig. 5 Strict motif spacings are found on retrotransposons and indirectly bound motifs can be validated.
a, To show that TF binding occurs with strict spacings in retrotransposons and that this is likely ancestral, the RLTR9E N6 motif is shown as an example. Sequences of the individual instances in the genome were sorted by the Kimura distance from the consensus motif, with the most similar sequences on top (which are likely more ancestral). Nanog, Sox2 and Klf4 ChIP-nexus binding footprints are shown in the same order on the right (+ strand reads in red, - strand reads in blue), revealing that the binding site spacing is largely constant across all sequences. b, Analysis of the most frequent distances between motif pairs (with >500 co-occurrences, distance measured at the trimmed motifs’ centers). The top 1% most frequent distances mapped in 83% to ERVs and were often longer than 20 bp. c, To validate the identified Zic3 motif instances, Zic3 ChIP-nexus experiments were performed. The average signal across the Zic3 instances reveals a strong Zic3 binding footprint. d, A similar validation was performed for the Esrrb motif instances, revealing that the Esrrb ChIP-nexus signal is present but more diffuse at the discovered Esrrb motif instances. e, To better understand the binding of Oct4 to the B-box, which is frequently found in tRNA, tRNA-overlapping B-box motif instances were reoriented to match the transcriptional direction and sorted by tRNA gene start proximity. This reveals Oct4 binding at tRNA gene start/stop sites. f, Amino acid anti-codons and their copy count of the tRNAs that overlapped with the B-box motif instances.
Extended Data Fig. 6 Additional genomic in-silico interaction analyses confirm the directional effects.
a, Example genomic in-silico mutagenesis analysis at the distal Oct4 enhancer. Predicted ChIP-nexus profiles and the contribution scores greatly decrease at both motifs (Oct4-Sox2 and Nanog) when erasing the Oct4-Sox2 motif (through random sequence insertion). By contrast, when the Nanog motif is erased (right), the predicted profile and the contribution scores of Oct4-Sox2 motif remain intact. b, Such directional effect of motifs can be quantified by the corrected binding fold change (Supplementary Fig. 10a) for all motif pairs in the genome and visualized as a scatterplot. c, Example scatterplot for the interaction between Sox2 and Nanog. Sox2 shows a positive directional effect on Nanog most profound for short motif distances (<35 bp). d, Predicted binding fold changes for all motif pairs in genomic sequences.
Extended Data Fig. 7 Helical periodicity of Nanog motifs is not discovered with traditional methods and requires BPNet’s large receptive field.
a, The pairwise spacing of Nanog motif instances located up to 100 bp away from the ChIP-nexus peak summits in all possible strand orientations (rows) for different methods and/or thresholds (columns). Results for all chromosomes are shown. b, The pairwise spacing of Nanog motif instances when BPNet is trained with different numbers of convolutional layers (Fig. 1g). BPNet with only a single convolutional layer (first column) is unable to capture the 10 bp periodicity due to the limited receptive field similar to PWMs.
a, Nanog and Sox2 ChIP-nexus profiles normalized to reads per million (RPM) show highly similar profiles and read counts across known enhancer regions for wild-type (Wt) and CRISPR ESCs with either a mutated Sox2 motif (Sox2 CRISPR) or mutated Nanog motif (Nanog CRISPR) at a selected genomic region (chr10: 85,539,626-85,539,777). b, Pairwise comparisons of ChIP-nexus RPM counts between Wt and CRISPR ESCs at bound genomic regions (151 bp centered on the respective motif) with Sox2 ChIP-nexus counts on Sox2 motifs and Nanog ChIP-nexus counts on Nanog motifs (motifs based on the original model). The bulk data (gray) are highly correlated and known enhancer regions as shown in Supplementary Fig. 5 (green) are highly reproducible between ESC lines. Note the specific loss of counts in the selected mutated genomic region (red) over wild-type. Pearson correlations (Rp) between groups are shown in the top left of each scatter plot.
a, Observed read counts (Obs) and Predicted read counts (Pred) for BPNet trained on ChIP-seq data for the Zfp281 and Lefty1 enhancers located on the held-out (test) chromosome 1, with forward strand reads (dark) and reverse strand reads (light). For Obs, a sliding window of 50 bp was used to smooth the raw 5’ end read counts (line); raw counts are shown as points on the bottom at y = 0. b, BPNet predicts the ChIP-seq profile shape better than replicates. Multinomial log-likelihood difference compared to the constant model was used to evaluate the profile shape quality at different resolutions (from 1 bp to 10 bp windows) in held-out chromosomes 1, 8 and 9. A log-likelihood of 0 corresponds to the constant model. Multinomial log-likelihood was conditioned on the observed number of total counts as in the training loss. c, Total counts in 1 kb regions can be predicted by BPNet (red) at decent accuracy (measured by Pearson correlation with log(1+observed values)). They do not surpass replicate performance (blue), but are well above the Input control (grey). d, Obs and Pred as in panel a, as well as contribution scores for the known Oct4 enhancer. Motif instances derived by CWM scanning are highlighted with a green box.
Extended Data Fig. 10 BPNet trained on ChIP-seq discovers similar motifs and recovers the Nanog motif periodicity.
a, BPNet applied to ChIP-seq discovers the majority of the motifs identified by BPNet applied to ChIP-nexus data. The models ‘ChIP-nexus profile cr’ and ‘ChIP-seq profile cr’ were trained on the union of the ChIP-nexus/seq peaks predicting Oct4, Sox2, and Nanog binding and were interpreted on the intersection of the ChIP-nexus/seq peaks. b, The pairwise spacing of Nanog motif instances derived from the ChIP-seq profile model in all possible strand orientations shows helical periodicity (similar to Extended Data Fig. 7a). c, Motif instance calling with CWM scanning has higher accuracy for BPNet trained on ChIP-nexus data than for BPNet trained on ChIP-seq data (evaluated on the union of the ChIP-nexus/seq peaks, 500 bp around the peak summit using ChIP-nexus footprints as ground truth). d, Training a sequence-to-profile model on ChIP-seq data yields more accurate motif instances (500 bp around the ChIP-seq peak summits using ChIP-nexus footprints as ground truth) than training a binary classification model or using a PWM scanning approach using FIMO for motifs derived directly from ChIP-nexus data. See Extended Data Figs. 3b, 4d and Supplementary Note for more details.
Supplementary Figs. 1–14 and Note
(1) List of all ChIP–nexus and ChIP–seq replicate experiments and associated quality control metrics. (2) Further binary classification metrics corresponding to Extended Data Fig. 4a. (3) Sequences of gRNA and single-stranded oligo donors for CRISPR mutations.
Clustered motifs and their labels. Motifs were obtained by TF–Modisco run on BPNet models trained on six different datasets: (1) seq/profile.peaks-union (ChIP–seq profile model trained on a combination of ChIP–nexus and ChIP–seq peaks); (2) seq/binary (binary classification model trained on genome-wide ChIP–seq peaks); (3) seq/profile (ChIP–seq profile model trained on ChIP–nexus peaks); (4) nexus/profile.peaks-union (ChIP–nexus profile model trained on a combination of ChIP–nexus/ChIP–seq peaks); (5) nexus/binary (binary classification model trained on genome-wide ChIP–nexus peaks); and (6) nexus/profile (ChIP–nexus profile model trained on ChIP–nexus peaks). Each motif logo shows the sequence information content of a PFM. The logo title consists of the manually assigned motif label (for example, TE1, Oct4–Sox2) and the motif ID composed from the model name, the task name and TF–Modisco motif ID (for example, seq/profile/Nanog/m0_p13).
BPNet profile predictions averaged across 128 random sequences with two motifs inserted at different positions. Centers of the motifs are marked by the vertical gray line; motif distance is shown on the right. For each motif, the predicted profile of the corresponding TF is shown on the y axis.
See Supplementary Video 1.
See Supplementary Video 1.
See Supplementary Video 1.
See Supplementary Video 1.
See Supplementary Video 1.
About this article
Cite this article
Avsec, Ž., Weilert, M., Shrikumar, A. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat Genet 53, 354–366 (2021). https://doi.org/10.1038/s41588-021-00782-6
This article is cited by
EvoAug: improving generalization and interpretability of genomic deep neural networks with evolution-inspired data augmentations
Genome Biology (2023)
Human Genomics (2023)
Genome Biology (2023)
Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers
Genome Biology (2023)
Genome Biology (2023)