Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Base-resolution models of transcription-factor binding reveal soft motif syntax

Abstract

The arrangement (syntax) of transcription factor (TF) binding motifs is an important part of the cis-regulatory code, yet remains elusive. We introduce a deep learning model, BPNet, that uses DNA sequence to predict base-resolution chromatin immunoprecipitation (ChIP)–nexus binding profiles of pluripotency TFs. We develop interpretation tools to learn predictive motif representations and identify soft syntax rules for cooperative TF binding interactions. Strikingly, Nanog preferentially binds with helical periodicity, and TFs often cooperate in a directional manner, which we validate using clustered regularly interspaced short palindromic repeat (CRISPR)-induced point mutations. Our model represents a powerful general approach to uncover the motifs and syntax of cis-regulatory sequences in genomics data.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: BPNet predicts ChIP–nexus signal at base resolution.
Fig. 2: TF motifs and their genomic instances can be accurately derived from BPNet using interpretation tools.
Fig. 3: Discovery of composite motifs and indirect binding footprints.
Fig. 4: In silico motif interaction analysis reveals TF cooperativity and motif syntax.
Fig. 5: Pervasive helical periodicity between Nanog and partner motifs.
Fig. 6: CRISPR mutations in a Sox2 and a Nanog motif validate BPNet predictions.

Similar content being viewed by others

Data availability

The raw sequencing data are available from GEO under accession number GSE137193. Data used to train, evaluate and interpret the BPNet models are found on zenodo at https://doi.org/10.5281/zenodo.3371215. Trained BPNet models and all the model interpretation results are on zenodo at https://doi.org/10.5281/zenodo.3371163. The BPNet model trained on ChIP–nexus data is available on Kipoi under the name BPNet-OSKN (http://kipoi.org/models/BPNet-OSKN/). Genome browser tracks showing observed/predicted ChIP–nexus signal and contribution scores for all factors are available at https://genome.ucsc.edu/s/mlweilert/mesc_OSKN_tracks. ATAC-seq data in mouse ESCs used in Fig. 2 and Supplementary Fig. 7 were obtained from GSE134680. Blacklisted regions used to filter genomic coordinates throughout the analysis are available at https://www.encodeproject.org/files/ENCFF547MET. RepeatMasker mm10 annotations were obtained from http://www.repeatmasker.org/genomes/mm10/RepeatMasker-rm405-db20140131/mm10.fa.out.gz. The nuclear magnetic resonance structure 1O4X used to render Sox2 and Oct1 in Fig. 3 is available at https://www.rcsb.org/structure/1o4x. TRANSFAC (v.7.0) was used to identify the TFIIIC B-box discussed in Fig. 3. The PH0134.1 Pbx PWM used for motif validation in Supplementary Fig. 8 and Extended Data Fig. 5 was obtained from JASPAR at http://jaspar.genereg.net/api/v1/matrix/PH0134.1.jaspar. The MA0141.1 Esrrb PWM used in Extended Data Fig. 5 was obtained from JASPAR at http://jaspar.genereg.net/api/v1/matrix/MA0141.1.jaspar. The transfer RNA database GtRNAdb (v.2.0, release 17.1) annotations and associated tRNAscan-SE scores used in Extended Data Fig. 5 were obtained from http://gtrnadb.ucsc.edu/GtRNAdb_archives/release17/genomes/eukaryota/Mmusc10/mm10-tRNAs.tar.gz. Source data are provided with this paper.

Code availability

The BPNet software package is available at https://github.com/kundajelab/bpnet/. Code to reproduce the results is available at https://github.com/kundajelab/bpnet-manuscript (https://doi.org/10.5281/zenodo.4294813). The ChIP–nexus data processing pipeline is available at https://github.com/kundajelab/chip-nexus-pipeline. Software to trim and deduplicate ChIP–nexus reads is available at https://github.com/Avsecz/nimnexus/.

References

  1. Gerstein, M. B. et al. Architecture of the human regulatory network derived from ENCODE data. Nature 489, 91–100 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

    Article  Google Scholar 

  3. Roadmap Epigenomics Consortiumet al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).

    Article  PubMed Central  Google Scholar 

  4. Morgunova, E. & Taipale, J. Structural perspective of cooperative transcription factor binding. Curr. Opin. Struct. Biol. 47, 1–8 (2017).

    Article  CAS  PubMed  Google Scholar 

  5. Zinzen, R. P., Senger, K., Levine, M. & Papatsenko, D. Computational models for neurogenic gene expression in the Drosophila embryo. Curr. Biol. 16, 1358–1365 (2006).

    Article  CAS  PubMed  Google Scholar 

  6. Fiore, C. & Cohen, B. A. Interactions between pluripotency factors specify cis-regulation in embryonic stem cells. Genome Res. 26, 778–786 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Sayal, R., Dresch, J. M., Pushel, I., Taylor, B. R. & Arnosti, D. N. Quantitative perturbation-based analysis of gene expression predicts enhancer activity in early Drosophila embryo. eLife 5, e08445 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  8. Erceg, J. et al. Subtle changes in motif positioning cause tissue-specific effects on robustness of an enhancer’s activity. PLoS Genet. 10, e1004060 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  9. Crocker, J. & Ilsley, G. R. Using synthetic biology to study gene regulatory evolution. Curr. Opin. Genet. Dev. 47, 91–101 (2017).

    Article  CAS  PubMed  Google Scholar 

  10. Farley, E. K. et al. Suboptimization of developmental enhancers. Science 350, 325–328 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Swanson, C. I., Evans, N. C. & Barolo, S. Structural rules and complex regulatory circuitry constrain expression of a Notch- and EGFR-regulated eye enhancer. Dev. Cell 18, 359–370 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Liu, F. & Posakony, J. W. Role of architecture in the function and specificity of two Notch-regulated transcriptional enhancer modules. PLoS Genet. 8, e1002796 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Lusk, R. W. & Eisen, M. B. Evolutionary mirages: selection on binding site composition creates the illusion of conserved grammars in Drosophila enhancers. PLoS Genet. 6, e1000829 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  14. Kulkarni, M. M. & Arnosti, D. N. Information display by transcriptional enhancers. Development 130, 6569–6575 (2003).

    Article  CAS  PubMed  Google Scholar 

  15. Liberman, L. M. & Stathopoulos, A. Design flexibility in cis-regulatory control of gene expression: synthetic and comparative evidence. Dev. Biol. 327, 578–589 (2009).

    Article  CAS  PubMed  Google Scholar 

  16. Junion, G. et al. A transcription factor collective defines cardiac cell fate and reflects lineage history. Cell 148, 473–486 (2012).

    Article  CAS  PubMed  Google Scholar 

  17. King, D. M. et al. Synthetic and genomic regulatory elements reveal aspects of cis-regulatory grammar in mouse embryonic stem cells. eLife 9, e41279 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Bailey, T. L. et al. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 37, W202–W208 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Hughes, J. D., Estep, P. W., Tavazoie, S. & Church, G. M. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol. 296, 1205–1214 (2000).

    Article  CAS  PubMed  Google Scholar 

  20. Pavesi, G., Mereghetti, P., Mauri, G. & Pesole, G. Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Res. 32, W199–W203 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Thijs, G. et al. A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics 17, 1113–1122 (2001).

    Article  CAS  PubMed  Google Scholar 

  22. Cheng, Q. et al. Computational identification of diverse mechanisms underlying transcription factor-DNA occupancy. PLoS Genet. 9, e1003571 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Guo, Y., Mahony, S. & Gifford, D. K. High resolution genome wide binding event finding and motif discovery reveals transcription factor spatial binding constraints. PLoS Comput. Biol. 8, e1002638 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Wang, J. et al. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 22, 1798–1812 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Lee, D., Karchin, R. & Beer, M. A. Discriminative prediction of mammalian enhancers from DNA sequence. Genome Res. 21, 2167–2180 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Erives, A. & Levine, M. Coordinate enhancers share common organizational features in the Drosophila genome. Proc. Natl Acad. Sci. USA 101, 3851–3856 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Papatsenko, D., Goltsev, Y. & Levine, M. Organization of developmental enhancers in the Drosophila embryo. Nucleic Acids Res. 37, 5665–5677 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Ng, F. S. L. et al. Constrained transcription factor spacing is prevalent and important for transcriptional control of mouse blood cells. Nucleic Acids Res. 42, 13513–13524 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Kharchenko, P. V., Tolstorukov, M. Y. & Park, P. J. Design and analysis of ChIP–seq experiments for DNA-binding proteins. Nat. Biotechnol. 26, 1351–1359 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Zhang, Y. et al. Model-based analysis of ChIP–seq (MACS). Genome Biol. 9, R137 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  31. Rozowsky, J. et al. PeakSeq enables systematic scoring of ChIP–seq experiments relative to controls. Nat. Biotechnol. 27, 66–75 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Guo, Y. et al. Discovering homotypic binding events at high spatial resolution. Bioinformatics 26, 3028–3034 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Kuan, P. F. et al. A statistical framework for the analysis of ChIP–seq data. J. Am. Stat. Assoc. 106, 891–903 (2011).

    Article  CAS  PubMed  Google Scholar 

  34. Hartonen, T., Sahu, B., Dave, K., Kivioja, T. & Taipale, J. PeakXus: comprehensive transcription factor binding site discovery from ChIP–Nexus and ChIP–Exo experiments. Bioinformatics 32, i629–i638 (2016).

    Article  CAS  PubMed  Google Scholar 

  35. Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).

    Article  CAS  PubMed  Google Scholar 

  36. Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Quang, D. & Xie, X. FactorNet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data. Methods 166, 40–47 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Bogard, N., Linder, J., Rosenberg, A. B. & Seelig, G. A deep neural network for predicting and engineering alternative polyadenylation. Cell 178, 91–106 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Lanchantin, J., Singh, R., Wang, B. & Qi, Y. Deep motif dashboard: visualizing and understanding genomic sequences using deep neural networks. Pac. Symp. Biocomput. 22, 254–265 (2017).

    PubMed  PubMed Central  Google Scholar 

  41. Shrikumar, A. et al. TF-MoDISco v0.4.2.2-alpha: technical note. Preprint at arXiv https://arxiv.org/abs/1811.00416 (2018).

  42. Jha, A., Aicher, J. K., Singh, D. & Barash, Y. Improving interpretability of deep learning models: splicing codes as a case study. Preprint at bioRxiv https://doi.org/10.1101/700096 (2019).

  43. Greenside, P., Shimko, T., Fordyce, P. & Kundaje, A. Discovering epistatic feature interactions from neural network models of regulatory DNA sequences. Bioinformatics 34, i629–i637 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Gordân, R., Hartemink, A. J. & Bulyk, M. L. Distinguishing direct versus indirect transcription factor-DNA interactions. Genome Res. 19, 2090–2100 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  46. Mariani, L., Weinand, K., Vedenko, A., Barrera, L. A. & Bulyk, M. L. Identification of human lineage-specific transcriptional coregulators enabled by a glossary of binding modules and tunable genomic backgrounds. Cell Syst. 5, 187–201 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Bailey, T. L. & Machanick, P. Inferring direct DNA binding from ChIP–seq. Nucleic Acids Res. 40, e128 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Rhee, H. S. & Pugh, B. F. Comprehensive genome-wide protein–DNA interactions detected at single-nucleotide resolution. Cell 147, 1408–1419 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. He, Q., Johnston, J. & Zeitlinger, J. ChIP–nexus enables improved detection of in vivo transcription factor binding footprints. Nat. Biotechnol. 33, 395–401 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Yamada, N., Lai, W. K. M., Farrell, N., Pugh, B. F. & Mahony, S. Characterizing protein–DNA binding event subtypes in ChIP–exo data. Bioinformatics 35, 903–913 (2019).

    Article  CAS  PubMed  Google Scholar 

  51. Starick, S. R. et al. ChIP–exo signal associated with DNA-binding motifs provides insight into the genomic binding of the glucocorticoid receptor and cooperating transcription factors. Genome Res. 25, 825–835 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Papagianni, A. et al. Capicua controls Toll/IL-1 signaling targets independently of RTK regulation. Proc. Natl Acad. Sci. USA 115, 1807–1812 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Reményi, A. et al. Crystal structure of a POU/HMG/DNA ternary complex suggests differential assembly of Oct4 and Sox2 on two enhancers. Genes Dev. 17, 2048–2059 (2003).

    Article  PubMed  PubMed Central  Google Scholar 

  54. Banerji, J., Rusconi, S. & Schaffner, W. Expression of a beta-globin gene is enhanced by remote SV40 DNA sequences. Cell 27, 299–308 (1981).

    Article  CAS  PubMed  Google Scholar 

  55. Spitz, F. & Furlong, E. E. M. Transcription factors: from enhancer binding to developmental control. Nat. Rev. Genet. 13, 613–626 (2012).

    Article  CAS  PubMed  Google Scholar 

  56. Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535–548 (2019).

    Article  CAS  PubMed  Google Scholar 

  57. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (eds. He, K. et al.) 770–778 (IEEE, 2016); https://doi.org/10.1109/CVPR.2016.90

  58. Van Den Oord, A. & Dieleman, S. WaveNet: a generative model for raw audio. DeepMind https://deepmind.com/blog/article/wavenet-generative-model-raw-audio (2016).

  59. Terooatea, T. W., Pozner, A. & Buck-Koehntop, B. A. PAtCh-Cap: input strategy for improving analysis of ChIP–exo data sets and beyond. Nucleic Acids Res. 44, e159 (2016).

    PubMed  PubMed Central  Google Scholar 

  60. Whyte, W. A. et al. Enhancer decommissioning by LSD1 during embryonic stem cell differentiation. Nature 482, 221–225 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Novo, C. L. et al. Long-range enhancer interactions are prevalent in mouse embryonic stem cells and are reorganized upon pluripotent state transition. Cell Rep. 22, 2615–2627 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Festuccia, N. et al. Esrrb extinction triggers dismantling of naïve pluripotency and marks commitment to differentiation. EMBO J. 37, e95476 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  63. Moorthy, S. D. et al. Enhancers and super-enhancers have an equivalent regulatory role in embryonic stem cells through regulation of single or multiple genes. Genome Res. 27, 246–258 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Avsec, Ž. et al. The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat. Biotechnol. 37, 592–600 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  65. Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. In Proc. 34th International Conference on Machine Learning 3145–3153 (2017).

  66. Chew, J.-L. et al. Reciprocal transcriptional regulation of Pou5f1 and Sox2 via the Oct4/Sox2 complex in embryonic stem cells. Mol. Cell. Biol. 25, 6031–6046 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  67. Chen, X. et al. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell 133, 1106–1117 (2008).

    Article  CAS  PubMed  Google Scholar 

  68. Mitsui, K. et al. The homeoprotein Nanog is required for maintenance of pluripotency in mouse epiblast and ES cells. Cell 113, 631–642 (2003).

    Article  CAS  PubMed  Google Scholar 

  69. Loh, Y.-H. et al. The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells. Nat. Genet. 38, 431–440 (2006).

    Article  CAS  PubMed  Google Scholar 

  70. Salmon-Divon, M., Dvinge, H., Tammoja, K. & Bertone, P. PeakAnalyzer: genome-wide annotation of chromatin binding and modification loci. BMC Bioinformatics 11, 415 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  71. Gagliardi, A. et al. A direct physical interaction between Nanog and Sox2 regulates embryonic stem cell self-renewal. EMBO J. 32, 2231–2247 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  72. He, X. et al. A biophysical model for analysis of transcription factor interaction and binding site arrangement from genome-wide binding data. PLoS ONE 4, e8155 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  73. Xie, L. et al. A dynamic interplay of enhancer elements regulates Klf4 expression in naïve pluripotency. Genes Dev. 31, 1795–1808 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  74. Mistri, T. K. et al. Dynamic changes in Sox2 spatio-temporal expression promote the second cell fate decision through Fgf4/Fgfr2 signaling in preimplantation mouse embryos. Biochem. J. 475, 1075–1089 (2018).

    Article  CAS  PubMed  Google Scholar 

  75. Tokuzawa, Y. et al. Fbx15 is a novel target of Oct3/4 but is dispensable for embryonic stem cell self-renewal and mouse development. Mol. Cell. Biol. 23, 2699–2708 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  76. Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  77. Friman, E. T. et al. Dynamic regulation of chromatin accessibility by pluripotency transcription factors across the cell cycle. eLife 8, e5008 (2019).

    Article  Google Scholar 

  78. Jolma, A. et al. DNA-binding specificities of human transcription factors. Cell 152, 327–339 (2013).

    Article  CAS  PubMed  Google Scholar 

  79. Tomilin, A. et al. Synergism with the coactivator OBF-1 (OCA-B, BOB-1) is mediated by a specific POU dimer configuration. Cell 103, 853–864 (2000).

    Article  CAS  PubMed  Google Scholar 

  80. Botquin, V. et al. New POU dimer configuration mediates antagonistic control of an osteopontin preimplantation enhancer by Oct-4 and Sox-2. Genes Dev. 12, 2073–2090 (1998).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  81. Mistri, T. K. et al. Selective influence of Sox2 on POU transcription factor binding in embryonic and neural stem cells. EMBO Rep. 16, 1177–1191 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  82. Ambrosetti, D. C., Basilico, C. & Dailey, L. Synergistic activation of the fibroblast growth factor 4 enhancer by Sox2 and Oct-3 depends on protein–protein interactions facilitated by a specific spatial arrangement of factor binding sites. Mol. Cell. Biol. 17, 6321–6329 (1997).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  83. Merino, F., Bouvier, B. & Cojocaru, V. Cooperative DNA recognition modulated by an interplay between protein–protein interactions and DNA-mediated allostery. PLoS Comput. Biol. 11, e1004287 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  84. Hayashi, Y. et al. Structure-based discovery of NANOG variant with enhanced properties to promote self-renewal and reprogramming of pluripotent stem cells. Proc. Natl Acad. Sci. USA 112, 4666–4671 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  85. Wang, J., Levasseur, D. N. & Orkin, S. H. Requirement of Nanog dimerization for stem cell self-renewal and pluripotency. Proc. Natl Acad. Sci. USA 105, 6326–6331 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  86. Todd, C. D., Deniz, Ö., Taylor, D. & Branco, M. R. Functional evaluation of transposable elements as enhancers in mouse embryonic and trophoblast stem cells. eLife 8, e44344 (2019).

  87. Bourque, G. et al. Evolution of the mammalian transcription factor binding repertoire via transposable elements. Genome Res. 18, 1752–1762 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  88. Kunarso, G. et al. Transposable elements have rewired the core regulatory network of human embryonic stem cells. Nat. Genet. 42, 631–634 (2010).

    Article  CAS  PubMed  Google Scholar 

  89. Sundaram, V. et al. Functional cis-regulatory modules encoded by mouse-specific endogenous retrovirus. Nat. Commun. 8, 14550 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  90. Xie, D. et al. Rewirable gene regulatory networks in the preimplantation embryonic development of three mammalian species. Genome Res. 20, 804–815 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  91. Jankowski, A., Szczurek, E., Jauch, R., Tiuryn, J. & Prabhakar, S. Comprehensive prediction in 78 human cell lines reveals rigidity and compactness of transcription factor dimers. Genome Res. 23, 1307–1318 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  92. Jolma, A. et al. DNA-dependent formation of transcription factor pairs alters their binding specificity. Nature 527, 384–388 (2015).

    Article  CAS  PubMed  Google Scholar 

  93. Mullin, N. P. et al. Distinct contributions of tryptophan residues within the dimerization domain to Nanog function. J. Mol. Biol. 429, 1544–1553 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  94. Kim, S. et al. Probing allostery through DNA. Science 339, 816–819 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  95. Soufi, A. et al. Pioneer transcription factors target partial DNA motifs on nucleosomes to initiate reprogramming. Cell 161, 555–568 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  96. Soufi, A., Donahue, G. & Zaret, K. S. Facilitators and impediments of the pluripotency reprogramming factors’ initial engagement with the genome. Cell 151, 994–1004 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  97. Winter, D. R., Song, L., Mukherjee, S., Furey, T. S. & Crawford, G. E. DNase-seq predicts regions of rotational nucleosome stability across diverse human cell types. Genome Res. 23, 1118–1129 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  98. Zhong, J. et al. Mapping nucleosome positions using DNase-seq. Genome Res. 26, 351–364 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  99. Jin, H., Rube, H. T. & Song, J. S. Categorical spectral analysis of periodicity in nucleosomal DNA. Nucleic Acids Res. 44, 2047–2057 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  100. Drew, H. R. et al. Structure of a B-DNA dodecamer: conformation and dynamics. Proc. Natl Acad. Sci. USA 78, 2179–2183 (1981).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  101. Müller, J., Oehler, S. & Müller-Hill, B. Repression of lac promoter as a function of distance, phase and quality of an auxiliary lac operator. J. Mol. Biol. 257, 21–29 (1996).

    Article  PubMed  Google Scholar 

  102. Hochschild, A. & Ptashne, M. Cooperative binding of lambda repressors to sites separated by integral turns of the DNA helix. Cell 44, 681–687 (1986).

    Article  CAS  PubMed  Google Scholar 

  103. Ghosh, R. P. et al. Satb1 integrates DNA binding site geometry and torsional stress to differentially target nucleosome-dense regions. Nat. Commun. 10, 3221 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  104. Zhu, F. et al. The interaction landscape between transcription factors and the nucleosome. Nature 562, 76–81 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  105. Ptashne, M. Regulation of transcription: from lambda to eukaryotes. Trends Biochem. Sci 30, 275–279 (2005).

    Article  CAS  PubMed  Google Scholar 

  106. Sun, Y. et al. Zelda overcomes the high intrinsic nucleosome barrier at enhancers during Drosophila zygotic genome activation. Genome Res. 25, 1703–1714 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  107. Thanos, D. & Maniatis, T. Virus induction of human IFNβ gene expression requires the assembly of an enhanceosome. Cell 83, 1091–1100 (1995).

    Article  CAS  PubMed  Google Scholar 

  108. Merika, M. & Thanos, D. Enhanceosomes. Curr. Opin. Genet. Dev. 11, 205–208 (2001).

    Article  CAS  PubMed  Google Scholar 

  109. Li, Q. & Wrange, O. Accessibility of a glucocorticoid response element in a nucleosome depends on its rotational positioning. Mol. Cell. Biol. 15, 4375–4384 (1995).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  110. Sharon, E. et al. Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters. Nat. Biotechnol. 30, 521–530 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  111. Cai, H. N., Arnosti, D. N. & Levine, M. Long-range repression in the Drosophila embryo. Proc. Natl Acad. Sci. USA 93, 9309–9314 (1996).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  112. Cui, F. & Zhurkin, V. B. Rotational positioning of nucleosomes facilitates selective binding of p53 to response elements associated with cell cycle arrest. Nucleic Acids Res. 42, 836–847 (2014).

    Article  CAS  PubMed  Google Scholar 

  113. Suryamohan, K. & Halfon, M. S. Identifying transcriptional cis-regulatory modules in animal genomes. Wiley Interdiscip. Rev. Dev. Biol. 4, 59–84 (2015).

    Article  CAS  PubMed  Google Scholar 

  114. Istrail, S. Eric Davidson’s regulatory genome for computer science: causality, logic, and proof principles of the genomic cis-regulatory code. J. Comput. Biol. 26, 653–684 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  115. Slattery, M. et al. Absence of a simple code: how transcription factors read the genome. Trends Biochem. Sci. 39, 381–399 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  116. Tseng, A. M., Shrikumar, A. & Kundaje, A. Fourier-transform-based attribution priors improve the interpretability and stability of deep learning models for genomics. Preprint at bioRxiv https://doi.org/10.1101/2020.06.11.147272 (2020).

  117. Klemenz, R., Stillman, D. J. & Geiduschek, E. P. Specific interactions of Saccharomyces cerevisiae proteins with a promoter region of eukaryotic tRNA genes. Proc. Natl Acad. Sci. USA 79, 6191–6195 (1982).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  118. Oler, A. J. et al. Human RNA polymerase III transcriptomes and relationships to Pol II promoter chromatin and enhancer-binding factors. Nat. Struct. Mol. Biol. 17, 620–628 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  119. Koenecke, N., Johnston, J., He, Q., Meier, S. & Zeitlinger, J. Drosophila poised enhancers are generated during tissue patterning with the help of repression. Genome Res. 27, 64–74 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  120. Stemmer, M., Thumberger, T., Del Sol Keyer, M., Wittbrodt, J. & Mateo, J. L. Cctop: an intuitive, flexible and reliable crispr/cas9 target prediction tool. PLoS ONE 10, e0124633 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  121. Labuhn, M. et al. Refined sgRNA efficacy prediction improves large- and small-scale CRISPR-Cas9 applications. Nucleic Acids Res. 46, 1375–1385 (2018).

    Article  CAS  PubMed  Google Scholar 

  122. Connelly, J. P. & Pruett-Miller, S. M. CRIS.py: a versatile and high-throughput analysis program for CRISPR-based genome editing. Sci. Rep. 9, 4194 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  123. Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 17, 10 (2011).

    Article  Google Scholar 

  124. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  125. Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  126. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  127. Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  128. Landt, S. G. et al. ChIP–seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 22, 1813–1831 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  129. Kent, W. J., Zweig, A. S., Barber, G., Hinrichs, A. S. & Karolchik, D. BigWig and BigBed: enabling browsing of large distributed datasets. Bioinformatics 26, 2204–2207 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  130. Li, Q., Brown, J. B., Huang, H. & Bickel, P. J. Measuring reproducibility of high-throughput experiments. Ann. Appl. Stat. 5, 1752–1779 (2011).

    Article  Google Scholar 

  131. Yardımcı, G. G., Frank, C. L., Crawford, G. E. & Ohler, U. Explicit DNase sequence bias modeling enables high-resolution transcription factor footprint detection. Nucleic Acids Res. 42, 11865–11878 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  132. Chollet, F. et al. Keras. https://keras.io (2015).

  133. Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. dblp: Computer Science Bibliography https://dblp.org/rec/journals/corr/KingmaB14.html (2015).

  134. Ward, J. H. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58, 236–244 (1963).

    Article  Google Scholar 

  135. Bar-Joseph, Z., Gifford, D. K. & Jaakkola, T. S. Fast optimal leaf ordering for hierarchical clustering. Bioinformatics 17, S22–S29 (2001).

    Article  PubMed  Google Scholar 

Download references

Acknowledgements

We thank M. Levine and R. Krumlauf for comments and J. Israeli for initial technical help. This work was funded by the Stowers Institute for Medical Research (SIMR), NIH grant no. R01HG010211 to J.Z. and NIH grant nos. DP2GM123485, U01HG009431 and R01HG009674 to A.K. Ž.A. was supported by the German Bundesministerium für Bildung und Forschung through the project MechML (no. 01IS18053F). A.S. was supported by the Stanford BioX Fellowship and HHMI International Student Research Fellowship. Illumina sequencing was performed at SIMR (A. Perera and M. Peterson) and the University of Kansas Medical Center Genomics Core, supported by NIH grant nos. U54HD090216, S10OD021743 and COBRE P30GM122731. Generation of CRISPR/Cas9 mouse ESC lines was performed by the following cores at SIMR: Genome Engineering (K. Delventhal, B. Miller and K. Weaver), Tissue Culture (C. Zhao, A. Murray, Y. Wang, O. Kenzior, Q. Jiang, S. Hime and S. Gosh) and Cytometry (J. Haug and D. DeGraffenreid).

Author information

Authors and Affiliations

Authors

Contributions

Ž.A., A.K. and J.Z. conceived the project. Ž.A., A.S., A.K. and J.Z. conceived and implemented the computational methods. S.K., K.D. and R.F. performed the experiments. Ž.A., M.W., A.A. and C.M. performed further computational analysis. J.Z., A.K. and J.G. supervised the project. Ž.A., M.W., S.K., J.G., A.K. and J.Z. prepared the manuscript with input from all authors.

Corresponding authors

Correspondence to Anshul Kundaje or Julia Zeitlinger.

Ethics declarations

Competing interests

J.Z. owns a patent on ChIP–nexus (no. 10287628). All other authors declare no competing interests.

Additional information

Peer review information Nature Genetics thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Additional performance evaluation of BPNet’s predictions of ChIP-nexus data.

a, Observed and predicted ChIP-nexus read counts mapping to the forward strand (dark) and the reverse strand (light) for the Zfp281 and Sall1 enhancers located on the held-out (test) chromosome 1. b, Alternative profile shape evaluation metrics showing the difference to random predictions: multinomial negative log-likelihood and Jensen-Shannon (JS) divergence. Both metrics were computed at different resolutions (from 1 bp to 10 bp windows) in held-out test chromosomes 1, 8 and 9. c, auPRC of profile predictions is high across various learning rates on the tuning set chromosomes 2, 3 and 4, demonstrating the robustness of the model. d, The deconvolutional layer slightly improves the profile predictive performance compared to a point-wise convolutional layer (deconvolution size=1). e, auPRC of profile predictions (top) and the Spearman correlation of total count predictions (bottom) for a range of different relative total count weight α in the BPNet loss function parameterized as λ = α/2 n_obs. Relative weight of 1 (center) denotes equal weighting of the counts and profile loss functions. The best performance is obtained for α<1, showing that putting more weight to profile predictions aids both profile and count predictions. f, Observed and predicted total read counts for BPNet (top) and replicate experiments (bottom) across the four studied TFs along with the Spearman correlation coefficient.

Extended Data Fig. 2 Removal of long motifs in retrotransposons and clustering of motifs by similarity.

a, Among all motifs discovered by TF-MoDISco, 18 motifs display unusually high information content (IC) of >30 bits (green). The expected short motifs are shown in gray. b, Histogram of the overlap of short motifs (gray) and long motifs (green) with repeat elements shows that long motifs overlap >80% with annotated retrotransposons. c, Long motifs with their PFM, ID, fraction of motif instances overlapping with a repeat and the most frequent (top class) RepeatMasker annotation. Highlighted within the repeat elements are potential motif instances of Oct4-Sox2, Sox2, Nanog and Klf4 as indicated by the CWMs. d, To identify a set of representative motifs from the 33 short motifs discovered for different TFs (information content <30 bit, shown in Supplementary Fig. 3) and remove redundant short motifs, motifs were clustered by similarity using hierarchical clustering. The results were then manually inspected to select clusters that separate known motifs that are distinct (for example Oct4-Oct4 resembles the known MORE and PORE motifs that bind Oct4 homodimers, which is different from the monomerically bound Oct4 motif). Among very similar motifs within a cluster, we then selected the most abundant motif that was discovered for the most relevant TF (if known). The 11 representative motifs that we selected are shown on the left. Non-canonical motifs were given a name (Nanog-alt for Nanog alternative, Klf4-long for longer Klf4).

Extended Data Fig. 3 BPNet and TF-MoDISco outperform traditional methods in motif discovery and the mapping of motif instances.

a, Motifs discovered by ChExMix, HOMER and MEME for Oct4, Sox2, Nanog and Klf4 ChIP-nexus peaks that are closest to the 11 primary representative BPNet motifs (top row). Green checkmark denotes whether the discovered motif is similar to the BPNet motif. b, Number of motif instances located up to 500 bp (top) or 100 bp (bottom) away from the ChIP-nexus peak summits showing a strong ChIP-nexus footprint. Only motif instances in peaks from held-out test chromosomes (1, 8 and 9) were used for the evaluation. (x-axis) top N motif instances from each of the methods were sorted in descending order of scores (PWM log odds score or CWM contrib score). For BPNet-augm, the center of the genomic region for which the contribution scores were computed was randomly jittered up to 200 bp away from the peak summit. This augmentation prevents BPNet from using the positional information of the peak summit. In the final column (Nanog replicate), the Nanog ChIP-nexus footprint was measured by a separate biological replicate using a different antibody (ɑ-Nanog from Abcam, ab214549), which was not used during training or evaluation.

Extended Data Fig. 4 BPNet training on ChIP-nexus profiles is faster and yields more accurate motif instances than a binary classification model.

a, Predictive performance as measured by the precision-recall curve of the binary classification models predicting the presence or absence of ChIP-nexus peaks from 1 kb DNA sequences evaluated across the held-out (tuning/validation) chromosomes 2, 3 and 4. The model trained to classify the sequences is outperformed when the model is trained to also predict the ChIP-nexus profiles from DNA sequence (without or without profile bias-correction) in addition to classifying them is shown in blue (without or without profile bias-correction) in light blue and with bias-correction in dark blue). b, Training time of the binary classification model trained genome-wide and the sequence-to-profile model (BPNet) trained in ChIP-nexus peaks. c, Detected motifs by TF-MoDISco using the contribution scores in ChIP-nexus peaks of the sequence-to-profile BPNet (profile reg.) or the binary classification model (binary class). A light color denotes a high number of seqlets for each motif. Motifs not discovered or motifs supported by less than 100 seqlets are shown in black. Questionable motifs are displayed separately on the right. d, The number of motif instances (500 bp within ChIP-nexus peak summit) showing a ChIP-nexus footprint (y-axis) within the top N motif instances with highest contribution scores (x-axis) from the held-out (test) chromosomes 1, 8 and 9. A site was considered to show a ChIP-nexus footprint if the number of reads at the position of the aggregate footprint summit (averaged across both strands) is higher than the 90th percentile value of all motif instances detected by the profile regression model for the corresponding TF (that is same as in Extended Data Fig. 3b).

Extended Data Fig. 5 Strict motif spacings are found on retrotransposons and indirectly bound motifs can be validated.

a, To show that TF binding occurs with strict spacings in retrotransposons and that this is likely ancestral, the RLTR9E N6 motif is shown as an example. Sequences of the individual instances in the genome were sorted by the Kimura distance from the consensus motif, with the most similar sequences on top (which are likely more ancestral). Nanog, Sox2 and Klf4 ChIP-nexus binding footprints are shown in the same order on the right (+ strand reads in red, - strand reads in blue), revealing that the binding site spacing is largely constant across all sequences. b, Analysis of the most frequent distances between motif pairs (with >500 co-occurrences, distance measured at the trimmed motifs’ centers). The top 1% most frequent distances mapped in 83% to ERVs and were often longer than 20 bp. c, To validate the identified Zic3 motif instances, Zic3 ChIP-nexus experiments were performed. The average signal across the Zic3 instances reveals a strong Zic3 binding footprint. d, A similar validation was performed for the Esrrb motif instances, revealing that the Esrrb ChIP-nexus signal is present but more diffuse at the discovered Esrrb motif instances. e, To better understand the binding of Oct4 to the B-box, which is frequently found in tRNA, tRNA-overlapping B-box motif instances were reoriented to match the transcriptional direction and sorted by tRNA gene start proximity. This reveals Oct4 binding at tRNA gene start/stop sites. f, Amino acid anti-codons and their copy count of the tRNAs that overlapped with the B-box motif instances.

Extended Data Fig. 6 Additional genomic in-silico interaction analyses confirm the directional effects.

a, Example genomic in-silico mutagenesis analysis at the distal Oct4 enhancer. Predicted ChIP-nexus profiles and the contribution scores greatly decrease at both motifs (Oct4-Sox2 and Nanog) when erasing the Oct4-Sox2 motif (through random sequence insertion). By contrast, when the Nanog motif is erased (right), the predicted profile and the contribution scores of Oct4-Sox2 motif remain intact. b, Such directional effect of motifs can be quantified by the corrected binding fold change (Supplementary Fig. 10a) for all motif pairs in the genome and visualized as a scatterplot. c, Example scatterplot for the interaction between Sox2 and Nanog. Sox2 shows a positive directional effect on Nanog most profound for short motif distances (<35 bp). d, Predicted binding fold changes for all motif pairs in genomic sequences.

Extended Data Fig. 7 Helical periodicity of Nanog motifs is not discovered with traditional methods and requires BPNet’s large receptive field.

a, The pairwise spacing of Nanog motif instances located up to 100 bp away from the ChIP-nexus peak summits in all possible strand orientations (rows) for different methods and/or thresholds (columns). Results for all chromosomes are shown. b, The pairwise spacing of Nanog motif instances when BPNet is trained with different numbers of convolutional layers (Fig. 1g). BPNet with only a single convolutional layer (first column) is unable to capture the 10 bp periodicity due to the limited receptive field similar to PWMs.

Extended Data Fig. 8 The ChIP-nexus data on CRISPR-mutated ESCs are highly reproducible.

a, Nanog and Sox2 ChIP-nexus profiles normalized to reads per million (RPM) show highly similar profiles and read counts across known enhancer regions for wild-type (Wt) and CRISPR ESCs with either a mutated Sox2 motif (Sox2 CRISPR) or mutated Nanog motif (Nanog CRISPR) at a selected genomic region (chr10: 85,539,626-85,539,777). b, Pairwise comparisons of ChIP-nexus RPM counts between Wt and CRISPR ESCs at bound genomic regions (151 bp centered on the respective motif) with Sox2 ChIP-nexus counts on Sox2 motifs and Nanog ChIP-nexus counts on Nanog motifs (motifs based on the original model). The bulk data (gray) are highly correlated and known enhancer regions as shown in Supplementary Fig. 5 (green) are highly reproducible between ESC lines. Note the specific loss of counts in the selected mutated genomic region (red) over wild-type. Pearson correlations (Rp) between groups are shown in the top left of each scatter plot.

Extended Data Fig. 9 The base-resolution BPNet model can be trained on ChIP-seq profiles.

a, Observed read counts (Obs) and Predicted read counts (Pred) for BPNet trained on ChIP-seq data for the Zfp281 and Lefty1 enhancers located on the held-out (test) chromosome 1, with forward strand reads (dark) and reverse strand reads (light). For Obs, a sliding window of 50 bp was used to smooth the raw 5’ end read counts (line); raw counts are shown as points on the bottom at y = 0. b, BPNet predicts the ChIP-seq profile shape better than replicates. Multinomial log-likelihood difference compared to the constant model was used to evaluate the profile shape quality at different resolutions (from 1 bp to 10 bp windows) in held-out chromosomes 1, 8 and 9. A log-likelihood of 0 corresponds to the constant model. Multinomial log-likelihood was conditioned on the observed number of total counts as in the training loss. c, Total counts in 1 kb regions can be predicted by BPNet (red) at decent accuracy (measured by Pearson correlation with log(1+observed values)). They do not surpass replicate performance (blue), but are well above the Input control (grey). d, Obs and Pred as in panel a, as well as contribution scores for the known Oct4 enhancer. Motif instances derived by CWM scanning are highlighted with a green box.

Extended Data Fig. 10 BPNet trained on ChIP-seq discovers similar motifs and recovers the Nanog motif periodicity.

a, BPNet applied to ChIP-seq discovers the majority of the motifs identified by BPNet applied to ChIP-nexus data. The models ‘ChIP-nexus profile cr’ and ‘ChIP-seq profile cr’ were trained on the union of the ChIP-nexus/seq peaks predicting Oct4, Sox2, and Nanog binding and were interpreted on the intersection of the ChIP-nexus/seq peaks. b, The pairwise spacing of Nanog motif instances derived from the ChIP-seq profile model in all possible strand orientations shows helical periodicity (similar to Extended Data Fig. 7a). c, Motif instance calling with CWM scanning has higher accuracy for BPNet trained on ChIP-nexus data than for BPNet trained on ChIP-seq data (evaluated on the union of the ChIP-nexus/seq peaks, 500 bp around the peak summit using ChIP-nexus footprints as ground truth). d, Training a sequence-to-profile model on ChIP-seq data yields more accurate motif instances (500 bp around the ChIP-seq peak summits using ChIP-nexus footprints as ground truth) than training a binary classification model or using a PWM scanning approach using FIMO for motifs derived directly from ChIP-nexus data. See Extended Data Figs. 3b, 4d and Supplementary Note for more details.

Supplementary information

Supplementary Information

Supplementary Figs. 1–14 and Note

Reporting Summary

Peer Review Information

Supplementary Tables 1–3

(1) List of all ChIP–nexus and ChIP–seq replicate experiments and associated quality control metrics. (2) Further binary classification metrics corresponding to Extended Data Fig. 4a. (3) Sequences of gRNA and single-stranded oligo donors for CRISPR mutations.

Supplementary Data 1

Clustered motifs and their labels. Motifs were obtained by TF–Modisco run on BPNet models trained on six different datasets: (1) seq/profile.peaks-union (ChIP–seq profile model trained on a combination of ChIP–nexus and ChIP–seq peaks); (2) seq/binary (binary classification model trained on genome-wide ChIP–seq peaks); (3) seq/profile (ChIP–seq profile model trained on ChIP–nexus peaks); (4) nexus/profile.peaks-union (ChIP–nexus profile model trained on a combination of ChIP–nexus/ChIP–seq peaks); (5) nexus/binary (binary classification model trained on genome-wide ChIP–nexus peaks); and (6) nexus/profile (ChIP–nexus profile model trained on ChIP–nexus peaks). Each motif logo shows the sequence information content of a PFM. The logo title consists of the manually assigned motif label (for example, TE1, Oct4–Sox2) and the motif ID composed from the model name, the task name and TF–Modisco motif ID (for example, seq/profile/Nanog/m0_p13).

Supplementary Video 1

BPNet profile predictions averaged across 128 random sequences with two motifs inserted at different positions. Centers of the motifs are marked by the vertical gray line; motif distance is shown on the right. For each motif, the predicted profile of the corresponding TF is shown on the y axis.

Supplementary Video 2

See Supplementary Video 1.

Supplementary Video 3

See Supplementary Video 1.

Supplementary Video 4

See Supplementary Video 1.

Supplementary Video 5

See Supplementary Video 1.

Supplementary Video 6

See Supplementary Video 1.

Source data

Source Data Fig. 5

Motif 10-bp periodicity for all motifs visualized in Fig. 5d.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Avsec, Ž., Weilert, M., Shrikumar, A. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat Genet 53, 354–366 (2021). https://doi.org/10.1038/s41588-021-00782-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41588-021-00782-6

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research