Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers

Abstract

Enhancer sequences control gene expression and comprise binding sites (motifs) for different transcription factors (TFs). Despite extensive genetic and computational studies, the relationship between DNA sequence and regulatory activity is poorly understood, and de novo enhancer design has been challenging. Here, we built a deep-learning model, DeepSTARR, to quantitatively predict the activities of thousands of developmental and housekeeping enhancers directly from DNA sequence in Drosophila melanogaster S2 cells. The model learned relevant TF motifs and higher-order syntax rules, including functionally nonequivalent instances of the same TF motif that are determined by motif-flanking sequence and intermotif distances. We validated these rules experimentally and demonstrated that they can be generalized to humans by testing more than 40,000 wildtype and mutant Drosophila and human enhancers. Finally, we designed and functionally validated synthetic enhancers with desired activities de novo.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: DeepSTARR quantitatively predicts enhancer activity genome wide from DNA sequence.
Fig. 2: DeepSTARR reveals important TF motif types that validate experimentally.
Fig. 3: Instances of the same TF motif have nonequivalent contributions to enhancer activity.
Fig. 4: Contribution of TF motifs depends on the flanking sequence.
Fig. 5: In silico analysis reveals distinct modes of motif cooperativity.
Fig. 6: Motif syntax rules dictate the contribution of TF motif instances in human enhancers.
Fig. 7: DeepSTARR designs synthetic enhancers using optimal sequence rules.

Similar content being viewed by others

Data availability

The raw sequencing data are available from GEO (https://www.ncbi.nlm.nih.gov/geo/) under accession number GSE183939. Data used to train and evaluate the DeepSTARR model as well as the final pretrained model are found on zenodo at https://doi.org/10.5281/zenodo.5502060. The pretrained DeepSTARR model is also available in the Kipoi model repository109 (http://kipoi.org/models/DeepSTARR/). Genome browser tracks showing genome-wide UMI-STARR-seq and DeepSTARR predictions in Drosophila S2 cells, including nucleotide contribution scores for all enhancer sequences, together with the enhancers used for mutagenesis, mutated motif instances and respective log2FC in enhancer activity, are available at https://genome.ucsc.edu/s/bernardo.almeida/DeepSTARR_manuscript. Dynamic sequence tracks (https://github.com/pkerpedjiev/higlass-dynseq) and contribution scores are also available as a Reservoir Genome Browser session at https://resgen.io/paper-data/Almeida...%202021%20-%20DeepSTARR/views. TF motif models were obtained from iRegulon (http://iregulon.aertslab.org/collections.html (ref. 101)). DNase-seq and ATAC-seq data in Drosophila S2 cells were obtained from refs. 63 and 110, respectively; nascent transcription from ref. 111 and H3K4me1 and H3K27ac chromatin marks from ref. 112. RepeatMasker dm3 annotations were obtained from http://www.repeatmasker.org/genomes/dm3/RepeatMasker-rm405-db20140131/dm3.fa.out.gz. Genomic DNase I footprinting data of RKO cells were downloaded from https://resources.altius.org/~jvierstra/projects/footprinting.2020/per.dataset/h.RKO-DS40362/. HCT116 DNase-seq, H3K27ac and H3K4me1 data were obtained from ENCODE97 (https://www.encodeproject.org/; ENCFF001SQU, ENCFF001WIJ, ENCFF001WIK, ENCFF175RBN, ENCFF228YKV, ENCFF851NWR, ENCFF927AHJ, ENCFF945KJN, ENCFF360XGA, ENCFF130JBP and ENCFF400KKD) and ATAC-seq data from ref. 96.

Code availability

Code used to process the genome-wide and oligonucleotide UMI-STARR-seq data, train DeepSTARR and predict the enhancer activity for new DNA sequences, as well as to reproduce the results, is available on GitHub (https://github.com/bernardo-de-almeida/DeepSTARR). The code and TF motif compendium are available from https://github.com/bernardo-de-almeida/motif-clustering.

References

  1. Banerji, J., Rusconi, S. & Schaffner, W. Expression of a β-globin gene is enhanced by remote SV40 DNA sequences. Cell 27, 299–308 (1981).

    Article  CAS  PubMed  Google Scholar 

  2. Levine, M. Transcriptional enhancers in animal development and evolution. Curr. Biol. 20, R754–R763 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Catarino, R. R. & Stark, A. Assessing sufficiency and necessity of enhancer activities for gene expression and the mechanisms of transcription activation. Genes Dev. 32, 202–223 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Gompel, N., Prud’homme, B., Wittkopp, P. J., Kassner, V. A. & Carroll, S. B. Chance caught on the wing: cis-regulatory evolution and the origin of pigment patterns in Drosophila. Nature 433, 481–487 (2005).

    Article  CAS  PubMed  Google Scholar 

  5. Rickels, R. & Shilatifard, A. Enhancer logic and mechanics in development and disease. Trends Cell Biol. 28, 608–630 (2018).

    Article  CAS  PubMed  Google Scholar 

  6. Spitz, F. & Furlong, E. E. M. Transcription factors: from enhancer binding to developmental control. Nat. Rev. Genet. 13, 613–626 (2012).

    Article  CAS  PubMed  Google Scholar 

  7. Kulkarni, M. M. & Arnosti, D. N. Information display by transcriptional enhancers. Development 130, 6569–6575 (2003).

    Article  CAS  PubMed  Google Scholar 

  8. Zinzen, R. P., Senger, K., Levine, M. & Papatsenko, D. Computational models for neurogenic gene expression in the Drosophila embryo. Curr. Biol. 16, 1358–1365 (2006).

    Article  CAS  PubMed  Google Scholar 

  9. Erceg, J. et al. Subtle changes in motif positioning cause tissue-specific effects on robustness of an enhancer’s activity. PLoS Genet. 10, e1004060 (2014).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  10. Levo, M. & Segal, E. In pursuit of design principles of regulatory sequences. Nat. Rev. Genet. 15, 453–468 (2014).

    Article  CAS  PubMed  Google Scholar 

  11. Crocker, J. et al. Low affinity binding site clusters confer Hox specificity and regulatory robustness. Cell 160, 191–203 (2015).

    Article  CAS  PubMed  Google Scholar 

  12. Farley, E. K. et al. Suboptimization of developmental enhancers. Science 350, 325–328 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Farley, E. K., Olson, K. M., Zhang, W., Rokhsar, D. S. & Levine, M. S. Syntax compensates for poor binding sites to encode tissue specificity of developmental enhancers. Proc. Natl Acad. Sci. USA 113, 6508–6513 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Fiore, C. & Cohen, B. A. Interactions between pluripotency factors specify cis-regulation in embryonic stem cells. Genome Res. 26, 778–786 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Mathelier, A. et al. DNA shape features improve transcription factor binding site predictions in vivo. Cell Syst. 3, 278–286 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Sayal, R., Dresch, J. M., Pushel, I., Taylor, B. R. & Arnosti, D. N. Quantitative perturbation-based analysis of gene expression predicts enhancer activity in early Drosophila embryo. eLife 5, e08445 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  17. King, D. M. et al. Synthetic and genomic regulatory elements reveal aspects of cis-regulatory grammar in mouse embryonic stem cells. eLife 9, e41279 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Jindal, G. A. & Farley, E. K. Enhancer grammar in development, evolution, and disease: dependencies and interplay. Dev. Cell 56, 575–587 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Swanson, C. I., Evans, N. C. & Barolo, S. Structural rules and complex regulatory circuitry constrain expression of a Notch- and EGFR-regulated eye enhancer. Dev. Cell 18, 359–376 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Snetkova, V. et al. Ultraconserved enhancer function does not require perfect sequence conservation. Nat. Genet. 53, 521–528 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Panne, D. The enhanceosome. Curr. Opin. Struct. Biol. 18, 236–242 (2008).

    Article  CAS  PubMed  Google Scholar 

  22. Wang, J. et al. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 22, 1798–1812 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Guo, Y., Mahony, S. & Gifford, D. K. High resolution genome wide binding event finding and motif discovery reveals transcription factor spatial binding constraints. PLoS Comput. Biol. 8, e1002638 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Junion, G. et al. A transcription factor collective defines cardiac cell fate and reflects lineage history. Cell 148, 473–486 (2012).

    Article  CAS  PubMed  Google Scholar 

  25. Liu, F. & Posakony, J. W. Role of architecture in the function and specificity of two notch-regulated transcriptional enhancer modules. PLoS Genet. 8, e1002796 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Smith, R. P. et al. Massively parallel decoding of mammalian regulatory sequences supports a flexible organizational model. Nat. Genet. 45, 1021–1028 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Yanez-Cuna, J. O. et al. Dissection of thousands of cell type-specific enhancers identifies dinucleotide repeat motifs as general enhancer features. Genome Res. 24, 1147–1156 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Arnosti, D. N. & Kulkarni, M. M. Transcriptional enhancers: intelligent enhanceosomes or flexible billboards? J. Cell. Biochem. 94, 890–898 (2005).

    Article  CAS  PubMed  Google Scholar 

  29. Berman, B. P. et al. Computational identification of developmental enhancers: conservation and function of transcription factor binding-site clusters in Drosophila melanogaster and Drosophila pseudoobscura. Genome Biol. 5, R61 (2004).

    Article  PubMed  PubMed Central  Google Scholar 

  30. Crocker, J., Ilsley, G. R. & Stern, D. L. Quantitatively predictable control of Drosophila transcriptional enhancers in vivo with engineered transcription factors. Nat. Genet. 48, 292–298 (2016).

    Article  CAS  PubMed  Google Scholar 

  31. He, X., Samee, M. A. H., Blatti, C. & Sinha, S. Thermodynamics-based models of transcriptional regulation by enhancers: the roles of synergistic activation, cooperative binding and short-range repression. PLoS Comput. Biol. 6, e1000935 (2010).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  32. Segal, E., Raveh-Sadka, T., Schroeder, M., Unnerstall, U. & Gaul, U. Predicting expression patterns from regulatory sequence in Drosophila segmentation. Nature 451, 535–540 (2008).

    Article  CAS  PubMed  Google Scholar 

  33. Beer, M. A. & Tavazoie, S. Predicting gene expression from sequence. Cell 117, 185–198 (2004).

    Article  CAS  PubMed  Google Scholar 

  34. Zinzen, R. P. & Papatsenko, D. Enhancer responses to similarly distributed antagonistic gradients in development. PLoS Comput. Biol. 3, 0826–0835 (2007).

    Article  CAS  Google Scholar 

  35. Ghandi, M., Lee, D., Mohammad-noori, M. & Beer, M. A. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput. Biol. 10, e1003711 (2014).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  36. Kwasnieski, J. C., Fiore, C., Chaudhari, H. G. & Cohen, B. A. High-throughput functional testing of ENCODE segmentation predictions. Genome Res. 24, 1595–1602 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Grossman, S. R. et al. Systematic dissection of genomic features determining transcription factor binding and enhancer function. Proc. Natl Acad. Sci. USA 114, E1291–E1300 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Kheradpour, P. et al. Systematic dissection of regulatory motifs in 2000 predicted human enhancers using a massively parallel reporter assay. Genome Res. 23, 800–811 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Svetlichnyy, D., Imrichova, H., Fiers, M., Kalender Atak, Z. & Aerts, S. Identification of high-impact cis-regulatory mutations using transcription factor specific random forest models. PLoS Comput. Biol. 11, e1004590 (2015).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  40. Dibaeinia, P. & Sinha, S. Deciphering enhancer sequence using thermodynamics-based models and convolutional neural networks. Nucleic Acids Res. 49, 10309–10327 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Zabidi, M. A. et al. Enhancer-core-promoter specificity separates developmental and housekeeping gene regulation. Nature 518, 556–559 (2015).

    Article  CAS  PubMed  Google Scholar 

  42. Arnold, C. D. et al. Genome-wide assessment of sequence-intrinsic enhancer responsiveness at single-base-pair resolution. Nat. Biotechnol. 35, 136–144 (2017).

    Article  CAS  PubMed  Google Scholar 

  43. Haberle, V. et al. Transcriptional cofactors display specificity for distinct types of core promoters. Nature 570, 122–126 (2019).

    Article  CAS  PubMed  Google Scholar 

  44. Kleftogiannis, D., Kalnis, P. & Bajic, V. B. Progress and challenges in bioinformatics approaches for enhancer identification. Brief. Bioinform. 17, 967–979 (2016).

    Article  CAS  PubMed  Google Scholar 

  45. Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).

    Article  CAS  PubMed  Google Scholar 

  46. Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Kim, D. et al. The dynamic, combinatorial cis-regulatory lexicon of epidermal differentiation. Nat. Genet. 53, 1564–1576 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Avsec, Ž. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53, 354–366 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Avsec, Ž. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Karbalayghareh, A., Sahin, M. & Leslie, C. S. Chromatin interaction aware gene regulatory modeling with graph attention networks. Preprint at bioRxiv https://doi.org/10.1101/2021.03.31.437978 (2021).

  52. Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Minnoye, L. et al. Cross-species analysis of enhancer logic using deep learning. Genome Res. 30, 1815–1834 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Zhou, J. et al. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat. Genet. 50, 1171–1179 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Janssens, J. et al. Decoding gene regulation in the fly brain. Nature 601, 630–636 (2022).

    Article  CAS  PubMed  Google Scholar 

  56. Bogard, N., Linder, J., Rosenberg, A. B. & Seelig, G. A deep neural network for predicting and engineering alternative polyadenylation. Cell 178, 91–106 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features propagating activation differences. In Proc. 34th International Conference on Machine Learning 3145–3153 (2017).

  58. Shrikumar, A. et al. Technical note on transcription factor motif discovery from importance scores (TF-MoDISco) version 0.5.6.5. Preprint at https://doi.org/10.48550/arXiv.1811.00416 (2018).

  59. Zheng, A. et al. Deep neural networks identify sequence context features predictive of transcription factor binding. Nat. Mach. Intell. 3, 172–180 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  60. Koo, P. K., Majdandzic, A., Ploenzke, M., Anand, P. & Paul, S. B. Global importance analysis: an interpretability method to quantify importance of genomic features in deep neural networks. PLoS Comput. Biol. 17, e1008925 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Greenside, P., Shimko, T., Fordyce, P. & Kundaje, A. Discovering epistatic feature interactions from neural network models of regulatory DNA sequences. Bioinformatics 34, i629–i637 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Movva, R. et al. Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays. PLoS One 14, e0218073 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Arnold, C. D. et al. Genome-wide quantitative enhancer activity maps identified by STARR-seq. Science 339, 1074–1077 (2013).

    Article  CAS  PubMed  Google Scholar 

  64. Neumayr, C., Pagani, M., Stark, A. & Arnold, C. D. STARR-seq and UMI-STARR-seq: assessing enhancer activities for genome-wide-, high-, and low-complexity candidate libraries. Curr. Protoc. Mol. Biol. 128, e105 (2019).

    Article  PubMed  CAS  Google Scholar 

  65. Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. In Proc. 31st International Conference on Neural Information Processing System 4768-4777 (2017).

  66. Lundberg, S. M. et al. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2, 56–67 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  67. Yáñez-Cuna, J. O., Dinh, H. Q., Kvon, E. Z., Shlyueva, D. & Stark, A. Uncovering cis-regulatory sequence requirements for context-specific transcription factor binding. Genome Res. 22, 2018–2030.

  68. Scardigli, R., Bäumer, N., Gruss, P., Guillemot, F. & Le Roux, I. Direct and concentration-dependent regulation of the proneural gene Neurogenin2 by Pax6. Development 130, 3269–3281 (2003).

  69. Swanson, C. I., Schwimmer, D. B. & Barolo, S. Rapid evolutionary rewiring of a structurally constrained eye enhancer. Curr. Biol. 21, 1186–1196 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  70. Crocker, J., Preger-Ben Noon, E. & Stern, D. L. The soft touch: low-affinity transcription factor binding sites in development and evolution. Curr. Top. Dev. Biol. 117, 455–469.

  71. Crocker, J. & Ilsley, G. R. Using synthetic biology to study gene regulatory evolution. Curr. Opin. Genet. Dev. 47, 91–101 (2017).

    Article  CAS  PubMed  Google Scholar 

  72. Boisclair Lachance, J. F., Webber, J. L., Hong, L., Dinner, A. R. & Rebay, I. Cooperative recruitment of Yan via a high-affinity ETS supersite organizes repression to confer specificity and robustness to cardiac cell fate specification. Genes Dev. 32, 389–401 (2018).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  73. Yu, M. et al. Insights into GATA-1-mediated gene activation versus repression via genome-wide chromatin occupancy analysis. Mol. Cell 36, 682–695 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  74. Chen, Y. et al. DNA binding by GATA transcription factor suggests mechanisms of DNA looping and long-range gene regulation. Cell Rep. 2, 1197–1206 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  75. Grossman, S. R. et al. Positional specificity of different transcription factor classes within enhancers. Proc. Natl Acad. Sci. USA 115, E7222–E7230 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  76. Scully, K. H. et al. Allosteric effects of Pit-1 DNA sites on long-term repression in cell type specification. Science 290, 1127–1131 (2000).

    Article  CAS  PubMed  Google Scholar 

  77. Crocker, J., Tamori, Y. & Erives, A. Evolution acts on enhancer organization to fine-tune gradient threshold readouts. PLoS Biol. 6, 2576–2587 (2008).

    Article  CAS  Google Scholar 

  78. Cheng, Q. et al. Computational identification of diverse mechanisms underlying transcription factor-DNA occupancy. PLoS Genet. 9, e1003571 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  79. Morgunova, E. & Taipale, J. Structural perspective of cooperative transcription factor binding. Curr. Opin. Struct. Biol. 47, 1–8 (2017).

    Article  CAS  PubMed  Google Scholar 

  80. Li, R., Pei, H. & Watson, D. K. Regulation of Ets function by protein–protein interactions. Oncogene 19, 6514–6523 (2000).

    Article  CAS  PubMed  Google Scholar 

  81. Burda, P., Laslo, P. & Stopka, T. The role of PU.1 and GATA-1 transcription factors during normal and leukemogenic hematopoiesis. Leukemia 24, 1249–1257 (2010).

    Article  CAS  PubMed  Google Scholar 

  82. Vierstra, J. et al. Global reference mapping of human transcription factor footprints. Nature 583, 729–736 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  83. Eraslan, G., Avsec, Ž., Gagneur, J. & Theis, F. J. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 20, 389–403 (2019).

    Article  CAS  PubMed  Google Scholar 

  84. Dror, I., Golan, T., Levy, C. & Rohs, R. A widespread role of the motif environment in transcription factor binding across diverse protein families. Genome Res. 25, 1268–1280 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  85. Kvon, E. Z. et al. Genome-scale functional characterization of Drosophila developmental enhancers in vivo. Nature 512, 91–95 (2014).

    Article  CAS  PubMed  Google Scholar 

  86. Yan, J. et al. Systematic analysis of binding of transcription factors to noncoding variants. Nature 591, 147–151 (2021).

    Article  CAS  PubMed  Google Scholar 

  87. Haberle, V. & Stark, A. Eukaryotic core promoters and the functional basis of transcription initiation. Nat. Rev. Mol. Cell Biol. 19, 621–637 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  88. Sahu, B. et al. Sequence determinants of human gene regulatory elements. Nat. Genet. 54, 283–294 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  89. Taylor, A. M. et al. Genomic and functional approaches to understanding cancer aneuploidy. Cancer Cell 33, 676–689 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  90. Baisya, D. R. & Lonardi, S. Prediction of histone post-translational modifications using deep learning. Bioinformatics 36, 5610–5617 (2020).

    Article  CAS  Google Scholar 

  91. Mauduit, D. et al. Analysis of long and short enhancers in melanoma cell states. eLife 10, e71735 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  92. The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

    Article  PubMed Central  CAS  Google Scholar 

  93. Roadmap Epigenomics Consortium. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–329 (2015).

    Article  PubMed Central  CAS  Google Scholar 

  94. Regev, A. et al. The human cell atlas. eLife 6, e27041 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  95. Fulco, C. P. et al. Activity-by-contact model of enhancer–promoter regulation from thousands of CRISPR perturbations. Nat. Genet. 51, 1664–1669 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  96. Ponnaluri, V. K. C. et al. NicE-seq: High resolution open chromatin profiling. Genome Biol. 18, 122 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  97. Sloan, C. A. et al. ENCODE data at the ENCODE portal. Nucleic Acids Res. 44, D726–D732 (2016).

    Article  CAS  PubMed  Google Scholar 

  98. Muerdter, F. et al. Resolving systematic errors in widely used enhancer activity assays in human cells. Nat. Methods 15, 141–149 (2018).

    Article  CAS  PubMed  Google Scholar 

  99. Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  100. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  101. Janky, R. et al. iRegulon: from a gene list to a gene regulatory network using large motif and track collections. PLoS Comput. Biol. 10, e1003731 (2014).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  102. Schep, A. motifmatchr: fast motif matching in R. R package version 1.14.0 https://bioconductor.org/packages/release/bioc/html/motifmatchr.html (2021).

  103. Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  104. Kuhn, M. caret: classification and regression training. R package version 6.0-80 https://CRAN.R-project.org/package=caret (2018).

  105. Stampfel, G. et al. Transcriptional regulators form diverse groups with context-dependent regulatory functions. Nature 528, 147–151 (2015).

    Article  CAS  PubMed  Google Scholar 

  106. R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2020).

  107. Wickham, H. ggplot2: Elegant Graphics For Data Analysis (Springer, 2016); https://ggplot2.tidyverse.org

  108. Kent, W. J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  109. Avsec, Ž. et al. The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat. Biotechnol. 37, 592–600 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  110. Albig, C. et al. Factor cooperation for chromosome discrimination in Drosophila. Nucleic Acids Res. 47, 1706–1724 (2019).

    Article  CAS  PubMed  Google Scholar 

  111. Kwak, H., Fuda, N. J., Core, L. J. & Lis, J. T. Precise maps of RNA polymerase reveal how promoters direct initiation and pausing. Science 339, 950–953 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  112. Rickels, R. et al. An evolutionary conserved epigenetic mark of polycomb response elements implemented by Trx/MLL/COMPASS. Mol. Cell 63, 318–328 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank A. Andersen (Life Science Editors), V. Loubiere and F. Lorbeer (IMP) for comments on the manuscript, G. Hulselmans and S. Aerts (KU Leuven) for sharing the TF motif PWM collection, and P. Kerpedjiev for generating the dynamic sequence tracks. Deep sequencing was performed at the Vienna Biocenter Core Facilities GmbH. Research in the Stark group is supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement no. 647320) and by the Austrian Science Fund (FWF, F4303-B09). Basic research at the IMP is supported by Boehringer Ingelheim GmbH and the Austrian Research Promotion Agency (FFG).

Author information

Authors and Affiliations

Authors

Contributions

B.P.d.A., F.R. and A.S. conceived the project. F.R. and M.P. performed all experiments. B.P.d.A. performed all computational analyses. B.P.d.A., F.R. and A.S. interpreted the data and wrote the manuscript. A.S. supervised the project.

Corresponding author

Correspondence to Alexander Stark.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Genetics thanks Ziga Avsec and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer review reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–28, Tables 1–18, Methods and References.

Reporting Summary

Peer Review File

Supplementary Table 1

Supplementary Tables 1–18

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

de Almeida, B.P., Reiter, F., Pagani, M. et al. DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers. Nat Genet 54, 613–624 (2022). https://doi.org/10.1038/s41588-022-01048-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41588-022-01048-5

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing