Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Regulatory activity is the default DNA state in eukaryotes

Abstract

Genomes encode for genes and non-coding DNA, both capable of transcriptional activity. However, unlike canonical genes, many transcripts from non-coding DNA have limited evidence of conservation or function. Here, to determine how much biological noise is expected from non-genic sequences, we quantify the regulatory activity of evolutionarily naive DNA using RNA-seq in yeast and computational predictions in humans. In yeast, more than 99% of naive DNA bases were transcribed. Unlike the evolved transcriptome, naive transcripts frequently overlapped with opposite sense transcripts, suggesting selection favored coherent gene structures in the yeast genome. In humans, regulation-associated chromatin activity is predicted to be common in naive dinucleotide-content-matched randomized DNA. Here, naive and evolved DNA have similar co-occurrence and cell-type specificity of chromatin marks, challenging these as indicators of selection. However, in both yeast and humans, extreme high activities were rare in naive DNA, suggesting they result from selection. Overall, basal regulatory activity seems to be the default, which selection can hone to evolve a function or, if detrimental, repress.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Experimental approach comparing naive and evolved regulatory activity in humans and yeast.
Fig. 2: Evolutionarily naive DNA includes abundant and heterogeneous transcripts in yeast.
Fig. 3: Enformer predicts that locally dinucleotide-matched DNA has regulatory activity.
Fig. 4: Naive DNA has much of the activity of evolved sequences, and is predicted to have cell type specific regulatory elements, with correlated histone marks.

Similar content being viewed by others

Data availability

All data generated in this study are available at NCBI’s GEO database under accession GSE217781.

Code availability

Code to recreate the computational analysis is available on Github (https://github.com/de-Boer-Lab/RGP.git).

References

  1. Cabili, M. N. et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 25, 1915–1927 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Hangauer, M. J., Vaughn, I. W. & McManus, M. T. Pervasive transcription of the human genome produces thousands of previously unidentified long intergenic noncoding RNAs. PLoS Genet. 9, e1003569 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Derrien, T. et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res. 22, 1775–1789 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Ramos, A. D. et al. Integration of genome-wide approaches identifies lncRNAs of adult neural stem cells and their progeny in vivo. Cell Stem Cell 12, 616–628 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Hon, C.-C. et al. An atlas of human long non-coding RNAs with accurate 5′ ends. Nature 543, 199–204 (2017).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  6. Ponting, C. P. & Haerty, W. Genome-wide analysis of human long noncoding RNAs: a provocative review. Annu Rev. Genomics Hum. Genet 123, 153–172(2022).

    Article  Google Scholar 

  7. Palazzo, A. F. & Lee, E. S. Non-coding RNA: what is functional and what is junk? Front. Genet. 6, 2 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  8. Pertea, M. et al. CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Genome Biol. 19, 208 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Chen, J. et al. Evolutionary analysis across mammals reveals distinct classes of long non-coding RNAs. Genome Biol. 17, 19 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Dinger, M. E., Amaral, P. P., Mercer, T. R. & Mattick, J. S. Pervasive transcription of the eukaryotic genome: functional indices and conceptual implications. Brief. Funct. Genomic Proteomic 8, 407–423 (2009).

    Article  CAS  PubMed  Google Scholar 

  11. Ulitsky, I. & Bartel, D. P. lincRNAs: genomics, evolution, and mechanisms. Cell 154, 26–46 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Mercer, T. R., Dinger, M. E. & Mattick, J. S. Long non-coding RNAs: insights into functions. Nat. Rev. Genet. 10, 155–159 (2009).

    Article  CAS  PubMed  Google Scholar 

  13. Fernandes, J. C. R., Acuña, S. M., Aoki, J. I., Floeter-Winter, L. M. & Muxel, S. M. Long non-coding RNAs in the regulation of gene expression: physiology and disease. Noncoding RNA 5, 17 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  14. Uszczynska-Ratajczak, B., Lagarde, J., Frankish, A., Guigó, R. & Johnson, R. Towards a complete map of the human long non-coding RNA transcriptome. Nat. Rev. Genet. 19, 535–548 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

    Article  ADS  Google Scholar 

  16. Ponting, C. P. & Hardison, R. C. What fraction of the human genome is functional? Genome Res. 21, 1769–1776 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Graur, D. An upper limit on the functional fraction of the human genome. Genome Biol. Evol. 9, 1880–1885 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  18. Struhl, K. Transcriptional noise and the fidelity of initiation by RNA polymerase II. Nat. Struct. Mol. Biol. 14, 103–105 (2007).

    Article  CAS  PubMed  Google Scholar 

  19. Robinson, R. Dark matter transcripts: sound and fury, signifying nothing? PLoS Biol. 8, e1000370 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  20. Eddy, S. R. The ENCODE project: missteps overshadowing a success. Curr. Biol. 23, R259–R261 (2013).

    Article  CAS  PubMed  Google Scholar 

  21. Jolma, A. et al. DNA-binding specificities of human transcription factors. Cell 152, 327–339 (2013).

    Article  CAS  PubMed  Google Scholar 

  22. Nutiu, R. et al. Direct measurement of DNA affinity landscapes on a high-throughput sequencing instrument. Nat. Biotechnol. 29, 659–664 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Yona, A. H., Alm, E. J. & Gore, J. Random sequences rapidly evolve into de novo promoters. Nat. Commun. 12, 604 (2021).

    Google Scholar 

  24. Vaishnav, E. D. et al. The evolution, evolvability and engineering of gene regulatory DNA. Nature 603, 455–463 (2022).

    Article  ADS  CAS  PubMed  Google Scholar 

  25. de Boer, C. G. et al. Deciphering eukaryotic gene-regulatory logic with 100 million random promoters. Nat. Biotechnol. 38, 56–65 (2020).

    Article  PubMed  Google Scholar 

  26. Sahu, B. et al. Sequence determinants of human gene regulatory elements. Nat. Genet. 54, 283–294 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. White, M. A., Myers, C. A., Corbo, J. C. & Cohen, B. A. Massively parallel in vivo enhancer assay reveals that highly local features determine the cis-regulatory function of ChIP–seq peaks. Proc. Natl Acad. Sci. USA 110, 11952–11957 (2013).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  28. Galupa, R. et al. Enhancer architecture and chromatin accessibility constrain phenotypic space during Drosophila development. Dev. Cell 58, 51–62.e4 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Cuperus, J. T. et al. Deep learning of the regulatory grammar of yeast 5′ untranslated regions from 500,000 random sequences. Genome Res. 27, 2015–2024 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Sample, P. J. et al. Human 5′ UTR design and variant effect prediction from a massively parallel translation assay. Nat. Biotechnol. 37, 803–809 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Bogard, N., Linder, J., Rosenberg, A. B. & Seelig, G. A deep neural network for predicting and engineering alternative polyadenylation. Cell 178, 91–106 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Rosenberg, A. B., Patwardhan, R. P., Shendure, J. & Seelig, G. Learning the sequence determinants of alternative splicing from millions of random sequences. Cell 163, 698–711 (2015).

    Article  CAS  PubMed  Google Scholar 

  33. de Boer, C. G. et al. A unified model for yeast transcript definition. Genome Res. 24, 154–166 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  34. Gvozdenov, Z., Barcutean, Z. & Struhl, K. Functional analysis of a random-sequence chromosome reveals a high level and the molecular nature of transcriptional noise in yeast cells. Mol. Cell. 83, 1786–1797.e5 (2023).

    Article  CAS  PubMed  Google Scholar 

  35. Zhou, J. et al. Exogenous artificial DNA forms chromatin structure with active transcription in yeast. Sci. China Life Sci. 65, 851–860 (2022).

    Article  MathSciNet  CAS  PubMed  Google Scholar 

  36. Scherer, S. W. et al. Human chromosome 7: DNA sequence and biology. Science 300, 767–772 (2003).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  37. Parfrey, L. W., Lahr, D. J. G., Knoll, A. H. & Katz, L. A. Estimating the timing of early eukaryotic diversification with multigene molecular clocks. Proc. Natl Acad. Sci. USA 108, 13624–13629 (2011).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  38. Eme, L., Sharpe, S. C., Brown, M. W. & Roger, A. J. On the age of eukaryotes: evaluating evidence from fossils and molecular clocks. Cold Spring Harb. Perspect. Biol. 6, a016139 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  39. Smale, S. T. & Kadonaga, J. T. The RNA polymerase II core promoter. Annu. Rev. Biochem. 72, 449–479 (2003).

    Article  CAS  PubMed  Google Scholar 

  40. Ulbricht, R. J. & Olivas, W. M. Puf1p acts in combination with other yeast Puf proteins to control mRNA stability. RNA 14, 246–262 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Schirman, D., Yakhini, Z., Pilpel, Y. & Dahan, O. A broad analysis of splicing regulation in yeast using a large library of synthetic introns. PLoS Genet. 17, e1009805 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Avsec, Ž. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Karollus, A., Mauermeier, T. & Gagneur, J. Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers. Genome Biol. 24, 56 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  44. Kimura, H. Histone modifications for human epigenome analysis. J. Hum. Genet 58, 439–445 (2013).

    Article  CAS  PubMed  Google Scholar 

  45. Karlin, S. Global dinucleotide signatures and analysis of genomic heterogeneity. Curr. Opin. Microbiol. 1, 598–610 (1998).

    Article  CAS  PubMed  Google Scholar 

  46. Mariño-Ramírez, L., Spuge, J. L., Kanga, G. C. & Landsman, D. Statistical analysis of over-represented words in human promoter sequences. Nucleic Acids Res. 32, 5972 (2004).

    Article  Google Scholar 

  47. Bird, A. P. DNA methylation and the frequency of CpG in animal DNA. Nucleic Acids Res. 8, 1499–1504 (1980).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Agarwal, V. & Shendure, J. Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks. Cell Rep. 31, 107663 (2020).

    Article  CAS  PubMed  Google Scholar 

  49. Holoch, D. & Margueron, R. Mechanisms regulating PRC2 recruitment and enzymatic activity. Trends Biochem. Sci. 42, 531–542 (2017).

    Article  CAS  PubMed  Google Scholar 

  50. Malik, H. S. & Henikoff, S. Phylogenomics of the nucleosome. Nat. Struct. Biol. 10, 882–891 (2003).

    Article  CAS  PubMed  Google Scholar 

  51. Kimura, M. Evolutionary rate at the molecular level. Nature 217, 624–626 (1968).

    Article  ADS  CAS  PubMed  Google Scholar 

  52. Tenesa, A. et al. Recent human effective population size estimated from linkage disequilibrium. Genome Res. 17, 520–526 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Sherry, S. T., Harpending, H. C., Batzer, M. A. & Stoneking, M. Alu evolution in human populations: using the coalescent to estimate effective population size. Genetics 147, 1977–1982 (1997).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Hawks, J. In Recent Advances in Palaeodemography: Data, Techniques, Patterns (ed. Bocquet-Appel, J.-P.) 9–30 (Springer, 2008).

  55. Tsai, I. J., Bensasson, D., Burt, A. & Koufopanou, V. Population genomics of the wild yeast Saccharomyces paradoxus: quantifying the life cycle. Proc. Natl Acad. Sci. USA 105, 4957–4962 (2008).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  56. Huang, Y.-F. & Niu, D.-K. Evidence against the energetic cost hypothesis for the short introns in highly expressed genes. BMC Evol. Biol. 8, 154 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  57. Palazzo, A. F. & Gregory, T. R. The case for junk DNA. PLoS Genet. 10, e1004351 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  58. Schulz, D. et al. Transcriptome surveillance by selective termination of noncoding RNA synthesis. Cell 155, 1075–1087 (2013).

    Article  CAS  PubMed  Google Scholar 

  59. de Boer, C. Mechanisms of Yeast Gene Definition (University of Toronto, 2014).

  60. Emera, D., Yin, J., Reilly, S. K., Gockley, J. & Noonan, J. P. Origin and evolution of developmental enhancers in the mammalian neocortex. Proc. Natl Acad. Sci. USA 113, E2617–E2626 (2016).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  61. Oss, S. B. V. & Carvunis, A.-R. De novo gene birth. PLoS Genet. 15, e1008160 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  62. Weisman, C. M. & Eddy, S. R. Gene evolution: getting something from nothing. Curr. Biol. 27, R661–R663 (2017).

    Article  CAS  PubMed  Google Scholar 

  63. Keefe, A. D. & Szostak, J. W. Functional proteins from a random-sequence library. Nature 410, 715–718 (2001).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  64. Blevins, W. R. et al. Uncovering de novo gene birth in yeast using deep transcriptomics. Nat. Commun. 12, 604 (2021).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  65. Hall, C., Brachat, S. & Dietrich, F. S. Contribution of horizontal gene transfer to the evolution of Saccharomyces cerevisiae. Eukaryot. Cell 4, 1102–1115 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Keeling, P. J. & Palmer, J. D. Horizontal gene transfer in eukaryotic evolution. Nat. Rev. Genet. 9, 605–618 (2008).

    Article  CAS  PubMed  Google Scholar 

  67. Fitzpatrick, D. A. Horizontal gene transfer in fungi. FEMS Microbiol. Lett. 329, 1–8 (2012).

    Article  CAS  PubMed  Google Scholar 

  68. Camellato, B. R., Brosh, R., Ashe, H. J., Maurano, M. T. & Boeke, J. D. Synthetic reversed sequences reveal default genomic states. Nature (in the press).

  69. Landt, S. G. et al. ChIP–seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 22, 1813–1831 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  70. Jung, Y. L. et al. Impact of sequencing depth in ChIP–seq experiments. Nucleic Acids Res. 42, e74 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  71. de Boer, C. G. & Taipale, J. Hold out the genome: a roadmap to solving the cis-regulatory code. Nature 625, 41–50 (2024).

    Article  ADS  CAS  PubMed  Google Scholar 

  72. Scherer, S. W., Tompkins, B. J. F. & Tsui, L.-C. A human chromosome 7-specific genomic DNA library in yeast artificial chromosomes. Mamm. Genome 3, 179–181 (1992).

    Article  CAS  PubMed  Google Scholar 

  73. Blackburn:Yeast Colony PCR v2.0. OpenWetWare https://openwetware.org/wiki/Blackburn:Yeast_Colony_PCR_v2.0

  74. Kunz, J. et al. Regional localization of 725 human chromosome 7-specific yeast artificial chromosome clones. Genomics 22, 439–448 (1994).

    Article  CAS  PubMed  Google Scholar 

  75. Stuecker, T. RNA Isolation from Yeast. protocols.io https://www.protocols.io/view/rna-isolation-from-yeast-inwcdfe (2017).

  76. Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  77. Andrews, S. FastQC: a quality control tool for high throughput sequence data. (2010).

  78. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

    Article  CAS  PubMed  Google Scholar 

  79. Trapnell, C. et al. Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  80. Ramírez, F. et al. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 44, W160–W165 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  81. Bailey, T. L. & Grant, C. E. SEA: simple enrichment analysis of motifs. Preprint at bioRxiv https://doi.org/10.1101/2021.08.23.457422 (2021).

  82. de Boer, C. G. & Hughes, T. R. YeTFaSCo: a database of evaluated yeast transcription factor sequence specificities. Nucleic Acids Res. 40, D169–D179 (2012).

    Article  PubMed  Google Scholar 

  83. Piovesan, A. et al. On the length, weight and GC content of the human genome. BMC Res. Notes 12, 106 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  84. Khan, A., Riudavets Puig, R., Boddie, P. & Mathelier, A. BiasAway: command-line and web server to generate nucleotide composition-matched DNA background sequences. Bioinformatics 37, 1607–1609 (2021).

    Article  CAS  PubMed  Google Scholar 

  85. Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

  86. Weirauch, M. T. et al. Determination and inference of eukaryotic transcription factor sequence specificity. Cell 158, 1431–1443 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank The Centre for Applied Genomics71 and the S. Scherer lab for the human chromosome 7 YACs, and BRC-seq (University of British Columbia) for assisting with RNA-seq. We would also like to thank the labs of Y.-J. Yuan and W. Tao (Tianjin University and Peking University, respectively) for the data-storage YAC reference sequence and helpful insights. This research was supported by the Natural Sciences and Engineering Research Council of Canada (RGPIN-2020-05425 to C.G.D.), the Stem Cell Network (ECR-C4R1-7 to C.G.D.) and the Canadian Institute for Health Research (PJT-180537 to C.G.D.). C.G.D. is a Michael Smith Health Research BC Scholar. This research was enabled in part by support provided by WestGrid, Compute Canada (www.computecanada.ca) and Advanced Research Computing at the University of British Columbia.

Author information

Authors and Affiliations

Authors

Contributions

I.L. and X.E.C. created sequence datasets, set-up the Enformer code and performed Enformer analysis. A.M.R. and A.L.S. set up up Enformer code. C.J. performed the experiments and analyzed the yeast data. C.G.D. conceived the project and supervised the research.

Corresponding author

Correspondence to Carl G. de Boer.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Structural & Molecular Biology thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Carolina Perdigoto was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 A data storage chromosome lacking evolutionary history produces abundant and heterogenous transcripts in yeast.

(a) Genome browser shot of 22 kb of the data carrying chromosome expressed in yeast. Blue = sense transcription; red = antisense transcription (both log scale), with predicted transcripts indicated (red and blue). One representative replicate is shown. (b) Boxplot of lengths (y-axes) for evolved and naïve transcripts (x-axes and colours). Transcript mean lengths are longer in naïve vs evolved DNA (2-sided Wilcoxon rank sum, rep 1 p = 1.8 × 10−24, rep 2 p = 5.3 × 10−16, rep 3 p = 3.4 × 10−27) (c) Boxplot of expression levels (FPKM; y-axis) for evolved and naïve transcripts (x-axes and colours). Transcript mean FPKM is higher for evolved DNA (2-sided Wilcoxon rank sum, rep 1 p = 1.3 × 10−12, rep 2 p = 6.9 × 10−11, rep 3 p = 2.90 × 10−16). (d) Contour plot showing expression on one strand relative to the other (x- and y-axes) for evolved (upper left triangle) and naïve (lower right triangle) DNA. For (B-D), replicates (three separate RNA-seq experiments from the same yeast strain) are shown as subplots. For (B,C), n = total number of predicted transcripts for evolved or naïve DNA for each indicated replicate. Box limits = 25th and 75th percentiles; centre line = median; whiskers = farthest value within 1.5x interquartile range; points = outliers.

Extended Data Fig. 2 Little correspondence between human transcription start sites and promoter activity in yeast.

(a) Metagene profile of the expression level (y-axis) surrounding human transcript start sites (Human RefGene, x-axis) from DNA in human YAC 1 & 2. (b) RNA-seq coverage (color) heatmaps for regions surrounding predicted human transcription start sites (x-axes) of DNA in human YAC 1 & 2. (c) Metagene profile of the expression level (y-axis) surrounding TSSs of the endogenous yeast genome (yeast RefGene; shown as a control). (d) RNA-seq coverage (colour) heatmaps for regions surrounding endogenous yeast genome transcription start sites (x-axes).

Extended Data Fig. 3 Example spliced naïve transcript.

Data from human YAC 1 is shown. Reads (top) and splice junctions for three isoforms (middle) are shown, with the DNA sequence (colours) at the mid bottom. At the very bottom matches to the known splicing motifs are indicated. The transcript is expressed in the antisense direction.

Extended Data Fig. 4 Evolved DNA has less antisense expression, and more expression extremes.

(a, b) Cumulative fractions of base pairs (y-axes) for each RNA-seq read coverage (x-axes), for both naïve and evolved DNA (colours), for (A) the minimum coverage of the two DNA strands or (B) the maximum coverage of the two strands. P-values from a two-sided Kolmogorov-Smirnov test, using only every 1000th base (to ensure independence between samples). Data from strain carrying human YAC 1 (left panel) and human YAC 2 (right panel) are shown as subplots. nEvolved = 12054 for both strains, and nnaïve = 760 for human YAC 1 and nnaïve = 811 for human YAC 2. Plots include all bases.

Extended Data Fig. 5 Expected TF motifs are enriched in both evolved and naïve DNA predicted promoter regions.

(a) The enrichment ratio of known TSSs relative to control regions (+50:+150 downstream of TSSs) for both evolved (x-axis) and naïve DNA (human YACs; y-axis). Values above 1 signify enrichment (more motif matches in promoters vs. control sequences). Promoter defining motifs (blue) are bound by yeast specific TFs (not found in humans). The line y = x (light blue) signifies equal enrichment in both evolved and naïve DNA. Lower enrichment in the naïve DNA likely results, in part, from difficulties in delineating transcription start sites (and thus promoter regions) in widely transcribed naïve DNA and differences in base composition between human and yeast. (b) The ln(e-value) of enrichment for each motif in (A). All motifs below the dotted blue line are significantly enriched in naïve DNA promoters; all motifs to the left of the red dotted line are significantly enriched in evolved DNA promoters.

Extended Data Fig. 6 Comparing Enformer predictions to random sequence model.

Enformer predictions for both naïve and genomic sequences for the LOVO DNase track are binned (x-axis) compared to the predicted probability of expression (y-axis) for a random sequence-trained model26. Sahu predictions >= 0.5 are predicted to be active. Genomic and naïve sequences with higher predicted activity by Enformer are aligned with the results from a model trained on sequences from a random sequence promoter assay and would be label as active. Box limits = 25th and 75th percentiles; centre line = median; whiskers = farthest value within 1.5x interquartile range.

Extended Data Fig. 7 iPSC related TFs (MYC, NANOG, KLF4) are significantly enriched in both genomic and naïve sequences.

a) The ln(e-values) of each TF in both the genomic (x-axis) and naïve sequences (y-axis) (n = 1065). Master regulators of PSCs (KLF4, NANOG, and MYC) are found to be significantly enriched in both naïve and genomic sequences. SOX2 and POU5F1 can be seen in the upper right-hand corner (non-significant). b) Enrichment ratio for genomic (x-axis) and naïve (y-axis) for the same TFs in (A).

Extended Data Fig. 8 Different chromatin marks appear with different abundance in naïve DNA.

For each Enformer predicted chromatin mark value (x-axes), the fraction of prediction bins with at least this value that are in ENCODE peaks (left y-axis; cyan) and the ratio of evolved:naïve bins (right y-axes; yellow), as in Fig. 4a. Chromatin marks include (a) H3K27me3, (b) H3K4me3, (c) H3K4me1, (d) H3K27ac.

Extended Data Fig. 9 Similar relationships between chromatin marks for naïve and evolved sequences.

Scatter plots of predicted values for all 1000 regions between all pairs of the 5 tracks. Each scatter plot shows the co-occurrence of the predicted values between two tracks (indicated by axis labels). Scatter plots on the diagonal (upper left to lower right) show the comparison between the naïve and evolved predicted values for the same track.

Extended Data Fig. 10 Similar chromatin mark correlations across cell types.

Pearson correlation coefficients (r; y-axes) for different pairs of chromatin marks (x-axes) for both naïve and genomic sequences (colours) across 11 human cell types (indicated above each graph).

Supplementary information

Supplementary Information

Supplementary Figs. 1–9 and Supplementary Tables 1–4.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Luthra, I., Jensen, C., Chen, X.E. et al. Regulatory activity is the default DNA state in eukaryotes. Nat Struct Mol Biol 31, 559–567 (2024). https://doi.org/10.1038/s41594-024-01235-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41594-024-01235-4

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing