Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

The evolution, evolvability and engineering of gene regulatory DNA

Abstract

Mutations in non-coding regulatory DNA sequences can alter gene expression, organismal phenotype and fitness1,2,3. Constructing complete fitness landscapes, in which DNA sequences are mapped to fitness, is a long-standing goal in biology, but has remained elusive because it is challenging to generalize reliably to vast sequence spaces4,5,6. Here we build sequence-to-expression models that capture fitness landscapes and use them to decipher principles of regulatory evolution. Using millions of randomly sampled promoter DNA sequences and their measured expression levels in the yeast Saccharomyces cerevisiae, we learn deep neural network models that generalize with excellent prediction performance, and enable sequence design for expression engineering. Using our models, we study expression divergence under genetic drift and strong-selection weak-mutation regimes to find that regulatory evolution is rapid and subject to diminishing returns epistasis; that conflicting expression objectives in different environments constrain expression adaptation; and that stabilizing selection on gene expression leads to the moderation of regulatory complexity. We present an approach for using such models to detect signatures of selection on expression from natural variation in regulatory sequences and use it to discover an instance of convergent regulatory evolution. We assess mutational robustness, finding that regulatory mutation effect sizes follow a power law, characterize regulatory evolvability, visualize promoter fitness landscapes, discover evolvability archetypes and illustrate the mutational robustness of natural regulatory sequence populations. Our work provides a general framework for designing regulatory sequences and addressing fundamental questions in regulatory evolution.

This is a preview of subscription content, access via your institution

Access options

Fig. 1: The evolution, evolvability and engineering of gene regulatory DNA.
Fig. 2: The evolutionary malleability of gene expression.
Fig. 3: The ECC detects signatures of selection on gene expression using natural genetic variation in regulatory DNA.
Fig. 4: The evolvability vector captures fitness landscapes.

Similar content being viewed by others

Data availability

Data generated for this study are available at the NCBI GEO with accession numbers GSE163045 and GSE163866. All models and processed data are available on Zenodo at https://zenodo.org/record/4436477.

Code availability

Code is available on GitHub at https://github.com/1edv/evolution and CodeOcean at https://codeocean.com/capsule/8020974/tree. A web app is available at https://1edv.github.io/evolution/.

References

  1. Wittkopp, P. J. & Kalay, G. Cis-regulatory elements: molecular mechanisms and evolutionary processes underlying divergence. Nat. Rev. Genet. 13, 59–69 (2011).

    Article  PubMed  Google Scholar 

  2. Hill, M. S., Vande Zande, P. & Wittkopp, P. J. Molecular and evolutionary processes generating variation in gene expression. Nat. Rev. Genet. 22, 203–215 (2021).

    Article  CAS  PubMed  Google Scholar 

  3. Fuqua, T. et al. Dense and pleiotropic regulatory information in a developmental enhancer. Nature 587, 235–239 (2020).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  4. de Visser, J. A. G. M. & Krug, J. Empirical fitness landscapes and the predictability of evolution. Nat. Rev. Genet. 15, 480–490 (2014).

    Article  PubMed  Google Scholar 

  5. Kondrashov, D. A. & Kondrashov, F. A. Topological features of rugged fitness landscapes in sequence space. Trends Genet. 31, 24–33 (2015).

    Article  CAS  PubMed  Google Scholar 

  6. de Visser, J. A. G. M., Elena, S. F., Fragata, I. & Matuszewski, S. The utility of fitness landscapes and big data for predicting evolution. Heredity 121, 401–405 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  7. Weirauch, M. T. & Hughes, T. R. Conserved expression without conserved regulatory sequence: the more things change, the more they stay the same. Trends Genet. 26, 66–74 (2010).

    Article  CAS  PubMed  Google Scholar 

  8. Orr, H. A. The genetic theory of adaptation: a brief history. Nat. Rev. Genet. 6, 119–127 (2005).

    Article  CAS  PubMed  Google Scholar 

  9. Weinreich, D. M., Lan, Y., Wylie, C. S. & Heckendorn, R. B. Should evolutionary geneticists worry about higher-order epistasis? Curr. Opin. Genet. Dev. 23, 700–707 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Venkataram, S. et al. Development of a comprehensive genotype-to-fitness map of adaptation-driving mutations in yeast. Cell 166, 1585–1596 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Keren, L. et al. Massively parallel interrogation of the effects of gene expression levels on fitness. Cell 166, 1282–1294 (2016).

    Article  CAS  PubMed  Google Scholar 

  12. Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  13. Ogden, P. J., Kelsic, E. D., Sinai, S. & Church, G. M. Comprehensive AAV capsid fitness landscape reveals a viral gene and enables machine-guided design. Science 366, 1139–1143 (2019).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  14. Pitt, J. N. & Ferré-D’Amaré, A. R. Rapid construction of empirical RNA fitness landscapes. Science 330, 376–379 (2010).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  15. Shultzaberger, R. K., Malashock, D. S., Kirsch, J. F. & Eisen, M. B. The fitness landscapes of cis-acting binding sites in different promoter and environmental contexts. PLoS Genet. 6, e1001042 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  16. Mustonen, V., Kinney, J., Callan, C. G. & Lässig, M. Energy-dependent fitness: a quantitative model for the evolution of yeast transcription factor binding sites. Proc. Natl Acad. Sci. USA 105, 12376–12381 (2008).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  17. Hartl, D. L. What can we learn from fitness landscapes? Curr. Opin. Microbiol. 0, 51–57 (2014).

    Article  PubMed Central  Google Scholar 

  18. Otwinowski, J. & Nemenman, I. Genotype to phenotype mapping and the fitness landscape of the E. coli lac promoter. PLoS ONE 8, e61570 (2013).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  19. Sinai, S. & Kelsic, E. D. A primer on model-guided exploration of fitness landscapes for biological sequence design. Preprint at https://arxiv.org/abs/2010.10614 (2020).

  20. Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Avsec, Ž. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53, 354–366 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. Proc. 34th International Conference on Machine Learning 3145–3153 (2017).

  23. Avsec, Ž. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Fragata, I., Blanckaert, A., Louro, M. A. D., Liberles, D. A. & Bank, C. Evolution in the light of fitness landscape theory. Trends Ecol. Evol. 34, 69–82 (2019).

    Article  PubMed  Google Scholar 

  25. Payne, J. L. & Wagner, A. The causes of evolvability and their evolution. Nat. Rev. Genet. 20, 24–38 (2019).

    Article  CAS  PubMed  Google Scholar 

  26. de Boer, C. G. et al. Deciphering eukaryotic gene-regulatory logic with 100 million random promoters. Nat. Biotechnol. 38, 56–65 (2020).

    Article  PubMed  Google Scholar 

  27. Crocker, J. et al. Low affinity binding site clusters confer hox specificity and regulatory robustness. Cell 160, 191–203 (2015).

    Article  CAS  PubMed  Google Scholar 

  28. Habib, N., Wapinski, I., Margalit, H., Regev, A. & Friedman, N. A functional selection model explains evolutionary robustness despite plasticity in regulatory networks. Mol. Syst. Biol. 8, 619 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  29. Gillespie, J. H. Molecular evolution over the mutational landscape. Evolution 38, 1116–1129 (1984).

    Article  CAS  PubMed  Google Scholar 

  30. Jerison, E. R. & Desai, M. M. Genomic investigations of evolutionary dynamics and epistasis in microbial evolution experiments. Curr. Opin. Genet. Dev. 35, 33–39 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Sæther, B.-E. & Engen, S. The concept of fitness in fluctuating environments. Trends Ecol. Evol. 30, 273–281 (2015).

    Article  PubMed  Google Scholar 

  32. Vaswani, A. et al. in Advances in Neural Information Processing Systems 30 (eds. Guyon, I. et al.) 5998–6008 (Curran Associates, 2017).

  33. Weirauch, M. T. et al. Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol. 31, 126–134 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Yang, N. & Bielawski, N. Statistical methods for detecting molecular adaptation. Trends Ecol. Evol. 15, 496–503 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Moses, A. M. Statistical tests for natural selection on regulatory regions based on the strength of transcription factor binding sites. BMC Evol. Biol. 9, 286 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  36. Rifkin, S. A., Houle, D., Kim, J. & White, K. P. A mutation accumulation assay reveals a broad capacity for rapid evolution of gene expression. Nature 438, 220–223 (2005).

    Article  ADS  CAS  PubMed  Google Scholar 

  37. Peter, J. et al. Genome evolution across 1,011 Saccharomyces cerevisiae isolates. Nature 556, 339–344 (2018).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  38. Erb, I. & van Nimwegen, E. Transcription factor binding site positioning in yeast: proximal promoter motifs characterize TATA-less promoters. PLoS One 6, e24279 (2011).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  39. Gilad, Y., Oshlack, A. & Rifkin, S. A. Natural selection on gene expression. Trends Genet. 22, 456–461 (2006).

    Article  CAS  PubMed  Google Scholar 

  40. Alhusaini, N. & Coller, J. The deadenylase components Not2p, Not3p, and Not5p promote mRNA decapping. RNA 22, 709–721 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Yang, J.-R., Maclean, C. J., Park, C., Zhao, H. & Zhang, J. Intra and interspecific variations of gene expression levels in yeast are largely neutral: (Nei Lecture, SMBE 2016, Gold Coast). Mol. Biol. Evol. 34, 2125–2139 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Chen, J. et al. A quantitative framework for characterizing the evolutionary history of mammalian gene expression. Genome Res. 29, 53–63 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Payne, J. L. & Wagner, A. Mechanisms of mutational robustness in transcriptional regulation. Front. Genet. 6, 322 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  44. Shoval, O. et al. Evolutionary trade-offs, Pareto optimality, and the geometry of phenotype space. Science 336, 1157–1160 (2012).

    Article  ADS  CAS  PubMed  Google Scholar 

  45. van Dijk, D. et al. Finding archetypal spaces using neural networks. IEEE International Conference on Big Data 2634-2643 (2019).

  46. He, X., Duque, T. S. P. C. & Sinha, S. Evolutionary origins of transcription factor binding site clusters. Mol. Biol. Evol. 29, 1059–1070 (2012).

    Article  CAS  PubMed  Google Scholar 

  47. Cliften, P. F. et al. Surveying Saccharomyces genomes to identify functional elements by comparative DNA sequence analysis. Genome Res. 11, 1175–1186 (2001).

    Article  CAS  PubMed  Google Scholar 

  48. Heinz, S., Romanoski, C. E., Benner, C. & Glass, C. K. The selection and function of cell type-specific enhancers. Nat. Rev. Mol. Cell Biol. 16, 144–154 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Lehner, B. Selection to minimise noise in living systems and its implications for the evolution of gene expression. Mol. Syst. Biol. 4, 170 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  50. Metzger, B. P. H., Yuan, D. C., Gruber, J. D., Duveau, F. & Wittkopp, P. J. Selection on noise constrains variation in a eukaryotic promoter. Nature 521, 344–347 (2015).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  51. Kosuri, S. et al. Composability of regulatory sequences controlling transcription and translation in Escherichia coli. Proc. Natl Acad. Sci. USA. 110, 14024–14029 (2013).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  52. Shalem, O. et al. Systematic dissection of the sequence determinants of gene 3’ end mediated expression control. PLoS Genet. 11, e1005147 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  53. Kinney, J. B., Murugan, A., Callan, C. G. Jr & Cox, E. C. Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc. Natl Acad. Sci. USA. 107, 9158–9163 (2010).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  54. Sharon, E. et al. Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters. Nat. Biotechnol. 30, 521–530 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Melnikov, A. et al. Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay. Nat. Biotechnol. 30, 271–277 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Kwasnieski, J. C., Mogno, I., Myers, C. A., Corbo, J. C. & Cohen, B. A. Complex effects of nucleotide variants in a mammalian cis-regulatory element. Proc. Natl Acad. Sci. USA 109, 19498–19503 (2012).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  57. Kircher, M. et al. Saturation mutagenesis of twenty disease-associated regulatory elements at single base-pair resolution. Nat. Commun. 10, 3583 (2019).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  58. Townsley, K. G., Brennand, K. J. & Huckins, L. M. Massively parallel techniques for cataloguing the regulome of the human brain. Nat. Neurosci. 23, 1509–1521 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Renganaath, K. et al. Systematic identification of cis-regulatory variants that cause gene expression differences in a yeast cross. eLife 9, e62669 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).

    Article  CAS  PubMed  Google Scholar 

  62. Travers, C. et al. Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface 15, 20170387 (2018).

    Article  Google Scholar 

  63. Avsec, Ž. et al. The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat. Biotechnol. 37, 592–600 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Quang, D. & Xie, X. FactorNet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data. Methods 166, 40–47 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  65. Zhou H. et al. Towards a better understanding of reverse-complement equivariance for deep learning models in genomics. Proc. 16th Machine Learning in Computational Biology meeting 165, 1–33 (2022).

  66. Morrow, A. et al. Convolutional kitchen sinks for transcription factor binding site prediction. Preprint at https://arxiv.org/abs/1706.00125 (2017).

  67. Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  68. Koo, P. K., Majdandzic, A., Ploenzke, M., Anand, P. & Paul, S. B. Global importance analysis: an interpretability method to quantify importance of genomic features in deep neural networks. PLoS Comput. Biol. 17, e1008925 (2021).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  69. Quang, D. & Xie, X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 44, e107 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  70. Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. International Conference on Learning Representations (Poster) (2015).

  71. Abadi, M. et al. TensorFlow: large-scale machine learning on heterogenous systems. Software available from https://www.tensorflow.org/ (2015).

  72. Jouppi, N. P. et al. In-datacenter performance analysis of a tensor processing unit. In Proc. 44th Annual International Symposium on Computer Architecture 1–12 (2017).

  73. Li, J., Pu, Y., Tang, J., Zou, Q. & Guo, F. DeepATT: a hybrid category attention neural network for identifying functional effects of DNA sequences. Brief. Bioinform. 22, bbaa159 (2020).

    Article  Google Scholar 

  74. Ullah, F. & Ben-Hur, A. A self-attention model for inferring cooperativity between regulatory features. Nucleic Acids Res. 49, e77 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  75. Clauwaert, J., Menschaert, G. & Waegeman, W. Explainability in transformer models for functional genomics. Brief. Bioinform. 22, bbab060 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  76. Hinton, G. & Tieleman, T. Lecture 6.5—RmsProp: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning 4, 26–31 (2012).

    Google Scholar 

  77. Sinai, S. et al. AdaLead: a simple and robust adaptive greedy search algorithm for sequence design. Preprint at https://arxiv.org/abs/2010.02141 (2020).

  78. Linder, J., Bogard, N., Rosenberg, A. B. & Seelig, G. A generative neural network for maximizing fitness and diversity of synthetic DNA and protein sequences. Cell Syst. 11, 49–62 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  79. Brookes, D., Park, H. & Listgarten, J. Conditioning by adaptive sampling for robust design. Proc. Mach. Learn. Res. 97, 773–782 (2019).

    Google Scholar 

  80. Killoran, N., Lee, L. J., Delong, A., Duvenaud, D. & Frey, B. J. Generating and designing DNA with deep generative models. Neurips Computational Biology Workshop (2017).

  81. Fortin, F.-A., Rainville, F.-M. D., Gardner, M.-A., Parizeau, M. & Gagné, C. DEAP: evolutionary algorithms made easy. J. Mach. Learn. Res. 13, 2171–2175 (2012).

    MathSciNet  Google Scholar 

  82. Jaeger, S. A. et al. Conservation and regulatory associations of a wide affinity range of mouse transcription factor binding sites. Genomics 95, 185–195 (2010).

    Article  CAS  PubMed  Google Scholar 

  83. Tanay, A. Extensive low-affinity transcriptional interactions in the yeast genome. Genome Res. 16, 962–972 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  84. Sniegowski, P. D. & Gerrish, P. J. Beneficial mutations and the dynamics of adaptation in asexual populations. Phil. Trans. R. Soc. B 365, 1255–1263 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  85. Szendro, I. G., Franke, J., de Visser, J. A. & Krug, J. Predictability of evolution depends nonmonotonically on population size. Proc. Natl Acad. Sci. USA 110, 571–576 (2013).

    Article  ADS  CAS  PubMed  Google Scholar 

  86. Orr, H. A. The population genetics of adaptation: the adaptation of DNA Sequences. Evolution 56, 1317–1330 (2002).

    Article  CAS  PubMed  Google Scholar 

  87. Bailey, T. L. DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics 27, 1653–1659 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  88. de Boer, C. G. & Hughes, T. R. YeTFaSCo: a database of evaluated yeast transcription factor sequence specificities. Nucleic Acids Res. 40, D169–D179 (2012).

    Article  PubMed  Google Scholar 

  89. Kent, W. J. BLAT—the BLAST-Like Alignment Tool. Genome Res. 12, 656–664 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  90. Cherry, J. M. et al. Saccharomyces Genome Database: the genomics resource of budding yeast. Nucleic Acids Res. 40, D700–D705 (2012).

    Article  CAS  PubMed  Google Scholar 

  91. Smith, J. D., McManus, K. F. & Fraser, H. B. A novel test for selection on cis-regulatory elements reveals positive and negative selection acting on mammalian transcriptional enhancers. Mol. Biol. Evol. 30, 2509–2518 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  92. Liu, J. & Robinson-Rechavi, M. Robust inference of positive selection on regulatory sequences in the human brain. Sci. Adv. 6, eabc9863 (2020).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  93. Rice, D. P. & Townsend, J. P. A test for selection employing quantitative trait locus and mutation accumulation data. Genetics 190, 1533–1545 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  94. Denver, D. R., Morris, K., Lynch, M. & Thomas, W. K. High mutation rate and predominance of insertions in the Caenorhabditis elegans nuclear genome. Nature 430, 679–682 (2004).

    Article  ADS  CAS  PubMed  Google Scholar 

  95. Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  96. Thompson, D. A. et al. Evolutionary principles of modular gene regulation in yeasts. eLife 2, e00603 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  97. Yassour, M. et al. Strand-specific RNA sequencing reveals extensive regulated long antisense transcripts that are conserved across yeast species. Genome Biol. 11, R87 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  98. Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat. Biotechnol. 29, 644–652 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  99. Wu, T. D. & Watanabe, C. K. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21, 1859–1875 (2005).

    Article  CAS  PubMed  Google Scholar 

  100. Wapinski, I., Pfeffer, A., Friedman, N. & Regev, A. Natural history and evolutionary principles of gene duplication in fungi. Nature 449, 54–61 (2007).

    Article  ADS  CAS  PubMed  Google Scholar 

  101. Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  102. Yates, A. D. et al. Ensembl 2020. Nucleic Acids Res. 48, D682–D688 (2020).

    CAS  PubMed  Google Scholar 

  103. DiCarlo, J. E. et al. Genome engineering in Saccharomyces cerevisiae using CRISPR–Cas systems. Nucleic Acids Res. 41, 4336–4343 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  104. Fleiss, A. et al. Reshuffling yeast chromosomes with CRISPR/Cas9. PLoS Genet. 15, e1008332 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  105. Horwitz, A. A. et al. Efficient multiplexed integration of synergistic alleles and metabolic pathways in yeasts via CRISPR–Cas. Cell Syst. 1, 88–96 (2015).

    Article  CAS  PubMed  Google Scholar 

  106. Livak, K. J. & Schmittgen, T. D. Analysis of relative gene expression data using real-time quantitative PCR and the 2−ΔΔCT method. Methods 25, 402–408 (2001).

    Article  CAS  PubMed  Google Scholar 

  107. Vandesompele, J. et al. Accurate normalization of real-time quantitative RT–PCR data by geometric averaging of multiple internal control genes. Genome Biol. 3, research0034.1 (2002).

    Article  Google Scholar 

  108. Teste, M.-A., Duquenne, M., François, J. M. & Parrou, J.-L. Validation of reference genes for quantitative expression analysis by real-time RT–PCR in Saccharomyces cerevisiae. BMC Mol. Biol. 10, 99 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  109. Mardones, W. et al. Rapid selection response to ethanol in Saccharomyces eubayanus emulates the domestication process under brewing conditions. Microb. Biotechnol. https://doi.org/10.1111/1751-7915.13803 (2021).

  110. Ibstedt, S. et al. Concerted evolution of life stage performances signals recent selection on yeast nitrogen use. Mol. Biol. Evol. 32, 153–161 (2015).

    Article  CAS  PubMed  Google Scholar 

  111. Rich, M. S. et al. Comprehensive analysis of the SUL1 promoter of Saccharomyces cerevisiae. Genetics 203, 191–202 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  112. Rest, J. S. et al. Nonlinear fitness consequences of variation in expression level of a eukaryotic gene. Mol. Biol. Evol. 30, 448–456 (2013).

    Article  CAS  PubMed  Google Scholar 

  113. Bergen, A. C., Olsen, G. M. & Fay, J. C. Divergent MLS1 promoters lie on a fitness plateau for gene expression. Mol. Biol. Evol. 33, 1270–1279 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  114. Alstott, J., Bullmore, E. & Plenz, D. Powerlaw: a Python package for analysis of heavy-tailed distributions. PLoS One 9, e85777 (2014).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank Google TPU Research Cloud for TPU access, L. Gaffney for help with figure preparation, Broad Genomics Platform for sequencing work, J.-C. Hütter for advice on fitness responsivity, J. Pfiffner-Borges for help with RNA-seq, R. Yu, B. Lee and N. Jaberi for manuscript feedback and members of the A.R. laboratory for discussions. E.D.V. was supported by the MIT Presidential Fellowship; C.G.d.B. was supported by a Canadian Institutes for Health Research Fellowship and the NIH (K99-HG009920-01); and F.A.C. and J.M. were supported by ANID (Programa Iniciativa Científica Milenio, ICN17_022). Work was supported by the Klarman Cell Observatory, Howard Hughes Medical Institute (HHMI) and Google TPU Research Cloud (https://sites.research.google/trc/about/). A.R. was an Investigator of the HHMI.

Author information

Authors and Affiliations

Authors

Contributions

E.D.V., C.G.d.B. and A.R. conceived, designed and supervised the study. E.D.V. and C.G.d.B. performed the analyses. M.Y., L.F., X.A. and D.A.T. performed and D.A.T., J.Z.L. and A.R. supervised the Ascomycota cross-species RNA-seq experiments. J.M. performed and F.A.C. supervised the CDC36 experiments. E.D.V. and C.G.d.B. performed the rest of the experiments. E.D.V., C.G.d.B. and A.R. wrote the manuscript.

Corresponding authors

Correspondence to Eeshit Dhaval Vaishnav, Carl G. de Boer or Aviv Regev.

Ethics declarations

Competing interests

A.R. is a co-founder and equity holder of Celsius Therapeutics and Immunitas and until 31 July 2020 was a member of the scientific advisory board of Thermo Fisher Scientific, Syros Pharmaceuticals, Neogene Therapeutics and Asimov. As of 1 August 2020, A.R. is an employee of Genentech. The other authors declare no competing interests.

Peer review

Peer review information

Nature thanks Martin Taylor and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 The convolutional sequence-to-expression model generalizes reliably and characterizes sequence trajectories under different evolutionary regimes.

ad, Prediction of expression from sequence in complex (YPD) (a, b) and defined (SD-Uracil) (c, d) medium. Predicted (x axis) and experimentally measured (y axis) expression for (a, c) random test sequences (sampled separately from and not overlapping with the training data) and (b, d) native yeast promoter sequences containing random single base mutations. Top left: Pearson’s r and associated two-tailed P value. Compression of predictions in the lower left results from binning differences during cell sorting in different experiments (Supplementary Notes). e, Experimental validation of trajectories from simulations of random genetic drift. Distribution of measured (light grey) and predicted (dark grey) changes in expression in the defined medium (SD-Uracil) (y axis) for the synthesized randomly designed sequences (n = 2,986) at each mutational step (x axis). Midline: median; boxes: interquartile range; whiskers: 5th and 95th percentile range. f, g, Simulation and validation of expression trajectories under SSWM in defined medium (SD-Uracil). f, Distribution of predicted expression levels (y axis) in defined medium at each evolutionary time step (x axis) for sequences under SSWM favouring high (red) or low (blue) expression, starting with native promoter sequences (n = 5,720). Midline: median; boxes: interquartile range; whiskers: 5th and 95th percentile range. g, Experimentally measured expression distribution in defined medium (y axis) for the synthesized sequences (n = 6,304 sequences; 637 trajectories) at each mutational step (x axis) from predicted mutational trajectories under SSWM, favouring high (red) or low (blue) expression. Midline: median; boxes: interquartile range; whiskers: 5th and 95th percentile range. ho, Experimental validation of predicted expression for sequences from the random genetic drift and SSWM simulations. Experimentally measured (y axis) and predicted (x axis) expression level (lo) or expression change from the starting sequence (hk) in complex (h, j, l, n) or defined (i, k, m, o) medium using sequences from the random genetic drift (Fig. 2e, Extended Data Fig 1e, h, i, l, m here) and SSWM (Fig. 2g, Extended Data Fig 1g, j, k, n, o here) validation experiments. Top left: Pearson’s r and associated two-tailed P values.

Extended Data Fig. 2 Characterization of sequence trajectories under strong competing selection pressures using the convolutional model.

a, b, Expression is highly correlated between defined and complex medium. Measured (a) and predicted (b) expression in defined (x axis) and complex (y axis) medium for a set of test sequences measured in both media. Top left: Pearson’s r and associated two-tailed P values. c, Opposing relationships between organismal fitness and URA3 expression in two environments. Measured expression (x axis, using a YFP reporter) and fitness (y axis; when used as the promoter sequence for the URA3 gene) for yeast with each of 11 promoters predicted to span a wide range of expression levels in complex medium with 5-FOA (red), where higher expression of URA3 is toxic owing to URA3-mediated conversion of 5-FOA to 5-fluorouracil, and in defined medium lacking uracil (blue), where URA3 is required for uracil synthesis. Error bars: Standard error of the mean (n = 3 replicate experiments). df, Competing expression objectives constrain adaptation. d, e, Difference in predicted expression (y axis) at each evolutionary time step (x axis) under selection to maximize (red) or minimize (blue) the difference between expression in defined and complex medium, starting with either native sequences (d, as Fig. 2h, n = 5,720) or random sequences (e, n = 10,000). f, Distribution of predicted expression (y axis) in complex (blue) and defined (red) medium at each evolutionary time step (x axis) for a starting set of random sequences (n = 10,000). Midline: median; boxes: interquartile range; whiskers: 5th and 95th percentile range. g, Motifs enriched within sequences evolved for competing objectives in different environments. Top five most enriched motifs, found using DREME87 (Methods) within sequences computationally evolved from a starting set of random sequences to either maximize (left) or minimize (right) the difference in expression between defined and complex medium, along with DREME E-values, the corresponding rank of the same motif when using native sequences as a starting point, the probable cognate transcription factor and that transcription factor’s known motif.

Extended Data Fig. 3 The transformer sequence-to-expression model generalizes reliably and characterizes sequence trajectories under different evolutionary regimes.

ad, Prediction of expression from sequence in the complex (a, b) and defined (c, d) medium. Predicted (x axis) and experimentally measured (y axis) expression for (a, c) random test sequences (sampled separately from and not overlapping with the training data) and (b, d) native yeast promoter sequences containing random single base mutations. Top left: Pearson’s r and associated two-tailed P value. Compression of predictions in the lower left results from binning differences during cell sorting in different experiments (Supplementary Notes). e, Predicted (x axis) and experimentally measured (y axis) expression in complex medium (YPD) for all native yeast promoter sequences. Pearson’s r and associated two-tailed P values are shown. f, Predicted expression divergence under random genetic drift. Distribution of the change in predicted expression (y axis) for random starting sequences (n = 5,720) at each mutational step (x axis) for trajectories simulated under random genetic drift. Silver bar: differences in expression between unrelated sequences. g, h, Comparison of the distribution of measured (light grey) and transformer model predicted (dark grey) changes in expression (y axis) in complex medium (g, n = 2,983) and defined medium (h, n = 2,986) for synthesized randomly designed sequences at each mutational step (x axis). i, j, Predicted expression evolution under SSWM. Distribution of predicted expression levels (y axis) in complex medium (i, n = 10,322) and defined medium (j, n = 6,304) at each mutational step (x axis) for sequence trajectories under SSWM favouring high (red) or low (blue) expression, starting with 5,720 native promoter sequences. (fj) Midline: median; boxes: interquartile range; whiskers: 5th and 95th percentile range. kr, Comparison of model predicted expression for sequences synthesized previously for the random genetic drift and SSWM analyses. Experimentally measured (y axis) and transformer model predicted (x axis) expression level (or) or expression change from the starting sequence (kn) in complex (k, m, o, q) or defined (l, n, p, r) medium using sequences from the random genetic drift (Fig. 2c, Extended Data Fig. 1e; k, l, o, p here) and SSWM (Fig. 2g, Extended Data Fig. 1g; m, n, q, r here) validation experiments. Top left: Pearson’s r and associated two-tailed P values.

Extended Data Fig. 4 Signatures of stabilizing selection on gene expression detected from regulatory DNA across natural populations.

a, Expression-altering alleles in the CDC36 promoter are attributed primarily to altered UPC2 binding. Transcription factor interaction strength26 (expression attributable to each transcription factor) difference between the high and low alleles (each point is a transcription factor) for each of two low expression alleles (allele 1: x axis; allele 2: y axis). Each low-expressing allele is compared to the high-expression allele with the most similar sequence (across all promoter sequences analysed from the 1,011 strains; \({e}_{{\rm{TF}},{A}_{high}}-{e}_{{\rm{TF}},{A}_{low}}\)). b, Distribution of ECC (y axis, calculated from 1,011 S. cerevisiae genomes, top left) for S. cerevisiae genes whose orthologues have divergent (blue) or conserved (purple) expression (within Saccharomyces (left, n = 4,191), Ascomycota (middle, n = 4,910), or mammals (right, n = 199) (as determined by cross species RNA-seq, top right). P values: two-sided Wilcoxon rank-sum test. Midline: median; boxes: interquartile range; whiskers: 5th and 95th percentile range. c, Determination of expression change threshold for defining a ‘tolerated mutation’ to compute mutational robustness. We used all genes with an ECC consistent with stabilizing selection (ECC > 0; left), calculated the variance in predicted expression across the 1011 yeast strains for each gene, and chose the tolerable mutation threshold, \({\epsilon }\), as two standard deviations of the distribution of the variance (right). ~73% of genes with ECC > 0 had an expression variation lower than \({\epsilon }\). d, Distribution of the effects (magnitude; y axis) of mutations (rank ordered; x axis) on expression for all native regulatory sequences follows a power law with an exponent of 2.252. Shaded regions are equal in area.

Extended Data Fig. 5 Fitness responsivity of a gene as the total variation of its expression-to-fitness relationship FGENE curves.

Expression (x axis) and fitness (y axis) level curves for each select gene, fit from experimental measurements of expression and fitness across promoter variants by Keren et al11. Fitness responsivity calculated as the total variation in each curve is noted above each panel.

Extended Data Fig. 6 Analysis of regulatory evolvability reveals sequence-encoded signatures of expression conservation from solitary sequences.

a, Selection of optimal number of archetypes. Mean-square-reconstruction error (y axis) for reconstructing the evolvability vectors from the embeddings learned by the autoencoder for an increasing number of archetypes (x axis). Red circle: optimal number of archetypes selected as prescribed45 by the ‘elbow method’. b, The archetypal embeddings learned by the autoencoder accurately capture evolvability vectors. Original (y axis) and reconstructed (x axis) expression changes (the values in the evolvability vectors) for each native sequence (none seen by the autoencoder in training). Top left: Pearson’s r and associated two-tailed P values. cf, Evolvability space captures regulatory sequences’ evolutionary properties. Proximity to the malleable archetype (Amalleable) (x axis) and mutational robustness (c, e y axis) or ECC (d, f y axis) for all yeast genes (e, f) or the gene for which fitness responsivity was quantified (c, d). Top right: Spearman’s ρ and associated two-sided P value. ‘L’-shape of relationship in e results from the robust cleft, Amaxima, and Aminima all being distal to Amalleable (left side of plot). g, All native (S288C reference) promoter sequences (points) projected onto the archetypal evolvability space learned from random sequences; coloured by their ECC. Large coloured circles: evolvability archetypes. h, The proximity to the malleable archetype (x axis) and fitness responsivity (y axis) for the 80 genes with measured fitness responsivity. Top right: Spearman’s ρ and associated two-tailed P values. Light blue error band: 95% confidence interval. i, All native (S288C reference) promoter sequences (points) projected on the evolvability space learned from random sequences; coloured by their mean pairwise distance in the archetypal evolvability space between all promoter alleles across the 1,011 yeast isolates for that gene (orthologue evolvability dispersion). Large coloured circles: evolvability archetypes.

Extended Data Fig. 7 Visualizing promoter fitness landscapes in sequence space.

Visualizing the fitness landscapes for the promoters of HXT3 (a), ADH1 (b), GCN4 (c), RPL3 (d), FBA1 (e), TUB3 (f), URA3 (in defined medium) (g), URA3 (in complex medium + 5FOA) (h). 1,000 promoter sequences represented by their evolvability vectors projected onto the 2D archetypal evolvability space and coloured by their associated fitness as reflected by their predicted growth rate relative to wild type (colour, Methods), estimated by first mapping sequences to expression with our model and then expression to fitness as measured and estimated previously11.

Extended Data Fig. 8 In silico mutagenesis of malleable and robust promoters.

SSWM trajectories for (a) DBP7, a malleable promoter, and (b) UTH1, a robust promoter. Each subplot shows the in silico mutagenesis effects for how expression level (colour) changes when mutating each position (x axis) to each of the four bases (y axis) of each sequence (subplots) in the trajectories. The DNA sequence is indicated above each wild-type subplot (indicated with ‘WT’ at left). Arrows indicate the mutations selected at each step, which always correspond to the mutation of maximal effect; increasing expression goes up the figure from wild type and decreasing expression goes down. Part of the malleability of the DBP7 promoter results from an intermediate-affinity Rap1p-binding site (grey bar). The first mutations in increasing- and decreasing-expression trajectories either increase or decrease (respectively) the affinity of this site. The UTH1 promoter changes gradually in expression and evolves proximal repressor binding sites to dampen expression (grey bars).

Supplementary information

Supplementary Information

This file contains Supplementary Notes, Supplementary Figures 1–21, legends for Supplementary Tables 1 and 2, Supplementary Tables 3 and 4, and additional references.

Reporting Summary

Supplementary Tables

This file contains Supplementary Tables 1 and 2; see main Supplementary Information PDF for legends.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vaishnav, E.D., de Boer, C.G., Molinet, J. et al. The evolution, evolvability and engineering of gene regulatory DNA. Nature 603, 455–463 (2022). https://doi.org/10.1038/s41586-022-04506-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41586-022-04506-6

This article is cited by

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research