The evolution, evolvability and engineering of gene regulatory DNA

Vaishnav, Eeshit Dhaval; de Boer, Carl G.; Molinet, Jennifer; Yassour, Moran; Fan, Lin; Adiconis, Xian; Thompson, Dawn A.; Levin, Joshua Z.; Cubillos, Francisco A.; Regev, Aviv

doi:10.1038/s41586-022-04506-6

Article
Published: 09 March 2022

The evolution, evolvability and engineering of gene regulatory DNA

Nature volume 603, pages 455–463 (2022)Cite this article

58k Accesses
79 Citations
510 Altmetric
Metrics details

Subjects

Abstract

Mutations in non-coding regulatory DNA sequences can alter gene expression, organismal phenotype and fitness^1,2,3. Constructing complete fitness landscapes, in which DNA sequences are mapped to fitness, is a long-standing goal in biology, but has remained elusive because it is challenging to generalize reliably to vast sequence spaces^4,5,6. Here we build sequence-to-expression models that capture fitness landscapes and use them to decipher principles of regulatory evolution. Using millions of randomly sampled promoter DNA sequences and their measured expression levels in the yeast Saccharomyces cerevisiae, we learn deep neural network models that generalize with excellent prediction performance, and enable sequence design for expression engineering. Using our models, we study expression divergence under genetic drift and strong-selection weak-mutation regimes to find that regulatory evolution is rapid and subject to diminishing returns epistasis; that conflicting expression objectives in different environments constrain expression adaptation; and that stabilizing selection on gene expression leads to the moderation of regulatory complexity. We present an approach for using such models to detect signatures of selection on expression from natural variation in regulatory sequences and use it to discover an instance of convergent regulatory evolution. We assess mutational robustness, finding that regulatory mutation effect sizes follow a power law, characterize regulatory evolvability, visualize promoter fitness landscapes, discover evolvability archetypes and illustrate the mutational robustness of natural regulatory sequence populations. Our work provides a general framework for designing regulatory sequences and addressing fundamental questions in regulatory evolution.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: The evolution, evolvability and engineering of gene regulatory DNA.**

**Fig. 2: The evolutionary malleability of gene expression.**

**Fig. 3: The ECC detects signatures of selection on gene expression using natural genetic variation in regulatory DNA.**

**Fig. 4: The evolvability vector captures fitness landscapes.**

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

John Jumper, Richard Evans, … Demis Hassabis

Genomic language model predicts protein co-regulation and function

Article Open access 03 April 2024

Yunha Hwang, Andre L. Cornman, … Peter R. Girguis

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Qiuyue Yuan & Zhana Duren

Data availability

Data generated for this study are available at the NCBI GEO with accession numbers GSE163045 and GSE163866. All models and processed data are available on Zenodo at https://zenodo.org/record/4436477.

Code availability

Code is available on GitHub at https://github.com/1edv/evolution and CodeOcean at https://codeocean.com/capsule/8020974/tree. A web app is available at https://1edv.github.io/evolution/.

References

Wittkopp, P. J. & Kalay, G. Cis-regulatory elements: molecular mechanisms and evolutionary processes underlying divergence. Nat. Rev. Genet. 13, 59–69 (2011).
Article PubMed Google Scholar
Hill, M. S., Vande Zande, P. & Wittkopp, P. J. Molecular and evolutionary processes generating variation in gene expression. Nat. Rev. Genet. 22, 203–215 (2021).
Article CAS PubMed Google Scholar
Fuqua, T. et al. Dense and pleiotropic regulatory information in a developmental enhancer. Nature 587, 235–239 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
de Visser, J. A. G. M. & Krug, J. Empirical fitness landscapes and the predictability of evolution. Nat. Rev. Genet. 15, 480–490 (2014).
Article PubMed Google Scholar
Kondrashov, D. A. & Kondrashov, F. A. Topological features of rugged fitness landscapes in sequence space. Trends Genet. 31, 24–33 (2015).
Article CAS PubMed Google Scholar
de Visser, J. A. G. M., Elena, S. F., Fragata, I. & Matuszewski, S. The utility of fitness landscapes and big data for predicting evolution. Heredity 121, 401–405 (2018).
Article PubMed PubMed Central Google Scholar
Weirauch, M. T. & Hughes, T. R. Conserved expression without conserved regulatory sequence: the more things change, the more they stay the same. Trends Genet. 26, 66–74 (2010).
Article CAS PubMed Google Scholar
Orr, H. A. The genetic theory of adaptation: a brief history. Nat. Rev. Genet. 6, 119–127 (2005).
Article CAS PubMed Google Scholar
Weinreich, D. M., Lan, Y., Wylie, C. S. & Heckendorn, R. B. Should evolutionary geneticists worry about higher-order epistasis? Curr. Opin. Genet. Dev. 23, 700–707 (2013).
Article CAS PubMed PubMed Central Google Scholar
Venkataram, S. et al. Development of a comprehensive genotype-to-fitness map of adaptation-driving mutations in yeast. Cell 166, 1585–1596 (2016).
Article CAS PubMed PubMed Central Google Scholar
Keren, L. et al. Massively parallel interrogation of the effects of gene expression levels on fitness. Cell 166, 1282–1294 (2016).
Article CAS PubMed Google Scholar
Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Ogden, P. J., Kelsic, E. D., Sinai, S. & Church, G. M. Comprehensive AAV capsid fitness landscape reveals a viral gene and enables machine-guided design. Science 366, 1139–1143 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Pitt, J. N. & Ferré-D’Amaré, A. R. Rapid construction of empirical RNA fitness landscapes. Science 330, 376–379 (2010).
Article ADS CAS PubMed PubMed Central Google Scholar
Shultzaberger, R. K., Malashock, D. S., Kirsch, J. F. & Eisen, M. B. The fitness landscapes of cis-acting binding sites in different promoter and environmental contexts. PLoS Genet. 6, e1001042 (2010).
Article PubMed PubMed Central Google Scholar
Mustonen, V., Kinney, J., Callan, C. G. & Lässig, M. Energy-dependent fitness: a quantitative model for the evolution of yeast transcription factor binding sites. Proc. Natl Acad. Sci. USA 105, 12376–12381 (2008).
Article ADS CAS PubMed PubMed Central Google Scholar
Hartl, D. L. What can we learn from fitness landscapes? Curr. Opin. Microbiol. 0, 51–57 (2014).
Article PubMed Central Google Scholar
Otwinowski, J. & Nemenman, I. Genotype to phenotype mapping and the fitness landscape of the E. coli lac promoter. PLoS ONE 8, e61570 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Sinai, S. & Kelsic, E. D. A primer on model-guided exploration of fitness landscapes for biological sequence design. Preprint at https://arxiv.org/abs/2010.10614 (2020).
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).
Article CAS PubMed PubMed Central Google Scholar
Avsec, Ž. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53, 354–366 (2021).
Article CAS PubMed PubMed Central Google Scholar
Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. Proc. 34th International Conference on Machine Learning 3145–3153 (2017).
Avsec, Ž. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021).
Article CAS PubMed PubMed Central Google Scholar
Fragata, I., Blanckaert, A., Louro, M. A. D., Liberles, D. A. & Bank, C. Evolution in the light of fitness landscape theory. Trends Ecol. Evol. 34, 69–82 (2019).
Article PubMed Google Scholar
Payne, J. L. & Wagner, A. The causes of evolvability and their evolution. Nat. Rev. Genet. 20, 24–38 (2019).
Article CAS PubMed Google Scholar
de Boer, C. G. et al. Deciphering eukaryotic gene-regulatory logic with 100 million random promoters. Nat. Biotechnol. 38, 56–65 (2020).
Article PubMed Google Scholar
Crocker, J. et al. Low affinity binding site clusters confer hox specificity and regulatory robustness. Cell 160, 191–203 (2015).
Article CAS PubMed Google Scholar
Habib, N., Wapinski, I., Margalit, H., Regev, A. & Friedman, N. A functional selection model explains evolutionary robustness despite plasticity in regulatory networks. Mol. Syst. Biol. 8, 619 (2012).
Article PubMed PubMed Central Google Scholar
Gillespie, J. H. Molecular evolution over the mutational landscape. Evolution 38, 1116–1129 (1984).
Article CAS PubMed Google Scholar
Jerison, E. R. & Desai, M. M. Genomic investigations of evolutionary dynamics and epistasis in microbial evolution experiments. Curr. Opin. Genet. Dev. 35, 33–39 (2015).
Article CAS PubMed PubMed Central Google Scholar
Sæther, B.-E. & Engen, S. The concept of fitness in fluctuating environments. Trends Ecol. Evol. 30, 273–281 (2015).
Article PubMed Google Scholar
Vaswani, A. et al. in Advances in Neural Information Processing Systems 30 (eds. Guyon, I. et al.) 5998–6008 (Curran Associates, 2017).
Weirauch, M. T. et al. Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol. 31, 126–134 (2013).
Article CAS PubMed PubMed Central Google Scholar
Yang, N. & Bielawski, N. Statistical methods for detecting molecular adaptation. Trends Ecol. Evol. 15, 496–503 (2000).
Article CAS PubMed PubMed Central Google Scholar
Moses, A. M. Statistical tests for natural selection on regulatory regions based on the strength of transcription factor binding sites. BMC Evol. Biol. 9, 286 (2009).
Article PubMed PubMed Central Google Scholar
Rifkin, S. A., Houle, D., Kim, J. & White, K. P. A mutation accumulation assay reveals a broad capacity for rapid evolution of gene expression. Nature 438, 220–223 (2005).
Article ADS CAS PubMed Google Scholar
Peter, J. et al. Genome evolution across 1,011 Saccharomyces cerevisiae isolates. Nature 556, 339–344 (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
Erb, I. & van Nimwegen, E. Transcription factor binding site positioning in yeast: proximal promoter motifs characterize TATA-less promoters. PLoS One 6, e24279 (2011).
Article ADS CAS PubMed PubMed Central Google Scholar
Gilad, Y., Oshlack, A. & Rifkin, S. A. Natural selection on gene expression. Trends Genet. 22, 456–461 (2006).
Article CAS PubMed Google Scholar
Alhusaini, N. & Coller, J. The deadenylase components Not2p, Not3p, and Not5p promote mRNA decapping. RNA 22, 709–721 (2016).
Article CAS PubMed PubMed Central Google Scholar
Yang, J.-R., Maclean, C. J., Park, C., Zhao, H. & Zhang, J. Intra and interspecific variations of gene expression levels in yeast are largely neutral: (Nei Lecture, SMBE 2016, Gold Coast). Mol. Biol. Evol. 34, 2125–2139 (2017).
Article CAS PubMed PubMed Central Google Scholar
Chen, J. et al. A quantitative framework for characterizing the evolutionary history of mammalian gene expression. Genome Res. 29, 53–63 (2019).
Article CAS PubMed PubMed Central Google Scholar
Payne, J. L. & Wagner, A. Mechanisms of mutational robustness in transcriptional regulation. Front. Genet. 6, 322 (2015).
Article PubMed PubMed Central Google Scholar
Shoval, O. et al. Evolutionary trade-offs, Pareto optimality, and the geometry of phenotype space. Science 336, 1157–1160 (2012).
Article ADS CAS PubMed Google Scholar
van Dijk, D. et al. Finding archetypal spaces using neural networks. IEEE International Conference on Big Data 2634-2643 (2019).
He, X., Duque, T. S. P. C. & Sinha, S. Evolutionary origins of transcription factor binding site clusters. Mol. Biol. Evol. 29, 1059–1070 (2012).
Article CAS PubMed Google Scholar
Cliften, P. F. et al. Surveying Saccharomyces genomes to identify functional elements by comparative DNA sequence analysis. Genome Res. 11, 1175–1186 (2001).
Article CAS PubMed Google Scholar
Heinz, S., Romanoski, C. E., Benner, C. & Glass, C. K. The selection and function of cell type-specific enhancers. Nat. Rev. Mol. Cell Biol. 16, 144–154 (2015).
Article CAS PubMed PubMed Central Google Scholar
Lehner, B. Selection to minimise noise in living systems and its implications for the evolution of gene expression. Mol. Syst. Biol. 4, 170 (2008).
Article PubMed PubMed Central Google Scholar
Metzger, B. P. H., Yuan, D. C., Gruber, J. D., Duveau, F. & Wittkopp, P. J. Selection on noise constrains variation in a eukaryotic promoter. Nature 521, 344–347 (2015).
Article ADS CAS PubMed PubMed Central Google Scholar
Kosuri, S. et al. Composability of regulatory sequences controlling transcription and translation in Escherichia coli. Proc. Natl Acad. Sci. USA. 110, 14024–14029 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Shalem, O. et al. Systematic dissection of the sequence determinants of gene 3’ end mediated expression control. PLoS Genet. 11, e1005147 (2015).
Article PubMed PubMed Central Google Scholar
Kinney, J. B., Murugan, A., Callan, C. G. Jr & Cox, E. C. Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc. Natl Acad. Sci. USA. 107, 9158–9163 (2010).
Article ADS CAS PubMed PubMed Central Google Scholar
Sharon, E. et al. Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters. Nat. Biotechnol. 30, 521–530 (2012).
Article CAS PubMed PubMed Central Google Scholar
Melnikov, A. et al. Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay. Nat. Biotechnol. 30, 271–277 (2012).
Article CAS PubMed PubMed Central Google Scholar
Kwasnieski, J. C., Mogno, I., Myers, C. A., Corbo, J. C. & Cohen, B. A. Complex effects of nucleotide variants in a mammalian cis-regulatory element. Proc. Natl Acad. Sci. USA 109, 19498–19503 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Kircher, M. et al. Saturation mutagenesis of twenty disease-associated regulatory elements at single base-pair resolution. Nat. Commun. 10, 3583 (2019).
Article ADS PubMed PubMed Central Google Scholar
Townsley, K. G., Brennand, K. J. & Huckins, L. M. Massively parallel techniques for cataloguing the regulome of the human brain. Nat. Neurosci. 23, 1509–1521 (2020).
Article CAS PubMed PubMed Central Google Scholar
Renganaath, K. et al. Systematic identification of cis-regulatory variants that cause gene expression differences in a yeast cross. eLife 9, e62669 (2020).
Article CAS PubMed PubMed Central Google Scholar
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Article CAS PubMed PubMed Central Google Scholar
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).
Article CAS PubMed Google Scholar
Travers, C. et al. Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface 15, 20170387 (2018).
Article Google Scholar
Avsec, Ž. et al. The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat. Biotechnol. 37, 592–600 (2019).
Article CAS PubMed PubMed Central Google Scholar
Quang, D. & Xie, X. FactorNet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data. Methods 166, 40–47 (2019).
Article CAS PubMed PubMed Central Google Scholar
Zhou H. et al. Towards a better understanding of reverse-complement equivariance for deep learning models in genomics. Proc. 16th Machine Learning in Computational Biology meeting 165, 1–33 (2022).
Morrow, A. et al. Convolutional kitchen sinks for transcription factor binding site prediction. Preprint at https://arxiv.org/abs/1706.00125 (2017).
Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).
Article CAS PubMed PubMed Central Google Scholar
Koo, P. K., Majdandzic, A., Ploenzke, M., Anand, P. & Paul, S. B. Global importance analysis: an interpretability method to quantify importance of genomic features in deep neural networks. PLoS Comput. Biol. 17, e1008925 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Quang, D. & Xie, X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 44, e107 (2016).
Article PubMed PubMed Central Google Scholar
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. International Conference on Learning Representations (Poster) (2015).
Abadi, M. et al. TensorFlow: large-scale machine learning on heterogenous systems. Software available from https://www.tensorflow.org/ (2015).
Jouppi, N. P. et al. In-datacenter performance analysis of a tensor processing unit. In Proc. 44th Annual International Symposium on Computer Architecture 1–12 (2017).
Li, J., Pu, Y., Tang, J., Zou, Q. & Guo, F. DeepATT: a hybrid category attention neural network for identifying functional effects of DNA sequences. Brief. Bioinform. 22, bbaa159 (2020).
Article Google Scholar
Ullah, F. & Ben-Hur, A. A self-attention model for inferring cooperativity between regulatory features. Nucleic Acids Res. 49, e77 (2021).
Article CAS PubMed PubMed Central Google Scholar
Clauwaert, J., Menschaert, G. & Waegeman, W. Explainability in transformer models for functional genomics. Brief. Bioinform. 22, bbab060 (2021).
Article PubMed PubMed Central Google Scholar
Hinton, G. & Tieleman, T. Lecture 6.5—RmsProp: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning 4, 26–31 (2012).
Google Scholar
Sinai, S. et al. AdaLead: a simple and robust adaptive greedy search algorithm for sequence design. Preprint at https://arxiv.org/abs/2010.02141 (2020).
Linder, J., Bogard, N., Rosenberg, A. B. & Seelig, G. A generative neural network for maximizing fitness and diversity of synthetic DNA and protein sequences. Cell Syst. 11, 49–62 (2020).
Article CAS PubMed PubMed Central Google Scholar
Brookes, D., Park, H. & Listgarten, J. Conditioning by adaptive sampling for robust design. Proc. Mach. Learn. Res. 97, 773–782 (2019).
Google Scholar
Killoran, N., Lee, L. J., Delong, A., Duvenaud, D. & Frey, B. J. Generating and designing DNA with deep generative models. Neurips Computational Biology Workshop (2017).
Fortin, F.-A., Rainville, F.-M. D., Gardner, M.-A., Parizeau, M. & Gagné, C. DEAP: evolutionary algorithms made easy. J. Mach. Learn. Res. 13, 2171–2175 (2012).
MathSciNet Google Scholar
Jaeger, S. A. et al. Conservation and regulatory associations of a wide affinity range of mouse transcription factor binding sites. Genomics 95, 185–195 (2010).
Article CAS PubMed Google Scholar
Tanay, A. Extensive low-affinity transcriptional interactions in the yeast genome. Genome Res. 16, 962–972 (2006).
Article CAS PubMed PubMed Central Google Scholar
Sniegowski, P. D. & Gerrish, P. J. Beneficial mutations and the dynamics of adaptation in asexual populations. Phil. Trans. R. Soc. B 365, 1255–1263 (2010).
Article PubMed PubMed Central Google Scholar
Szendro, I. G., Franke, J., de Visser, J. A. & Krug, J. Predictability of evolution depends nonmonotonically on population size. Proc. Natl Acad. Sci. USA 110, 571–576 (2013).
Article ADS CAS PubMed Google Scholar
Orr, H. A. The population genetics of adaptation: the adaptation of DNA Sequences. Evolution 56, 1317–1330 (2002).
Article CAS PubMed Google Scholar
Bailey, T. L. DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics 27, 1653–1659 (2011).
Article CAS PubMed PubMed Central Google Scholar
de Boer, C. G. & Hughes, T. R. YeTFaSCo: a database of evaluated yeast transcription factor sequence specificities. Nucleic Acids Res. 40, D169–D179 (2012).
Article PubMed Google Scholar
Kent, W. J. BLAT—the BLAST-Like Alignment Tool. Genome Res. 12, 656–664 (2002).
CAS PubMed PubMed Central Google Scholar
Cherry, J. M. et al. Saccharomyces Genome Database: the genomics resource of budding yeast. Nucleic Acids Res. 40, D700–D705 (2012).
Article CAS PubMed Google Scholar
Smith, J. D., McManus, K. F. & Fraser, H. B. A novel test for selection on cis-regulatory elements reveals positive and negative selection acting on mammalian transcriptional enhancers. Mol. Biol. Evol. 30, 2509–2518 (2013).
Article CAS PubMed PubMed Central Google Scholar
Liu, J. & Robinson-Rechavi, M. Robust inference of positive selection on regulatory sequences in the human brain. Sci. Adv. 6, eabc9863 (2020).
Article ADS PubMed PubMed Central Google Scholar
Rice, D. P. & Townsend, J. P. A test for selection employing quantitative trait locus and mutation accumulation data. Genetics 190, 1533–1545 (2012).
Article CAS PubMed PubMed Central Google Scholar
Denver, D. R., Morris, K., Lynch, M. & Thomas, W. K. High mutation rate and predominance of insertions in the Caenorhabditis elegans nuclear genome. Nature 430, 679–682 (2004).
Article ADS CAS PubMed Google Scholar
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Article CAS PubMed PubMed Central Google Scholar
Thompson, D. A. et al. Evolutionary principles of modular gene regulation in yeasts. eLife 2, e00603 (2013).
Article PubMed PubMed Central Google Scholar
Yassour, M. et al. Strand-specific RNA sequencing reveals extensive regulated long antisense transcripts that are conserved across yeast species. Genome Biol. 11, R87 (2010).
Article PubMed PubMed Central Google Scholar
Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat. Biotechnol. 29, 644–652 (2011).
Article CAS PubMed PubMed Central Google Scholar
Wu, T. D. & Watanabe, C. K. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21, 1859–1875 (2005).
Article CAS PubMed Google Scholar
Wapinski, I., Pfeffer, A., Friedman, N. & Regev, A. Natural history and evolutionary principles of gene duplication in fungi. Nature 449, 54–61 (2007).
Article ADS CAS PubMed Google Scholar
Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).
Article CAS PubMed PubMed Central Google Scholar
Yates, A. D. et al. Ensembl 2020. Nucleic Acids Res. 48, D682–D688 (2020).
CAS PubMed Google Scholar
DiCarlo, J. E. et al. Genome engineering in Saccharomyces cerevisiae using CRISPR–Cas systems. Nucleic Acids Res. 41, 4336–4343 (2013).
Article CAS PubMed PubMed Central Google Scholar
Fleiss, A. et al. Reshuffling yeast chromosomes with CRISPR/Cas9. PLoS Genet. 15, e1008332 (2019).
Article CAS PubMed PubMed Central Google Scholar
Horwitz, A. A. et al. Efficient multiplexed integration of synergistic alleles and metabolic pathways in yeasts via CRISPR–Cas. Cell Syst. 1, 88–96 (2015).
Article CAS PubMed Google Scholar
Livak, K. J. & Schmittgen, T. D. Analysis of relative gene expression data using real-time quantitative PCR and the 2^−ΔΔCT method. Methods 25, 402–408 (2001).
Article CAS PubMed Google Scholar
Vandesompele, J. et al. Accurate normalization of real-time quantitative RT–PCR data by geometric averaging of multiple internal control genes. Genome Biol. 3, research0034.1 (2002).
Article Google Scholar
Teste, M.-A., Duquenne, M., François, J. M. & Parrou, J.-L. Validation of reference genes for quantitative expression analysis by real-time RT–PCR in Saccharomyces cerevisiae. BMC Mol. Biol. 10, 99 (2009).
Article PubMed PubMed Central Google Scholar
Mardones, W. et al. Rapid selection response to ethanol in Saccharomyces eubayanus emulates the domestication process under brewing conditions. Microb. Biotechnol. https://doi.org/10.1111/1751-7915.13803 (2021).
Ibstedt, S. et al. Concerted evolution of life stage performances signals recent selection on yeast nitrogen use. Mol. Biol. Evol. 32, 153–161 (2015).
Article CAS PubMed Google Scholar
Rich, M. S. et al. Comprehensive analysis of the SUL1 promoter of Saccharomyces cerevisiae. Genetics 203, 191–202 (2016).
Article CAS PubMed PubMed Central Google Scholar
Rest, J. S. et al. Nonlinear fitness consequences of variation in expression level of a eukaryotic gene. Mol. Biol. Evol. 30, 448–456 (2013).
Article CAS PubMed Google Scholar
Bergen, A. C., Olsen, G. M. & Fay, J. C. Divergent MLS1 promoters lie on a fitness plateau for gene expression. Mol. Biol. Evol. 33, 1270–1279 (2016).
Article CAS PubMed PubMed Central Google Scholar
Alstott, J., Bullmore, E. & Plenz, D. Powerlaw: a Python package for analysis of heavy-tailed distributions. PLoS One 9, e85777 (2014).
Article ADS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank Google TPU Research Cloud for TPU access, L. Gaffney for help with figure preparation, Broad Genomics Platform for sequencing work, J.-C. Hütter for advice on fitness responsivity, J. Pfiffner-Borges for help with RNA-seq, R. Yu, B. Lee and N. Jaberi for manuscript feedback and members of the A.R. laboratory for discussions. E.D.V. was supported by the MIT Presidential Fellowship; C.G.d.B. was supported by a Canadian Institutes for Health Research Fellowship and the NIH (K99-HG009920-01); and F.A.C. and J.M. were supported by ANID (Programa Iniciativa Científica Milenio, ICN17_022). Work was supported by the Klarman Cell Observatory, Howard Hughes Medical Institute (HHMI) and Google TPU Research Cloud (https://sites.research.google/trc/about/). A.R. was an Investigator of the HHMI.

Author information

Aviv Regev
Present address: Genentech, South San Francisco, CA, USA
These authors contributed equally: Eeshit Dhaval Vaishnav, Carl G. de Boer

Authors and Affiliations

Massachusetts Institute of Technology, Cambridge, MA, USA
Eeshit Dhaval Vaishnav
Broad Institute of MIT and Harvard, Cambridge, MA, USA
Eeshit Dhaval Vaishnav, Lin Fan & Dawn A. Thompson
School of Biomedical Engineering, University of British Columbia, Vancouver, British Columbia, Canada
Carl G. de Boer
Klarman Cell Observatory, Broad Institute of MIT and Harvard, Cambridge, MA, USA
Carl G. de Boer, Moran Yassour, Xian Adiconis, Joshua Z. Levin & Aviv Regev
Departamento de Biología, Facultad de Química y Biología, Universidad de Santiago de Chile, Santiago, Chile
Jennifer Molinet & Francisco A. Cubillos
ANID—Millennium Science Initiative Program, Millennium Institute for Integrative Biology (iBio), Santiago, Chile
Jennifer Molinet & Francisco A. Cubillos
Faculty of Medicine, The Hebrew University of Jerusalem, Jerusalem, Israel
Moran Yassour
The Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel
Moran Yassour
Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
Xian Adiconis & Joshua Z. Levin
Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA
Aviv Regev

Authors

Eeshit Dhaval Vaishnav
View author publications
You can also search for this author in PubMed Google Scholar
Carl G. de Boer
View author publications
You can also search for this author in PubMed Google Scholar
Jennifer Molinet
View author publications
You can also search for this author in PubMed Google Scholar
Moran Yassour
View author publications
You can also search for this author in PubMed Google Scholar
Lin Fan
View author publications
You can also search for this author in PubMed Google Scholar
Xian Adiconis
View author publications
You can also search for this author in PubMed Google Scholar
Dawn A. Thompson
View author publications
You can also search for this author in PubMed Google Scholar
Joshua Z. Levin
View author publications
You can also search for this author in PubMed Google Scholar
Francisco A. Cubillos
View author publications
You can also search for this author in PubMed Google Scholar
Aviv Regev
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

E.D.V., C.G.d.B. and A.R. conceived, designed and supervised the study. E.D.V. and C.G.d.B. performed the analyses. M.Y., L.F., X.A. and D.A.T. performed and D.A.T., J.Z.L. and A.R. supervised the Ascomycota cross-species RNA-seq experiments. J.M. performed and F.A.C. supervised the CDC36 experiments. E.D.V. and C.G.d.B. performed the rest of the experiments. E.D.V., C.G.d.B. and A.R. wrote the manuscript.

Corresponding authors

Correspondence to Eeshit Dhaval Vaishnav, Carl G. de Boer or Aviv Regev.

Ethics declarations

Competing interests

A.R. is a co-founder and equity holder of Celsius Therapeutics and Immunitas and until 31 July 2020 was a member of the scientific advisory board of Thermo Fisher Scientific, Syros Pharmaceuticals, Neogene Therapeutics and Asimov. As of 1 August 2020, A.R. is an employee of Genentech. The other authors declare no competing interests.

Peer review

Peer review information

Nature thanks Martin Taylor and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 The convolutional sequence-to-expression model generalizes reliably and characterizes sequence trajectories under different evolutionary regimes.

a–d, Prediction of expression from sequence in complex (YPD) (a, b) and defined (SD-Uracil) (c, d) medium. Predicted (x axis) and experimentally measured (y axis) expression for (a, c) random test sequences (sampled separately from and not overlapping with the training data) and (b, d) native yeast promoter sequences containing random single base mutations. Top left: Pearson’s r and associated two-tailed P value. Compression of predictions in the lower left results from binning differences during cell sorting in different experiments (Supplementary Notes). e, Experimental validation of trajectories from simulations of random genetic drift. Distribution of measured (light grey) and predicted (dark grey) changes in expression in the defined medium (SD-Uracil) (y axis) for the synthesized randomly designed sequences (n = 2,986) at each mutational step (x axis). Midline: median; boxes: interquartile range; whiskers: 5^th and 95^th percentile range. f, g, Simulation and validation of expression trajectories under SSWM in defined medium (SD-Uracil). f, Distribution of predicted expression levels (y axis) in defined medium at each evolutionary time step (x axis) for sequences under SSWM favouring high (red) or low (blue) expression, starting with native promoter sequences (n = 5,720). Midline: median; boxes: interquartile range; whiskers: 5^th and 95^th percentile range. g, Experimentally measured expression distribution in defined medium (y axis) for the synthesized sequences (n = 6,304 sequences; 637 trajectories) at each mutational step (x axis) from predicted mutational trajectories under SSWM, favouring high (red) or low (blue) expression. Midline: median; boxes: interquartile range; whiskers: 5^th and 95^th percentile range. h–o, Experimental validation of predicted expression for sequences from the random genetic drift and SSWM simulations. Experimentally measured (y axis) and predicted (x axis) expression level (l–o) or expression change from the starting sequence (h–k) in complex (h, j, l, n) or defined (i, k, m, o) medium using sequences from the random genetic drift (Fig. 2e, Extended Data Fig 1e, h, i, l, m here) and SSWM (Fig. 2g, Extended Data Fig 1g, j, k, n, o here) validation experiments. Top left: Pearson’s r and associated two-tailed P values.

Extended Data Fig. 2 Characterization of sequence trajectories under strong competing selection pressures using the convolutional model.

a, b, Expression is highly correlated between defined and complex medium. Measured (a) and predicted (b) expression in defined (x axis) and complex (y axis) medium for a set of test sequences measured in both media. Top left: Pearson’s r and associated two-tailed P values. c, Opposing relationships between organismal fitness and URA3 expression in two environments. Measured expression (x axis, using a YFP reporter) and fitness (y axis; when used as the promoter sequence for the URA3 gene) for yeast with each of 11 promoters predicted to span a wide range of expression levels in complex medium with 5-FOA (red), where higher expression of URA3 is toxic owing to URA3-mediated conversion of 5-FOA to 5-fluorouracil, and in defined medium lacking uracil (blue), where URA3 is required for uracil synthesis. Error bars: Standard error of the mean (n = 3 replicate experiments). d–f, Competing expression objectives constrain adaptation. d, e, Difference in predicted expression (y axis) at each evolutionary time step (x axis) under selection to maximize (red) or minimize (blue) the difference between expression in defined and complex medium, starting with either native sequences (d, as Fig. 2h, n = 5,720) or random sequences (e, n = 10,000). f, Distribution of predicted expression (y axis) in complex (blue) and defined (red) medium at each evolutionary time step (x axis) for a starting set of random sequences (n = 10,000). Midline: median; boxes: interquartile range; whiskers: 5^th and 95^th percentile range. g, Motifs enriched within sequences evolved for competing objectives in different environments. Top five most enriched motifs, found using DREME⁸⁷ (Methods) within sequences computationally evolved from a starting set of random sequences to either maximize (left) or minimize (right) the difference in expression between defined and complex medium, along with DREME E-values, the corresponding rank of the same motif when using native sequences as a starting point, the probable cognate transcription factor and that transcription factor’s known motif.

Extended Data Fig. 3 The transformer sequence-to-expression model generalizes reliably and characterizes sequence trajectories under different evolutionary regimes.

a–d, Prediction of expression from sequence in the complex (a, b) and defined (c, d) medium. Predicted (x axis) and experimentally measured (y axis) expression for (a, c) random test sequences (sampled separately from and not overlapping with the training data) and (b, d) native yeast promoter sequences containing random single base mutations. Top left: Pearson’s r and associated two-tailed P value. Compression of predictions in the lower left results from binning differences during cell sorting in different experiments (Supplementary Notes). e, Predicted (x axis) and experimentally measured (y axis) expression in complex medium (YPD) for all native yeast promoter sequences. Pearson’s r and associated two-tailed P values are shown. f, Predicted expression divergence under random genetic drift. Distribution of the change in predicted expression (y axis) for random starting sequences (n = 5,720) at each mutational step (x axis) for trajectories simulated under random genetic drift. Silver bar: differences in expression between unrelated sequences. g, h, Comparison of the distribution of measured (light grey) and transformer model predicted (dark grey) changes in expression (y axis) in complex medium (g, n = 2,983) and defined medium (h, n = 2,986) for synthesized randomly designed sequences at each mutational step (x axis). i, j, Predicted expression evolution under SSWM. Distribution of predicted expression levels (y axis) in complex medium (i, n = 10,322) and defined medium (j, n = 6,304) at each mutational step (x axis) for sequence trajectories under SSWM favouring high (red) or low (blue) expression, starting with 5,720 native promoter sequences. (f–j) Midline: median; boxes: interquartile range; whiskers: 5^th and 95^th percentile range. k–r, Comparison of model predicted expression for sequences synthesized previously for the random genetic drift and SSWM analyses. Experimentally measured (y axis) and transformer model predicted (x axis) expression level (o–r) or expression change from the starting sequence (k–n) in complex (k, m, o, q) or defined (l, n, p, r) medium using sequences from the random genetic drift (Fig. 2c, Extended Data Fig. 1e; k, l, o, p here) and SSWM (Fig. 2g, Extended Data Fig. 1g; m, n, q, r here) validation experiments. Top left: Pearson’s r and associated two-tailed P values.

Extended Data Fig. 4 Signatures of stabilizing selection on gene expression detected from regulatory DNA across natural populations.

a, Expression-altering alleles in the CDC36 promoter are attributed primarily to altered UPC2 binding. Transcription factor interaction strength²⁶ (expression attributable to each transcription factor) difference between the high and low alleles (each point is a transcription factor) for each of two low expression alleles (allele 1: x axis; allele 2: y axis). Each low-expressing allele is compared to the high-expression allele with the most similar sequence (across all promoter sequences analysed from the 1,011 strains; \({e}_{{\rm{TF}},{A}_{high}}-{e}_{{\rm{TF}},{A}_{low}}\)). b, Distribution of ECC (y axis, calculated from 1,011 S. cerevisiae genomes, top left) for S. cerevisiae genes whose orthologues have divergent (blue) or conserved (purple) expression (within Saccharomyces (left, n = 4,191), Ascomycota (middle, n = 4,910), or mammals (right, n = 199) (as determined by cross species RNA-seq, top right). P values: two-sided Wilcoxon rank-sum test. Midline: median; boxes: interquartile range; whiskers: 5^th and 95^th percentile range. c, Determination of expression change threshold for defining a ‘tolerated mutation’ to compute mutational robustness. We used all genes with an ECC consistent with stabilizing selection (ECC > 0; left), calculated the variance in predicted expression across the 1011 yeast strains for each gene, and chose the tolerable mutation threshold, \({\epsilon }\), as two standard deviations of the distribution of the variance (right). ~73% of genes with ECC > 0 had an expression variation lower than \({\epsilon }\). d, Distribution of the effects (magnitude; y axis) of mutations (rank ordered; x axis) on expression for all native regulatory sequences follows a power law with an exponent of 2.252. Shaded regions are equal in area.

Extended Data Fig. 5 Fitness responsivity of a gene as the total variation of its expression-to-fitness relationship F_GENE curves.

Expression (x axis) and fitness (y axis) level curves for each select gene, fit from experimental measurements of expression and fitness across promoter variants by Keren et al¹¹. Fitness responsivity calculated as the total variation in each curve is noted above each panel.

Extended Data Fig. 6 Analysis of regulatory evolvability reveals sequence-encoded signatures of expression conservation from solitary sequences.

a, Selection of optimal number of archetypes. Mean-square-reconstruction error (y axis) for reconstructing the evolvability vectors from the embeddings learned by the autoencoder for an increasing number of archetypes (x axis). Red circle: optimal number of archetypes selected as prescribed⁴⁵ by the ‘elbow method’. b, The archetypal embeddings learned by the autoencoder accurately capture evolvability vectors. Original (y axis) and reconstructed (x axis) expression changes (the values in the evolvability vectors) for each native sequence (none seen by the autoencoder in training). Top left: Pearson’s r and associated two-tailed P values. c–f, Evolvability space captures regulatory sequences’ evolutionary properties. Proximity to the malleable archetype (A_malleable) (x axis) and mutational robustness (c, e y axis) or ECC (d, f y axis) for all yeast genes (e, f) or the gene for which fitness responsivity was quantified (c, d). Top right: Spearman’s ρ and associated two-sided P value. ‘L’-shape of relationship in e results from the robust cleft, A_maxima, and A_minima all being distal to A_malleable (left side of plot). g, All native (S288C reference) promoter sequences (points) projected onto the archetypal evolvability space learned from random sequences; coloured by their ECC. Large coloured circles: evolvability archetypes. h, The proximity to the malleable archetype (x axis) and fitness responsivity (y axis) for the 80 genes with measured fitness responsivity. Top right: Spearman’s ρ and associated two-tailed P values. Light blue error band: 95% confidence interval. i, All native (S288C reference) promoter sequences (points) projected on the evolvability space learned from random sequences; coloured by their mean pairwise distance in the archetypal evolvability space between all promoter alleles across the 1,011 yeast isolates for that gene (orthologue evolvability dispersion). Large coloured circles: evolvability archetypes.

Extended Data Fig. 7 Visualizing promoter fitness landscapes in sequence space.

Visualizing the fitness landscapes for the promoters of HXT3 (a), ADH1 (b), GCN4 (c), RPL3 (d), FBA1 (e), TUB3 (f), URA3 (in defined medium) (g), URA3 (in complex medium + 5FOA) (h). 1,000 promoter sequences represented by their evolvability vectors projected onto the 2D archetypal evolvability space and coloured by their associated fitness as reflected by their predicted growth rate relative to wild type (colour, Methods), estimated by first mapping sequences to expression with our model and then expression to fitness as measured and estimated previously¹¹.

Extended Data Fig. 8 In silico mutagenesis of malleable and robust promoters.

SSWM trajectories for (a) DBP7, a malleable promoter, and (b) UTH1, a robust promoter. Each subplot shows the in silico mutagenesis effects for how expression level (colour) changes when mutating each position (x axis) to each of the four bases (y axis) of each sequence (subplots) in the trajectories. The DNA sequence is indicated above each wild-type subplot (indicated with ‘WT’ at left). Arrows indicate the mutations selected at each step, which always correspond to the mutation of maximal effect; increasing expression goes up the figure from wild type and decreasing expression goes down. Part of the malleability of the DBP7 promoter results from an intermediate-affinity Rap1p-binding site (grey bar). The first mutations in increasing- and decreasing-expression trajectories either increase or decrease (respectively) the affinity of this site. The UTH1 promoter changes gradually in expression and evolves proximal repressor binding sites to dampen expression (grey bars).

Supplementary information

Supplementary Information

This file contains Supplementary Notes, Supplementary Figures 1–21, legends for Supplementary Tables 1 and 2, Supplementary Tables 3 and 4, and additional references.

Reporting Summary

Supplementary Tables

This file contains Supplementary Tables 1 and 2; see main Supplementary Information PDF for legends.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vaishnav, E.D., de Boer, C.G., Molinet, J. et al. The evolution, evolvability and engineering of gene regulatory DNA. Nature 603, 455–463 (2022). https://doi.org/10.1038/s41586-022-04506-6

Download citation

Received: 08 February 2021
Accepted: 02 February 2022
Published: 09 March 2022
Issue Date: 17 March 2022
DOI: https://doi.org/10.1038/s41586-022-04506-6

This article is cited by

Engineering strategies for enhanced heterologous protein production by Saccharomyces cerevisiae
- Meirong Zhao
- Jianfan Ma
- Haishan Qi
Microbial Cell Factories (2024)
Proformer: a hybrid macaron transformer model predicts expression values from promoter sequences
- Il-Youp Kwak
- Byeong-Chan Kim
- Wuming Gong
BMC Bioinformatics (2024)
Characterization and optimization of 5´ untranslated region containing poly-adenine tracts in Kluyveromyces marxianus using machine-learning model
- Junyuan Zeng
- Kunfeng Song
- Yao Yu
Microbial Cell Factories (2024)
Regulatory activity is the default DNA state in eukaryotes
- Ishika Luthra
- Cassandra Jensen
- Carl G. de Boer
Nature Structural & Molecular Biology (2024)
Hold out the genome: a roadmap to solving the cis-regulatory code
- Carl G. de Boer
- Jussi Taipale
Nature (2024)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.