Abstract
Mutations in non-coding regulatory DNA sequences can alter gene expression, organismal phenotype and fitness1,2,3. Constructing complete fitness landscapes, in which DNA sequences are mapped to fitness, is a long-standing goal in biology, but has remained elusive because it is challenging to generalize reliably to vast sequence spaces4,5,6. Here we build sequence-to-expression models that capture fitness landscapes and use them to decipher principles of regulatory evolution. Using millions of randomly sampled promoter DNA sequences and their measured expression levels in the yeast Saccharomyces cerevisiae, we learn deep neural network models that generalize with excellent prediction performance, and enable sequence design for expression engineering. Using our models, we study expression divergence under genetic drift and strong-selection weak-mutation regimes to find that regulatory evolution is rapid and subject to diminishing returns epistasis; that conflicting expression objectives in different environments constrain expression adaptation; and that stabilizing selection on gene expression leads to the moderation of regulatory complexity. We present an approach for using such models to detect signatures of selection on expression from natural variation in regulatory sequences and use it to discover an instance of convergent regulatory evolution. We assess mutational robustness, finding that regulatory mutation effect sizes follow a power law, characterize regulatory evolvability, visualize promoter fitness landscapes, discover evolvability archetypes and illustrate the mutational robustness of natural regulatory sequence populations. Our work provides a general framework for designing regulatory sequences and addressing fundamental questions in regulatory evolution.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
Leveraging massively parallel reporter assays for evolutionary questions
Genome Biology Open Access 14 February 2023
-
Controlling gene expression with deep generative design of regulatory DNA
Nature Communications Open Access 30 August 2022
-
Advances in biosynthesis of scopoletin
Microbial Cell Factories Open Access 02 August 2022
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 per month
cancel any time
Subscribe to this journal
Receive 51 print issues and online access
$199.00 per year
only $3.90 per issue
Rent or buy this article
Get just this article for as long as you need it
$39.95
Prices may be subject to local taxes which are calculated during checkout




Data availability
Data generated for this study are available at the NCBI GEO with accession numbers GSE163045 and GSE163866. All models and processed data are available on Zenodo at https://zenodo.org/record/4436477.
Code availability
Code is available on GitHub at https://github.com/1edv/evolution and CodeOcean at https://codeocean.com/capsule/8020974/tree. A web app is available at https://1edv.github.io/evolution/.
References
Wittkopp, P. J. & Kalay, G. Cis-regulatory elements: molecular mechanisms and evolutionary processes underlying divergence. Nat. Rev. Genet. 13, 59–69 (2011).
Hill, M. S., Vande Zande, P. & Wittkopp, P. J. Molecular and evolutionary processes generating variation in gene expression. Nat. Rev. Genet. 22, 203–215 (2021).
Fuqua, T. et al. Dense and pleiotropic regulatory information in a developmental enhancer. Nature 587, 235–239 (2020).
de Visser, J. A. G. M. & Krug, J. Empirical fitness landscapes and the predictability of evolution. Nat. Rev. Genet. 15, 480–490 (2014).
Kondrashov, D. A. & Kondrashov, F. A. Topological features of rugged fitness landscapes in sequence space. Trends Genet. 31, 24–33 (2015).
de Visser, J. A. G. M., Elena, S. F., Fragata, I. & Matuszewski, S. The utility of fitness landscapes and big data for predicting evolution. Heredity 121, 401–405 (2018).
Weirauch, M. T. & Hughes, T. R. Conserved expression without conserved regulatory sequence: the more things change, the more they stay the same. Trends Genet. 26, 66–74 (2010).
Orr, H. A. The genetic theory of adaptation: a brief history. Nat. Rev. Genet. 6, 119–127 (2005).
Weinreich, D. M., Lan, Y., Wylie, C. S. & Heckendorn, R. B. Should evolutionary geneticists worry about higher-order epistasis? Curr. Opin. Genet. Dev. 23, 700–707 (2013).
Venkataram, S. et al. Development of a comprehensive genotype-to-fitness map of adaptation-driving mutations in yeast. Cell 166, 1585–1596 (2016).
Keren, L. et al. Massively parallel interrogation of the effects of gene expression levels on fitness. Cell 166, 1282–1294 (2016).
Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016).
Ogden, P. J., Kelsic, E. D., Sinai, S. & Church, G. M. Comprehensive AAV capsid fitness landscape reveals a viral gene and enables machine-guided design. Science 366, 1139–1143 (2019).
Pitt, J. N. & Ferré-D’Amaré, A. R. Rapid construction of empirical RNA fitness landscapes. Science 330, 376–379 (2010).
Shultzaberger, R. K., Malashock, D. S., Kirsch, J. F. & Eisen, M. B. The fitness landscapes of cis-acting binding sites in different promoter and environmental contexts. PLoS Genet. 6, e1001042 (2010).
Mustonen, V., Kinney, J., Callan, C. G. & Lässig, M. Energy-dependent fitness: a quantitative model for the evolution of yeast transcription factor binding sites. Proc. Natl Acad. Sci. USA 105, 12376–12381 (2008).
Hartl, D. L. What can we learn from fitness landscapes? Curr. Opin. Microbiol. 0, 51–57 (2014).
Otwinowski, J. & Nemenman, I. Genotype to phenotype mapping and the fitness landscape of the E. coli lac promoter. PLoS ONE 8, e61570 (2013).
Sinai, S. & Kelsic, E. D. A primer on model-guided exploration of fitness landscapes for biological sequence design. Preprint at https://arxiv.org/abs/2010.10614 (2020).
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).
Avsec, Ž. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53, 354–366 (2021).
Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. Proc. 34th International Conference on Machine Learning 3145–3153 (2017).
Avsec, Ž. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021).
Fragata, I., Blanckaert, A., Louro, M. A. D., Liberles, D. A. & Bank, C. Evolution in the light of fitness landscape theory. Trends Ecol. Evol. 34, 69–82 (2019).
Payne, J. L. & Wagner, A. The causes of evolvability and their evolution. Nat. Rev. Genet. 20, 24–38 (2019).
de Boer, C. G. et al. Deciphering eukaryotic gene-regulatory logic with 100 million random promoters. Nat. Biotechnol. 38, 56–65 (2020).
Crocker, J. et al. Low affinity binding site clusters confer hox specificity and regulatory robustness. Cell 160, 191–203 (2015).
Habib, N., Wapinski, I., Margalit, H., Regev, A. & Friedman, N. A functional selection model explains evolutionary robustness despite plasticity in regulatory networks. Mol. Syst. Biol. 8, 619 (2012).
Gillespie, J. H. Molecular evolution over the mutational landscape. Evolution 38, 1116–1129 (1984).
Jerison, E. R. & Desai, M. M. Genomic investigations of evolutionary dynamics and epistasis in microbial evolution experiments. Curr. Opin. Genet. Dev. 35, 33–39 (2015).
Sæther, B.-E. & Engen, S. The concept of fitness in fluctuating environments. Trends Ecol. Evol. 30, 273–281 (2015).
Vaswani, A. et al. in Advances in Neural Information Processing Systems 30 (eds. Guyon, I. et al.) 5998–6008 (Curran Associates, 2017).
Weirauch, M. T. et al. Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol. 31, 126–134 (2013).
Yang, N. & Bielawski, N. Statistical methods for detecting molecular adaptation. Trends Ecol. Evol. 15, 496–503 (2000).
Moses, A. M. Statistical tests for natural selection on regulatory regions based on the strength of transcription factor binding sites. BMC Evol. Biol. 9, 286 (2009).
Rifkin, S. A., Houle, D., Kim, J. & White, K. P. A mutation accumulation assay reveals a broad capacity for rapid evolution of gene expression. Nature 438, 220–223 (2005).
Peter, J. et al. Genome evolution across 1,011 Saccharomyces cerevisiae isolates. Nature 556, 339–344 (2018).
Erb, I. & van Nimwegen, E. Transcription factor binding site positioning in yeast: proximal promoter motifs characterize TATA-less promoters. PLoS One 6, e24279 (2011).
Gilad, Y., Oshlack, A. & Rifkin, S. A. Natural selection on gene expression. Trends Genet. 22, 456–461 (2006).
Alhusaini, N. & Coller, J. The deadenylase components Not2p, Not3p, and Not5p promote mRNA decapping. RNA 22, 709–721 (2016).
Yang, J.-R., Maclean, C. J., Park, C., Zhao, H. & Zhang, J. Intra and interspecific variations of gene expression levels in yeast are largely neutral: (Nei Lecture, SMBE 2016, Gold Coast). Mol. Biol. Evol. 34, 2125–2139 (2017).
Chen, J. et al. A quantitative framework for characterizing the evolutionary history of mammalian gene expression. Genome Res. 29, 53–63 (2019).
Payne, J. L. & Wagner, A. Mechanisms of mutational robustness in transcriptional regulation. Front. Genet. 6, 322 (2015).
Shoval, O. et al. Evolutionary trade-offs, Pareto optimality, and the geometry of phenotype space. Science 336, 1157–1160 (2012).
van Dijk, D. et al. Finding archetypal spaces using neural networks. IEEE International Conference on Big Data 2634-2643 (2019).
He, X., Duque, T. S. P. C. & Sinha, S. Evolutionary origins of transcription factor binding site clusters. Mol. Biol. Evol. 29, 1059–1070 (2012).
Cliften, P. F. et al. Surveying Saccharomyces genomes to identify functional elements by comparative DNA sequence analysis. Genome Res. 11, 1175–1186 (2001).
Heinz, S., Romanoski, C. E., Benner, C. & Glass, C. K. The selection and function of cell type-specific enhancers. Nat. Rev. Mol. Cell Biol. 16, 144–154 (2015).
Lehner, B. Selection to minimise noise in living systems and its implications for the evolution of gene expression. Mol. Syst. Biol. 4, 170 (2008).
Metzger, B. P. H., Yuan, D. C., Gruber, J. D., Duveau, F. & Wittkopp, P. J. Selection on noise constrains variation in a eukaryotic promoter. Nature 521, 344–347 (2015).
Kosuri, S. et al. Composability of regulatory sequences controlling transcription and translation in Escherichia coli. Proc. Natl Acad. Sci. USA. 110, 14024–14029 (2013).
Shalem, O. et al. Systematic dissection of the sequence determinants of gene 3’ end mediated expression control. PLoS Genet. 11, e1005147 (2015).
Kinney, J. B., Murugan, A., Callan, C. G. Jr & Cox, E. C. Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc. Natl Acad. Sci. USA. 107, 9158–9163 (2010).
Sharon, E. et al. Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters. Nat. Biotechnol. 30, 521–530 (2012).
Melnikov, A. et al. Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay. Nat. Biotechnol. 30, 271–277 (2012).
Kwasnieski, J. C., Mogno, I., Myers, C. A., Corbo, J. C. & Cohen, B. A. Complex effects of nucleotide variants in a mammalian cis-regulatory element. Proc. Natl Acad. Sci. USA 109, 19498–19503 (2012).
Kircher, M. et al. Saturation mutagenesis of twenty disease-associated regulatory elements at single base-pair resolution. Nat. Commun. 10, 3583 (2019).
Townsley, K. G., Brennand, K. J. & Huckins, L. M. Massively parallel techniques for cataloguing the regulome of the human brain. Nat. Neurosci. 23, 1509–1521 (2020).
Renganaath, K. et al. Systematic identification of cis-regulatory variants that cause gene expression differences in a yeast cross. eLife 9, e62669 (2020).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).
Travers, C. et al. Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface 15, 20170387 (2018).
Avsec, Ž. et al. The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat. Biotechnol. 37, 592–600 (2019).
Quang, D. & Xie, X. FactorNet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data. Methods 166, 40–47 (2019).
Zhou H. et al. Towards a better understanding of reverse-complement equivariance for deep learning models in genomics. Proc. 16th Machine Learning in Computational Biology meeting 165, 1–33 (2022).
Morrow, A. et al. Convolutional kitchen sinks for transcription factor binding site prediction. Preprint at https://arxiv.org/abs/1706.00125 (2017).
Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).
Koo, P. K., Majdandzic, A., Ploenzke, M., Anand, P. & Paul, S. B. Global importance analysis: an interpretability method to quantify importance of genomic features in deep neural networks. PLoS Comput. Biol. 17, e1008925 (2021).
Quang, D. & Xie, X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 44, e107 (2016).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. International Conference on Learning Representations (Poster) (2015).
Abadi, M. et al. TensorFlow: large-scale machine learning on heterogenous systems. Software available from https://www.tensorflow.org/ (2015).
Jouppi, N. P. et al. In-datacenter performance analysis of a tensor processing unit. In Proc. 44th Annual International Symposium on Computer Architecture 1–12 (2017).
Li, J., Pu, Y., Tang, J., Zou, Q. & Guo, F. DeepATT: a hybrid category attention neural network for identifying functional effects of DNA sequences. Brief. Bioinform. 22, bbaa159 (2020).
Ullah, F. & Ben-Hur, A. A self-attention model for inferring cooperativity between regulatory features. Nucleic Acids Res. 49, e77 (2021).
Clauwaert, J., Menschaert, G. & Waegeman, W. Explainability in transformer models for functional genomics. Brief. Bioinform. 22, bbab060 (2021).
Hinton, G. & Tieleman, T. Lecture 6.5—RmsProp: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning 4, 26–31 (2012).
Sinai, S. et al. AdaLead: a simple and robust adaptive greedy search algorithm for sequence design. Preprint at https://arxiv.org/abs/2010.02141 (2020).
Linder, J., Bogard, N., Rosenberg, A. B. & Seelig, G. A generative neural network for maximizing fitness and diversity of synthetic DNA and protein sequences. Cell Syst. 11, 49–62 (2020).
Brookes, D., Park, H. & Listgarten, J. Conditioning by adaptive sampling for robust design. Proc. Mach. Learn. Res. 97, 773–782 (2019).
Killoran, N., Lee, L. J., Delong, A., Duvenaud, D. & Frey, B. J. Generating and designing DNA with deep generative models. Neurips Computational Biology Workshop (2017).
Fortin, F.-A., Rainville, F.-M. D., Gardner, M.-A., Parizeau, M. & Gagné, C. DEAP: evolutionary algorithms made easy. J. Mach. Learn. Res. 13, 2171–2175 (2012).
Jaeger, S. A. et al. Conservation and regulatory associations of a wide affinity range of mouse transcription factor binding sites. Genomics 95, 185–195 (2010).
Tanay, A. Extensive low-affinity transcriptional interactions in the yeast genome. Genome Res. 16, 962–972 (2006).
Sniegowski, P. D. & Gerrish, P. J. Beneficial mutations and the dynamics of adaptation in asexual populations. Phil. Trans. R. Soc. B 365, 1255–1263 (2010).
Szendro, I. G., Franke, J., de Visser, J. A. & Krug, J. Predictability of evolution depends nonmonotonically on population size. Proc. Natl Acad. Sci. USA 110, 571–576 (2013).
Orr, H. A. The population genetics of adaptation: the adaptation of DNA Sequences. Evolution 56, 1317–1330 (2002).
Bailey, T. L. DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics 27, 1653–1659 (2011).
de Boer, C. G. & Hughes, T. R. YeTFaSCo: a database of evaluated yeast transcription factor sequence specificities. Nucleic Acids Res. 40, D169–D179 (2012).
Kent, W. J. BLAT—the BLAST-Like Alignment Tool. Genome Res. 12, 656–664 (2002).
Cherry, J. M. et al. Saccharomyces Genome Database: the genomics resource of budding yeast. Nucleic Acids Res. 40, D700–D705 (2012).
Smith, J. D., McManus, K. F. & Fraser, H. B. A novel test for selection on cis-regulatory elements reveals positive and negative selection acting on mammalian transcriptional enhancers. Mol. Biol. Evol. 30, 2509–2518 (2013).
Liu, J. & Robinson-Rechavi, M. Robust inference of positive selection on regulatory sequences in the human brain. Sci. Adv. 6, eabc9863 (2020).
Rice, D. P. & Townsend, J. P. A test for selection employing quantitative trait locus and mutation accumulation data. Genetics 190, 1533–1545 (2012).
Denver, D. R., Morris, K., Lynch, M. & Thomas, W. K. High mutation rate and predominance of insertions in the Caenorhabditis elegans nuclear genome. Nature 430, 679–682 (2004).
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Thompson, D. A. et al. Evolutionary principles of modular gene regulation in yeasts. eLife 2, e00603 (2013).
Yassour, M. et al. Strand-specific RNA sequencing reveals extensive regulated long antisense transcripts that are conserved across yeast species. Genome Biol. 11, R87 (2010).
Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat. Biotechnol. 29, 644–652 (2011).
Wu, T. D. & Watanabe, C. K. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21, 1859–1875 (2005).
Wapinski, I., Pfeffer, A., Friedman, N. & Regev, A. Natural history and evolutionary principles of gene duplication in fungi. Nature 449, 54–61 (2007).
Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).
Yates, A. D. et al. Ensembl 2020. Nucleic Acids Res. 48, D682–D688 (2020).
DiCarlo, J. E. et al. Genome engineering in Saccharomyces cerevisiae using CRISPR–Cas systems. Nucleic Acids Res. 41, 4336–4343 (2013).
Fleiss, A. et al. Reshuffling yeast chromosomes with CRISPR/Cas9. PLoS Genet. 15, e1008332 (2019).
Horwitz, A. A. et al. Efficient multiplexed integration of synergistic alleles and metabolic pathways in yeasts via CRISPR–Cas. Cell Syst. 1, 88–96 (2015).
Livak, K. J. & Schmittgen, T. D. Analysis of relative gene expression data using real-time quantitative PCR and the 2−ΔΔCT method. Methods 25, 402–408 (2001).
Vandesompele, J. et al. Accurate normalization of real-time quantitative RT–PCR data by geometric averaging of multiple internal control genes. Genome Biol. 3, research0034.1 (2002).
Teste, M.-A., Duquenne, M., François, J. M. & Parrou, J.-L. Validation of reference genes for quantitative expression analysis by real-time RT–PCR in Saccharomyces cerevisiae. BMC Mol. Biol. 10, 99 (2009).
Mardones, W. et al. Rapid selection response to ethanol in Saccharomyces eubayanus emulates the domestication process under brewing conditions. Microb. Biotechnol. https://doi.org/10.1111/1751-7915.13803 (2021).
Ibstedt, S. et al. Concerted evolution of life stage performances signals recent selection on yeast nitrogen use. Mol. Biol. Evol. 32, 153–161 (2015).
Rich, M. S. et al. Comprehensive analysis of the SUL1 promoter of Saccharomyces cerevisiae. Genetics 203, 191–202 (2016).
Rest, J. S. et al. Nonlinear fitness consequences of variation in expression level of a eukaryotic gene. Mol. Biol. Evol. 30, 448–456 (2013).
Bergen, A. C., Olsen, G. M. & Fay, J. C. Divergent MLS1 promoters lie on a fitness plateau for gene expression. Mol. Biol. Evol. 33, 1270–1279 (2016).
Alstott, J., Bullmore, E. & Plenz, D. Powerlaw: a Python package for analysis of heavy-tailed distributions. PLoS One 9, e85777 (2014).
Acknowledgements
We thank Google TPU Research Cloud for TPU access, L. Gaffney for help with figure preparation, Broad Genomics Platform for sequencing work, J.-C. Hütter for advice on fitness responsivity, J. Pfiffner-Borges for help with RNA-seq, R. Yu, B. Lee and N. Jaberi for manuscript feedback and members of the A.R. laboratory for discussions. E.D.V. was supported by the MIT Presidential Fellowship; C.G.d.B. was supported by a Canadian Institutes for Health Research Fellowship and the NIH (K99-HG009920-01); and F.A.C. and J.M. were supported by ANID (Programa Iniciativa Científica Milenio, ICN17_022). Work was supported by the Klarman Cell Observatory, Howard Hughes Medical Institute (HHMI) and Google TPU Research Cloud (https://sites.research.google/trc/about/). A.R. was an Investigator of the HHMI.
Author information
Authors and Affiliations
Contributions
E.D.V., C.G.d.B. and A.R. conceived, designed and supervised the study. E.D.V. and C.G.d.B. performed the analyses. M.Y., L.F., X.A. and D.A.T. performed and D.A.T., J.Z.L. and A.R. supervised the Ascomycota cross-species RNA-seq experiments. J.M. performed and F.A.C. supervised the CDC36 experiments. E.D.V. and C.G.d.B. performed the rest of the experiments. E.D.V., C.G.d.B. and A.R. wrote the manuscript.
Corresponding authors
Ethics declarations
Competing interests
A.R. is a co-founder and equity holder of Celsius Therapeutics and Immunitas and until 31 July 2020 was a member of the scientific advisory board of Thermo Fisher Scientific, Syros Pharmaceuticals, Neogene Therapeutics and Asimov. As of 1 August 2020, A.R. is an employee of Genentech. The other authors declare no competing interests.
Peer review
Peer review information
Nature thanks Martin Taylor and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
Extended Data Fig. 1 The convolutional sequence-to-expression model generalizes reliably and characterizes sequence trajectories under different evolutionary regimes.
a–d, Prediction of expression from sequence in complex (YPD) (a, b) and defined (SD-Uracil) (c, d) medium. Predicted (x axis) and experimentally measured (y axis) expression for (a, c) random test sequences (sampled separately from and not overlapping with the training data) and (b, d) native yeast promoter sequences containing random single base mutations. Top left: Pearson’s r and associated two-tailed P value. Compression of predictions in the lower left results from binning differences during cell sorting in different experiments (Supplementary Notes). e, Experimental validation of trajectories from simulations of random genetic drift. Distribution of measured (light grey) and predicted (dark grey) changes in expression in the defined medium (SD-Uracil) (y axis) for the synthesized randomly designed sequences (n = 2,986) at each mutational step (x axis). Midline: median; boxes: interquartile range; whiskers: 5th and 95th percentile range. f, g, Simulation and validation of expression trajectories under SSWM in defined medium (SD-Uracil). f, Distribution of predicted expression levels (y axis) in defined medium at each evolutionary time step (x axis) for sequences under SSWM favouring high (red) or low (blue) expression, starting with native promoter sequences (n = 5,720). Midline: median; boxes: interquartile range; whiskers: 5th and 95th percentile range. g, Experimentally measured expression distribution in defined medium (y axis) for the synthesized sequences (n = 6,304 sequences; 637 trajectories) at each mutational step (x axis) from predicted mutational trajectories under SSWM, favouring high (red) or low (blue) expression. Midline: median; boxes: interquartile range; whiskers: 5th and 95th percentile range. h–o, Experimental validation of predicted expression for sequences from the random genetic drift and SSWM simulations. Experimentally measured (y axis) and predicted (x axis) expression level (l–o) or expression change from the starting sequence (h–k) in complex (h, j, l, n) or defined (i, k, m, o) medium using sequences from the random genetic drift (Fig. 2e, Extended Data Fig 1e, h, i, l, m here) and SSWM (Fig. 2g, Extended Data Fig 1g, j, k, n, o here) validation experiments. Top left: Pearson’s r and associated two-tailed P values.
Extended Data Fig. 2 Characterization of sequence trajectories under strong competing selection pressures using the convolutional model.
a, b, Expression is highly correlated between defined and complex medium. Measured (a) and predicted (b) expression in defined (x axis) and complex (y axis) medium for a set of test sequences measured in both media. Top left: Pearson’s r and associated two-tailed P values. c, Opposing relationships between organismal fitness and URA3 expression in two environments. Measured expression (x axis, using a YFP reporter) and fitness (y axis; when used as the promoter sequence for the URA3 gene) for yeast with each of 11 promoters predicted to span a wide range of expression levels in complex medium with 5-FOA (red), where higher expression of URA3 is toxic owing to URA3-mediated conversion of 5-FOA to 5-fluorouracil, and in defined medium lacking uracil (blue), where URA3 is required for uracil synthesis. Error bars: Standard error of the mean (n = 3 replicate experiments). d–f, Competing expression objectives constrain adaptation. d, e, Difference in predicted expression (y axis) at each evolutionary time step (x axis) under selection to maximize (red) or minimize (blue) the difference between expression in defined and complex medium, starting with either native sequences (d, as Fig. 2h, n = 5,720) or random sequences (e, n = 10,000). f, Distribution of predicted expression (y axis) in complex (blue) and defined (red) medium at each evolutionary time step (x axis) for a starting set of random sequences (n = 10,000). Midline: median; boxes: interquartile range; whiskers: 5th and 95th percentile range. g, Motifs enriched within sequences evolved for competing objectives in different environments. Top five most enriched motifs, found using DREME87 (Methods) within sequences computationally evolved from a starting set of random sequences to either maximize (left) or minimize (right) the difference in expression between defined and complex medium, along with DREME E-values, the corresponding rank of the same motif when using native sequences as a starting point, the probable cognate transcription factor and that transcription factor’s known motif.
Extended Data Fig. 3 The transformer sequence-to-expression model generalizes reliably and characterizes sequence trajectories under different evolutionary regimes.
a–d, Prediction of expression from sequence in the complex (a, b) and defined (c, d) medium. Predicted (x axis) and experimentally measured (y axis) expression for (a, c) random test sequences (sampled separately from and not overlapping with the training data) and (b, d) native yeast promoter sequences containing random single base mutations. Top left: Pearson’s r and associated two-tailed P value. Compression of predictions in the lower left results from binning differences during cell sorting in different experiments (Supplementary Notes). e, Predicted (x axis) and experimentally measured (y axis) expression in complex medium (YPD) for all native yeast promoter sequences. Pearson’s r and associated two-tailed P values are shown. f, Predicted expression divergence under random genetic drift. Distribution of the change in predicted expression (y axis) for random starting sequences (n = 5,720) at each mutational step (x axis) for trajectories simulated under random genetic drift. Silver bar: differences in expression between unrelated sequences. g, h, Comparison of the distribution of measured (light grey) and transformer model predicted (dark grey) changes in expression (y axis) in complex medium (g, n = 2,983) and defined medium (h, n = 2,986) for synthesized randomly designed sequences at each mutational step (x axis). i, j, Predicted expression evolution under SSWM. Distribution of predicted expression levels (y axis) in complex medium (i, n = 10,322) and defined medium (j, n = 6,304) at each mutational step (x axis) for sequence trajectories under SSWM favouring high (red) or low (blue) expression, starting with 5,720 native promoter sequences. (f–j) Midline: median; boxes: interquartile range; whiskers: 5th and 95th percentile range. k–r, Comparison of model predicted expression for sequences synthesized previously for the random genetic drift and SSWM analyses. Experimentally measured (y axis) and transformer model predicted (x axis) expression level (o–r) or expression change from the starting sequence (k–n) in complex (k, m, o, q) or defined (l, n, p, r) medium using sequences from the random genetic drift (Fig. 2c, Extended Data Fig. 1e; k, l, o, p here) and SSWM (Fig. 2g, Extended Data Fig. 1g; m, n, q, r here) validation experiments. Top left: Pearson’s r and associated two-tailed P values.
Extended Data Fig. 4 Signatures of stabilizing selection on gene expression detected from regulatory DNA across natural populations.
a, Expression-altering alleles in the CDC36 promoter are attributed primarily to altered UPC2 binding. Transcription factor interaction strength26 (expression attributable to each transcription factor) difference between the high and low alleles (each point is a transcription factor) for each of two low expression alleles (allele 1: x axis; allele 2: y axis). Each low-expressing allele is compared to the high-expression allele with the most similar sequence (across all promoter sequences analysed from the 1,011 strains; \({e}_{{\rm{TF}},{A}_{high}}-{e}_{{\rm{TF}},{A}_{low}}\)). b, Distribution of ECC (y axis, calculated from 1,011 S. cerevisiae genomes, top left) for S. cerevisiae genes whose orthologues have divergent (blue) or conserved (purple) expression (within Saccharomyces (left, n = 4,191), Ascomycota (middle, n = 4,910), or mammals (right, n = 199) (as determined by cross species RNA-seq, top right). P values: two-sided Wilcoxon rank-sum test. Midline: median; boxes: interquartile range; whiskers: 5th and 95th percentile range. c, Determination of expression change threshold for defining a ‘tolerated mutation’ to compute mutational robustness. We used all genes with an ECC consistent with stabilizing selection (ECC > 0; left), calculated the variance in predicted expression across the 1011 yeast strains for each gene, and chose the tolerable mutation threshold, \({\epsilon }\), as two standard deviations of the distribution of the variance (right). ~73% of genes with ECC > 0 had an expression variation lower than \({\epsilon }\). d, Distribution of the effects (magnitude; y axis) of mutations (rank ordered; x axis) on expression for all native regulatory sequences follows a power law with an exponent of 2.252. Shaded regions are equal in area.
Extended Data Fig. 5 Fitness responsivity of a gene as the total variation of its expression-to-fitness relationship FGENE curves.
Expression (x axis) and fitness (y axis) level curves for each select gene, fit from experimental measurements of expression and fitness across promoter variants by Keren et al11. Fitness responsivity calculated as the total variation in each curve is noted above each panel.
Extended Data Fig. 6 Analysis of regulatory evolvability reveals sequence-encoded signatures of expression conservation from solitary sequences.
a, Selection of optimal number of archetypes. Mean-square-reconstruction error (y axis) for reconstructing the evolvability vectors from the embeddings learned by the autoencoder for an increasing number of archetypes (x axis). Red circle: optimal number of archetypes selected as prescribed45 by the ‘elbow method’. b, The archetypal embeddings learned by the autoencoder accurately capture evolvability vectors. Original (y axis) and reconstructed (x axis) expression changes (the values in the evolvability vectors) for each native sequence (none seen by the autoencoder in training). Top left: Pearson’s r and associated two-tailed P values. c–f, Evolvability space captures regulatory sequences’ evolutionary properties. Proximity to the malleable archetype (Amalleable) (x axis) and mutational robustness (c, e y axis) or ECC (d, f y axis) for all yeast genes (e, f) or the gene for which fitness responsivity was quantified (c, d). Top right: Spearman’s ρ and associated two-sided P value. ‘L’-shape of relationship in e results from the robust cleft, Amaxima, and Aminima all being distal to Amalleable (left side of plot). g, All native (S288C reference) promoter sequences (points) projected onto the archetypal evolvability space learned from random sequences; coloured by their ECC. Large coloured circles: evolvability archetypes. h, The proximity to the malleable archetype (x axis) and fitness responsivity (y axis) for the 80 genes with measured fitness responsivity. Top right: Spearman’s ρ and associated two-tailed P values. Light blue error band: 95% confidence interval. i, All native (S288C reference) promoter sequences (points) projected on the evolvability space learned from random sequences; coloured by their mean pairwise distance in the archetypal evolvability space between all promoter alleles across the 1,011 yeast isolates for that gene (orthologue evolvability dispersion). Large coloured circles: evolvability archetypes.
Extended Data Fig. 7 Visualizing promoter fitness landscapes in sequence space.
Visualizing the fitness landscapes for the promoters of HXT3 (a), ADH1 (b), GCN4 (c), RPL3 (d), FBA1 (e), TUB3 (f), URA3 (in defined medium) (g), URA3 (in complex medium + 5FOA) (h). 1,000 promoter sequences represented by their evolvability vectors projected onto the 2D archetypal evolvability space and coloured by their associated fitness as reflected by their predicted growth rate relative to wild type (colour, Methods), estimated by first mapping sequences to expression with our model and then expression to fitness as measured and estimated previously11.
Extended Data Fig. 8 In silico mutagenesis of malleable and robust promoters.
SSWM trajectories for (a) DBP7, a malleable promoter, and (b) UTH1, a robust promoter. Each subplot shows the in silico mutagenesis effects for how expression level (colour) changes when mutating each position (x axis) to each of the four bases (y axis) of each sequence (subplots) in the trajectories. The DNA sequence is indicated above each wild-type subplot (indicated with ‘WT’ at left). Arrows indicate the mutations selected at each step, which always correspond to the mutation of maximal effect; increasing expression goes up the figure from wild type and decreasing expression goes down. Part of the malleability of the DBP7 promoter results from an intermediate-affinity Rap1p-binding site (grey bar). The first mutations in increasing- and decreasing-expression trajectories either increase or decrease (respectively) the affinity of this site. The UTH1 promoter changes gradually in expression and evolves proximal repressor binding sites to dampen expression (grey bars).
Supplementary information
Supplementary Information
This file contains Supplementary Notes, Supplementary Figures 1–21, legends for Supplementary Tables 1 and 2, Supplementary Tables 3 and 4, and additional references.
Supplementary Tables
This file contains Supplementary Tables 1 and 2; see main Supplementary Information PDF for legends.
Rights and permissions
About this article
Cite this article
Vaishnav, E.D., de Boer, C.G., Molinet, J. et al. The evolution, evolvability and engineering of gene regulatory DNA. Nature 603, 455–463 (2022). https://doi.org/10.1038/s41586-022-04506-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41586-022-04506-6
This article is cited by
-
Leveraging massively parallel reporter assays for evolutionary questions
Genome Biology (2023)
-
Advances in biosynthesis of scopoletin
Microbial Cell Factories (2022)
-
Controlling gene expression with deep generative design of regulatory DNA
Nature Communications (2022)
-
AI predicts the effectiveness and evolution of gene promoter sequences
Nature (2022)
-
Evaluating deep learning for predicting epigenomic profiles
Nature Machine Intelligence (2022)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.