Abstract
Gene expression is regulated by transcription factors that work together to read cis-regulatory DNA sequences. The ‘cis-regulatory code’ — how cells interpret DNA sequences to determine when, where and how much genes should be expressed — has proven to be exceedingly complex. Recently, advances in the scale and resolution of functional genomics assays and machine learning have enabled substantial progress towards deciphering this code. However, the cis-regulatory code will probably never be solved if models are trained only on genomic sequences; regions of homology can easily lead to overestimation of predictive performance, and our genome is too short and has insufficient sequence diversity to learn all relevant parameters. Fortunately, randomly synthesized DNA sequences enable testing a far larger sequence space than exists in our genomes, and designed DNA sequences enable targeted queries to maximally improve the models. As the same biochemical principles are used to interpret DNA regardless of its source, models trained on these synthetic data can predict genomic activity, often better than genome-trained models. Here we provide an outlook on the field, and propose a roadmap towards solving the cis-regulatory code by a combination of machine learning and massively parallel assays using synthetic DNA.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 51 print issues and online access
$199.00 per year
only $3.90 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Kim, S. & Wysocka, J. Deciphering the multi-scale, quantitative cis-regulatory code. Mol. Cell 83, 373–392 (2023).
Lambert, S. A. et al. The human transcription factors. Cell 172, 650–665 (2018).
Zeitlinger, J. Seven myths of how transcription factors read the cis-regulatory code. Curr. Opin. Syst. Biol. 23, 22–31 (2020).
Baralle, M. & Baralle, F. E. The splicing code. Biosystems 164, 39–48 (2018).
Morris, C., Cluet, D. & Ricci, E. P. Ribosome dynamics and mRNA turnover, a complex relationship under constant cellular scrutiny. Wiley Interdiscip. Rev. RNA 12, e1658 (2021).
Borbolis, F. & Syntichaki, P. Cytoplasmic mRNA turnover and ageing. Mech. Ageing Dev. 152, 32–42 (2015).
Nieuwkoop, T., Finger-Bou, M., van der Oost, J. & Claassens, N. J. The ongoing quest to crack the genetic code for protein production. Mol. Cell 80, 193–209 (2020).
Cramer, P. Organization and regulation of gene transcription. Nature 573, 45–54 (2019).
Furlong, E. E. M. & Levine, M. Developmental enhancers and chromosome topology. Science 361, 1341–1345 (2018).
Michael, A. K. & Thomä, N. H. Reading the chromatinized genome. Cell 184, 3599–3611 (2021).
Roeder, R. G. 50+ years of eukaryotic transcription: an expanding universe of factors and mechanisms. Nat. Struct. Mol. Biol. 26, 783–791 (2019).
Field, A. & Adelman, K. Evaluating enhancer function and transcription. Annu. Rev. Biochem. 89, 213–234 (2020).
Cohen, B. A. How should novelty be valued in science? eLife 6, e28699 (2017).
Vaishnav, E. D. et al. The evolution, evolvability and engineering of gene regulatory DNA. Nature 603, 455–463 (2022). This paper demonstrates that random DNA-trained cis-regulatory models are useful for understanding cis-regulatory evolution and correctly predicted functional cis-regulatory variation.
Wittkopp, P. J. & Kalay, G. Cis-regulatory elements: molecular mechanisms and evolutionary processes underlying divergence. Nat. Rev. Genet. 13, 59–69 (2012).
Wray, G. A. The evolutionary significance of cis-regulatory mutations. Nat. Rev. Genet. 8, 206–216 (2007).
Farh, K. K. et al. Genetic and epigenetic fine mapping of causal autoimmune disease variants. Nature 518, 337–343 (2015).
Maurano, M. T. et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190–1195 (2012). This paper reports that most genome-wide association study variation appears to be regulatory, a finding that has since been replicated for most complex traits.
Vierstra, J. et al. Global reference mapping of human transcription factor footprints. Nature 583, 729–736 (2020). In this paper, the authors use DNase I footprinting to show that most human enhancers appear to have a relatively simple logic with few strict spacing or positional requirements.
Arnosti, D. N. & Kulkarni, M. M. Transcriptional enhancers: intelligent enhanceosomes or flexible billboards? J. Cell. Biochem. 94, 890–898 (2005).
de Boer, C. G. et al. Deciphering eukaryotic gene-regulatory logic with 100 million random promoters. Nat. Biotechnol. 38, 56–65 (2020). This paper demonstrates that the cis-regulatory activity of random DNA can be used to model many of the parameters of cis-regulation.
Tycko, J. et al. High-throughput discovery and characterization of human transcriptional effectors. Cell 183, 2020–2035.e16 (2020).
Alerasool, N., Leng, H., Lin, Z.-Y., Gingras, A.-C. & Taipale, M. Identification and functional characterization of transcriptional activators in human cells. Mol. Cell 82, 677–695.e7 (2022).
Reiter, F., Wienerroither, S. & Stark, A. Combinatorial function of transcription factors and cofactors. Curr. Opin. Genet. Dev. 43, 73–81 (2017).
Wei, B. et al. A protein activity assay to measure global transcription factor activity reveals determinants of chromatin accessibility. Nat. Biotechnol. 36, 521–529 (2018).
Sahu, B. et al. Sequence determinants of human gene regulatory elements. Nat. Genet. 54, 283–294 (2022). In this paper, the authors show that random DNA has regulatory activity in human cells and that it can be used to learn cis-regulatory models.
Balsalobre, A. & Drouin, J. Pioneer factors as master regulators of the epigenome and cell fate. Nat. Rev. Mol. Cell Biol. 23, 449–464 (2022).
Sharon, E. et al. Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters. Nat. Biotechnol. 30, 521–530 (2012).
Grossman, S. R. et al. Positional specificity of different transcription factor classes within enhancers. Proc. Natl Acad. Sci. USA 115, E7222–E7230 (2018).
Chen, L., Glover, J. N., Hogan, P. G., Rao, A. & Harrison, S. C. Structure of the DNA-binding domains from NFAT, Fos and Jun bound specifically to DNA. Nature 392, 42–48 (1998).
Perkins, N. D. et al. A cooperative interaction between NF-κB and Sp1 is required for HIV-1 enhancer activation. EMBO J. 12, 3551–3558 (1993).
Martinez, G. J. & Rao, A. Immunology. Cooperative transcription factor complexes in control. Science 338, 891–892 (2012).
Jolma, A. et al. DNA-dependent formation of transcription factor pairs alters their binding specificity. Nature 527, 384–388 (2015). In this paper, the authors systematically test pairs of transcription factors to see which could bind cooperatively to the DNA using high-throughput sequencing SELEX, revealing that many transcription factor pairs prefer to bind in one or a few of the possible relative arrangements.
Henikoff, S. & Shilatifard, A. Histone modification: cause or cog? Trends Genet. 27, 389–396 (2011).
Loaeza-Loaeza, J., Beltran, A. S. & Hernández-Sotelo, D. DNMTs and impact of CpG content, transcription factors, consensus motifs, lncRNAs, and histone marks on DNA methylation. Genes 11, 1336 (2020).
Blattler, A. & Farnham, P. J. Cross-talk between site-specific transcription factors and DNA methylation states. J. Biol. Chem. 288, 34287–34294 (2013).
Schübeler, D. Function and information content of DNA methylation. Nature 517, 321–326 (2015).
Kreibich, E., Kleinendorst, R., Barzaghi, G., Kaspar, S. & Krebs, A. R. Single-molecule footprinting identifies context-dependent regulation of enhancers by DNA methylation. Mol. Cell 83, 787–802.e9 (2023).
Yin, Y. et al. Impact of cytosine methylation on DNA binding specificities of human transcription factors. Science 356, eaaj2239 (2017).
Vinson, C. & Chatterjee, R. CG methylation. Epigenomics 4, 655–663 (2012).
Leman, A. R. & Noguchi, E. The replication fork: understanding the eukaryotic replication machinery and the challenges to genome duplication. Genes 4, 1–32 (2013).
Flury, V. et al. Recycling of modified H2A-H2B provides short-term memory of chromatin states. Cell 186, 1050–1065.e19 (2023).
Laprell, F., Finkl, K. & Müller, J. Propagation of Polycomb-repressed chromatin requires sequence-specific recruitment to DNA. Science 356, 85–88 (2017).
Coleman, R. T. & Struhl, G. Causal role for inheritance of H3K27me3 in maintaining the OFF state of a Drosophila HOX gene. Science 356, eaai8236 (2017).
Hua, P. et al. Defining genome architecture at base-pair resolution. Nature 595, 125–129 (2021).
Lieberman-Aiden, E. et al. Comprehensive mapping of long range interactions reveals folding principles of the human genome. Science 326, 289–293 (2009).
Eagen, K. P. Principles of chromosome architecture revealed by Hi-C. Trends Biochem. Sci. 43, 469–478 (2018).
Van Bortle, K. & Corces, V. G. tDNA insulators and the emerging role of TFIIIC in genome organization. Transcription 3, 277–284 (2012).
Fulco, C. P. et al. Systematic mapping of functional enhancer-promoter connections with CRISPR interference. Science 354, 769–773 (2016).
Klann, T. S. et al. CRISPR–Cas9 epigenome editing enables high-throughput screening for functional regulatory elements in the human genome. Nat. Biotechnol. 35, 561–568 (2017).
de Boer, C. G., Ray, J. P., Hacohen, N. & Regev, A. MAUDE: inferring expression changes in sorting-based CRISPR screens. Genome Biol. 21, 134 (2020).
Rippe, K. Liquid-liquid phase separation in chromatin. Cold Spring Harb. Perspect. Biol. 14, a040683 (2022).
Hnisz, D., Shrinivas, K., Young, R. A., Chakraborty, A. K. & Sharp, P. A. A phase separation model for transcriptional control. Cell 169, 13–23 (2017).
Mirny, L. A. Nucleosome-mediated cooperativity between transcription factors. Proc. Natl Acad. Sci. USA 107, 22534–22539 (2010).
Morgunova, E. & Taipale, J. Structural perspective of cooperative transcription factor binding. Curr. Opin. Struct. Biol. 47, 1–8 (2017).
Avsec, Ž. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53, 354–366 (2021). In this paper, the authors make exceptional machine learning models that capture highly complex ChIP-nexus data for pluripotency transcription factors, revealing certain ‘soft’ transcription factor interactions.
Cheng, Q. et al. Computational identification of diverse mechanisms underlying transcription factor-DNA occupancy. PLoS Genet. 9, e1003571 (2013).
Jindal, G. & Farley, E. Enhancer grammar in development, evolution, and disease — dependencies and interplay. Dev. Cell 56, 575–587 (2021).
de Almeida, B. P., Reiter, F., Pagani, M. & Stark, A. DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers. Nat. Genet. 54, 613–624 (2022).
Eraslan, G., Avsec, Ž., Gagneur, J. & Theis, F. J. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 20, 389–403 (2019).
Avsec, Ž. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021). This paper describes a deep learning transformer-based sequence-to-expression predictor for the human genome.
Agarwal, V. & Shendure, J. Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks. Cell Rep. 31, 107663 (2020).
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).
Chen, K. M., Wong, A. K., Troyanskaya, O. G. & Zhou, J. A sequence-based global map of regulatory activity for deciphering human genetics. Nat. Genet. 54, 940–949 (2022).
Horton, C. A. et al. Short tandem repeats bind transcription factors to tune eukaryotic gene expression. Preprint at bioRxiv https://doi.org/10.1101/2022.05.24.493321 (2022).
Janssens, J. et al. Decoding gene regulation in the fly brain. Nature 601, 630–636 (2022). This work describes a deep learning model that can predict tissue specificity of enhancers in the Drosophila brain based on single-cell ATAC-seq data.
He, Q., Johnston, J. & Zeitlinger, J. ChIP-nexus enables improved detection of in vivo transcription factor binding footprints. Nat. Biotechnol. 33, 395–401 (2015).
Rhee, H. S. & Pugh, B. F. Comprehensive genome-wide protein-DNA interactions detected at single-nucleotide resolution. Cell 147, 1408–1419 (2011).
Karollus, A., Mauermeier, T. & Gagneur, J. Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers. Genome Biol. 24, 56 (2023). This paper performs a rigorous evaluation of state-of-the-art cis-regulatory deep learning models trained on genomics data, noting substantial limitations.
Sasse, A. et al. How far are we from personalized gene expression prediction using sequence-to-expression deep neural networks? Preprint at bioRxiv https://doi.org/10.1101/2023.03.16.532969 (2023).
Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).
International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).
Zhang, K. et al. A single-cell atlas of chromatin accessibility in the human genome. Cell 184, 5985–6001.e19 (2021). This article provides an atlas of human single-cell ATAC-seq data, demonstrating the amount of specific open chromatin regions in individual human cell types.
Whalen, S., Schreiber, J., Noble, W. S. & Pollard, K. S. Navigating the pitfalls of applying machine learning in genomics. Nat. Rev. Genet. 23, 169–181 (2022).
de Koning, A. P. J., Gu, W., Castoe, T. A., Batzer, M. A. & Pollock, D. D. Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet. 7, e1002384 (2011).
Lee, J. M. & Sonnhammer, E. L. L. Genomic gene clustering analysis of pathways in eukaryotes. Genome Res. 13, 875–882 (2003).
Hurst, L. D., Pál, C. & Lercher, M. J. The evolutionary dynamics of eukaryotic gene order. Nat. Rev. Genet. 5, 299–310 (2004).
Lercher, M. J., Urrutia, A. O. & Hurst, L. D. Clustering of housekeeping genes provides a unified model of gene order in the human genome. Nat. Genet. 31, 180–183 (2002).
Cannavò, E. et al. Shadow enhancers are pervasive features of developmental regulatory networks. Curr. Biol. 26, 38–51 (2016).
Barolo, S. Shadow enhancers: frequently asked questions about distributed cis-regulatory information and enhancer redundancy. BioEssays 34, 135–141 (2012).
Li, S. & Ovcharenko, I. Enhancer jungles establish robust tissue-specific regulatory control in the human genome. Genomics 112, 2261–2270 (2020).
Hong, J.-W., Hendrix, D. A. & Levine, M. S. Shadow enhancers as a source of evolutionary novelty. Science 321, 1314 (2008).
Gotea, V. et al. Homotypic clusters of transcription factor binding sites are a key component of human promoters and enhancers. Genome Res. 20, 565–577 (2010).
Luna-Zurita, L. et al. Complex interdependence regulates heterotypic transcription factor distribution and coordinates cardiogenesis. Cell 164, 999–1014 (2016).
Schölkopf, B. et al. Toward causal representation learning. Proc. IEEE 109, 612–634 (2021).
Whalen, S. & Pollard, K. S. Reply to ‘Inflated performance measures in enhancer–promoter interaction-prediction methods’. Nat. Genet. 51, 1198–1200 (2019).
Cao, F. & Fullwood, M. J. Inflated performance measures in enhancer–promoter interaction-prediction methods. Nat. Genet. 51, 1196–1198 (2019).
Xi, W. & Beer, M. A. Local epigenomic state cannot discriminate interacting and non-interacting enhancer–promoter pairs with high accuracy. PLoS Comput. Biol. 14, e1006625 (2018).
Barnett, E., Onete, D., Salekin, A. & Faraone, S. V. Genomic machine learning meta-regression: insights on associations of study features with reported model performance. Preprint at medRxiv https://doi.org/10.1101/2022.01.10.22268751 (2022).
Chuong, E. B., Elde, N. C. & Feschotte, C. Regulatory evolution of innate immunity through co-option of endogenous retroviruses. Science 351, 1083–1087 (2016).
Wang, T. et al. Species-specific endogenous retroviruses shape the transcriptional network of the human tumor suppressor protein p53. Proc. Natl Acad. Sci. USA 104, 18613–18618 (2007).
Patwardhan, R. P. et al. High-resolution analysis of DNA regulatory elements by synthetic saturation mutagenesis. Nat. Biotechnol. 27, 1173–1175 (2009).
Melnikov, A. et al. Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay. Nat. Biotechnol. 30, 271–277 (2012).
Kircher, M. et al. Saturation mutagenesis of twenty disease-associated regulatory elements at single base-pair resolution. Nat. Commun. 10, 3583 (2019).
Kinney, J. B. & McCandlish, D. M. Massively parallel assays and quantitative sequence-function relationships. Annu. Rev. Genomics Hum. Genet. 20, 99–127 (2019).
Lubliner, S. et al. Core promoter sequence in yeast is a major determinant of expression level. Genome Res. 25, 1008–1017 (2015).
Gertz, J., Siggia, E. D. & Cohen, B. A. Analysis of combinatorial cis-regulation in synthetic and genomic promoters. Nature 457, 215–218 (2009).
King, D. M. et al. Synthetic and genomic regulatory elements reveal aspects of cis-regulatory grammar in mouse embryonic stem cells. eLife 9, e41279 (2020).
Yuh, C. H. & Davidson, E. H. Modular cis-regulatory organization of Endo16, a gut-specific gene of the sea urchin embryo. Dev. Camb. Engl. 122, 1069–1082 (1996).
Hossain, A. et al. Automated design of thousands of nonrepetitive parts for engineering stable genetic systems. Nat. Biotechnol. 38, 1466–1475 (2020).
Wilson, D. S. & Szostak, J. W. In vitro selection of functional nucleic acids. Annu. Rev. Biochem. 68, 611–647 (1999).
Keefe, A. D. & Szostak, J. W. Functional proteins from a random-sequence library. Nature 410, 715–718 (2001).
Cuperus, J. T. et al. Deep learning of the regulatory grammar of yeast 5′ untranslated regions from 500,000 random sequences. Genome Res. 27, 2015–2024 (2017).
Sample, P. J. et al. Human 5′ UTR design and variant effect prediction from a massively parallel translation assay. Nat. Biotechnol. 37, 803–809 (2019).
Rosenberg, A. B., Patwardhan, R. P., Shendure, J. & Seelig, G. Learning the sequence determinants of alternative splicing from millions of random sequences. Cell 163, 698–711 (2015).
Liao, S. E., Sudarshan, M. & Regev, O. Machine learning for discovery: deciphering RNA splicing logic. Preprint at bioRxiv https://doi.org/10.1101/2022.10.01.510472 (2022).
Bogard, N., Linder, J., Rosenberg, A. B. & Seelig, G. A deep neural network for predicting and engineering alternative polyadenylation. Cell 178, 91–106.e23 (2019).
Galupa, R. et al. Enhancer architecture and chromatin accessibility constrain phenotypic space during Drosophila development. Dev. Cell 58, 51–62.e4 (2023). This study demonstrates that random DNA sequences tested in a reporter system show diverse cell-type-specific expression across early Drosophila development.
Wunderlich, Z. & Mirny, L. A. Different gene regulation strategies revealed by analysis of binding motifs. Trends Genet. 25, 434–440 (2009). This paper demonstrates that eukaryotic transcription factors lack sufficient specificity to uniquely specify genes for activation and so must work combinatorially.
Jolma, A. et al. DNA-binding specificities of human transcription factors. Cell 152, 327–339 (2013).
Ogawa, N. & Biggin, M. D. High-throughput SELEX determination of DNA sequences bound by transcription factors in vitro. Methods Mol. Biol. Clifton NJ 786, 51–63 (2012).
Luthra, I. et al. Biochemical activity is the default DNA state in eukaryotes. Preprint at bioRxiv https://doi.org/10.1101/2022.12.16.520785 (2022).
Ni, X. et al. Adaptive evolution and the birth of CTCF binding sites in the Drosophila genome. PLoS Biol. 10, e1001420 (2012).
Weirauch, M. T. & Hughes, T. R. Conserved expression without conserved regulatory sequence: the more things change, the more they stay the same. Trends Genet. 26, 66–74 (2010).
Wong, E. S. et al. Deep conservation of the enhancer regulatory code in animals. Science 370, eaax8137 (2020).
Villar, D. et al. Enhancer evolution across 20 mammalian species. Cell 160, 554–566 (2015).
Cotney, J. et al. The evolution of lineage-specific regulatory activities in the human embryonic limb. Cell 154, 185–196 (2013).
Arnold, C. D. et al. Quantitative genome-wide enhancer activity maps for five Drosophila species show functional enhancer conservation and turnover during cis-regulatory evolution. Nat. Genet. 46, 685–692 (2014).
Eichenlaub, M. P. & Ettwiller, L. De novo genesis of enhancers in vertebrates. PLoS Biol. 9, e1001188 (2011).
Gvozdenov, Z., Barcutean, Z. & Struhl, K. Functional analysis of a random-sequence chromosome reveals a high level and the molecular nature of transcriptional noise in yeast cells.Mol. Cell 83, 1786–1797 (2023).
Maniatis, T. et al. Structure and function of the interferon-β enhanceosome. Cold Spring Harb. Symp. Quant. Biol. 63, 609–620 (1998).
Panne, D., Maniatis, T. & Harrison, S. C. An atomic model of the interferon-β enhanceosome. Cell 129, 1111–1123 (2007). This structural study describes binding of transcription factors in a highly optimized and compact human enhancer.
Emera, D., Yin, J., Reilly, S. K., Gockley, J. & Noonan, J. P. Origin and evolution of developmental enhancers in the mammalian neocortex. Proc. Natl Acad. Sci. USA 113, E2617–E2626 (2016).
Fong, S. L. & Capra, J. A. Modeling the evolutionary architectures of transcribed human enhancer sequences reveals distinct origins, functions, and associations with human trait variation. Mol. Biol. Evol. 38, 3681–3696 (2021).
Friedman, R. Z. et al. Active learning of enhancer and silencer regulatory grammar in photoreceptors. Preprint at bioRxiv https://doi.org/10.1101/2023.08.21.554146 (2023).
Arnold, C. D. et al. Genome-wide quantitative enhancer activity maps identified by STARR-seq. Science 339, 1074–1077 (2013).
Neumayr, C., Pagani, M., Stark, A. & Arnold, C. D. STARR-seq and UMI-STARR-seq: assessing enhancer activities for genome-wide-, high-, and low-complexity candidate libraries. Curr. Protoc. Mol. Biol. 128, e105 (2019).
Muerdter, F. et al. Resolving systematic errors in widely used enhancer activity assays in human cells. Nat. Methods 15, 141–149 (2018).
Kerkmann, M. et al. Activation with CpG-A and CpG-B oligonucleotides reveals two distinct regulatory pathways of type I IFN synthesis in human plasmacytoid dendritic cells. J. Immunol. 170, 4465–4474 (2003).
Harton, M. D., Koh, W. S., Bunker, A. D., Singh, A. & Batchelor, E. p53 pulse modulation differentially regulates target gene promoters to regulate cell fate decisions. Mol. Syst. Biol. 15, e8685 (2019).
Adamson, A. et al. Signal transduction controls heterogeneous NF-κB dynamics and target gene expression through cytokine-specific refractory states. Nat. Commun. 7, 12057 (2016).
Umans, B. D., Battle, A. & Gilad, Y. Where are the disease-associated eQTLs? Trends Genet. 37, 109–124 (2021).
Lalanne, J.-B. et al. Multiplex profiling of developmental enhancers with quantitative, single-cell expression reporters. Preprint at bioRxiv https://doi.org/10.1101/2022.12.10.519236 (2022).
Zhao, S. et al. A single-cell massively parallel reporter assay detects cell-type-specific gene regulation. Nat. Genet. 55, 346–354 (2023).
Murtha, M. et al. FIREWACh: high-throughput functional detection of transcriptional regulatory modules in mammalian cells. Nat. Methods 11, 559–565 (2014).
Levo, M. et al. Systematic investigation of transcription factor activity in the context of chromatin using massively parallel binding and expression assays. Mol. Cell 65, 604–617.e6 (2017).
Joung, J. et al. A transcription factor atlas of directed differentiation. Cell 186, 209–229.e26 (2023).
Calderon, D. et al. TransMPRA: a framework for assaying the role of many trans-acting factors at many enhancers. Preprint at bioRxiv https://doi.org/10.1101/2020.09.30.321323 (2020).
Ng, A. H. M. et al. A comprehensive library of human transcription factors for cell fate engineering. Nat. Biotechnol. 39, 510–519 (2021).
Sidore, A. M., Plesa, C., Samson, J. A., Lubock, N. B. & Kosuri, S. DropSynth 2.0: high-fidelity multiplexed gene synthesis in emulsions. Nucleic Acids Res. 48, e95 (2020).
Plesa, C., Sidore, A. M., Lubock, N. B., Zhang, D. & Kosuri, S. Multiplexed gene synthesis in emulsions for exploring protein functional landscapes. Science 359, 343–347 (2018).
Camellato, B. R., Brosh, R., Maurano, M. T. & Boeke, J. D. Genomic analysis of a synthetic reversed sequence reveals default chromatin states in yeast and mammalian cells. Preprint at bioRxiv https://doi.org/10.1101/2022.06.22.496726 (2022).
Pinglay, S. et al. Synthetic regulatory reconstitution reveals principles of mammalian Hox cluster regulation. Science 377, eabk2820 (2022). The authors of this study synthesized synthetic variants of the HOXA cluster, up to approximately 170 kb of synthetic DNA to dissect the regulatory logic of the locus.
Zhao, Y. et al. Debugging and consolidating multiple synthetic chromosomes reveals combinatorial genetic interactions. Cell 186, 5220–5236 (2023).
Venter, J. C., Glass, J. I., Hutchison, C. A. & Vashee, S. Synthetic chromosomes, genomes, viruses, and cells. Cell 185, 2708–2724 (2022).
Boeke, J. D. et al. The Genome Project-Write. Science 353, 126–127 (2016).
Battaglia, S. et al. Long-range phasing of dynamic, tissue-specific and allele-specific regulatory elements. Nat. Genet. 54, 1504–1513 (2022).
Krebs, A. R. Studying transcription factor function in the genome at molecular resolution. Trends Genet. 37, 798–806 (2021).
Stergachis, A. B., Debo, B. M., Haugen, E., Churchman, L. S. & Stamatoyannopoulos, J. A. Single-molecule regulatory architectures captured by chromatin fiber sequencing. Science 368, 1449–1454 (2020). This paper reports genome-scale single-molecule measurements of transcription factor and nucleosome binding across long (approximately 10 kb) chromatin fragments.
Koonin, E. V. Splendor and misery of adaptation, or the importance of neutral null for understanding evolution. BMC Biol. 14, 114 (2016).
Eddy, S. R. The ENCODE project: missteps overshadowing a success. Curr. Biol. 23, R259–R261 (2013).
Kim, J., Koo, B.-K. & Knoblich, J. A. Human organoids: model systems for human biology and medicine. Nat. Rev. Mol. Cell Biol. 21, 571–584 (2020).
Vierbuchen, T. & Wernig, M. Molecular roadblocks for cellular reprogramming. Mol. Cell 47, 827–838 (2012).
Tu, L., Lalwani, G., Gella, S. & He, H. An empirical study on robustness to spurious correlations using pre-trained language models. Trans. Assoc. Comput. Linguist. 8, 621–633 (2020).
Evans, R. et al. Protein complex prediction with AlphaFold-Multimer. Preprint at bioRxiv https://doi.org/10.1101/2021.10.04.463034 (2022).
Baek, M., McHugh, R., Anishchenko, I., Baker, D. & DiMaio, F. Accurate prediction of protein–nucleic acid complexes using RoseTTAFoldNA. Nat. Methods https://doi.org/10.1038/s41592-023-02086-5 (2023).
Koo, P. K. & Ploenzke, M. Improving representations of genomic sequence motifs in convolutional networks with exponential activations. Nat. Mach. Intell. 3, 258–266 (2021).
Prakash, E. I., Shrikumar, A. & Kundaje, A. Towards more realistic simulated datasets for benchmarking deep learning models in regulatory genomics. In Proc. 16th Machine Learning in Computational Biology 58–77 (PMLR, 2022).
Rafi, A. M. et al. Evaluation and optimization of sequence-based gene regulatory deep learning models. Preprint at bioRxiv https://doi.org/10.1101/2023.04.26.538471 (2023).
Weirauch, M. T. et al. Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol. 31, 126–134 (2013).
Meyer, P. et al. Inferring gene expression from ribosomal promoter sequences, a crowdsourcing approach. Genome Res. 23, 1928–1937 (2013).
Segal, E. et al. A genomic code for nucleosome positioning. Nature 442, 772–778 (2006).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Pak, M. A. et al. Using AlphaFold to predict the impact of single mutations on protein stability and function. PLoS ONE 18, e0282689 (2023).
Buel, G. R. & Walters, K. J. Can AlphaFold2 predict the impact of missense mutations on structure? Nat. Struct. Mol. Biol. 29, 1–2 (2022).
Acknowledgements
We thank B. Cleary, B. Cohen, G. Eraslan, A. Kundaje, B. Lehner, S. Mostafavi, A. M. Rafi, S. Reilly, C. Rogerson, A. Sasse, J. Schreiber, N. Shakiba, J. Shendure, B. van Steensel, M. Taipale, O. Tariq, X. Tu, M. Underhill, M. Weirauch and N. Yachie for helpful discussions. We apologise to all our colleagues whose work we could not cite due to the limit on the number of references. C.G.d.B. is a Michael Smith Health Research BC Scholar and is supported by a Stem Cell Network Jump Start award (ECR-C4R1-7).
Author information
Authors and Affiliations
Contributions
C.G.d.B. and J.T. conceptualized the paper. C.G.d.B. produced the first draft, analysed the data and created the figures with advice from J.T. C.G.d.B. and J.T. edited the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature thanks Shaun Mahony and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
de Boer, C.G., Taipale, J. Hold out the genome: a roadmap to solving the cis-regulatory code. Nature 625, 41–50 (2024). https://doi.org/10.1038/s41586-023-06661-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41586-023-06661-w
This article is cited by
-
Unlocking gene regulation with sequence-to-function models
Nature Methods (2024)
-
Regulatory activity is the default DNA state in eukaryotes
Nature Structural & Molecular Biology (2024)
-
Epigenomic insights into common human disease pathology
Cellular and Molecular Life Sciences (2024)
-
Deciphering the impact of genomic variation on function
Nature (2024)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.