How transcription factors (TFs) interpret cis-regulatory DNA sequence to control gene expression remains unclear, largely because past studies using native and engineered sequences had insufficient scale. Here, we measure the expression output of >100 million synthetic yeast promoter sequences that are fully random. These sequences yield diverse, reproducible expression levels that can be explained by their chance inclusion of functional TF binding sites. We use machine learning to build interpretable models of transcriptional regulation that predict ~94% of the expression driven from independent test promoters and ~89% of the expression driven from native yeast promoter fragments. These models allow us to characterize each TF’s specificity, activity and interactions with chromatin. TF activity depends on binding-site strand, position, DNA helical face and chromatin context. Notably, expression level is influenced by weak regulatory interactions, which confound designed-sequence studies. Our analyses show that massive-throughput assays of fully random DNA can provide the big data necessary to develop complex, predictive models of gene regulation.
Subscribe to Journal
Get full journal access for 1 year
only $20.83 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Open source code for our transcriptional models is available at https://github.com/Carldeboer/CisRegModels
Beer, M. A. & Tavazoie, S. Predicting gene expression from sequence. Cell 117, 185–198 (2004).
Yuan, Y., Guo, L., Shen, L. & Liu, J. S. Predicting gene expression from sequence: a reexamination. PLoS Comput. Biol. 3, e243 (2007).
Kinney, J. B., Murugan, A., Callan, C. G. Jr. & Cox, E. C. Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc. Natl Acad. Sci. USA107, 9158–9163 (2010).
van Arensbergen, J. et al. Genome-wide mapping of autonomous promoter activity in human cells. Nat. Biotechnol. 35, 145–153 (2017).
Muerdter, F. et al. Resolving systematic errors in widely used enhancer activity assays in human cells. Nat. Methods 15, 141–149 (2018).
Wang, X. et al. High-resolution genome-wide functional dissection of transcriptional regulatory regions and nucleotides in human. Nat. Commun. 9, 5380 (2018).
Sharon, E. et al. Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters. Nat. Biotechnol. 30, 521–530 (2012).
Gertz, J., Siggia, E. D. & Cohen, B. A. Analysis of combinatorial cis-regulation in synthetic and genomic promoters. Nature 457, 215–218 (2009).
Hughes, T. R. & de Boer, C. G. Mapping yeast transcriptional networks. Genetics 195, 9–36 (2013).
Jolma, A. et al. DNA-binding specificities of human transcription factors. Cell 152, 327–339 (2013).
Nutiu, R. et al. Direct measurement of DNA affinity landscapes on a high-throughput sequencing instrument. Nat. Biotechnol. 29, 659–664 (2011).
Oliphant, A. R., Brandl, C. J. & Struhl, K. Defining the sequence specificity of DNA-binding proteins by selecting binding sites from random-sequence oligonucleotides: analysis of yeast GCN4 protein. Mol. Cell. Biol. 9, 2944–2949 (1989).
Horwitz, M. S. & Loeb, L. A. Promoters selected from random DNA sequences. Proc. Natl Acad. Sci. USA 83, 7405–7409 (1986).
Cuperus, J. T. et al. Deep learning of the regulatory grammar of yeast 5’ untranslated regions from 500,000 random sequences. Genome Res. 27, 2015–2024 (2017).
Yona, A. H., Alm, E. J. & Gore, J. Random sequences rapidly evolve into de novo promoters. Nat. Commun. 9, 1530 (2018).
Wunderlich, Z. & Mirny, L. A. Different gene regulation strategies revealed by analysis of binding motifs. Trends Genet. 25, 434–440 (2009).
Arnosti, D. N. & Kulkarni, M. M. Transcriptional enhancers: intelligent enhanceosomes or flexible billboards? J. Cell. Biochem. 94, 890–898 (2005).
de Boer, C. G. & Hughes, T. R. YeTFaSCo: a database of evaluated yeast transcription factor sequence specificities. Nucleic Acids Res. 40, D169–D179 (2012).
Kosuri, S. et al. Composability of regulatory sequences controlling transcription and translation in Escherichia coli. Proc. Natl Acad. Sci. USA 110, 14024–14029 (2013).
Shalem, O. et al. Systematic dissection of the sequence determinants of gene 3’ end mediated expression control. PLoS Genet. 11, e1005147 (2015).
Granek, J. A. & Clarke, N. D. Explicit equilibrium modeling of transcription-factor binding and gene regulation. Genome Biol. 6, R87 (2005).
Kulkarni, M. M. & Arnosti, D. N. Information display by transcriptional enhancers. Development 130, 6569–6575 (2003).
Bussemaker, H. J., Li, H. & Siggia, E. D. Regulatory element detection using correlation with expression. Nat. Genet. 27, 167–171 (2001).
Conlon, E. M., Liu, X. S., Lieb, J. D. & Liu, J. S. Integrating regulatory motif discovery and genome-wide expression analysis. Proc. Natl Acad. Sci. USA 100, 3339–3344 (2003).
Liu, X., Lee, C. K., Granek, J. A., Clarke, N. D. & Lieb, J. D. Whole-genome comparison of Leu3 binding in vitro and in vivo reveals the importance of nucleosome occupancy in target site selection. Genome Res. 16, 1517–1528 (2006).
Zhang, Z. et al. A packing mechanism for nucleosome organization reconstituted across a eukaryotic genome. Science 332, 977–980 (2011).
Hesselberth, J. R. et al. Global mapping of protein–DNA interactions in vivo by digital genomic footprinting. Nat. Methods 6, 283–289 (2009).
Bernstein, B. E., Liu, C. L., Humphrey, E. L., Perlstein, E. O. & Schreiber, S. L. Global nucleosome occupancy in yeast. Genome Biol. 5, R62 (2004).
Hartley, P. D. & Madhani, H. D. Mechanisms that specify promoter nucleosome location and identity. Cell 137, 445–458 (2009).
Ganapathi, M. et al. Extensive role of the general regulatory factors, Abf1 and Rap1, in determining genome-wide chromatin structure in budding yeast. Nucleic Acids Res. 39, 2032–2044 (2011).
Levo, M. et al. Systematic investigation of transcription factor activity in the context of chromatin using massively parallel binding and expression assays. Mol. Cell 65, 604–617 e606 (2017).
Axelrod, J. D., Reagan, M. S. & Majors, J. GAL4 disrupts a repressing nucleosome during activation of GAL1 transcription in vivo. Genes Dev. 7, 857–869 (1993).
Morse, R. H. Nucleosome disruption by transcription factor binding in yeast. Science 262, 1563–1566 (1993).
Roberts, G. G. & Hudson, A. P. Transcriptome profiling of Saccharomyces cerevisiae during a transition from fermentative to glycerol-based respiratory growth reveals extensive metabolic and structural remodeling. Mol. Genet. Genomics 276, 170–186 (2006).
Forsburg, S. L. & Guarente, L. Identification and characterization of HAP4: a third component of the CCAAT-bound HAP2/HAP3 heteromer. Genes Dev. 3, 1166–1178 (1989).
Hedges, D., Proft, M. & Entian, K. D. CAT8, a new zinc cluster-encoding gene necessary for derepression of gluconeogenic enzymes in the yeast Saccharomyces cerevisiae. Mol. Cell. Biol. 15, 1915–1922 (1995).
Haurie, V. et al. The transcriptional activator Cat8p provides a major contribution to the reprogramming of carbon metabolism during the diauxic shift in Saccharomyces cerevisiae. J. Biol. Chem. 276, 76–85 (2001).
Sato, T. et al. TheE-box DNA binding protein Sgc1p suppresses the gcr2 mutation, which is involved in transcriptional activation of glycolytic genes in Saccharomyces cerevisiae. FEBS Lett. 463, 307–311 (1999).
Grauslund, M. & Ronnow, B. Carbon source-dependent transcriptional regulation of the mitochondrial glycerol-3-phosphate dehydrogenase gene, GUT2, from Saccharomyces cerevisiae. Can. J. Microbiol. 46, 1096–1100 (2000).
Madhani, H. D. & Fink, G. R. Combinatorial control required for the specificity of yeast MAPK signaling. Science 275, 1314–1317 (1997).
Gavrias, V., Andrianopoulos, A., Gimeno, C. J. & Timberlake, W. E. Saccharomyces cerevisiae TEC1 is required for pseudohyphal growth. Mol. Microbiol. 19, 1255–1263 (1996).
Cullen, P. J. & Sprague, G. F. Jr. Glucose depletion causes haploid invasive growth in yeast. Proc. Natl Acad. Sci. USA 97, 13619–13624 (2000).
Neely, K. E., Hassan, A. H., Brown, C. E., Howe, L. & Workman, J. L. Transcription activator interactions with multiple SWI/SNF subunits. Mol. Cell. Biol. 22, 1615–1625 (2002).
Kim, T. S., Kim, H. Y., Yoon, J. H. & Kang, H. S. Recruitment of the Swi/Snf complex by Ste12-Tec1 promotes Flo8-Mss11-mediated activation of STA1 expression. Mol. Cell. Biol. 24, 9542–9556 (2004).
Harbison, C. T. et al. Transcriptional regulatory code of a eukaryotic genome. Nature 431, 99–104 (2004).
Hibbs, M. A. et al. Exploring the functional landscape of gene expression: directed search of large microarray compendia. Bioinformatics 23, 2692–2699 (2007).
Bednar, J. et al. Determination of DNA persistence length by cryo-electron microscopy. Separation of the static and dynamic contributions to the apparent persistence length of DNA. J. Mol. Biol. 254, 579–594 (1995).
Tanay, A. Extensive low-affinity transcriptional interactions in the yeast genome. Gen. Res. 16, 962–972 (2006).
Boyle, E. A., Li, Y. I. & Pritchard, J. K. An expanded view of complex traits: from polygenic to omnigenic. Cell 169, 1177–1186 (2017).
Liu, X., Li, Y. I. & Pritchard, J. K. Trans effects on gene expression can drive omnigenic inheritance. Cell 177, 1022–1034 e1026 (2019).
Gibson, D. G. et al. Enzymatic assembly of DNA molecules up to several hundred kilobases. Nat. Methods 6, 343–345 (2009).
Erb, I. & van Nimwegen, E. Transcription factor binding site positioning in yeast: proximal promoter motifs characterize TATA-less promoters. PloS One 6, e24279 (2011).
Tong, A. H. & Boone, C. Synthetic genetic array analysis in Saccharomyces cerevisiae. Methods Mol. Biol. 313, 171–192 (2006).
de Boer, C. High-efficiency S. cerevisiae lithium acetate transformation. protocols.io https://doi.org/10.17504/protocols.io.j4tcqwn (2017).
Deng, C., Daley, T. & Smith, A. D. Applications of species accumulation curves in large-scale biological data analysis. Quant. Biol. 3, 135–144 (2015).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).
Segal, E. & Widom, J. From DNA sequence to transcriptional behaviour: a quantitative approach. Nat. Rev. Genet. 10, 443–456 (2009).
Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous systems. arXiv 1603.04467 (2016).
Kent, N. A., Eibert, S. M. & Mellor, J. Cbf1p is required for chromatin remodeling at promoter-proximal CACGTG motifs in yeast. J. Biol. Chem. 279, 27116–27123 (2004).
Cherry, J. M. et al. Saccharomyces Genome Database: the genomics resource of budding yeast. Nucleic Acids Res. 40, D700–D705 (2012).
Chua, G. et al. Identifying transcription factor functions and targets by phenotypic activation. Proc. Natl Acad. Sci. USA 103, 12045–12050 (2006).
We thank R. Nelken, J. Weinstein, A. Dixit, B. Cleary, K. Shekhar and U. Eser for analysis advice; C. Muus, B. Cleary, A. Dixit, Y. Oren, T. Jones, L. Mariani, K. Shekhar, J. B. Kinney, D. M. McCandlish and J. Vierstra for feedback on the manuscript; T. Delorey, J. Pfiffner and C. Bashor for experimental advice; L. Gaffney and A. Hupalowska for help with figures; P. Rogers for cell sorting; and E. Segal for the dual reporter yeast vector. C.G.D. was supported by a Fellowship from the Canadian Institutes for Health Research and by the NIH (grant no. K99-HG009920-01). E.D.V. was supported by the MIT Presidential Fellowship. Work was supported by the Klarman Cell Observatory, the NHGRI Center of Excellence in Genome Science, the HHMI (A.R.) and the Israel Science Foundation ICORE on Chromatin and RNA in Gene Regulation (N.F.).
A.R. is an SAB member of Thermo Fisher Scientific, Neogene Therapeutics, Asimov, and Syros Pharmaceuticals, an equity holder of Immunitas, and a founder of and equity holder in Celsius Therapeutics. All other authors declare no competing interests.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Integrated supplementary information
(a) Diverse expression in all tested promoter scaffolds. For each promoter scaffold (right), shown are the distributions of expression levels (log2(YFP/RFP), x axis) measured by flow cytometry for the entire library (gray filled curves) and for a few selected clones, each from a different single promoter from the library (colored line curves). (b) Expression level distribution of random DNA. Shown is the expression distribution of >2,000,000 promoters comprised of 3,811 random scaffolds, each in combination with ~660 random 80 bp oligos included in the middle. The periodic peaks occur at the expression bins and result from the large number of promoters that were only observed in a single bin (and so have a discrete expression level). We considered any promoters that, upon resorting, would end up in any of the non-zero bins as ‘expressed’ (non-zero bins indicated in red; Methods). Note that the expression units are not equivalent to those used elsewhere; the dynamic range of this library is similar to that of the pTpA library, but contains fewer high-expressing promoters.
Shown are the model-predicted (x axis) and measured (y axis) expression levels of held-out training data (never seen by the model, but generated in the same experiment as the training data) for (a) Abf1TATA+glucose, (b) pTpA+glucose, (c) pTpA+galactose, and (d) pTpA+glycerol. Since the pTpA+glucose billboard model (b) can explain 68.3% of this held out data and 92.6% of the high-quality data, we estimate that the training data has ~24% noise, although this varies by experiment. (n = 250,000 promoters for each; Pearson’s r2 is shown at bottom right; Blue lines: Generalized Additive Model lines of best fit.).
Shown are the numbers of distinct promoters detected (y axis) when subsampling the pTpA+glucose sequencing data (black points) to different read coverage (x axis), after combining reads from all expression bins. Red curve: promoters projected to be detected with additional sequencing.
(a) Relationship between predicted binding of individual TFs and expression level. Measured expression level (pTpA+Glu data; y axis) vs. predicted binding (x axis) for Abf1 (left), Rsc30 (middle), and Ume6 (right). Ume6 (a similar motif to Rsc30), is positively correlated with expression overall (r=0.14), but has a strong negative trend at high predicted occupancies. (n = 250,000 promoters for each; Pearson’s r shown at bottom right; Top: Motifs. Blue lines: GAM lines of best fit. Gray shaded areas: 95% confidence intervals.) (b) Model correctly predicts expression of designed sequences. Shown are the model-predicted (x axis) and measured (y axis) expression levels for 1,000 in silico generated random sequences designed based on the pTpA+glucose billboard model’s predictions to span a range of expression levels, and then measured experimentally. (n = 1,000 promoters; Pearson’s r2 shown at bottom right.).
Distributions of measured expression levels (x axis) for 80 bp fragments of native yeast promoters, corresponding to different regions of the promoter (colors). Although each region included sequences from all promoters, the regions upstream of -200 are covered by nucleosomes in many promoters, which may explain their tendency to express at lower levels.
(a,b) Model predicts nucleosome occupancy well. (a) Model predicted (x axis) vs. measured (MNase-Seq, y axis) nucleosome occupancy. Four MNase biological replicates are shown (Methods). (Blue lines: Generalized Additive Model lines of best fit.) (b) As in (a), with replicates averaged, and only promoters present in both replicates shown. (c) Model-predicted accessibility across the YGR012W locus. A genome browser track showing yeast genes (top), measured nucleosome occupancy (Zhang, second from the top) and DNaseI hypersensitivity (Hesselberth, third from the top), and model predicted accessibilities (Ω). Promoters are predicted to be accessible and models provide similar predicted accessibilities. The four models were learned independently from different experimental data.
Shown are the number of TFs classified as activators, repressors, neither, or both in the yeast Gene Ontology (GO, Methods) (x axes) and whether they are predicted as (a) repressor (blue) or activator (red); or (b) closing (blue) or opening (red) chromatin, by each model (labels on top of subpanels). Hypergeometric P-values for overlaps between predicted activator/repressor (or chromatin opener/closer), compared with activator/repressor GO annotations are as shown (‘neither’ and ‘both’ categories are ignored). (n = 1 set of model predictions for each subpanel; 4 models total).
(a) Model initialized with random PWMs learns that General Regulatory Factor motifs open chromatin. Shown are 12 motifs with the highest potentiation parameter values for a model trained on pTpA+glucose data with 1,000 randomly initialized motifs, allowing for multiple variants of the same motif to be learned. Motifs correspond to the three canonical general regulatory factors, and Cbf1, previously suggested to have a similar function. (b-d) Model-refined motifs perform better in predicting TF binding and knockout effects in independent experiments. (b) Similar refinement in independent models. Comparison of the original TFBS motif (top) and model-refined motifs from each of the four models for two example motifs. (c,d) Shown are the absolute values of the Pearson correlation coefficient (|r|) when using either the original motifs (x axis) or the pTpA+Gal model-refined motifs (y axis) to predict whether (c) the gene’s expression will change in the corresponding TF mutant (compared to wild type) based on predicted binding to the promoter, or (d) a ChIP probe will be bound by the TF in a ChIP assay based on predicted binding to the ChIP probe. (Here, data were not subsampled). Overall, model-refined motifs perform better (points above diagonal), but some perform worse. Reduced performance can be due to condition specific regulators that are minimally active in our tested growth conditions (for example, Gcn4), redundancy between motifs (for example, Hsf1 has mono-, di-, and trimeric motifs), and overfitting of the original motif to the test data (for example, ChIP-derived motifs tested on ChIP data). (n = 824 and 303 experiments, respectively for (c) and (d)).
The positional model is identical to that in Fig. 2c, except that the predicted accessibility (Ω) is used to rescale the tensor of DNA binding sites (4) (rather than the DNA binding vector, as before), preserving binding site orientation and location in the resulting chromatin binding tensor. Positional chromatin binding is multiplied pairwise with the learned positional TF activities (5), and then summed (6) to obtain the activity contribution of each TF, which is used to calculate the overall expression level of the sequence (7). Actual parameter values are shown (pTpA+glucose model), with binding sites smoothed over 8 bp for visualization purposes.
Supplementary Figure 10 Downsampling analysis determines the extent of data needed to train models of varying complexity.
Model performance as reflected by prediction accuracy on the held-out validation data (Pearson’s r2, y axis, top) and by the relative performance on validation-vs-training data (y axis, bottom, ratio of validation-vs-training Pearson r2; when <1.00, model is overfit) for models trained using sub-samples of the available training data (x axis) and learning different parameters (colors; parameters are cumulative from top to bottom of legend). Dotted lines: number of training examples needed to eliminate overfitting ((validation r2 / training r2) > 0.999). Solid points: number of examples at which the maximal validation performance is achieved.
(a,b) Position-specific pTpA+Glu model-predicted expression levels (x axes) vs. measured expression levels (y axes) for (a) high-quality test data in the pTpA promoter scaffold, grown in glucose, and (b) native yeast promoter sequences, divided into 80 bp fragments and tested in the pTpA promoter scaffold, grown in glucose. (c) Most expression variation is attributed to accessibility. Position-specific pTpA+Glu model-predicted accessibility (Ω; x axis) vs. measured expression levels (y axis) for high-quality test data in the pTpA promoter scaffold, grown in glucose. Performance is better when incorporating positional activities (a,b), and, in particular, accessibility alone cannot distinguish the highest expression levels (measured expression >12). The dominant effect of accessibility on expression likely reflects accessibility being a prerequisite for expression. (n = 9,982, 70,924, and 9,982 promoters for (a-c), respectively; Pearson’s r2 shown in bottom right; Red lines: Generalized Additive Model lines of best fit).
Position and strand preferences. Learned activity parameter values (y axis) for motifs in each position (x axis) and strand orientation (upper and lower panels) for each model (colors), for (a) Abf1, (b) Mcm1, (c) Ume6, (d) Mot3, (e) Azf1, and (f) Thi2.
(a) Capturing helically biased positional preferences. For each location within the promoter (x axis), this shows the learned activity bias parameters (red curve; as in Fig. 4b) for the poly-A motif, long-range trend captured by a loess fit (green), and short-range residual activity bias after subtracting loess fit (blue) with reference 10.5 bp sine waves (black) for the minus strand (top) and plus strands (bottom) for the four different models (columns). (b) Modeling positional preferences increases predictive accuracy within the same scaffold but can drastically decrease it between scaffolds. For each training data set (four sub-panels) for both model types (colors), the Pearson r2 (y axis) capturing performance on each test dataset (x axis). (n = 1 set of model predictions per bar; 4 independent training sets each with 2 model types, evaluated on each of 3 test datasets total).
Measured (black) and predicted (red) expression levels for sequences containing each motif (rows), tiled at every possible position (x axes) and orientation (top and bottom halves) in each of three random background sequence (columns).
(a,b) Distinct regulatory networks for two sequences with nearly identical expression levels. Model-predicted regulatory networks for two 80 bp native promoters fragments using the positional pTpA+glucose model. Purple nodes: Individual TFs. Edge thickness and spacing: interaction strengths for each TF (green: positive/activating; red: negative/repressing). Both 80 bp promoter sequences have intermediate measured expression levels of ~10.5 in pTpA+glucose, but different regulatory networks: (a) the -160:-80 region of YDR456W is dominated by a single strong interaction, while (b) the −320:-240 region of YIR035C is regulated by many weak interactions. (c) Abundant weak regulatory interactions also explain most of expression in the most relevant promoter region. Cumulative distributions (y axis) of the number of regulatory interactions (black) and fraction of expression explained (that is cumulative total/sum of all effect sizes; red) for each interaction strength (x axis). Here, we consider only native promoter fragments from the -160:-80 region (relative to the TSS), which is the region our models are trained on and corresponds to the most active promoter region (Supplementary Fig. 5). The magnitude (and not the sign) of the interaction strength is considered.
Supplementary Figs. 1–15.
Motifs used in this study. Motif IDs are from the YeTFaSCo database. Motifs excluded from the motif frequency analysis (Fig. 1a) are indicated.
Promoter scaffolds included in the scaffold library. Sequences include 80 Ns in place of the random 80-mers and begin 13 bp upstream of the theoretical TSS.
TSS annotations of native promoters. Coordinates of TSSs for the v64 S. cerevisiae genome sequence.
All model parameters learned. For each of the eight models discussed in this paper (billboard and positional, for each of the four datasets), all model parameters, including TF activities, potentiations, motifs and (where applicable) position-specific activities.
About this article
Cite this article
Boer, C.G., Vaishnav, E.D., Sadeh, R. et al. Deciphering eukaryotic gene-regulatory logic with 100 million random promoters. Nat Biotechnol (2019) doi:10.1038/s41587-019-0315-8