Abstract
Transcription factor binding sites are being discovered at a rapid pace1, 2. It is now necessary to turn attention towards understanding how these sites work in combination to influence gene expression. Quantitative models that accurately predict gene expression from promoter sequence3, 4, 5 will be a crucial part of solving this problem. Here we present such a model, based on the analysis of synthetic promoter libraries in yeast (Saccharomyces cerevisiae). Thermodynamic models based only on the equilibrium binding of transcription factors to DNA and to each other captured a large fraction of the variation in expression in every library. Thermodynamic analysis of these libraries uncovered several phenomena in our system, including cooperativity and the effects of weak binding sites. When applied to the S. cerevisiae genome, a model of repression by Mig1 (which was trained on synthetic promoters) predicts a number of Mig1-regulated genes that lack significant Mig1-binding sites in their promoters. The success of the thermodynamic approach suggests that the information encoded by combinations of cis-regulatory sites is interpreted primarily through simple protein–DNA and protein–protein interactions, with complicated biochemical reactions—such as nucleosome modifications—being downstream events. Quantitative analyses of synthetic promoter libraries will be an important tool in unravelling the rules underlying combinatorial cis-regulation.
Thermodynamic models of gene regulation have shown promising results in eukaryotic systems6, 7 when applied to small gene sets. Owing to limitations in studying genomic promoters, the number of observations in these studies is small compared to the number of molecular events that are modelled, and over-fitting is therefore a serious concern. An approach that circumvents this limitation is to model the expression of synthetic promoters8, 9, 10. As conceivably any promoter sequence can be created and analysed, a large portion of possible regulatory element combinations can be evaluated.
We constructed synthetic promoter libraries consisting of random combinations of three to four transcription factor binding sites (TFBS), or building blocks (Table 1 and Supplementary Information). In total, we analysed 2,807 promoters among 7 libraries using 18 different building blocks. All promoters were placed upstream of a medium strength basal promoter driving yellow fluorescent protein (YFP; Supplementary Fig. 1) and integrated into the yeast genome at the TRP1 locus. The level of gene expression directed by each synthetic promoter was quantified by flow cytometry of 25,000 individual cells per promoter (Fig. 1a, b).
Figure 1: Gene expression measurements.

a, b, Graphs of cell volume versus fluorescence for 25,000 individual cells containing the promoters S'MM'M (a) and G'S'G'S'M' (b). Here S = Spacer, G = Gcr1 site, M = Mig1 site and prime indicates a site in the reverse orientation. c, Histogram of expression values for all L1 library members. Expression values were computed as the average fluorescence/volume ratio for 25,000 individual cells, and then normalized to plate controls. Control promoters with no library insert are shown in red. a.u., arbitrary units.
High resolution image and legend (158K)Download Power Point slide (536K)Slides may be downloaded for educational use, according to the terms described in Nature Publishing Group's licensing policy.
Figure 1c shows the expression levels of 429 synthetic promoters from the L1 library (see Supplementary Tables 1–7 for expression and sequence of all promoters). Basal promoter only controls (Fig. 1c, shown in red) were used to estimate the technical variance of our expression measurements, which is 1.3% of the total variance of the L1 library; the average technical variance for all libraries is 0.8% of the total variance. The biological replicate variance, which refers to the gene expression differences between independent transformants that have the same synthetic promoter by chance, is 35% of the total variance in the L1 library and 17% on average. Therefore, a perfect model relating promoter sequence to our expression data would explain 65% of the variance in expression driven by the different promoters in the L1 library.
We constructed a thermodynamic model of the relationship between promoter sequence and expression. The purpose of the model was to provide a formal mathematical framework for predicting the activity of novel combinations of cis-regulatory sites, and to gain insight into the mechanisms that generate diverse expression levels from different arrangements of the same cis-regulatory sites. We used a model first proposed in ref. 11, and later modified in ref. 12. The main assumption of this model is that gene regulation is controlled completely by the equilibrium binding of proteins to DNA and to each other. Enzymatic events, such as chromatin modifications and polymerase phosphorylation, are not taken into account. The model consists of parameters that describe the changes in free energy of particular DNA–protein and protein–protein interactions that can occur on the promoters. These parameters are used to calculate the probability of RNA polymerase (RNAP) being bound to each promoter in the library (see Supplementary Information). We then assume that the probability of RNAP being bound to a given promoter is directly proportional to the intensity of YFP fluorescence measured for that promoter.
In every library, thermodynamic models explained 44–59% of the variance in expression (Table 1), which is more than double the amount of variance explained by the best models of genome-wide expression data4, 5. The thermodynamic model for the L1 library captured 49% of the variance in expression (Supplementary Fig. 2; 75% of the available variance). The overall success of the thermodynamic approach indicates that expression driven by combinations of binding sites can be generally and accurately modelled by simply considering protein–DNA and protein–protein binding events.
To determine the predictive power of our model for the L1 library, we constructed the L1-test library, which consists of novel combinations of the L1 building blocks. With the same parameter values from the L1 library, the model still captures 44% of the variance in expression, implying that the model is not over-fitted. This lack of over-fitting is not surprising, considering that each model contains about 6 parameters fitted to an average of over 400 promoters. The Mig1 parameter values found in the L1 library were held constant among thermodynamic models for three other libraries (L1-test, L1-weak and L2) that all exhibited high predictive power (see Supplementary Table 8 for all parameter values). Our model for the L1 library predicts that the 'Spacer' building block, which we designed to contain no known or predicted regulatory sequence elements, can recruit RNAP to promoters. As about half of the DNA-binding proteins in yeast do not yet have an associated cis-regulatory motif1, it is likely that the Spacer site is actually an unidentified cis-regulatory element. The ability of the model to incorporate an unknown sequence element and accurately predict its behaviour points to a strength of the approach.
Analysis of the model for the L1 library suggests that Mig1 binds cooperatively to the synthetic promoters. Because nothing in the previous literature suggested cooperativity between Mig1 monomers, we decided to analyse Mig1 cooperativity independent of the model. We fitted a Hill equation relating percentage repression to the number of Mig1 sites, with the assumption that 100% repression occurs with five Mig1-binding sites. We found that a Hill coefficient of 3.4
0.25 and K = 1.8 (where K is the number of Mig1 sites that causes half maximal repression) gives the best fit, suggesting cooperativity. Figure 2a shows that the observed data fit well to the Hill equation and that without cooperativity the fit is substantially worse. These results are consistent with the thermodynamic model and suggest that Mig1 acts cooperatively to repress transcription in our system, which led us to examine the influence of low affinity TFBS on expression.
Figure 2: Mig1-binding sites act cooperatively, and a weak Mig1 site represses weakly.

a, Hill equation with a Hill coefficient of 3.4 and K = 1.8 (red) fits the observed data (blue) well, compared to a Hill equation with a Hill coefficient of 1 (green). b, Plot of average expression versus the number of weak sites (blue) without strong sites, and versus the number of strong sites (red) without weak sites. Error bars,
1 s.d. c, Plots of expression for pairs of promoters that are almost identical except that either one strong Mig1 site or two strong Mig1 sites replace one strong and one weak Mig1 site. A blue circle represents one promoter pair and the red line represents equal expression.
Slides may be downloaded for educational use, according to the terms described in Nature Publishing Group's licensing policy.
Low affinity, or weak, TFBS are known to play important roles in prokaryotic promoters13 and have been postulated to be important in eukaryotic gene regulation14. However, their quantitative effect on gene expression is difficult to determine. To study the effects of weak TFBS, we constructed a library (L1-weak) incorporating a building block matching a Mig1-binding site that was shown to have low affinity for Mig1 in vitro15. The sequence of this weak site scores below any reasonable cut-off in a genome scan for Mig1 sites based on a weight matrix derived from known Mig1 sites16, 17. In our system, the low affinity Mig1 site behaved as a weaker repressor than the strong site (Fig. 2b). However, when there are strong Mig1 sites present in a promoter, the weak sites behave as strong sites. When comparing promoters with the same building block content except for the number of Mig1 sites, promoters with one weak and one strong Mig1 site exhibit lower expression compared to promoters with one strong Mig1 site (Fig. 2c; P < 10-8, sign test, n = 211) and the same expression as promoters with two strong Mig1 sites (Fig. 2c; P > 10-2, sign test, n = 177). This behaviour suggests that strong and weak Mig1 sites interact cooperatively to repress transcription in our system. This interaction produces complex patterns of expression in the L1-weak library.
The thermodynamic model of transcriptional regulation accurately captures many of the complexities of expression in the L1-weak library by adding only one adjustable parameter to the L1 library model parameters—namely, the relative affinity of Mig1 for the weak site. The optimal value of the new parameter corresponds to a 6.7-fold lower relative affinity for Mig1 than the stronger Mig1 site. This value is in good agreement (and within a 95% confidence interval) with an independent computational analysis of a position specific weight matrix for Mig1, which predicted a 9.0-fold lower affinity of the weak site for Mig1 (refs 16, 18). The similarity of the R2 of this model with that of the L1 library model demonstrates that we are capturing the additional complexities caused by the effect of weak Mig1 sites on expression.
We next examined the possibility that weak sites contribute to Mig1 repression of genomic promoters. Weak Mig1-binding sites are over-represented in S. cerevisiae promoters compared to shuffled S. cerevisiae promoters (P < 10-3, simulation, n = 1,000; Supplementary Fig. 3). Weak sites are found in 24% of all promoters, while 39% of promoters containing a significant match to a Mig1 weight matrix also contain a weak site (P < 10-12, hypergeometric test), indicating that strong and weak sites tend to co-occur. Of 33 genes that are known to be regulated by Mig1 (refs 19, 20), and whose promoters contain a significant match to a Mig1 weight matrix, 20 also contain a weaker Mig1 site in their promoters compared to 8 genes expected by chance. According to our model of gene regulation, promoters with one strong and one weak site are more sensitive to changes in Mig1 concentrations than are promoters with either two strong or two weak sites; promoters with one strong and one weak site therefore may be best suited to respond to changes in available carbon sources (Supplementary Fig. 4). These results suggest that combinations of strong and weak Mig1-binding sites are commonly found together in genomic promoters, and may provide a sensitive strategy for glucose repression.
We sought to determine if the properties of Mig1 repression found in the synthetic promoter libraries were informative when studying genomic promoters. In the S. cerevisiae genome, 359 promoters have a significant match to a Mig1 weight matrix and 33 of these promoters correspond to one of 136 documented Mig1-regulated genes. To compare these results directly to our model, we applied the thermodynamic model of Mig1 repression to genomic promoters (see Methods). Out of the top 359 promoters ranked by the thermodynamic model for the strength of Mig1 repression, 41 correspond to one of the 136 documented Mig1-regulated genes. Using the regulatory rules encoded in our thermodynamic model, we explain eight (24%) more known Mig1-regulated genes (HXT9, HXT12, HXT13, GSY1, SOR1, ICS2, YIL172C, YOL153C) than by simply looking for promoters with a significant match to a Mig1 weight matrix. For example, the SOR1 promoter does not harbour a significant match to a Mig1 site but contains a number of weak sites that cluster together (Fig. 3a). As cooperativity between Mig1 sites is an important part of our quantitative model, we correctly predicted that SOR1 is Mig1 regulated and also identified the likely binding sites of Mig1 in this promoter.
Figure 3: Thermodynamic model explains Mig1 repression in the genome.

a, b, Mig1-binding sites in the promoters of SOR1 (a) and MIG2 (b). The affinity of Mig1 for the site based on a position weight matrix score relative to the strong site is plotted versus the location upstream of the translation start site (TSS). The horizontal line represents the significance threshold for the weight matrix and each square represents a Mig1 site. c, MIG2 promoter activity in a wild-type (WT) strain and a mig1
mig2
strain. Error bars,
1 s.d.
Slides may be downloaded for educational use, according to the terms described in Nature Publishing Group's licensing policy.
Using the thermodynamic model, we also predicted a number of Mig1-regulated genes that were not previously known to be Mig1 targets (Supplementary Table 9). MIG2, a paralogue of MIG1 that represses and binds the same site as Mig1 (ref. 15), was predicted by the model to be auto-regulated on the basis of its promoter sequence (Fig. 3b). To validate this prediction, we measured MIG2 promoter activity (see Methods) in strains deleted for both MIG1 and MIG2. MIG2 promoter activity increased significantly in the mig1
mig2
strain as compared to wild type (P < 10-3, t-test, n = 24), showing that MIG2 is auto-regulated by Mig1/Mig2 (Fig. 3c). The prediction from the model was that MIG2 expression would increase 1.8-fold in a mig1
mig2
strain, and we observed a 1.5-fold change. The regulation of MIG2 by Mig1/Mig2 represents a previously unreported negative feedback loop in the glucose repression network that was identified on the basis of our analysis of synthetic promoters.
Using a simple system, we have succeeded in constructing an accurate model of the relationship between promoter sequence and gene expression. In part this was because we sampled a much larger fraction of promoter space using our library than we could by sampling genomic promoters. Thus, we were able to fit models containing a small number of parameters to data containing large numbers of observations. We found that a completely thermodynamic model based on the equilibrium binding of the transcription factors and RNAP to each other, and to their cis-regulatory sites, was a reasonable way to capture the relationship between promoter sequence and gene expression in our system for all of the libraries examined. This does not imply that kinetic processes, such as histone or RNAP modification, are unimportant in gene regulation; however, it does suggest that the information encoded in a promoter is decoded primarily by the sequence-specific binding of transcription factors. Our results support the idea that the complexity and variation in gene regulation could stem from very simple rules describing the binding of proteins to DNA and to each other12, 13, 21.
Methods Summary
To create the building blocks that make up the synthetic promoters, oligonucleotide pairs (each with a 5' phosphate) were annealed by being boiled and then slowly cooled to room temperature (see Supplementary Information for building block sequences). 15
l of 50
M double-stranded building blocks were then ligated with 200 U of T4 DNA ligase (New England Biolabs) for 2 hours at 16 °C. The ligation products were then purified using a Microcon YM-100 column (Millipore) to reduce the number of short promoters. 15 ng of purified ligation product were then ligated into the BamH1 site of the integrating reporter plasmid pJG102 (20 ng) and transformed into Escherichia coli. Transformants were scraped into Luria broth plus carbenicillin, grown overnight and then DNA was extracted using the GenElute HP Plasmid Maxiprep kit (Sigma). 130
g of library DNA was digested with BglI, BamH1, Sal1 and EcoR1 (200 U each) and transformed into yeast as described22. Colonies growing on medium lacking uracil were picked into 96-well plates and Trp- colonies were then identified by replica plating onto medium lacking tryptophan. We observed that some building blocks were represented slightly more than others, even though they were added at equal molar concentrations. The relative abundance of each building block in each library scaled similarly to the melting temperature of the building block.


(1995-,
-galactosidase was monitored using Novagen's
