Central dogma rates and the trade-off between precision and economy in gene expression

Steady-state protein abundance is set by four rates: transcription, translation, mRNA decay and protein decay. A given protein abundance can be obtained from infinitely many combinations of these rates. This raises the question of whether the natural rates for each gene result from historical accidents, or are there rules that give certain combinations a selective advantage? We address this question using high-throughput measurements in rapidly growing cells from diverse organisms to find that about half of the rate combinations do not exist: genes that combine high transcription with low translation are strongly depleted. This depletion is due to a trade-off between precision and economy: high transcription decreases stochastic fluctuations but increases transcription costs. Our theory quantitatively explains which rate combinations are missing, and predicts the curvature of the fitness function for each gene. It may guide the design of gene circuits with desired expression levels and noise.


Supplementary Figure 3.
A. In E. coli, coefficients of variation scale with transcription rates. Transcription and translation rates inferred from Li et al. 4 . CVs from Taniguchi et al. 19 . B. Clone abundance in growth competition experiments can be predicted from theoretical estimates of fitness cost of mRNA c m and fitness cost of protein c p . We assumed a growth time t = 24h in this figure. The correlation between predictions and measurements is independent of the growth time t (Methods). The dotted line represents a perfect fit between measurements and experiments (y = x). C. The 3-stage telegraph process can be used to models how intrinsic noise in gene expression is affected by different reaction rates. D. In the 3-stage model of gene expression of panel C, intrinsic protein noise is mainly due to small mRNA copy number noise (blue line) when β m is small. For large β m , intrinsic protein noise is dominated by gene activation noise (green line), which is almost constant for β m in physiological range (x-axis). Constants are from the experiments of So et al. 20 and Taniguchi et al. 19 : α p = 0.28h −1 , c 2 v0 = 0.07, δ = 800h −1 , k on = 10h −1 . With these constants, the noise floor observed experimentally (orange line) is larger than the gene activation noise. E. By neglecting gene activation dynamics, one can estimate the coefficient of variation from the central dogma rates with an error φ less than 1.5-fold. Note however that this result does not hold for highly expressed genes (see Methods). F. The error rate of transcription is approximately constant across Crick space. Error rates from Gout et al. 21 , transcription and translation rates from Weinberg et al. 1 .

Supplementary Figure 4.
A. Measured curvatures of fitness functions are within measurement error of curvatures predicted by theory from transcription and translation rates. Curvatures were estimated from the experiments of Keren et al. 22 who measured fitness as a function of log protein abundance. The quantity f (p * )p * 2 can be directly estimated from this data (Supplementary Methods). We used the precision -economy theory to predict f (p * )p * 2 from the central dogma rates (Methods). Error bars represent standard errors. B. A noise floor c v0 = 0.22 ± 0.01 is found in the measurements of the protein noise conferred by E. coli promoters of Silander et al. 23 . This is comparable to the noise floor c v0 = 0.27 ± 0.01 found in the measurements of Taniguchi et al. 19 (Fig. 4C). C. A noise floor c v0 is also found in S. cerevisiae. Coefficients of variation and protein abundance from Newman et al. 11 . D. Individual constants cannot predict the position k of the boundary of the depleted region. We correlated the position of the boundary (log k) of the depleted region with 1. the predictions from the theory (Eq. 5 of the main text) and 2. the 8 cell biology constants listed in Table 1 (number of mRNAs N m and protein N p per cell, cell volume V , growth rate µ, active protein decay rate α deg , effective protein decay α p , mRNA decay α m , noise floor c v0 ), plus the maximal translation rate β max p and total transcription output ∑ β m (Table 1). We tested for significant correlations between these constants (in log) and measured log k using Pearson's test. Measured log k correlates significantly with the predictions of the theory. No significant correlation is found between any of the 10 cell biology constants and measured log k. E. The correlation between measurements and theory is robust to varying the fraction of genes used in defining the depleted region from 0.2% to 2% (10-fold). P-values were computed by comparing log k from theory and log k from measurements using Pearson's test. Figure 5. A -B. Genes located close to the boundary of depleted region provide more fitness benefit than genes located far from it. Contours of fitness loss upon gene deletion in S. cerevisiae (panel A) and E. coli (panel B). We obtained fitness data from Steinmetz et al. 24 (S. cerevisiae) and Baba et al. 9 (E. coli). Fitness was Gaussian-smoothed using the same procedure as the CV data (Methods). C. To test for statistical associations between the function of genes and their position in the Crick space, we describe each gene by an angle θ . θ = 0 for genes with lowest possible translation / transcription, whereas genes that maximize the translation / transcription ratio have θ = π/4. θ can be computed from a gene's β m and β p , and from the maximal transcription and translation rates β max m and β max p . D -G. Genes with different functions (GO categories) are found close or far from the boundary of the depleted region. In each panel, black dots represent genes belonging to a specific group of genes. H. Principal component analysis of gene expression in 859 cancer cell lines (gray dots) shows that HeLa cells cluster with ovarian cell lines (red dots) whereas 3T3 cells cluster with skin cell lines (blue dots). Ellipses represent the covariance of skin and ovarian cell lines at the 75% confidence level. I. 3T3 cells over-express Ras signaling genes whereas HeLa cells over-express genes involved in nonsense mediated decay and protein targeting to membrane (Gene Set Enrichment Analysis, see Supplementary Methods). J. Upon a sudden depletion of amino-acids, abundance of cystein biosynthesis proteins increases by up-regulation of transcription, which re-positions genes in the region of Crick space associated to high precision. Ribosome profiling data from Ingolia et al. 18  Genes that are key to growth tend have low β p /β m whereas genes needed for stress response, survival and differentiation have high β p /β m . For each of the four organisms studied in the present study, the table lists GO categories that are significantly enriched (FDR < 0.01) at high and low β p /β m . Genes potentially involved in bet-hedging strategies have high β p /β m . For example, in E. coli, oxidative stress response induces tolerance to antibiotics and persistence 25,26 . Persistence is a bet-hedging strategy used by bacteria to cope with unpredictable environments 27 . Consistent with their role in regulating persistence, oxidative stress response genes have high β p /β m . In S. cerevisiae, a single-cell screen previously identified respiration genes as potential bet-hedging genes 28 . Respiration provides tolerance to glucose starvation at the expense of a slower growth rate [28][29][30][31] . Consistent with their possible role in hedging glucose starvation, respiration genes have high β p /β m ( Supplementary Fig. 5E). Differences in GO terms with high and low β p /β m in human and mouse can be explained by differences in the underlying biology of the cell lines. For example, genes involved in translation initiation, ATP binding, RNA binding as well as ubiquitin-dependent protein catabolism appear both in 3T3 and HeLa cells. Other groups of genes are only found in one of the two cell lines. Such difference in GO terms is expected for cancer cell lines of different tissue origin: 3T3 cells cluster with skin cell lines whereas HeLa cells cluster with ovary cell lines (

Supplementary Discussion
This study is based on an optimality approach to evolutionary biology 32 whose aim is to explain adaptations found in living organisms in terms of selective forces. This approach has been used to explain, for example, bacterial growth laws 33,34 or energy landscape in molecular recognition 35 . It is a fruitful approach in the 11/27 sense that it can suggest new experiments (see discussion).
The quantitative model rests on the assumption that transcription and translation rates can be tuned independently. Although coupling is seen in bacteria 36 and eukaryotes 37 , that coupling has itself evolved rather than being an absolute constraint. Studies on synthetic promoters and ribosomal binding sites in E. coli indicate that transcription and translation rates can be changed independently over a wide range 15 .
In S. cerevisiae too, flow-cytometry analysis of synthetic gene constructs suggest that the eukaryotic gene expression machinery can achieve transcription and translation rates that position genes in the depleted region of Crick space ( Supplementary Fig. 2J).
Another coupling relevant to the present study is co-translational stabilization of mRNA 38 . Since rapid translation stabilizes mRNA and mRNA abundance is used to estimate transcription, our estimates of transcription and translation are statistically coupled. However, variation in co-translation stability is smaller than gene-gene variation in transcription rates by about three orders of magnitude: co-translational stability varies by less than one order of magnitude 38 , compared to 3 orders of magnitude for transcription. Therefore, our estimates of transcription and translation are not formally independent, but the dependence is small.
Our theory assumes that cells are well-adapted to the conditions in which transcription and translation were measured, namely rapid growth in rich medium. Rapid growth is thought to be a key fitness component of microorganisms like E. coli and S. cerevisiae 39,40 . Rapid growth also occurs in several contexts in mammals, including immune expansion, cancer, development and stem cells in tissues with rapid turn-over. The mammalian HeLa and 3T3 cell lines studied here have likely been selected for fast growth. HeLa cells were collected from cervical cancer, a condition which selects for mutations that provide a growth advantage. Following collection, HeLa cells underwent serial dilutions for 4 months 41 , a procedure which selects for growth 40 . In the case of 3T3 cells, embryonic fibroblasts were collected and underwent serial dilutions for 3-4 weeks 42 . In the process, cell growth first collapses due to cell senescence, and then recovers to levels of freshly collected embryonic fibroblasts, presumably because immortalized mutants take over the population. Thus, both HeLa and 3T3 cells are likely well adapted to the conditions in which transcription and translation were measured, namely rapid growth in rich culture medium. Note that these growth conditions may not represent the healthy environment of mammalian 12/27 cells in vivo, but rather the growth conditions found in a tumor or in cell culture.
Here we focused on growing cells because there is ample data covering mRNA and protein abundance and decay as well as translation dynamics through ribosome profiling. In conditions of slower growth such as amino-acids starvation in S. cerevisiae 18 , we also find a depleted region in Crick space ( Supplementary   Fig. 2P). The boundary has slope 0.5 instead of 1, indicating perhaps a different tradeoff. Identifying this tradeoff could be an excellent follow-up study.
We proposed here that a tradeoff between economy and precision in protein abundance explains the depletion of genes with high transcription and low translation. However, the rate of transcription and translation could also impact protein sequence accuracy. For example, the risk that a transcriptional error propagates to many proteins could be reduced by increasing or decreasing transcription, leading to potential tradeoffs between sequence accuracy, precision and economy. At present, tradeoffs involving protein sequencing accuracy are not supported by the data. Measurements of error rates for thousands of genes in S. cerevisiae show no correlation between transcription error rates and transcription rates 21 .
Variations in error rates over Crick space are found in a narrow range between 4.0 × 10 −6 nucleotide −1 and 6.1 × 10 −6 nucleotide −1 , with no clear correlation with transcription, translation or the translation / transcription ratio (Supplementary Fig. 3F). Thus, considerations of protein sequence accuracy are unlikely to deplete genes with high transcription and low translation rate.
Another possible hypothesis to explain the depleted region of Crick space is that mutations that tune protein abundance could affect transcription and translation with different frequency. Difference in the frequency of mutations affecting transcription and translation could stem from a difference in mutational target size. To explore this hypothesis, we simulated evolution towards a given protein abundance (drawn form the natural protein abundance distribution) by mutations that affect transcription and translation rates with different mutational targets (see Supplementary Methods). We find that different mutation target size ratios do not create a depleted region (Supplementary Fig. 5K). The distribution in simulated Crick space is along a diagonal that is nearly orthogonal to the distribution of natural gene. We conclude that target size alone is not a viable explanation for the depleted region.
Our theory focuses how on considerations of precision and economy set transcription and translation rates. Yet, precision in gene expression depends not only on the transcription rate but also on the epigenetic 13/27 environment which affects the transcription mode -burst size, burst frequency 43,44 . As a result, two genes with the same transcription rate may exhibit different noise. Here we ask whether this scaling can predict the noise floor, which would remove the need for the phenomenological factor c 2 v0 in the expression for the protein noise c v (Eq. 3 of the main text).
In the case of the stochastic model of gene expression of the previous section ( Supplementary Fig. 3C), Paulsson 45 showed that the Fano factor for mRNA abundance m can we written as By equating the observed Fano factor power-law (Eq. 1) and the theoretical expression for the Fano factor (Eq. 2), we find how the gene activation noise term varies with the transcription rate β m : where we have used m = β m /α m . Plugging this expression into the expression for c 2 v derived by Paulsson 45 (Eq. 32 of the main text), we can find how protein noise c 2 v varies with transcription rate β m In conclusion, incorporating the Fano factor power law (Eq. 1) into the model, we find that it results in no noise floor for γ 0.5 or in a noise floor that is too low compared to experimental observations.
Since the noise floor is a well-founded experimental observation 11,19,23,46 , the power law by itself cannot explain the full noise behavior. Additional biology, such as extrinsic noise or increased transcriptional bursting at large transcription rates, is needed to understand the noise floor.

Estimating the curvature of fitness functions from the measurements of Keren et al.
Keren et al. 22 measured the fitness of S. cerevisiae cells as a function of log-expression for growth in glucose. Specifically, these measurements map log 10 gene expression x to fitness f (x), with x = log 10 p. These measurements do not allow to estimate the curvature of the fitness function f (p * ) directly. Nevertheless, they allow to estimate a closely related quantity, namely: where we neglected f (p) since we expand around the fitness optimum. In this section, we compare the measured p 2 f (p) to the predictions from the theory.
We focused on genes present in both the study of Keren et al. 22 and the ribosomal profiling data of Weinberg et al. 1 . To have enough context to estimate the local curvature, we considered only genes that could be both under-expressed and over-expressed compared to their wild-type expression by at least half an order of magnitude. Following Keren et al. 22 , we also discarded low-quality genes, such as genes whose fitness value at wild-type expression was significantly lower than the fitness of the wild-type.
This left 34 genes for analysis. We further excluded 9 genes with TATA promoters because these genes

16/27
tend to have large transcriptional burst size 47 : our model assumes a small transcriptional burst size and hence cannot accurately model the noise of genes with TATA promoters. Finally, we did not consider 4 genes whose curvature was too low to be estimated accurately given the accuracy of fitness measurements (| f (p * )p * 2 < 0.006). This leaves 21 genes for the analysis.
To estimate the local curvature at wild-type log 10 expression x wt , we focus on fitness measurements located within one order of magnitude of x wt : x wt − 1 < x < x wt + 1.
To these measurements, we then fit the parameters a, c, d of the polynomial: There is no first order term because fitness peaks at wild-type expression. The third order term allows for asymmetry in the fitness function. c is the curvature of f (x) at x wt . Using Eq. 5 for the curvature of fitness as a function of log 10 p, we find a relationship between the curvature and the c parameter of the polynomial: p * 2 f (p * ) = c log(10) 2 .
We estimate the standard error on c using Fisher's information matrix, assuming a 10% error on fitness measurements 22 .
By rewriting Eq. 55 of the main text for the optimal β p /β m in terms of f (p * )p * 2 , we can predict the local curvature of fitness functions from the precision -economy theory: We estimate the error on these predictions by considering that the sampling error on log 10 β m is about 0.1 (see section on data processing), and a 2-fold error on c m (mainly due to uncertainty in the number of mRNAs per cell). This leads to a standard error of 0.36 on the log 10 predictions.

17/27
To determine the significance of the agreement between predictions and measurements, we compute the root-mean-square deviation (RMSD) between predictions and measurements. Upon shuffling the measurements 10 6 times, only in p = 4.1% of shuffles is the RMSD smaller than the RMSD computed on the non-shuffled measurements. Hence, the agreement between predictions and measurements is unlikely explained by chance. Consistent with this result, a χ 2 test concludes that predictions and measurements do not differ significantly given the measurement error (p = 0.51).

The region in which transcription noise is comparable to the noise floor does not correspond to the depleted region
The depleted region cannot be explained by determining the β m and β p for which increasing transcription provides little extra precision relative to the the noise floor.
To see why, consider Eq. 3 of the main text, which relates transcription β m to the noise σ /p: Transcription noise becomes comparable to the noise floor when Therefore, the predicted boundary is a vertical line in the Crick space. This is a bad model for the data.
For example, in S. cerevisiae, the predicted boundary would be log 10 β m = 1.6 which excludes about half of the genome ( Fig. 2A). have high translation / transcription ratios due to the maximal translation rate β max p . Since translation rates β p cannot exceed β max p , achieving high protein abundance requires recruiting transcription, which decreases β p /β m .

Identifying groups of genes with significantly high or low translation / transcription ratio
As a result, stratifying genes by equal β p /β m arbitrarily groups low abundance proteins with low β p /β m together with high abundance proteins that would have high β p /β m but cannot because translation rates cannot exceed β p .
To address this issue, we stratify genes by their position relative to two boundaries: the line of lowest possible translation / transcription ratios (β p = kβ m ), and the line of highest possible translation rates (β p = β max p ). We summarize the position of genes in between these two boundaries by an angle θ ( Supplementary Fig. 5C). Genes that sit on the line of lowest possible translation / translation ratios have θ = 0, whereas θ = π/4 corresponds to genes with maximal translation / transcription ratios.
θ can be computed from a gene's β m and β p , and from the maximal transcription and translation rates β max m and β max p by trigonometry ( Supplementary Fig. 5C): The line of lowest possible translation / transcription ratio has slope 1 (β p = kβ m ), η + θ = π/4. Therefore, For each group of genes from the Gene Ontology, we use the Mann-Whitney test to compute the p-value that the θ s in that group are significantly larger or smaller compared to the distribution of θ s in genes overall. We estimate the false discovery rates (q-values) using the fdrtool R package 49 .
For all four model organisms, Supplementary Data 1 lists all GO categories with significantly high or low θ with false discovery rate q < 0.01, together with (uncorrected) p-values and θ normalized to π/4 so that θ ranges between 0 (for GO categories with lowest β p /β m ) and 1 (for GO categories with highest β p /β m ). Supplementary Table 2 shows a summary of these results. Supplementary Fig. 5D-G shows the

19/27
position of genes belonging to selected significant GO categories in the Crick space.
Differences in the genes groups with high and low β p /β m in HeLa compared to 3T3 cells can be explained by global differences in gene expression profiles.
Different groups of genes take unusually high or low β p /β m in HeLa cells compared to 3T3 cells (Supplementary Table 2). Here we examine whether these differences can be explained by global differences in the gene expression.
We obtained log 2 RPKMs for HeLa and 3T3 cells the RNAseq experiments of Eichhorn et al. 7 .
Because 3T3 cells are of murine origin, comparing the two cell lines requires mapping mouse gene names to human names. To so so, we used ENSEMBL's orthology database, using only one-to-one orthology relations of highest confidence (confidence = 1). After mapping genes names from mouse to human, we discarded 31 non-unique genes.
To provide more context for the gene expression analysis, we merged the HeLa and 3T3 gene expression Finally, we determined gene sets that were differently regulated in HeLa compared to 3T3 by gene set enrichment analysis 51 (Supplementary Fig. 5I).

Simulating different target size of mutations affecting transcription and translation cannot explain the depleted region
An alternative hypothesis to explain the depleted region of Crick could be that mutations affecting transcription and translation to tune protein protein abundance occur with different frequency. For example, mutational target sizes could be different for transcription and translation.

20/27
To test whether a difference in target size between mutations that affect transcription or translation could explain the depleted region, we perform an evolutionary simulation.
For each gene whose evolution we simulate, we initialize the transcription rate β m and translation rate β p by sampling from a uniform distribution bounded by the range of transcription and translation rates found in E. coli (see inset titled 'Gen. 1' in Supplementary Fig. 5K).
We first pick an optimal protein abundance p * by sampling the protein abundance of E. coli genes found by Li et al. 4 . The fitness of each gene is defined as e − 1 2 log 10 (β m β p /α m α p )−log 10 p * σ 2 with σ = 0.25. Thus, fitness is maximal when β m and β p are such that protein abundance is p * , and decreases with a typical abundance scale of a quarter order of magnitude.
The evolutionary simulation alternates rounds of mutations and of selection. In each mutation round, we first determine whether the mutation would affect transcription or translation by random sampling. The odds r of the mutation affecting translation vs transcription is a control parameter. r represents the ratio between the target size of mutations that affect translation over the target size of mutations that affect transcription. We set r to 0.001, 0.01, 0.1, 1, 10, 100, 1000. Supplementary Fig. 5K shows simulations results for r = 10 and r = 100 in favor of mutations that affect translation.
Having determined whether the mutation would affect transcription or translation, the corresponding rate log 10 β m or log 10 β p is then perturbed by adding noise drawn from a Gaussian distribution of mean 0 and variance 1. Thus, the typical mutation perturbs transcription or translation by an order of magnitude.
Mutations that push transcription or translation outside the range of values found in E. coli are rejected.
Mutations that do not improve fitness are also rejected.
The mutation and selection step define a generation in the simulation. Simulations are run for 10 4 generations. At each generation, we simulate the parallel evolution of 3744 genes, the number of genes for which transcription and translation was measured by Li et al. 4 . We monitor the average fitness at each generation to verify that evolution converged after 10 4 generations (see insets in Supplementary Fig. 5K).
Simulations show that different mutation target size ratios do not create a depleted region (Fig. 5K).

21/27
The distribution in simulated Crick space is along a diagonal that is nearly orthogonal to the distribution of natural gene (Fig. 2D). We conclude that target size alone is not a viable explanation for the depleted region.