Highly expressed genes evolve under strong epistasis from a proteome-wide scan in E. coli

Epistasis or the non-additivity of mutational effects is a major force in protein evolution, but it has not been systematically quantified at the level of a proteome. Here, we estimated the extent of epistasis for 2,382 genes in E. coli using several hundreds of orthologs for each gene within the class Gammaproteobacteria. We found that the average epistasis is ~41% across genes in the proteome and that epistasis is stronger among highly expressed genes. This trend is quantitatively explained by the prevailing model of sequence evolution based on minimizing the fitness cost of protein unfolding and aggregation. The genes with the highest epistasis are also functionally involved in the maintenance of proteostasis, translation and central metabolism. In contrast, genes evolving with low epistasis mainly encode for membrane proteins and are involved in transport activity. Our results highlight the coupling between selection and epistasis in the long-term evolution of a proteome.

non-epistatic rate of substitution as using (u-1)/19 where (u-1) is the number of amino-acid states 37 to which the current wildtype can transition, normalized by all possible substitutions (20-1=19). 38 39

Calculation of dN/dS: 40
We obtained codon alignments with PAL2NAL 4 ; it uses a multiple protein sequence alignment 41 and their corresponding DNA sequences as input. We then used Gblocks as described above but 42 we remove all positions with gaps. Another factor to be considered is the unequal codon frequencies in the MYN model compared 80 to other methods. As thoroughly discussed by Yang and Nielsen 10 , unequal codon frequencies 81 would have the opposite effect to the transition/transversion bias. Therefore, all Li-based models 82 that lack such effect would underestimate dS. Altogether, this shows that the NG method which 83 lacks both effects (transition/transversion and unequal codon biases) would produce the most 84 unbiased estimate of epistasis compared to Li-based methods. All estimated rates are compiled in 85 Table S3. 86 87 88 Amino-acid usage correction for non-fixed states: 89 In principle, polymorphisms, or non-fixed states, in a species can potentially inflate the counting 90 of amino acid usage. To estimate the impact of non-fixed states in our amino-acid usage 91 calculation, we used a correction based on the probability of occurrence of non-fixed amino 92 acids p 3 . To calculate p, we first measured the average amino acid diversity ∏ a for each gene. 93 The amino acid diversity is a measure of the fraction of amino acid mismatches between two 94 sequences of the same species in a pairwise alignment. We calculated the average ∏ a values for 95 each species with at least two sequences, and then averaged ∏ a values across all species. Since 96 amino-acid diversity is roughly the expected density of non-fixed states, the probability of 97 observing a non-fixed state p is ∏ a /6. The factor of 6 is the number of possible non-synonymous mutations away from an average codon. 99 100 If non-fixed states have a probability of p, then fixed states have a probability of q = 1 -p and the 101 distribution of those states in a multiple alignment follows a binomial distribution. To 102 approximate the probability that a non-fixed state is observed k times in an alignment of N 103 sequences, we use the Poisson formula m=(pN) k e -pN /k!. The probability to have a fixed, rather 104 than non-fixed, is then r = 1 -m. Finally, we take the sum of the probabilities of fixed-states for 105 each amino-acid state observed at one site or , where i is the amino-acid state and u is the 106 amino-acid usage. We further average this sum across all sites of the gene to get the corrected u. 107 The average correction to u is Δu ~ 4% (values for each gene are listed in Table S2). 108 Consequently, the correction to R u due to polymorphism is ΔR u = 1 L ( ) Δu where c is the fitness cost imposed by each misfolded protein and is shown to be ~ 10 -7 in yeast. 118 The number of misfolded proteins is the protein abundance A multiplied by the probability of 119 each single protein to be unfolded P unfolded . Based on protein folding thermodynamics and 120 assuming equilibrium between folded and unfolded state, the probability of being unfolded is 121 The distribution P(ΔΔG) is the probability distribution of mutational effects on folding stability 144 known from large-scale mutational studies [16][17][18][19] , and has been parameterized for proteins of 145 different folds. The distribution of background folding stability P(ΔG) is a consequence of 146 mutation-selection balance on the protein folding fitness landscape [20][21][22][23][24] . This distribution has also 147 been documented experimentally from >4000 proteins 25 . 148

149
To calculate the rate obtained from mutational usage R u , we assume that each protein sequence in 150 an MSA corresponds to a ∆G value in P(∆G) 26 . Moreover, in the regime where proteins are very 151 stable (Fig. S10), the fitness landscape is flat with respect to folding stability, which implies that 152

P(ΔG)P(ΔΔG)d(ΔG)d(ΔΔG)
the current amino acid can be substituted by all other amino acids, that is, u=19. In the regime of 153 marginal stability, the presence of selection implies that amino acids are not allowed, that is, 154 u<19. Thus, R u is the rate of evolution of the most stable sequence in an MSA 26 . In the case of 155 continuous distribution of P(∆G), R u is the rate at the probability cutoff, e.g., P(∆G) = 0.01 or 156 The meaning of 0.01 is that the probability of sampling this specific ∆G is one in 100 trials. We 160 have shown that R u calculated using this approach shows an excellent correlation with R u 161 calculated directly from multiple sequence alignment 26 . 162 163

Theoretical prediction of relation between epistasis and expression level 164 165
We calculate epistasis at different mRNA stabilities using the following steps: 166 1. An mRNA expression level is chosen from the set of measured expression levels for 167 different E. coli genes (Table S1). 168 2. The corresponding protein expression level (Table S1) is used in Eqns. S1-S7 169 3. We assumed c=10 -7 (ref. 27 ). 170 4. Distribution of mutational effects on protein folding stability, P(∆∆G), was assumed to be 171 where ! is the weights of first distribution, ! , ! , ! and ! are the average values and 173 standard deviations of each Gaussian distribution found to be 0.53 ± 0.12, 0.56 ± 0.12, 174 1.96 ± 0.53, 0.90 ± 0.16 and, 1.93 ± 0.29 respectively 16 . 175 5. The distribution of mutational effects on protein folding stability, Eq. S7, is not assumed 176 constant at all stabilities. We used the following empirical formula 23,25,28 : 177 Eq. S7 implies that at higher stabilities fewer stabilizing mutants would be available to a 179 protein, as confirmed from large-scale experimental observations 25 . 180 6. Distribution of protein stabilities, P(∆G), was taken from that of experimentally-181 calibrated log-normal distribution for E. coli 29 with following parameters: 182 ΔΔG mean = −0.13(ΔG) + 0.23 (kcal / mol) 183 7. R dN/dS is then calculated from Eq. S6 taking the knowledge of fitness function and 184 probability distribution, P(∆G) and P(∆∆G). 185 8. R u is calculated from Eq. S7. 186 9. Epistasis is calculated from Eq. 1 as ε = 1-(R dN/dS )/(R u ). The steps are repeated for a 187 different mRNA-expression level.  Figure S2. A correction to mutational usage, u, is accounted for by estimating for each gene the 204 probability of counting a non-fixed polymorphism as fixed state ( Table S2). The average 205 correction to u is Δu ~ 4 %. Consequently, the correction to R u due to polymorphism is 206 calculations is 10 6 . The theoretically estimated curve for epistasis is also shown in Fig. 1d. 221 222 Figure S5.         stronger agreement with respect to dN. 245  Folding stability (kcal/mol) Fitness mRNA= 1 copy/cell mRNA= 2 copies/cell mRNA= 3 copies/cell mRNA= 4 copies/cell mRNA= 5 copies/cell mRNA= 6 copies/cell mRNA= 7 copies/cell mRNA= 8 copies/cell mRNA= 9 copies/cell mRNA= 10 copies/cell