Mutational bias and the protein code shape the evolution of splicing enhancers

Exonic splicing enhancers (ESEs) are enriched in exons relative to introns and bind splicing activators. This study considers a fundamental question of co-evolution: How did ESE motifs become enriched in exons prior to the evolution of ESE recognition? We hypothesize that the high exon to intron motif ratios necessary for ESE function were created by mutational bias coupled with purifying selection on the protein code. These two forces retain certain coding motifs in exons while passively depleting them from introns. Through the use of simulations, genomic analyses, and high throughput splicing assays, we confirm the key predictions of this hypothesis, including an overlap between protein and splicing information in ESEs. We discuss the implications of mutational bias as an evolutionary driver in other cis-regulatory systems.


Supplementary Figure 2. Alternative measures of ESE activity do not affect results.
Motifs enriched in exons versus introns (y-axis) exhibit higher ESE activity when using (x-axis), and exhibit higher mutability (ERM rate quintiles). Compared to Fig. 2d, two alternative measures of hexamer ESE activity (A3SS and A5SS) were used instead of EI scores: Rosenberg (a) exonic A3SS and (b) exonic A5SS score of ESE activity is based on the effect (log2 odds ratio) of exonic hexamers on alternative 3'/5' splice site usage as measured by high throughput minigene experiments25. (a,b) Squared Pearson's correlations (n=4,096 hexamers). Source data are provided as a Source Data file. and wild type pairs were synthesized and incorporated into a three-exon minigene library comprised of 1,414 minigenes. The pooled library was sequenced (input), transfected into HEK 293T tissue culture cells, and cDNA was generated and sequenced (output). The sequencing read counts in the input (wti, mti in the contingency table) and output (wto, mto) were used to calculate the mutant to wild type splice ratios (M/W splice ratio). (b) The in vivo splicing assay minigene reporter construct: CMV enhancer and promoter sequence (blue), exon 7 of ACTN4 (uppercase, green) and intron 7 of ACTN4 (lowercase, green), 230-mer library (red), intron 9 of ACTN4 (lowercase, cyan) and exon 10 of ACTN4 (uppercase, cyan), bGH poly(a) (maroon). (c) Comparison between the severity of 54 de novo mutations in the assay in either 3' (blue) and 5' splice site regions (black) as measured by their change in MaxEntScan score (ΔSS score (wt-mt)) or by their in vivo MaPSy splicing disruption results (M/W splice ratio). Quadratic polynomial regression (95% confidence band). (d) Comparison between the severity of splicing disruption (M/W splice ratio, log2) versus severity of protein disruption (synonymous < missense < stop gain) for previously published in vitro MaPSy results29. In vitro results are robust to nonsense-mediated decay (n=7 synonymous, 2,745 missense, 782 stop gain variants). The boxplots indicate the median (middle line), first and third quartiles (box), and 1.5x IQR (whiskers); pairwise p-values from Mann-Whitney two-tailed test. Source data are provided as a Source Data file. Figure 5. Replication of input and output read counts in the splicing assay of de novo variants. Uniquely mapped read counts for both mutant and wildtype species were compared for (a) two technical sequencing replicates of the unspliced minigene reporter (input replicates), and (b-g) four replicate transfections followed by sequencing of the spliced cDNA (output replicates). . Missense mutations were predicted to be more splicing disruptive on average than synonymous mutations, except for the case of WM (no synonymous mutations possible). Stop gain mutations were generally more splicing disruptive than missense mutations, except for the cases of DA, DD, and MD (no stop gains mutations possible), and RR (slightly less disruptive). Boxplots indicate the median (middle line), first and third quartiles (box), and 1.5x IQR (whiskers). Source data are provided as a Source Data file.

Supplementary Figure 8. Mutational bias and protein selection are both necessary to create pre-ESEs and pre-ISEs.
A mathematical model of motif evolution under different levels of mutational bias (mb parameter) and different levels of purifying selection in exons (sp parameter) was developed, assuming non-overlapping loci (independently evolving), infinite population size and genome size (no genetic drift), and mutation-selection equilibrium (Supplementary Note 1). The mutation-selection equilibrium for (a) intron motif frequencies, (b) exon motif frequencies, and (c) exon vs intron motif ratios was solved using Supplementary Equations 1, 2, and 3, respectively, for randomly sampled 20x20 mutation rate matrices and different values of mb and sp (Supplementary Note 1). (a,b) Increasing mb results in more variable motif frequencies in introns and exons, (b) but increasing sp results in more constrained motif frequencies in exons. (c) Increasing both mb and sp results in motifs with more extreme enrichments in exons (pre-ESEs) and introns (pre-ISEs), depending on their relative mutability (Mutability quintile).

Supplementary Figure 9. Simulations with varying levels of mutational bias and an estimate of intrinsic ESE activity in the protein code.
Simulations initialized on a genome of random sequence were run to model evolution with or without protein selection and with varying levels of mutational bias (Methods). (a) Mutability (mean ERM rate) and (b) ESE activity (mean EI score) were recorded over time (x-axis). For uniform mutation rates, mutability and ESE activity remain near initial levels. Increasing mutational bias resulted in greater decreases in mutability and ESE activity. Simulations with protein selection (black, analogous to exons) retained mutability and ESE activity at higher levels compared to simulations without selection (blue, analogous to introns). Simulations were also initialized on a genome of human exonic sequence, again recording (c) mutability and (d) ESE activity. Human values of exonic and intronic mutability and ESE activity (dotted lines) were compared against simulations (solid lines). Simulations without protein selection equilibrated at ESE levels similar to human intronic values (blue), whereas simulations with selection equilibrated at ESE levels about half the distance between human intronic and exonic values (black), suggesting about half the ESE activity in the human genome resides in the protein code.
protein-code shape the evolution of splicing enhancers' Rong et al.

A model with mutational bias and purifying selection in exons
We present a model of sequence evolution with mutational bias (motifs with variable mutation rates) and purifying selection in exons (to preserve protein-coding function) but not in introns. We show that the joint interaction of these two forces results in particular motifs becoming enriched in exons relative to introns at equilibrium, and vice versa.
Let G intron and G exon be the sets of intronic and exonic loci in a genome, respectively, where we assume G intron and G exon are disjoint and both consist of infinitely-many loci.
Define a locus as a random variable whose state is one of n possible motifs. For example, in DNA, the set of possible motifs at a base position may be defined by the set of n = 4 k possible k-mers, when considering a sequence context of length k bps centered at the base position (as in the Main Text, where k = 7). Let us further partition G exon into the subset of loci evolving neutrally, G exon,ne , and the subset of loci evolving under purifying selection, G exon,ps , and assume G intron consists of only neutrally evolving loci. For simplicity, assume loci are non-overlapping and independently evolving, such that their position relative to each other does not matter (loci are exchangeable within G intron , G exon,ne , and G exon,ps ). This assumption is relaxed in the simulations described in the Main Text, where we explicitly simulate mutations based on overlapping heptamer contexts along a linked sequence.
The assumptions of exchangeable and infinitely-many loci imply we only need to keep track of the proportion of loci in G intron , G exon,ne , and G exon,ps in each of the n possible motif states over time. We want to model the evolution of motif proportions as a system of ordinary differential equations. Define the n length vector representation p intron (t) = {p intron,i (t)} as the proportions of loci in G intron found in each of the motif states indexed from 1, · · · , n at time t ≥ 0, where n i=1 p i (t) = 1. Similarly, define p exon,ne (t) with respect to G exon,ne , and p exon,ps (t) with respect to G exon,ps . Finally, define p exon (t) with respect to G exon , and let s p be the proportion of exonic loci evolving under purifying selection, such that p exon (t) = (1 − s p )p exon,ne (t) + s p p exon,ps (t).
Define a mutation rate matrix for n motifs as where the off-diagonals m i,j ≥ 0 represent the rates at which motif i mutates to motif j, the diagonals − k =i m i.k represent the rates at which motif i mutates, and assume M is invertible. For a diploid population with N individuals, define the substitution rate matrix for n motifs as where 2N M is the population-level mutation rate matrix, and U is the probability of fixation Thus, the evolution of exonic and intronic motif proportions can be modeled by the following system of ordinary differential equations: Define the mutation-selection-drift equilibrium of p intron (t) as π intron , which satisfies π intron M * intron = π intron M = 0 n,n , and n i=1 π intron,i = 1.
This gives us the equivalent system which has the solution Define the mutation-selection-drift equilibrium of p exon (t) as π exon , of p exon,ne (t) as π exon,ne , and of p exon,ps (t) as π exon,ps . Since M * exon,ne = M, we have π exon,ne = e α M −1 α .
We can solve this system by similarly defining the augmented matrix m n−1,1 m n−1,2 · · · − k =n−1 m n−1,k 1 m n,1 m n,2 · · · m n,n−1 1 This gives us the equivalent system π exon,ps M * α = e α , which has the solution π exon,ps = Thus, Finally, we want the ratios of motif proportions in exons relative to introns at mutationselection-drift equilibrium (division is element-wise): M has no mutational bias, that is, all off-diagonals are equal, then π intron = 1 n /n. If we make the additional assumption that p exon (0) is approximately uniform to begin with (p exon (0) ≈ 1 n /n), then r equil ≈ 1 n . Under these assumptions, both purifying selection (s p > 0 or s < 0) and mutational bias (M with non-equal off-diagonal entries) are necessary to enrich for motifs in exons relative to introns.
In Supplementary Fig. 8 is sampled as follows: randomly sample m ij = e x i e y ij , where x i ∼ Normal(0, m b ) for each 1 ≤ i ≤ n, and y ij ∼ Normal(0, m b ) for each 1 ≤ j ≤ n and j = i. Here, e x i represents a base-line mutation rate for motif i, and e y ij represents a modifier of the base-line mutation rate for the mutation rate of each motif i to each motif j. If m b = 0, then M(m b ) has no mutational bias.
As we increase m b , M gains larger and smaller off-diagonal entries, consistent with increasing mutational bias. In Supplementary Fig. 8, we plotted quantiles of (a) π intron , (b) π exon , and (c) r equil using Supplementary Equations 1, 2, 3, respectively. We assumed no genetic drift and uniform initial exonic proportions for n = 20 motifs, let s p ∈ {0, 0.05, 0.1, 0.2, 0.4, 0.8}, and randomly generated M(m b ) for each m b ∈ {0, 0.01, 0.02, 0.03, · · · , 1}. We show that as we increase (or decrease) s p and m b , the spread of exon versus intron motif ratios also increases (or decreases). In the main text, we argue this drives ESE and ISE evolution.