Dissecting splicing decisions and cell-to-cell variability with designed sequence libraries

Most human genes are alternatively spliced, allowing for a large expansion of the proteome. The multitude of regulatory inputs to splicing limits the potential to infer general principles from investigating native sequences. Here, we create a rationally designed library of >32,000 splicing events to dissect the complexity of splicing regulation through systematic sequence alterations. Measuring RNA and protein splice isoforms allows us to investigate both cause and effect of splicing decisions, quantify diverse regulatory inputs and accurately predict (R2 = 0.73–0.85) isoform ratios from sequence and secondary structure. By profiling individual cells, we measure the cell-to-cell variability of splicing decisions and show that it can be encoded in the DNA and influenced by regulatory inputs, opening the door for a novel, single-cell perspective on splicing regulation.

. Effects of splice site mutations and splicing factor binding sites on splicing.
A-C. Normalized ratios for variants with endogenous intronic sequences vs. variants in which a canonical branchpoint sequence (CTCAC) has been introduced at position -25 relative to the acceptor splice site (n=14-45 for individual groups); p-values were determined using Wilcoxon signed-rank test; boxes show the quartiles of the dataset, whiskers show the range of the distribution not including outliers (displayed as points). D. Splicing ratio for all tandem 5' or 3' splice site variants, plotted against the difference in splice site strength (second -first) as calculated using the MaxEntScan package. The dark blue line denotes the running average (window size 300) with confidence intervals (as determined by bootstrapping). E. Heatmap of mean effects on usage of the adjacent splice site of introducing sequence motifs in the indicated libraries and locations. 5' and 3' insertion positions in retained introns, cassette exons and the alternative exon between 5' tandem splice sites were treated independently and assigned either as influencing the adjacent acceptor or donor splice site. In the case of tandem 3' splice site insertion locations were not split according to their proximity to either the first or second splice sites due to the low number of possible insertion locations (1-2), but both were assumed to potentially influence the upstream and downstream acceptor. p-values were determined using Wilcoxon signed-rank test and corrected for multiple testing (Benjamini-Hochberg) for each splicing type separately. F. Effect of introducing a motif either towards the 5' or 3' end of retained introns, cassette exons and the alternative exon between tandem 5' splice sites. G. Heatmap of mean effects of introducing a combination of two splicing factor binding sites at the indicated locations. Asterisks denote cases where the trend reached statistical significance (p<0.05, Wilcoxon signed-rank test, n=3-22 for each combination).

Supplementary Figure 3. Coordination and antagonism between genetic building blocks.
AB. Distribution of splicing ratios (2nd/1st) in tandem 3' splice sites in variants with the upstream intron (A, green) or the downstream exon (B, blue) replaced with the corresponding exonic or intronic part from a set of eight constitutive 3' splice sites; gray box plots and data points denote wild type splicing ratios; boxes show the quartiles of the dataset, whiskers show the range of the distribution not including outliers (displayed as points). C. Each data point represents the log-ratio (2nd/1st splice site) for a pair of exonic (x-axis) and intronic (y-axes) from the same native constitutive donor site placed in a library context (based on the same data as the box plots in Fig S3AB). D-E. Each datapoint represents the splicing ratio for a context (containing a retained intron (D) or a cassette exon (E)), in which one of the indicated components has been replaced by the corresponding element from the same (unrelated) native splice site. Full construct indicates that all three components have been replaced with sequences from the same native splice site. Pearson correlation coefficient with the associated p-values are stated for each comparison. A. Outlines for building a regressor model based on MaxEntScan scores for both splice sites, on minimum free energy (delta G) of 24 nt regions across the variable region, on hexamer counts in different parts of the variable region and on 1174 PWM scores for 160 RBPs in different parts of the variable region (shown for retained introns). B. Schematic depicting secondary structure manipulations in the retained intron (left) and tandem 5' splice sites (right) library. Blue, green and red lines indicate sequences whose complement or reverse complement is introduced in the indicated positions, corresponding to the color scheme in the plot below, where mean and 95% CI for the effect on the wild-type splicing ratio of introducing different sequence elements are shown for the indicated locations (n=19-28 for each group). C. Mean difference in splicing ratio between variants in which the indicated position along the x-axis is predicted to be paired and variants in which it is predicted to be unpaired (top) and the associated p-values (Mann-Whitney U) for the comparison between the two groups (position paired vs. not paired) for each indicated position relative to the splice sites (bottom). The darker blue line represents the rolling average of the surrounding five positions. D. The impact of splice site strength (left) and secondary structure around the donor sites (right) on the model prediction (as determined using SHAP) is plotted against the feature values. E. Hexamer feature effects (as determined using SHAP) on the model prediction (ranked by their importance for the prediction) for the different splicing type libraries. The color denotes the feature value (hexamer counts) and the position along the x-axis denotes the impact on model output, for each item in the training set.
Supplementary Figure 5. Feature effects on model output.
A. Feature effects (as determined using SHAP) on the model prediction (ranked by their importance for the prediction) for the different splicing type libraries. The color denotes the feature value (MaxEntScan scores, deltaG, hexamer counts and cumulative binding scores) and the position along the x-axis denotes the impact on model output, for each item in the training set.
Supplementary Figure 6. Testing the model on data from other MPRAs.
A. Prediction of splicing ratios: Pearson correlation coefficients between the splicing ratio prediction of our model (trained on the data indicated above; "training set") for sequences tested in the indicated studies (see schematic above, "test set") and the splicing ratios measured in these studies with the indicated sets of features (top) and scatter plots of measured splicing ratios vs. predictions based on the highest scoring combination of features (bottom). BC. Prediction of variant effects based on prediction of splicing ratios for both the wild-type and the mutant sequence independently and reporting the paired difference: Scatter plots of measured variant effects vs. predictions based on the full set of features and the indicated training and test sets for data from MaPSy (B) and Vexseq (C). Figure 7. A high-throughput assay for determining protein-based splicing values.

Supplementary
A. Boxplots showing splicing values for groups of multiple barcodes for the same variants (n>7 in each group); boxes show the quartiles of the dataset, whiskers show the range of the distribution not including outliers (displayed as points). B. Example for four barcode control groups; the distribution across bins for 8 variants with identical variable region, but different barcodes is shown. C. RNA-based splicing ratios plotted against protein-based splicing values for the tandem 5' splice sites library; the color intensity denotes the RNA expression levels (dark blue corresponds to high and light blue to low RNA expression levels (log2(RNA/DNA reads)). D. Distribution of the effect of recoding exonic and/or intronic components on RNAbased splicing ratios (top), protein-based splicing values (middle) and RNA expression levels (RNA/DNA reads; bottom) for retained introns (values normalized to the respective wild-type sequence; n=20-92 variants in each group); boxes show the quartiles of the dataset, whiskers show the range of the distribution not including outliers (displayed as points). E. Difference in Pearson correlation coefficients between all intronic hexamer counts and either RNA-or protein-based splicing readouts. F. Distribution of differences in Pearson correlation coefficients between hexamer counts in exonic and intronic regions based on the GC content of the hexamer; boxes show the quartiles of the dataset, whiskers show the range of the distribution not including outliers (displayed as points). G. Pearson correlation coefficients between intronic hexamer counts and RNA-based splicing ratios (x-axis) are plotted against Pearson correlation coefficients between the same intronic hexamer counts and protein-based splicing values. Only statistically significant correlations (after multiple testing correction, Bonferroni) are shown and color-coded for their GC content (dark blue -higher hexamer GC content; light blue -lower hexamer GC content).
Supplementary Figure 8. Assessing the regulation of splicing cell-to-cell variability.
A. Distribution across bins for three unrelated variants with similar mean splicing value, but different variance (noise). B-C. Splicing noise strength (variance/mean) for the retained introns (B) and tandem 5' splice sites (C) libraries is plotted against RNA expression (RNA/DNA reads, left). Noise residuals (after correcting for the influence of mean splicing value and expression level) are plotted against RNA expression (middle) and mean splicing value (right), for the retained introns (B) and tandem 5' splice sites (C) libraries, respectively. D. Distribution of noise residuals in variants from the tandem 5' splice sites library with the indicated sequence (endogenous, splicing consensus or nonspliceable) at the first and/or second donor splice sites; boxes show the quartiles of the dataset, whiskers show the range of the distribution not including outliers (displayed as points). E. Mean and 95% CI of the effect on noise residual (top), mean splicing value (middle) or expression level (RNA/DNA reads, bottom) of introducing a motif for the indicated splicing factor within a given region in or around a retained intron. Several points of insertion within this region are treated as one set. n=4-18 and 4-25 variants in each group).