Identification and analysis of splicing quantitative trait loci across multiple tissues in the human genome

Alternative splicing (AS) is a fundamental step in eukaryotic mRNA biogenesis. Here, we develop an efficient and reproducible pipeline for the discovery of genetic variants that affect AS (splicing QTLs, sQTLs). We use it to analyze the GTEx dataset, generating a comprehensive catalog of sQTLs in the human genome. Downstream analysis of this catalog provides insight into the mechanisms underlying splicing regulation. We report that a core set of sQTLs is shared across multiple tissues. sQTLs often target the global splicing pattern of genes, rather than individual splicing events. Many also affect the expression of the same or other genes, uncovering regulatory loci that act through different mechanisms. sQTLs tend to be located in post-transcriptionally spliced introns, which would function as hotspots for splicing regulation. While many variants affect splicing patterns by altering the sequence of splice sites, many more modify the binding sites of RNA-binding proteins. Genetic variants affecting splicing can have a stronger phenotypic impact than those affecting gene expression.

if: i) it is an sQTL, but not an eQTL, for gene g 1 in tissue t 1 , ii) it is an eQTL, but not an sQTL, for gene g 2 in tissue t 2 , iii) it is neither an sQTL nor an eQTL for gene g 2 in tissue t 1 , iv) it is neither an sQTL nor an eQTL for gene g 1 in tissue t 2 . b) Example of a heteropleiotropic locus. The SNP rs11603538 (chr11:63,986,713, C/A) is an sQTL for the gene FERMT3 (chr11:63,974,150-63,991,354, forward strand) in Spleen (n = 146), but not in Muscle Skeletal (n = 491). The SNP is not an eQTL for FERMT3 in any of the two tissues. In contrast, the SNP is an eQTL for the gene TRPT1 (chr11:63,991,272-63,993,726, reverse strand) in Muscle Skeletal, but not in Spleen. The SNP is not an sQTL for TRPT1 in any of the two tissues.
In the left panel, the dots represent the −log 10 p values of association with the expression (two-sided t-test, blue) and splicing (Anderson test, red) of the two genes in the two tissues, for variants in a 20 Kb window centered at rs11603538 (the −log 10 p values corresponding to rs11603538 are highlighted by a diamond). The transparency of the dots depends on the −log 10 p value. The significance level for each molecular trait, gene and tissue is shown as a coloured, horizontal dashed line. When this line is not present, the gene-level p value is above the 0.05 FDR threshold and hence no variant is significantly associated with this molecular trait in this tissue (see Methods). The shaded area represents the position of a H3K36me3 ChIP-seq peak (see below). The right panel shows the fold-change signal of the H3K36me3 histone mark with respect to the input across three ENTEx donors in Spleen and Muscle Skeletal, in the same genomic region of the left panel.
The line and the coloured area correspond, respectively, to the mean fold-change signal and its standard error (SEM) across three ENTEx donors (i.e. mean ± SEM). The location of the SNP (vertical dashed line) and the overlapping ChIP-seq peak (intersection of the peaks in the three donors, black rectangle) are also displayed. Source data are provided as a Source Data file.  Figure 8. sQTL sharing in downsampled GTEx datasets. GTEx tissues were randomly downsampled once to a) 100 b) 200 and c) 300 samples, respectively (see Methods). Sharing estimates range from 0 (low sharing, blue) to 1 (high sharing, red). In addition, we display the hierarchical clustering of the tissues based on sQTL sharing, together with the tissue specificity estimates. Source data for a) -c) are provided as a Source Data file. A SNP with alleles A>G is an sQTL for a gene with three isoforms in a given tissue t 1 (red). The average individual (centroid) of each genotype group can be represented as a point in a three-dimensional space, whose coordinates are the mean relative abundances of each isoform across individuals in this genotype group. As relative abundances add up to one, these points are located on a two-dimensional simplex (i.e. a triangle). The left panel shows a SNP that is an sQTL for the same gene in tissue t 2 (violet), affecting isoform abundances in the same way as in tissue t 1 (red). Note that here we consider a single variant-gene pair (p = 1), and d is given by the sum of the lengths of the dashed lines. Hence, in this case d(t 1 , t 2 ) would be small. In the middle panel, the SNP is not an sQTL for the same gene in tissue t 3 (blue), and therefore d(t 1 , t 3 ) would be large. In the right panel, the SNP is an sQTL for the same gene in tissue t 4 (yellow), but the associated change in splicing isoform abundances is different, and d(t 1 , t 4 ) would be large. b) Hierarchical clustering built using the centroid distance approach. Source data are provided as a Source Data file. c) Estimates of similarity (Baker's Gamma) between tissue dendrograms obtained using MD correlations, the Jaccard index and the centroid distance approach. The dendrogram obtained using mashr on LeafCutter sQTLs from GTEx V8 is also compared. All similarity estimates are significantly different from 0 (permutation test p value < 10 −4 ). 11  In contrast, isoforms ENST00000457498 (green) and ENST00000486090 (yellow) display the opposite behaviour, being less abundant in alternative homozygous individuals. rs9876026 is an sQTL for TAMM41 in a total of 46 tissues (MD = 0.18). Both in a) and b), data is shown as boxplots, where the box represents the first to third quartiles and the median, and the whiskers indicate ± 1.5 × interquartile range (IQR). Source data for both a) and b) are provided as a Source Data file. 18 PPIG (8,198) YBX3 (4,645) SF3B4 (8,120) RBFOX2 (6,125) SUGP2 (7,489) PTBP1 (5,598) KHSRP (12,933) BUD13 (14,141) SUB1 (1,440) ZNF622 (6,198) UCHL5 (7,271) EFTUD2 (14,307) PRPF8 (16,250) DDX24 (4,366) LARP4 (3,463) U2AF2 (5,453) GRWD1 (8,946) RBM15 (4,567) TARDBP (2,218) TIA1 (3,099) NCBP2 (3,001) SAFB2 (1,334) Table 3. b) Overlap between the sGenes identified by sQTLseekeR2 in GTEx V7 and V8. Distribution of the Jaccard index (left) and the minimum overlap (that is, the number of common sGenes between V7 and V8 over the total number of sGenes identified in the smallest set, i.e. V7, right), computed per tissue (n = 48 tissues). To obtain these metrics only the genes tested in both V7 and V8 are considered. c) Relative change in median sQTL effect size (median MD value) between V8 and V7, per tissue (y-axis), vs the relative change in tissue sample size (x-axis). d) Overlap between the sQTLs identified by RSEM + sQTLseekeR2 and the ones obtained using LeafCutter + FastQTL by the GTEx Consortium, measured at the level of sGenes (Jaccard index) and pairs sQTL-sGene (π 1 ) across n = 48 tissues. Note that the overlap was evaluated considering the variant-gene-tissue trios tested in both runs. Both in b) and d), data is shown as boxplots, where the box represents the first to third quartiles and the median, and the whiskers indicate ± 1.
x ij is the expression of isoform i in individual j and q is the number of isoforms of the gene. Note that splicing ratios configure a multivariate phenotype, with as many values as there are transcripts for a given gene. sQTLseekeR uses the Hellinger distance between splicing ratios to estimate their variability across individuals. The Hellinger distance between the splicing ratios of individuals j and k is given by: For a given gene and genetic variant, sQTLseekeR compares the variability of the gene's splicing ratios, calculated as a sum of squares, within and between genotypes at the variant (i.e. 0, 1, 2). The comparison is performed computing a pseudo F score as defined by Anderson 2 : where SS T is the total variability, SS B is the variability between genotypes, and SS W is the variability within genotypes. If N is the total number of individuals, where d 2 H (j, k) is the squared Hellinger distance between the splicing ratios of individuals j and k, and where p is the number of genotypes at the variant, n g the number of individuals with genotype g, and g,j,k = 1 if individuals j and k have genotype g at the variant, otherwise g,j,k = 0. To assess significance, sQTLseekeR relies on the asymptotic distribution of pseudo F scores. Anderson derived the asymptotic distribution for the numerator of the pseudo F score, which is a linear combination of independent χ 2 variables with n g −1 degrees of freedom, where the coefficients are derived from the matrix built upon the distances between every pair of individuals 3 . For any set of permutations of a given data set, the sum of squares of the numerator is a monotonic function of the pseudo F score, thus the permutation p values computed on the numerator and on the pseudo F score are equivalent. Essentially, this approach is a nonparametric analogue to multivariate analysis of variance (MANOVA). sQTLseekeR2, available at https: //github.com/guigolab/sQTLseekeR2, is a completely rewritten, largely enhanced version of sQTLseekeR (see below), although the statistical framework implemented remains the same. Notably, this framework is not necessarily restricted to the analysis of relative transcript abundances, and can be applied to other multivariate AS phenotypes (see also Supplementary Note 2).

Enhancements in sQTLseekeR2 p value computation
To approximate the asymptotic distribution of the test statistic, sQTLseekeR relied on the Monte Carlo gen-

Correction for potential confounders
sQTLseekeR2 allows the incorporation of potential technical or biological confounders (either numerical or categorical) as covariates to be regressed out from the splicing ratios before testing for association with the genotype.

Multiple testing correction scheme
In cis sQTL mapping, all possible variant-gene pairs in cis are tested for association. This implies two multipletesting levels: i) multiple variants are tested per gene, and ii) multiple genes are tested genome-wide. To account for (i), sQTLseekeR2 implements a permutation scheme that empirically characterizes, for each gene, the distribution of nominal p values expected under the null hypothesis of no association 5 . This null distribution is then modeled using a beta distribution as in FastQTL 6  To identify all significant variant-gene pairs, sQTLseekeR2 implements a procedure identical to the one depicted in 7 for expression QTLs: i) it defines a genome-wide empirical p value threshold, p t , as the gene-level p value closest to the 0.05 FDR threshold; ii) for each gene, it computes a nominal p value threshold based on the beta distribution model of the minimum p value distribution f (p min ) (obtained from the permutations for this gene, see above), as F −1 (p t ) (where F −1 is the inverse cumulative distribution); iii) for each gene, variants with a nominal p value below the gene-level threshold are considered significant.

Linkage disequilibrium-based variant clustering
A substantial fraction of the variants tested in cis (up to 50% for a window of 5 Kb around the gene) are in high (r 2 ≥ 0.8) or even complete (r 2 = 1) linkage disequilibrium (LD). sQTLseekeR2 allows clustering variants above a user-defined LD threshold, so that only a representative of each cluster is tested for association. This reduces the number of dependent tests and therefore the running time and stringency of the multiple testing correction. It can be combined with simple FDR as a fast, suboptimal alternative to permutations for multiple testing correction.

Reduced running time
The more efficient p value computation, along with other minor improvements to speed up the preprocessing steps, allowed to achieve a substantial reduction (between three and four times) of the running time of a nominal pass of sQTLseekeR2 with respect to sQTLseekeR (Supplementary Figure 24).

Simulation study
To study the type I error of sQTLseekeR2, we simulated the splicing ratios of 10,000 genes with q transcript

LeafCutter
A number of studies have recently employed LeafCutter to identify sQTLs [9][10][11] . LeafCutter 12 is a method to quantify alternative splicing (AS) from short-read RNA-seq data. It uses split-mapped reads to identify alternatively-excised introns and groups them into clusters. Then, intron usage can be expressed as proportions (i.e. intron excision ratios: the number of reads supporting an intron over the total number of reads supporting the cluster to which the intron has been assigned). To test for association with genetic variants, individual intron excision ratios are often modeled using univariate linear regression (e.g. as implemented in The vector of intron excision ratios obtained by LeafCutter, however, can also be modeled as a multivariate phenotype, and therefore it can be employed as input for sQTLseekeR2 to identify sQTLs. Actually, multivariate modeling through a Dirichlet-multinomial generalized linear model (GLM) is available within LeafCutter, but only for differential splicing analyses. Given the popularity of LeafCutter, we have run sqtlseeker2-nf with LeafCutter quantifications in GTEx V7, and compared the resulting sQTLs with the set obtained using RSEM transcript quantifications.

LeafCutter AS quantification and sQTL mapping
We quantified AS based on the intron excision phenotypes defined by LeafCutter as follows: first, we used the bam2junc.sh script to quantify intron usage, and then we employed the leafcutter cluster.py script (with options --min clu reads 30 --min clu ratio 0.001 --max intron len 500000) to define intron clusters.
Both scripts are provided with LeafCutter software. To map LeafCutter clusters to genes, we employed the map clusters to genes.R script available from https://github.com/broadinstitute/gtex-pipeline, with exon coordinates derived from GENCODE v19.
The resulting * perind numers.counts.gz files were modified to include two ID columns: i) intron:cluster:gene and ii) cluster:gene, and used as input for sqtlseeker2-nf, in place of tissue transcript expression files. Note that these columns replace the transcript and gene ID columns required when using transcript quantifications. By  Table 2).

Comparison between sQTLs identified using RSEM transcript quantifications and LeafCutter intron excision ratios
To obtain comparable results, genes that were not tested in both runs were filtered out, and the multiple testing correction (process permuted mtc in sqtlseeker2-nf) was repeated on the set of common genes. In addition, the selection of the same cis window and parameters regarding variant filtering in both runs ensured that the set of variants tested was virtually identical. We compared the pairs sQTL-sGene identified by i) both approaches (i.e. common sQTLs), ii) LeafCutter quantifications only (i.e. LC-exclusive sQTLs), and iii) transcript quantifications only (i.e. TQ-exclusive sQTLs). Note that in the case of LeafCutter, we considered that an sQTL affects a given gene if it is associated with changes in intron excision ratios of at least one of the intron clusters assigned to this gene.
Overall, we observed a moderate overlap between the sQTL sets identified by the two approaches (median Jaccard index of sGenes across tissues 0.34, median π 1 of TQ in LC sQTL-sGene associations across tissues 0.73, Supplementary Figure 26a). The number of LC-exclusive sQTLs was consistently larger than the number of TQ-exclusive sQTLs. This could be explained by the fact that LeafCutter is able to detect novel introns, whereas transcript quantification relies only on GENCODE v19 annotation. Indeed, 44% (median across tissues) of the most-changing introns within intron clusters associated with LC-exclusive sQTLs are novel. However, we observed that the proportion of LC-exclusive sGenes decreased with the tissue sample size, while the proportion of common and TQ-exclusive sGenes increased (Supplementary Figure 26b).
Furthermore, LC-exclusive sQTLs are associated with intron clusters that have lower read support, when compared to common sQTLs (Supplementary Figure 26c). Hence, this could also indicate an inflated false positive rate associated with LeafCutter quantifications, which has already been observed in differential splicing analyses 13 . For large sample sizes, the number of sQTLs identified by sQTLseekeR2 using LeafCutter or RSEM quantifications is very similar.
We also compared sQTL effect sizes (MD values). In the case of LeafCutter sQTLs, MD values (MD LC ) correspond to the absolute maximum difference in mean adjusted intron excision ratios of a given intron cluster (rather than in mean adjusted transcript relative expression, MD TQ , as when using transcript quantifications) between genotype groups. In the case of sGenes with more than one significant intron cluster, sQTL effect sizes were computed as the median MD LC value across all significant intron clusters. As expected, MD TQ values were smaller (median reduction across tissues 7%) for TQ-exclusive sQTLs than for common sQTLs. The same behaviour was observed for the MD LC values of LC-exclusive sQTLs and common sQTLs, although in this case the reduction was much larger (median reduction across tissues 22%). This suggests that LeafCutter quantifications allow to identify smaller effects, and contributes to explain the larger number of LC-exclusive sQTLs with respect to TQ-exclusive sQTLs found. In addition, despite the marked differences in the splicing phenotypes employed, the effect sizes of common sQTLs obtained by LeafCutter (MD LC ) and transcript quantifications (MD TQ ) displayed a substantial Pearson correlation (median r across tissues of 0.51).
Given that LeafCutter does not provide information about the flanking exons, we could not characterize the AS events associated with LC-exclusive sQTLs. Nevertheless, we computed the AS events associated with TQ-exclusive and common sQTLs (see section Alternative splicing events associated with sQTLs in Methods) and compared them. We observed that the nature of the AS events identified was different between TQ-exclusive and common sQTLs (Supplementary Figure 26d). We further explored individual events using two-sided Wilcoxon Rank-Sum tests to assess the significance of the differences. We found that common sQTLs displayed larger proportions of events involving internal exons, including mutually exclusive exons (p value 6.70 · 10 −10 ) or alternative acceptor (p value 4.51 · 10 −5 ). In contrast, TQ-exclusive sQTLs showed larger proportions of AS events affecting the gene termini, such as alternative first exon (p value 1.56 · 10 −5 ) or alternative 3' UTR (p value 4.74 · 10 −4 ), as well as intron retention (p value 7.51 · 10 −6 ), an event that cannot be identified by LeafCutter. This highlights the complementarity of LeafCutter and transcript quantifications for alternative splicing measurement in the context of sQTL mapping. Jaccard index of sGenes across tissues 0.33, median π 1 of sQTLseekeR2's in the Consortium's sQTL-sGene associations across tissues 0.44, Supplementary Figure 27d). Note that the overlap was evaluated considering the variant-gene-tissue trios tested in both runs. Overall, the number of sGenes identified by the GTEx Consortium was larger than the number of sGenes identified by sQTLseekeR2: on average across tissues, out of 100 sGenes, 44 were identified only by the GTEx Consortium, 24 only by us, and 32 in common. The numbers are analogous for sQTLs (i.e. 46, 26 and 28, see also Supplementary Data 7).
A major source for this difference is the nature of the splicing phenotypes employed in the two analyses (RSEM transcript quantifications vs LeafCutter intron excision ratios), as we extensively discuss in the Supplementary Note 2. Indeed, the larger number of sQTLs identified by the GTEx Consortium could be explained by the fact that LeafCutter is able to detect novel introns, while our analysis employs RSEM transcript quantifications that rely on GENCODE annotation. It could be also related to an inflated false positive rate associated with LeafCutter quantifications, which has already been described in differential splicing analyses 13  Wilcoxon Rank-Sum test p value 2.78 · 10 −11 ), as well as intron retention (two-sided Wilcoxon Rank-Sum test p value 1.47 · 10 −6 ). The latter is particularly relevant, given that this event, which cannot be identified by LeafCutter, is known to be important in tissues such as blood 16 or brain 17 and has been related to several types of cancer 18 . In contrast, sQTLs identified also by the GTEx Consortium showed larger proportions of events involving internal exons (e.g. exon skipping, two-sided Wilcoxon Rank-Sum test p value < 10 −16 ).
We further characterized the sQTLs identified only by one of the approaches (either LeafCutter + FastQTL, LCFQ, or RSEM + sQTLseekeR2, RMSQ). For each tissue, we built a dataset with all the approach-exclusive sQTL-sGene associations, and defined the following set of features: number of transcripts of the sGene, number of introns, average number of introns per transcript, average intron length, average exon length, sGene expression (TPM), MAF of the sQTL variant, whether it is present or not in the primary transcript, distance to the TSS, distance to the closest splice site, TFBS, RBP-binding site, GWAS variant, H3K36me3 peak and Pol II binding site. Then, we split the dataset in training and test sets (70/30), and trained a random forest (as implemented in sklearn.ensemble.randomforestclassifier) to classify each sQTL-sGene association as LCFQ-exclusive or RMSQ-exclusive. We performed a 10-fold cross-validation with hyperparameter tuning to select the optimal model, which was used to make predictions on the test set. Overall, the classification accuracy was high (average across tissues 98%). We determined the importance of each feature for classifi-43 cation as the mean decrease in node impurity from splitting on the feature (measured by the Gini index) and via SHAP values 19 (mean absolute SHAP value). Importance estimates were normalized to sum up to 1. In addition, we employed SHAP values to explore the contribution of each feature to each class at the level of individual observations. In order to evaluate the impact of duplicated feature values in the models, we repeated In summary, these results suggest that our approach has larger power to detect sQTLs for genes with lower expression and shorter introns (longer exons), as well as involving variants with smaller MAF. In contrast, the GTEx Consortium's approach would be better powered to detect sQTLs for genes with higher expression and longer introns (shorter exons). A possible explanation for these observations is that both lowly expressed genes and longer introns result in less reliable transcript quantifications and intron-excision ratios, respectively, which could lead to a larger number of associations. Indeed, the former has been reported 20 , and as for the latter, we have observed a marked decrease in the number of reads that support an intron as the intron length increases (Supplementary Figure 28e). Regarding MAF, our observation could be due to the fact that in sQTLseekeR2 the genotype is encoded as a categorical variable, while the GTEx Consortium's approach treats it as a numerical variable. Hence, our approach would have larger power to detect what is known in this context as non-additive (either dominant or recessive) genetic effects, especially when there are few individuals in one of the genotype groups.