MOCCASIN: a method for correcting for known and unknown confounders in RNA splicing analysis

The effects of confounding factors on gene expression analysis have been extensively studied following the introduction of high-throughput microarrays and subsequently RNA sequencing. In contrast, there is a lack of equivalent analysis and tools for RNA splicing. Here we first assess the effect of confounders on both expression and splicing quantifications in two large public RNA-Seq datasets (TARGET, ENCODE). We show quantification of splicing variations are affected at least as much as those of gene expression, revealing unwanted sources of variations in both datasets. Next, we develop MOCCASIN, a method to correct the effect of both known and unknown confounders on RNA splicing quantification and demonstrate MOCCASIN’s effectiveness on both synthetic and real data. Code, synthetic and corrected datasets are all made available as resources.


Toy data
The toy data was constructed to quantify the inaccuracies incurred by various models in a setting where a bias term causes a known change in % TPM (i.e. a change that matches the synthetic batch data generation process described above) such that the more used junction in a two-way LSV changes by that % TPM while the other junction increases by the same amount of TPM to maintain overall expression.
Given the above data we follow the MOCCASIN correction procedure for batch effects: First, we scale the junction reads so the LSV has the same coverage over all samples. Then we apply the transformation using numpy (identity, arcsinh, log), then fit the linear model and perform the read rates adjustments accordingly, then apply the inverse transformation, then compute PSI from the adjusted reads. The results of this analysis are plotted with matplotlib in Supplementary Figure S7 (see the Supplementary Files for the code to reproduce the toy data).

TARGET dataset
The TARGET data used in this paper comprises 579 patients, from which samples were taken either at primary leukemia diagnosis or after relapse. Most of the patients' samples were sequenced in 2-3 technical duplicates, and as indicated in the sequencing meta data, samples were sequenced either on an Illumina HiSeq 2000 or 2500. There was also information about sample release and load date, but these dates were completely confounded with our biological variable of interest (primary diagnosis vs relapse). We observed that the samples deriving from the Illumina HiSeq 2000 or 2500 exhibited batch effects (Fig1a-b). It is possible that the HiSeq 2000 and 2500 have distinct biases, or it is possible that the sequencer label is a proxy for how the samples were handled. For example, samples may have been frozen for a longer period of time prior to HiSeq 2500 vs 2000 sequencing. Or, perhaps different technicians processed the 2500 vs 2000 samples.
Without further information about how the samples were processed, handled, stored, etc., it is difficult to determine the cause of 2000 vs 2500 sample batch effects we observed.
The clinical information associated with the 579 TARGET patients indicates their leukemias break down as follows: 160 with a B cell of origin, 265 with a T cell of origin, and 154 with an unknown cell of origin. We predicted the 154 patients' leukemia cells of origin using UMAP of splicing (PSI) and gene expression (TPM). The UMAP analysis is identical to Figure 1a-b, but colored by cell of origin. Of the 154 patients with an unknown cell of origin, 90 cluster with B cell of origin patients and 64 cluster with T cell of origin patients. Thus, we predicted that 250 leukemias have a B cell of origin and 329 leukemias have a T cell of origin.

ENCODE dataset
The ENCODE RNA-Sequencing data used in this paper were described in a preprint (Van Nostrand et al. 2020) which used 1100 ENCODE RNA-Sequencing samples generated in 61 distinct experiments (batches), whereby each batch had a control and at least one knockdown target. Every ENCODE sample, including the controls, had two biological replicates performed in every batch. 29 (548 samples) and 32 (552 samples) batches used HepG2 cells or K562 cells, respectively.
Van Nostrand et al. reported they identified batch effects in their data but did not quantify those. To correct for those, Van Nostrand et al applied the ComBat batch correction algorithm directly on sample read counts, and then averaged together the controls across batches to create two "virtual" control replicates before quantifying differential expression and differential splicing.
Similar to Van Nostrand et al, in this work HepG2 cells and K562 cells were considered separately when modelling and quantifying batch effects. However, given the ENCODE experimental design described above it was not clear what would be the optimal approach to model batch effects. After testing a variety of strategies and measuring their effects using the metrics described in this paper we converged to the following procedure.
First, batch effects in HepG2 and K562 samples were modelled and removed independently by MOCCASIN. After batch-correction, controls across all batches were then considered altogether in one group when quantifying differential splicing between control and knockdowns, separately for HepG2 and K562 samples. Differential splicing was quantified with MAJIQ (see 'Splicing analysis' for more details). For differential expression analysis, control replicate 1 samples and control replicate 2 samples' gene level TPMs across all the batches were averaged together to create two "virtual" control replicates.
Overall, there were 1100 control vs target comparisons included in the analysis presented in the various plots and heatmaps presented here. Specifically, for the heatmaps and clustering figures LSVs were filtered to select at most one representative junction such that at least one knockdown to controls comparison had a |dPSI|>=0.1 and Wilcoxon two-sided test p-value less than 0.05. Then for each LSV, the junction with the greatest |dPSI| value kept, breaking ties randomly.
MOCCASIN model matrix included: intercept, Treatment_OLM, Treatment_FC. No explicit confounder factor was included even though we know the Lab_Wood / Lab_Abel difference. Instead, we used MOCCASIN -U (see methods above) with 2 hidden factors. For evaluation, 10,000 LSVs were selected uniformly at random from those originally quantified in all samples. Principal component analysis (PCA) from scikit-learn was applied to PSI from the 10,000 LSVs, and plots were generated with matplotlib. MAJIQ was used to quantify differences in splicing between Wood vs Abel control samples and Wilcoxon two sided rank test applied to quantify pvalues.
See Supplementary Figure 1 for the results of this analysis.

Gene expression analysis
For gene expression level analysis of the TARGET and ENCODE datasets, transcript level TPM quantifications were collapsed into gene level TPMs with tximport (bioconductor-tximport 1.12.0). For differential expression analysis of ENCODE, DESeq2 was used to quantify the log2 fold-change (shrunk with DESeq2::fcShrink using the apelgm method) and statistical significance of differential expression between the two "virtual" control replicates and a given pair of knockdown target replicates. The DESeq2 adjusted p-value of 0.05 was the threshold for identifying a differentially expressed gene as statistically significant.

Splicing analysis
MAJIQ and VOILA (v2.1-46f7268) were used to quantify splicing from sorted bam files. MAJIQ uses a Bayesian model for both PSI and dPSI, computing expected and posterior probabilities over those. Thus, when point estimates for PSI or dPSI are described in the text they generally refer to the expected values for PSI or dPSI with a slight abuse of notation to improve readability: For example PSI > 0.1 instead of E[PSI] > 0.1, unless the posterior probability distribution is stated explicitly (e.g. P(|dPSI| > C)). For defining differentially spliced events two MAJIQ based methods were used. The first is MAJIQ's Bayesian expected delta PSI (dPSI) mentioned above and described in (Vaquero-Garcia et al. 2016), which is geared to produce a set of high confidence splicing changes between groups of biological replicates. Briefly, MAJIQ dPSI quantifies the probability that a difference in splice inclusion is above some threshold with a certain confidence i.e. (P(|dPSI| >= C1) >= C2) for example . We used a threshold of C1 = 10% or 20% in the results reported in the main and supplementary figures. Significant differences in splicing were those identified as having a C2 = 95% probability of being greater than the given C1 threshold.
The second method to detect differential splicing involved either a Student's t test or a Wilcoxon two sided rank test combined with an optional threshold on the observed difference between the median over the expected PSI of samples from the two groups. This approach was aimed to accommodate heterogeneous groups, as we observed in ENCODE's CTRL samples (see description in ENCODE data section).

Uniform manifold approximation and projection (UMAP)
UMAP (McInnes, Healy, and Melville 2018) was applied to gene expression (TPM and log2 foldchange) data and splicing (PSI, dPSI, and difference in median PSI) data using the R package umap (0.2.4.1) with method set to 'umap-learn', n_neighbors set to 25, and the random seed set to 924319452. UMAP projections were visualized with the R packages extrafont, ggplot2, and patchwork.

Calculating percent variance explained by confounders
Percent variance explained by a confounder was calculated by the R^2 when fitting a linear model with batch effect as a covariate.

Hierarchical clustering
For hierarchical clustering of gene expression (TPM and log2 fold-change) data and splicing (PSI, dPSI, and difference in median PSI) data, the ward.D2 algorithm was applied to Euclidean distances between columns and between rows. Clustering results were visualized with the R packages extrafont, ComplexHeatmap, and Pheatmap.

Scatterplots and Venn Diagrams
Scatterplots with lines of best fit and Pearson correlations were generated using the base R lm() method. Size-scaled Venn Diagrams were generated with R package eulerr (6.1.0).

Assessing MOCCASIN effect on differentially spliced events
LSVs were called changing if they met the thresholds (e.g. dPSI>=0.1) as specified for a given analysis and matching plot in this work. Note that in this work we only aimed to assess MOCCASIN's correction effect. Thus, to avoid introducing other variables into the analysis we used the same algorithm (MAJIQ) and the same settings as ground truth when these were applied to the original simulated data without batch effects introduced. This comparison between the analysis of the original data, perturbed data, and corrected data gave the true negative (TN), true positive (TP), false positive (FP), and false negative (FN) estimates and their associated statistics (FPR, FDR) used in the main text and figures.

Modeling unknown confounding factors
MOCCASIN's model matrix input by the user includes filenames, variables of interest, and known confounders. The user may optionally request that MOCCASIN identify additional factors of unwanted variation, by setting the -U option to 1 or more.
In order to discover factors of variation in the data which are not explained by the factors in the model matrix, MOCCASIN first computes for each LSV, the junction with the highest variance in PSI over samples. Then, the top NUM_JUNCS (default: 10,000) such junctions by variance are selected as the input data Y for the following procedure: 1. Perform OLS regression on the input data Y with this model matrix.
2. Compute the residuals Y r from this regression. 3. Compute the singular value decomposition: WSV T =Y r 4. Append the estimated factors W (up to the number specified by the -U argument) to the original model matrix and retain them if the model remains full-rank.
In a nutshell, the above method for discovering unknown confounding factors is a variant of matrix decomposition as are many of the previously published works on correcting confounders for expression data. Specifically, the application of SVD on the most highly varying features was previously suggested by Risso et al and included in the RUV package.
Compared to RUV, we made the following modifications: First, the RUVs procedure assumes that only two kinds of effects exist, termed "main variables of interest" and "unwanted variation"; it does not handle known confounders, and as currently implemented would rediscover known confounders corresponding to a large proportion of the variation in the data. Hence, our input to the SVD step used by RUV are the residuals Y r .We note that these residuals may be negative. This means we were not able to simply apply RUV as a second processing step: RUVs expects counts and tries to take their logarithm and was therefore unworkable for our purposes.
In addition, as discussed in the main text we utilize OLS regression over read rates instead of a negative binomial GLM suggested by the RUV documentation. We also note that, like the RUVr procedure as implemented currently in RUVSeq (July 2019), our derived factors correspond to only the left singular vectors in the singular value decomposition; this implementation contrasts with the method reported in the 2014 Nat Biotech paper, which states that the derived factors should be the left singular vectors composed with the singular values.

Speed and memory optimizations
In contrast to gene or transcript expression studies, which typically utilize tens or at most a few hundred thousand tags, MOCCASIN must model and adjust counts for possibly millions of splice junctions in a single run. The need to handle that many inference tasks motivated MOCCASIN's simple model performing linear regression on junction read rates. However, the size of the data necessitates additional speed and memory optimizations.
MOCCASIN partitions LSVs into "chunks" for processing in order to limit peak memory usage. This is possible because LSVs are modeled and adjusted independently of one another in the MOCCASIN framework. The user can set the max chunk size in terms of the max number of coverage table cells to be loaded at once into memory (-M). However, the peak memory use currently is lower-bounded by the size of the coverage table in the largest input .majiq file.
MOCCASIN allows the user to specify the number of processes to use (-J) as a speed optimization. This governs the number of spawned processes which will process each chunk of LSVs in parallel during the modeling and adjustment steps.
In practice, when running MOCCASIN on our simulated datasets we found MOCCASIN (-J 8) took 13 seconds to run on 4 samples and 34 seconds to run on 16 samples, with roughly linear increase in time, and used less than 1G of RAM ( Supplementary Fig 8).
For large (i.e. >500 samples) datasets, I/O becomes a rate-limiting process since we try to keep memory use low (see above) by writing and reading temporary files to a user-specified tmp location. The temporary files would not be necessary if we made use of shared memory across processes. Python3.8 does include shared memory capabilities but python3.8 is not yet as widely used as python3.6 and python 3.6 lacks shared memory capabilities. To facilitate MOCCASIN usability, we chose to use python3.6, and to make up for lack of shared memory in python3.6, we suggest users point the MOCCASIN tmp directory to the fastest disk available (e.g. a RAM drive at /dev/shm). At the end of the MOCCASIN execution, the tmp subdirectory is automatically deleted (unless the -K flag is supplied). For example, on the TARGET dataset (885 samples), using a RAM drive and -J 40, MOCCASIN took 2.8 days to finish and used a maximum of 26G of RAM. We note that especially for large datasets the main bottleneck for execution time is I/O and not the algorithm itself.