Abstract
Identification of cancer subtypes is a pivotal step for developing personalized treatment. Specifically, subtyping based on changes in RNA splicing has been motivated by several recent studies. We thus develop CHESSBOARD, an unsupervised algorithm tailored for RNA splicing data that captures “tiles” in the data, defined by a subset of unique splicing changes in a subset of patients. CHESSBOARD allows for a flexible number of tiles, accounts for uncertainty of splicing quantification, and is able to model missing values as additional signals. We first apply CHESSBOARD to synthetic data to assess its domain specific modeling advantages, followed by analysis of several leukemia datasets. We show detected tiles are reproducible in independent studies, investigate their possible regulatory drivers and probe their relation to known AML mutations. Finally, we demonstrate the potential clinical utility of CHESSBOARD by supplementing mutation based diagnostic assays with discovered splicing profiles to improve drug response correlation.
Similar content being viewed by others
Introduction
Analysis of RNA sequencing (RNAseq) data from large patient cohorts is commonly used to reveal transcriptomic variations that are associated with complex disease. Within the framework of machine learning, such analysis can be framed as either supervised or unsupervised learning tasks. In a supervised setting, the objective is usually to identify transcriptomic variations that act as predictive markers for disease state or are strongly correlated with clinically significant variables^{1,2,3}. Unsupervised analysis typically involves identifying latent substructures in the data which can be used to learn more about disease etiology, such as cancer subtypes^{4,5}. One approach to quantify changes in the transcriptome is at the level of alternative splicing (AS). AS is the process by which different segments of premRNA can be removed while others are joined, or spliced, together to form mature mRNA. AS is regulated by intricate interactions between hundreds of RNA binding proteins (RBPs) and is thus highly susceptible to diseasecausing disruption, especially in cancer^{6,7,8}. Given the strong association between splice variants and disease, we propose an unsupervised learning algorithm for identifying substructures in a matrix of RNA splicing measurements from cancer patients.
The focus on identifying substructures in RNA splicing cancer data is motivated by several additional observations. First, in cancers such as acute myeloid leukemia (AML), the mutation burden is particularly low such that analyzing genetic mutations alone is insufficient for explaining disruption of key oncogenic pathways^{9}. Instead, several works have pointed to splicing aberrations which can severely perturb the function of regulatory proteins involved in apoptosis and cancer suppression^{10,11}. Second, many cancers have been shown to be heterogeneous, with patients exhibiting high variability in splicing measurements. While some of this variability is likely due to confounders such as batch effects, recent studies have shown that this variability can result from mutations which seldom appear in a large fraction of the patients^{12}. Rather, they are observed in small subsets of patients with both cis acting effects due to mutations at splice sites and trans acting effects due to mutations in splicing regulatory machinery^{13}. These observations motivate the derivation of a dedicated method for identifying “tile” substructures in the RNA splicing data matrix, i.e., subsets of patients that exhibit distinct splicing changes in a subset of genes.
Splicing variations derived from RNAseq are commonly defined at the level of whole transcripts or at the level of AS “events”. Transcript based approaches rely on estimating the abundance of whole isoforms (e.g., RSEM^{14}, SALMON^{15}, Kalisto^{16}) or relative isoform usage (e.g., MISO^{17}, BANDITS^{18}). In contrast, methods such as MAJIQ^{19}, SUPPA2^{20}, rMATS^{21}, quantify splicing of events, or local splicing variations (LSVs), such as cassette exons. These local splice variations measure splicing as the ratio of RNA segment (e.g., exon) inclusion (defined as percent spliced in or Ψ ∈ [0, 1]) for isoforms that contain the segment vs. isoforms that do not. While quantifying whole isoforms is clearly appealing, we focus here on event based splicing quantification. Some advantages of using AS events include the fact they do not require apriori assumptions about the underlying isoforms, can handle unannotated (de novo) isoforms which is crucial in cancer analysis, and can be directly validated with orthogonal methods (e.g., RTPCR).
Despite the above advantages, there are several modeling challenges associated with unsupervised tile identification using event based quantifications. First, splicing measurements are inherently different from gene expression measurements which are modeled as Gaussian (TPM) or negative binomial (read counts) distributions in many previous works^{22,23,24,25,26}. In contrast, Ψ is bounded in the [0, 1] interval, with inclusion levels commonly exhibiting a U shape distribution, favoring either high or low exon inclusion values. Furthermore, Ψ quantifications typically involve only a small subset of reads that span splice junctions, thus accounting for uncertainty of Ψ estimates is important. Finally, when identifying substructures in cancer RNA splicing measurements, it is important to address the inherent heterogeneous nature of the data and natural variations between individuals. Specifically, global patterns across rows (Ψ for specific AS events) or columns (patients) are unlikely. Instead, only a small subset of LSVs may be perturbed in a subset of samples.
Another important challenge we address here, which has implications beyond the analysis of RNA splicing data, is the modeling of missing values. Genomics data often contains missing values that result from technical limitations in sequencing technologies and are assumed to be missing completely at random (MCAR). Under this model, the missingness rate does not depend on observed or unobserved values and can be imputed or ignored^{27}. However, in RNAseq data, the missingness rate is inversely proportional to the sequencing depth where higher read coverage results in a lower probability of missingness. Furthermore, splicing quantifications, unlike expression measurements, cannot be meaningfully imputed since a missing Ψ quantification can be denoted by a 0 or 1 representing alternate junction usage. Thus, naive imputation (e.g., mean) can lead to unlikely intermediate values. This necessitates an alternate model in which values are MNAR (missing not at random). Under this model, the missingness rate depends on observations in the data matrix and external factors such as coverage. In cancer data specifically, values can also be systematically missing due to genetic mutations which could result in a specific junction not being observed (e.g., mutations near splice sites) and should be modeled as a secondary signal.
Here we address the above modeling challenges, by developing CHESSBOARD (Characterizing Heterogeneity of Expression and Splicing by Search for Blocks of Abnormalities and Outliers in RNA Datasets). CHESSBOARD is a Bayesian tile finding algorithm tailored for splicing data with missing values and includes a suite of data processing and visualization tools (Fig. 1). The input consists of a matrix of junction spanning reads counts to account for uncertainty in splicing quantifications (Fig. 1a). The algorithmic task is to identify splicing patterns in the form of tiles in this matrix (Fig. 1b). This is achieved by first employing model based prefiltering to remove irrelevant LSVs and reduce and data size (Fig. 1c, left). Next, CHESSBOARD’s nonparametric Bayesian tiles model is fit to the data using efficient blocked Gibbs sampling (Fig. 1c center). Finally, posterior summary statistics can be visualized to perform downstream analysis (Fig. 1c right).
We first apply CHESSBOARD to synthetic datasets to show it outperforms several baseline methods and validate the effectiveness of our modeling approach. Next, we show that CHESSBOARD recovers tiles characterized by splicing aberrations which are reproducible in multiple AML patient cohorts. Finally, we show that tiles we discover are correlated with drug responses, pointing to the translational potential of our findings. We also develop GAMBIT (Graphical Annotated Map for Basic Inspection of Tiles), a webbased visualization tool which allows users to visually explore the discovered tile structures and Bayesian output. Both CHESSBOARD and GAMBIT are available as open source tools to facilitate reproducible workflow and analysis.
Results
CHESSBOARD robustly models alternative splicing and missing values to discover tile structures in large heterogeneous datasets
To address the challenges of analyzing heterogeneous AS datasets, CHESSBOARD directly models properties of the data that arise from biological and technical processes. Briefly, the model’s input data matrix X contains the number of junction spanning reads x_{ij} mapped to the representative (i.e., most variable) splice junction of LSV j in sample i and the total number of reads mapped to the LSV, denoted η_{ij} (more input details in Supplementary Note 1.2). Under CHESSBOARD’s model, naturally occurring splicing variations in each LSV are captured by a (learned) mixture of a Betabinomial distribution over each x_{ij} and a binomial distribution (defined by missingness rate θ_{j0}) for having a missing value. This mixture distribution over observed and missing values captures the background. In specific patient subsets however, additional variation or signal in underlying Ψ (captured by a separate Betabinomial distribution over observed values) or an elevated missingness rate (captured by a separate θ_{j1}) may be observed. Thus, the first part of the CHESSBOARD pipeline is to filter out noninformative splicing events which can be captured well by the background distribution. This is achieved using a parametric bootstrap KolmogorovSmirnov test (Supplementary Note 2.1). Then, for the remaining LSVs in the data matrix, CHESSBOARD aims to find latent “tiles” in which multiple LSVs deviate from the background in the same subset of samples. In practice, this means that every sample i belonging to tile k has its (unknown) group indicator variable set c_{i} = k and every LSVs j belonging to this tile has a matching (also unknown) indicator variable r_{jk} = 1. A specific CHESSBOARD model is represented by a learned tile configuration and distribution parameters for all the background and signal groups. Under this Bayesian formulation, every such model can be assigned a posterior probability, and the CHESSBOARD algorithm uses an efficient blocked Gibbs sampling procedure to sample from the posterior distribution over possible models given the observed data matrix X. See Methods for a detailed description of the CHESSBOARD model.
In this section we demonstrate the utility of the CHESSBOARD model formulation described above. First, we show that CHESSBOARD accounts for uncertainty in splicing measurements due to low sequencing coverage. Specifically, CHESSBOARD uses a beta binomial distribution which attributes higher variance to LSVs with low coverage, capturing increased uncertainty in their underlying Ψ. To assess whether this model is advantageous for estimating variability in splicing data, we simulate Ψ values from a Beta distribution modeling low exon inclusion (Beta(10, 90)) and generate reads for each Ψ at various coverage levels from a binomial distribution (Fig. 2a). We compute the empirical variance using MLE (maximum likelihood estimation) under the CHESSBOARD model and two alternative models: a Beta model which also functions on a domain of [0,1] analogous to Ψ values and a Gaussian model which represents a generic approach. This analysis shows the error in the variance estimation is lowest for the Beta Binomial model and all 3 models converge at about 50 reads. However, in real life datasets the majority of quantifiable LSVs have read coverage significantly lower than 50. For example, in the beatAML and TARGET datasets used in this study (see Fig. 2a as red and blue histograms), 38% and 88% respectively have coverage below this level, indicating a substantial portion of the data benefits from CHESSBOARD’s modeling.
CHESBOARD’s model further alleviates the effect of coverage dependent uncertainty in heterogeneous data by sharing information across samples. Specifically, CHESSBOARD uses empirical Bayesian shrinkage to learn group specific priors, taking advantage of samples with higher coverage assigned to the same cluster to improve estimates for samples with lower coverage (Methods). To demonstrate the advantages of CHESSBOARD’s modeling approach we generated Ψ values from Beta(10, 90) and read counts from each Ψ at varying levels of coverage as before. The results shown in Fig. 2b demonstrate that indeed there is lower error in Ψ estimates in samples with lower read counts when estimates are shrunken to the group mean compared to computing \(\hat{\Psi }\) without prior information. Furthermore, we show that shrinkage significantly increases correlation of \(\hat{\Psi }\) and the true value of Ψ, especially in samples with low coverage. As denoted in Fig. 2c, \(\hat{\Psi }\) for darker data points representing low coverage samples is closer to the ground truth with group shrinkage (right) compared to individual quantification (left). Together these experiments show that CHESSBOARD’s generative BetaBinomial model acting on read counts can substantially improve analysis of splicing data by accounting for uncertainty in the RNAseq measurements.
Next, we turned to assess CHESSBOARD’s missing values modeling. To account for missing values, CHESSBOARD uses a MNAR model where missing values are treated as a secondary signal when the missingness rate of an LSV is much higher than expected under the null missingness rate associated with sequencing limitations. We first replace each unquantifiable splicing event with a missingness indicator. We then estimate priors for the missingness rates using an Empirical Bayes procedure (Methods). During CHESSBOARDs model fit, we obtain posterior estimates for both the background and signal missingness rate, where the latter can account for other factors such as unobserved values due to mutations (Methods). We show that the MNAR missing value model is effective in identifying tiles containing missing value signals. For comparison, we implemented an alternative version of CHESSBOARD that uses the MCAR model assumption where missing values are integrated out. Both CHESSBOARD and CHESSBOARDMCAR were then applied to a simulated homogeneous data matrix in which the read counts for each LSV (row) were drawn from a background distribution with parameters estimated from beatAML and values were missing at a fixed dropout rate of 10%. To these we added a single tile of varying size with a missingness rate of 60%. We then assessed the algorithm’s ability to recover this tile by information gain which measures the purity of the clusters (Supplementary Note 2.2). Fig. 2d shows that the MCAR model was unable to identify this tile (information gain close to 0) regardless of tile size, as it relies solely on the observed Ψ values to identify tiles. The MNAR model is able to effectively recover the tile, but as expected the information gain decreases as the tile size decreases. When the size of the tile increases to 50 LSVs, the information gain reaches a maximum.
After assessing the CHESSBOARD modeling components individually, we turned to assess its ability to recover tiles. For this, we generated synthetic splicing data modeled based on statistics collected from the BeatAML dataset (Supplementary Note 2.3). For comparison, we also ran two commonly used algorithms for biclustering as baselines: singlelink hierarchical biclustering (HBC) and spectral coclustering (SCC)^{28}. Since these algorithms lack some of CHESSBOARD’s features, both were given data with scalar Ψ and no missing values to fit their input definition. We also ran both with the correct number of clusters given as input. Since CHESSBOARD learns the number of clusters using an infinite mixture modeling approach with a Chinese Restaurant Process (CRP) prior (Methods), we first evaluated the behavior of this feature under various parameters (Supplementary Note 2.4). Then the ability of the algorithms to recover tiles was evaluated using a tile precision (τ_{pr}) and recall (τ_{rc}) statistic adapted from the recovery and relevance score^{25} (Supplementary Note 2.2). Intuitively, these scores identify the tile in the test set that maximizes precision or recall with respect to each of the reference tiles, then average the precision or recall across the tiles. In addition, we evaluated sample group clustering using adjusted rand index (ARI) (Supplementary Note 2.2). The results in Fig. 2e show that all algorithms were able to recover sample groups well (ARI > 0.9). This result is to be expected given the strong group signal (number of changing LSV, magnitude of change) in the original data (see BeatAML analysis below) and the fact the baseline algorithms were initialized with the exact cluster number. However, CHESSBOARD significantly outperformed the baseline algorithms in recovering the exact tiles, achieving τ_{pr} = 1.00, τ_{rc} = 0.98 compared to τ_{pr} = 0.33, τ_{rc} = 0.68 for HBC and τ_{pr} = 0.78, τ_{rc} = 0.55 for SCC.
CHESSBOARD discovers reproducible tiles in AML data which correlate with cancer associated regulators
Having established strong performance of the CHESSBOARD model on synthetic data, we applied it to several primary leukemia sample datasets to discover tiles that correspond to cancer associated regulators. We ran the standard CHESSBOARD pipeline (Supplementary Note 3.1) on the beatAML^{12} dataset (samples = 477, LSVs = 2299) (Supplementary Data 1). The algorithm detected a single large tile consisting of 217 samples and 1910 LSVs (Fig. 3a). Confidence in the predicted tile structure was high with most probabilities of sample assignment to the tile cluster and LSV assignment to the signal distribution being close to 1 (Supplementary Fig. 4a). To confirm whether this tile constitutes a real biological signal, we first assessed its reproducibility in Penn HTSC, an independent inhouse dataset consisting of 77 adult AML samples. We trained the CHESSBOARD model on a random subset of the beatAML cohort (samples = 400) and used this as a predictive model to predict the tile assignments of the held out beatAML samples (samples = 77) and Penn HTSC (samples = 77) samples (Supplementary Note 3.3)). We used MOCCASIN^{29} to account for confounding factors between the datasets. The prediction yielded a similar tile structure in the Penn HTSC dataset (Fig. 3b). Furthermore, the median(ΔΨ) (change in Ψ) of LSVs belonging to the tile between the 2 groups in each dataset are highly correlated (r = 0.779) suggesting that the splicing perturbations captured by the tile are similar in both datasets (Fig. 3c). Sample likelihoods were also comparable between the held out and external data indicating that the model has similar confidence in the tile structure predictions (Supplementary Fig. 4b).
Having established the reproducibility of the AML splicing tile in two independent cohorts, we then turned to investigate potential mechanisms for formation of this tile. First, we tested whether the identified tile was enriched for differentially spliced junctions that are coregulated by RNA Binding Proteins (RBPs). Intersecting the tile’s differentially spliced junctions with those observed as differentially spliced in ENCODE’s RBP knockdown experiments implicated 17 RBPs, all of which were either differentially expressed or spliced between the signal and background patient groups (Supplementary Note 3.4). Put together, all 106 RBPs considered in the analysis affected approximately 11.75% of the junctions in the signal tile. Notably, two RBPs with the most significantly enriched DS junction overlap include SRSF1 (2.48%) and U2AF2 (1.54%), both of which have known roles in promoting expression of antiapoptotic isoforms of oncogenes in several hematopoietic maglicanies^{30}. Another candidate splicing regulator which appeared to be differentially expressed and spliced between the signal and background groups is HNRNPC (2.44%), which has been implicated in AML in a recent study^{31}. Next, we analyzed eCLIP data for each RBP to test whether there was evidence for direct RBP binding around the tile’s differentially spliced junctions. We observed high binding rates for SRSF1 and U2AF2 (>4% of tile junctions). However, the binding rate was lower compared to spliceosome components including AQR, SF3B4, PRPF8 and EFTUD2 which are known to bind spuriously to constitutive splice sites. Surprisingly, almost no binding was observed for HNRNPC. For SRSF1 specifically, there was also significant enrichment of CLIP binding to junctions that were also DS suggesting direct splicing regulation by SRSF1 (Fig. 3d).
Interestingly, SRSF1 itself undergoes AS whereby one isoform includes exon 4 for the production of the full protein while the other skips exon 4, resulting in a transcript that contains a premature termination codon that is targeted for nonsensemediated decay^{32}. We thus assessed whether variation in SRSF1 exon 4 splicing between the 2 clusters corresponds to splicing variations in its known targets. Observed differences in SRSF1 splicing between the signal and background occurred almost exclusively at exon 4 (Supplementary Fig. 4c). The background cluster had a higher rate of inclusion for exon 4 (Ψ = 0.759) compared to the signal cluster (Ψ = 0.490) and higher expression of the functional transcript (log2FC = 1.16). This suggests there is higher expression of the productive isoform in the background cluster due to lack of NMDinduced degradation. Over expression of SRSF1 has been associated with aberrant splicing of several apoptotic factors in cancer^{32,33}. We analyzed several cancerassociated genes with experimentally verified splice variations that are affected by SRSF1 overexpression. Notably, BIN1 exon 12a inclusion is upregulated in the background and is associated with antiapoptotic processes^{33}. Furthermore, exon 36 inclusion in CASP9 is upregulated in the background and is associated with proapoptotic processes^{34} (Supplementary Fig. 4c).
CHESSBOARD offers a gene ranking method that implicates mTORC signaling in identified differentially spliced gene set
In order to more broadly assess whether the identified tiles correspond to known biological functions, we performed a gene ontology analysis of biological processes with the genes harboring LSVs in an extended tile containing all DS LSVs between the two clusters. The analysis shows that genes with differential splicing in the tile are enriched for roles related to general functions commonly found in cancer transcriptomics studies such as gene expression and transcription, RNA processing, and posttranslational modifications (Supplementary Fig. 4d). However, we also found that a subset of genes participate in stressrelated cellular responses, including regulation of cholesterol/lipid storage and MAPKsignaling (Supplementary Fig. 4d highlighted in red) suggesting that samples from the two clusters exhibit different cellular stress profiles.
To further explore possible tile characterization, we sought to use gene set enrichment analysis (GSEA) to identify similar pathways in the tile gene sets. Since GSEA requires ranking genes within a group we developed a probabilistic ranking method based on the CHESSBOARD model which account for both splicing changes and missingness rates. Specifically, we score each LSV based on the likelihood gain achieved in the learned tile configuration compared to an inverted tile configuration and then used the score for the highest scoring LSV in each gene as input to GSEA (see Supplementary Note 3.5 for details). Indeed, GSEA revealed an enrichment of differentially spliced genes in the tile among the hallmark mTORC1 signaling gene set (Fig. 3e), a signaling pathway centrally involved in stress response^{35}. Drawing from experimentally validated interactions extracted from the Ingenuity Pathway Analysis software, we confirmed that several genes that harbor highranking LSVs in the tile interact directly with mTORC1 or one of its direct regulators, and that many of these genes activate mTOR signaling, although it is unclear how the splicing variations that we observe might affect the function of the proteins (Supplementary Fig. 4e). Collectively, these results suggest that the main tile structure CHESSBOARD identified in the BeatAML data represents a highly reproducible and biologically relevant AML subtype.
CHESSBOARD enables scalable recursive clustering to discover alternate subtype definitions
Although CHESSBOARD was able to successfully discover a tile corresponding to an AML subtype characterized by a specific set of splicing events, other subtype definitions may exist. Alternative tile structures representing these subtypes can emerge when the inclusion of additional features or exclusion of selected features alters the amount of evidence supporting existing tile boundaries. Intuitively, a tile can be interpreted as a collection of correlated transcriptomic signatures that each capture a misregulated biological process. For example, the tile discovered in the previous section is partially explained by misregulation of RBPs. Removal of a signal dominated by certain processes can lead to discovery of tiles characterized by misregulation of orthogonal, possibly less pronounced (in terms of number of splicing changes and their magnitude), pathways where other splicing perturbations exist in a different subset of patients. Similarly, LSVs not present in the initial prefiltered data matrix may provide additional support for such tiles. To address this scenario we developed a recursive clustering solution that naturally fits into CHESSBOARD’s probabilistic model and serves as a scalable means to probe the entire transcriptome (Supplementary Note 4.1). In short, our approach iteratively reclusters the LSVs that are not assigned to a tile and after each recursive step, tests for termination by assessing the likelihood ratio of the tile model to a null model in which tile structure is removed. This null model can be interpreted as a distribution over tile structures in datasets where tiles are not expected to occur. Furthermore, the result of each recursive step can be extended to the whole transcriptome using MAJIQ or similar tools for differential splicing analysis between the sample groups identified by CHESSBOARD.
We performed recursive clustering on the beatAML dataset using this approach, treating the result from the previous section as the base case. The first recursive step yielded a smaller signal tile (samples = 196, LSVs = 389) corresponding to a different subset of the patients (ARI = 0.0066 between recursive and base case sample clusters) (Supplementary Data 2). In addition, the new cluster had high correlation with several known AML mutations (Fig. 4a). Specifically, we observed a significant permutation test p value for enrichment of mutations in FLT3ITD (p < 0.001), NPM1 (p < 0.001), and CEBPA (p = 0.025) in patients assigned to the tile cluster (Supplementary Note 4.2). Mutations in these 3 genes are associated with normal karyotype AML which is a known subtype of the disease^{36}. We observed a fourth association of the cluster with mutations in NRAS (p = 0.025), but mutations in this gene were actually depleted in the tile samples. Continued recursive tile discovery showed a sharp decrease in the likelihood ratio (Fig. 4b) indicating there are no more significant tiles in the matrix to discover.
CHESSBOARD identifies tiles that correlate with drug responses
To demonstrate the translational utility of conducting a CHESSBOARD analysis, we assessed whether tiles discovered by the algorithm correlate with patient response to therapeutics. We ran the CHESSBOARD pipeline on the beatAML dataset again but now limited the analysis to only LSVs in 70 AML associated genes (Supplementary Data 3). These genes have been identified as commonly mutated, truncated or translocated in AML patients^{36,37} and their mutational status is used by clinicians to decide on drug administration. This targeted tile finding approach based on known gene sets is motivated by several observations. First, we demonstrated above that a transcriptome wide approach can be dominated by signals orthogonal to pathways inhibited by a drug. Second, and as we show below, CHESSBOARD’s unsupervised approach can detect splicing signals not directly captured by the mutational landscape in such AML associated genes. Finally, as demonstrated in Rivera et al. 2021^{10}, clustering splicing changes across those 70 genes gives rise to clear groups and several candidate regulators.
Our splicing analysis of the 70 AML associated genes recovered 2 clusters (Fig. 5a) with resulting patient subgroups similar to the original clustering (ARI = 0.958). This result is notable since the LSV set used for this analysis was significantly different, with shared LSVs constituting only 0.57% of the original LSV set and 14.4% of the current set. This result suggests the splicing changes in AML related genes are part of perturbations to pathways captured in the original, unbiased, LSVs tile finding.
Since the mutational status in many of these AML associated genes is used by clinicians to decide on drug administration we correlated the samples belonging to each tile with drug response measured by area under the IC_{50} curve (AUC). Details about this measurement are discussed in Supplementary Note 5.1. We observed strong correlations between drug response and the tiles, and noticed the tiles included aberrant splicing in many gene targets of the most correlated drugs (Supplementary Data 4). We therefore first tested whether our splicing based patient stratification can serve as good predictors of drug response. Specifically, we computed for each drug the percent of AUC variance that can be explained (Supplementary Note 5.2) by CHESSBOARD’s discovered sample groupings compared to that explained by known mutations. However, this analysis conclusively found the variance explained by the patient subgroups to be relatively low, maxing at 6.7% compared to a much higher percentage for known mutations and drug combinations (Fig. 5b). Specifically, the variance explained by FLT3 mutation is highest for Gilteritinib and FLT3ITD is highest for Sunitinib/Sorafenib which are known mutationdrug associations.
Next, given the possible functional consequences of splicing changes between the identified tiles in many AML associated genes, we hypothesized that our splicing based patient grouping could improve clinical decisions that are based on mutation analysis alone. Notably, the added value of splicing changes to AML classification has been shown recently for FLT3ITD and NPM1 for FAB classification AML genes^{38} as well as RUNX1 and SF3B1^{39}. To assess usefulness of combining splicing changes and mutations to predict drug response, we prioritize FLT3ITD and Sorafenib and NPM1 and Venetoclax due to reliable mutation calls and their prominent role in AML clinical diagnosis. Specifically, we developed a simple decision tree (Fig. 5c) combining both FLT3ITD and patient subgroups which increased the AUC variance explained by FLT3ITD for Sorafenib from 26.0 to 36.8% (Fig. 5d). We also looked at median change in AUC and IC_{50} fold change (FC) to confirm that the effect size differences between the groups is biologically meaningful. Accordingly, using FLT3ITD alone had median(ΔAUC) = 64.36 and Log3FC = 2.09, while the combined classification had a median(ΔAUC) = 76.29 and Log3FC = 2.73 (p = 0.034 by permutation test, see Supplementary Note 5.3). Similarly for NPM1, using the NPM1 mutation status alone had a median(ΔAUC) = 23.34 and Log3FC = 0.86, while the combined classification had a median(ΔAUC) = 57.91 and Log3FC = 2.65 (p = 0.048).
CHESSBOARD’s identified tiles include splicing changes in AML drugs’ target genes
To assess potential mechanisms by which CHESSBOARD tiles explain drug response as described above, we looked for specific targets of Sorafenib in the tile that were differentially spliced between the 2 groups. Sorafenib is commonly used as a treatment for AML patients with a FLT3ITD mutation and functions as a tyrosine kinase inhibitor with high specificity for FLT3. At a molecular level, FLT3ITD result in constitutive activation of receptors that lead to downstream activation of PI3K/AKT/mTOR, Ras/Raf/MEK/ERK, and JAK/STAT pathways. This activation in turn results in enhanced proliferation and reduced apoptosis of the myeloblasts, which contribute to leukemogenesis^{40}. Inline with this known mechanism, we observed multiple LSVs in FLT3 that were differentially spliced between the patients’ subgroups. We therefore analyzed several specific splicing events in FLT3 to determine if there was enrichment of splice isoforms in the signal that may lead to reduced transcript viability and thus higher sensitivity. Indeed, for FLT3 we identified two differential splicing events involving skipping of exon 4b (p = 1.13e−43, ΔΨ = 0.110) and exon 17b (p = 4.34e−33, ΔΨ = 0.115) that are highly correlated (r = 0.858) and have a higher skipping rate in the background. Notably, exon 4b has not been previously reported (de novo exon and junctions) and both events have not been previously reported with respect to AML to the best of our knowledge. Skipping both of these exons results in the functional canonical isoform of FLT3 which was correlated with an increase in expression of FLT3 (Fig. 5e). In contrast, inclusion of this exon introduces a PTC or frameshift in the alternate isoform.
Taken together, the above analysis suggests that there is overexpression of FLT3 in the background cluster due to constitutive expression of canonical FLT3 and failure of regulatory systems to induce NMD. Noting that Sorafenib is a tyrosine kinase inhibitor, which includes FLT3, an increase in the concentration of its target would therefore be expected to increase drug sensitivity, as the splicing changes we detected could mimic the gain of function effect of FLT3ITD. Inline with this mechanistic hypothesis, we find that when combining the splicing signals and FLT3ITD status, the group of patients that were FLT3ITD negative and assigned to the signal group had much worse responses than patients that were FLT3ITD positive and assigned the background tile (Fig. 5d). Indeed, this latter group of patients has significant enrichment of patients (55/66 patients, p = 8.95e−8) with constitutive FLT3 canonical isoform expression, defined as inclusion of both events being > 0.9. In contrast, the patients that were FLT3ITD and in the signal group showed enrichment for intermediate canonical isoform levels (152/159 patients, p = 4.59e−22).
Finally, in another investigation of tile associated splicing changes in AML genes we observed high enrichment of missing values in the background cluster for EZH2. The change in inclusion levels was not particularly large, yet there was over a 15% increased missingness rate of two EZH2 LSV in the background cluster (Fig. 5e). One of these events corresponds to an event recently reported by Rivera et al. 2021^{10} and validated to introduce a PTC that induces NMD and results in reduced protein levels. The other splicing change we identified introduces an unannotated exon into the highly conserved WD domain of the protein. This suggests there was rapid degradation of the transcript making it more difficult to sequence which in turn resulted in elevated missing values. In summary, our analysis of CHESSBOARD’s tile with respect to drug response indicated that the RNA splicing tiles correlate with AML specific drug responses and offer insights into potential underlying mechanisms captured by both changes of Ψ and missing values.
CHESSBOARD finds more complex tile structures in other leukemia datasets
While our analysis focused on adult AML, we also applied CHESSBOARD to several other datasets and disease to demonstrate CHESSBOARD’s general utility for splicing pattern discovery. First we applied CHESSBOARD to a joint dataset (samples = 1089, LSV = 2965) consisting of TARGET pediatric AML and beatAML samples (Supplementary Data 5). Studies have show that there are many genetic differences between pediatric and adult AML^{41}. However the mutation burden in pediatric AML is lower suggesting that alternative diseasecausing modalities should be investigated. Specifically, LSVs that are included in tiles that are enriched for samples of a single disease type can be used to distinguish the diseases at the transcriptomic level. On the other hand, LSVs which appear in tiles with mixed sample composition represent splicing variations that are shared between diseases. CHESSBOARD discovered 5 clusters in this dataset. Notably, tiles segregate by disease with C1, C2, and C4 representing pediatric AML and C3 and C5 representing adult AML (Fig. 6a). However a subset of LSVs are unique to adult (green) and pediatric (blue) AMLs respectively. Other LSVs are either shared between subtypes of each disease type (yellow) or unique to only a single subtype of a disease (purple). Many of these splice variations occur in genes that are commonly differentially mutated between pediatric and adult disease types^{42,43}.
Next we applied CHESSBOARD to TARGET BALL (Bcell Acute Lymphoblastic Leukemia) data (samples = 517, LSVs = 1562), a markedly different type of leukemia characterized by proliferation of lymphoid blasts in the bone marrow (Supplementary Data 6). We recovered five clusters with a distinctively more complex tile structure compared to the result on the beatAML dataset (Fig. 6b). Of note, one identified subgroup was enriched for patients which are RUNX1ETV6 fusion negative who also have high relapse rates. This mutation is often used as a positive prognostic marker which suggests the splicing signature associated with this tile can be used in a similar manner^{44}.
Discussion
There is increasing evidence for the pathogenicity of splicing aberrations in heterogeneous cancers such as AML and BALL^{10,11,45,46}, pointing to a need for methods dedicated to unsupervised discovery of splicing based disease subtypes. Here, we develop CHESSBOARD, a method which offers several contributions to the densely populated area of clustering and missing value modeling. Specifically, previous works on tile finding and biclustering approaches were either not domain specific^{23,24} or tailored for other data modalities such as gene expression and genetic mutations^{22,25,26}. Consequently, these algorithms do not consider crucial characteristics of heterogeneous splicing cancer data such as the uncertainty in splicing quantifications and missing values. We demonstrate here using both synthetic and real data, the usefulness of modeling these data characteristics. Furthermore, CHESSBOARD’s MNAR model could also be applicable in domains well beyond RNA splicing or even clustering, for example in algorithms for dimensionality reduction such as sparse probabilistic PCA or factor analysis^{47,48}.
Beyond the CHESSBOARD model, we also implement several additional algorithms and tools to enable more extensive exploration of the data. First, we developed a prefiltering and recursive clustering method to facilitate analysis of the entire transcriptome. We then used the recursive clustering to discover alternate AML subgroups definitions which strongly correlated with mutations in key AML genes. Second, we implement a LSV ranking system to enable prioritization of driver genes for use in downstream analysis like GSEA. This system is unique in that we can rank LSVs based on differences in Ψ distribution and enrichment of missingess value signals. Finally we implement the CHESSBOARD algorithm and all analytical tools in a Python package. The package is accompanied by an online interactive visualization tool called GAMBIT that enables users to manually inspect the LSVs and samples contained in each tile.
While we applied CHESSBOARD to several leukemia datasets, we focused on beatAML as it offered both a large set of samples and drug response measurements. In beatAML, we found a single strong “signal” tile that divided the dataset into two main subgroups of patients which were highly reproducible in an independent dataset. Investigating possible splicing factors which may form these tiles, we find that SRSF1 is a key regulatory factor and affects the splicing of 2.49% of the junctions in the tile through direct binding. However it is important to note that taken together all of these RBPs can still only explain 11.75% of the splicing variations in the observed tiles. This arguably low fraction could be due to a myriad of reasons, including the difference between ENCODE’s cell lines and tumor specimens, the limited number of RBPs served by ENCODE, and inherent noise in the CLIP and KD assays.
When we investigated possible functional consequences of the two BeatAML subgroups, we found that the genes containing events differentiating the groups were enriched for genes in the mTORC1 pathway. As mTOR is frequently activated in cancer in a manner that affects drug susceptibility^{49,50}, our clusters might reflect variations in cellular metabolism that could alter drug susceptibilities in AML samples. Furthermore, we suspect that the signal tile corresponds to a subtype of AML that may be less adverse (see Supplemental Note 7.1 and Supplementary Fig. 7). Inline with this hypothesis, we find that the background tile was characterized by SRSF1 misregulation which affects several oncogenes including BIN1 and CASP9 in the tile.
We demonstrated the utility of CHESSBOARD’s recursive clustering by detecting an alternative tile in the BeatAML data which correlated with FLT3ITD, NPM1, and CEBPA mutations, defined together as normal karyotype AML^{36}. The discovery of this known subtype points to the power of recursive clustering. By removing the dominant signal driven by splicing variations caused by misregulation of RBPs/SFs, we enabled further discovery of an alternate tile structure associated with a different AML subtype characterized by weaker splicing signals but a strong mutation signature. We then demonstrated the clinical utility of CHESSBOARD by analyzing correlation of tiles with drug response data. Notably, we found that while mutations were better predictors of drug response than splicing signals, combining the two yielded a better prediction overall, specifically for FLT3ITD and Sorafenib and NPM1 and Venetoclax. An interesting hypothesis related to these results is that Sorafenib sensitivity may have been reduced by enrichment of the PI3K/mTOR pathway in the signal group as suggested by previous work^{51}. Indeed, such a connection between Sorafenib and the mTOR pathway has also been observed in hepatocellular carcinoma where treatment with Sorafenib in patients with increased PI3K/mTOR pathway activity results in reduced relapse rates^{52,53,54}. A similar effect has recently been observed in AML patients too^{55}.
There are several limitations in this study which are important to highlight. Specifically, the narrow IC_{50} concentration ranges used in the beatAML experiments limited fitting of sensitivity curves and thus we had to use AUC as a proxy for sensitivity. Furthermore, despite the many advantages of the CHESSBOARD model, we make several modeling assumptions that could be improved upon. For example, CHESSBOARD assumes there is only a single signal distribution. In many scenarios, there can be multiple sources of heterogeneity that lead to signal distributions that are a mixture of Beta distributions. We note though that in practice, given the noisy nature of splicing and its quantification from limited read counts, we did not find many clear LSV cases in the data used here that would justify the additional complexity beyond a two component mixture model of signal vs. background.
In summary, we developed CHESSBOARD, the first RNA splicing tailored algorithm for signal detection in heterogeneous RNAseq datasets. We showed its applicability on several leukemia datasets, connecting the splicing tiles discovered to potential regulators, drug response, and known pathways. Although we present a model of splicing, CHESSBOARD can be easily adapted for alternate datatypes such as expression and multiomics data integration using a multiview model^{56}. We also hope the research community will take advantage of the open source code and apply CHESSBOARD to many other analysis tasks in large, heterogeneous cancer datasets, pushing further our understanding of the role of splicing in complex disease.
Methods
Filtering
To enable analysis of large datasets, CHESSBOARD uses a prefiltering pipeline to select LSVs of interest followed by a recursive clustering procedure. This twostep process allows the algorithm to analyze the most potentially interesting splicing events at a high resolution by removing noisy events that could potentially confound true signals. The filtering pipeline is detailed below:

Remove LSVs that correspond to lowly expressed genes. We quantified gene expression using Salmon and aggregated transcript level quantifications into gene level quantifications by summing the TPMs. Any LSV corresponding to a gene with TPMs in the lowest 5th percentile was removed from further analysis.

Remove LSVs with high missingness rates. A LSV is considered quantifiable if at least 10 reads are observed as being mapped to its splice junctions. Any LSV that is not quantifiable in more than 80% of the samples is removed. Note that we allow for such a high missingness rate because the algorithm is designed to handle missing values.

Select highly variable LSVs. For each LSV j in sample i, we compute the variance across all samples \({\sigma }_{j}^{2}=\mathop{\sum }\nolimits_{i}^{N}\frac{{({\Psi }_{ij}{\mu }_{j})}^{2}}{N}\). We construct the empirical CDF of variances and choose a cutoff based on where the graph plateaus. This procedure selects for approximated 1500–2500 events in our datasets.

Select for LSVs with a bimodal Ψ distribution. Intuitively, a mode that tends toward 0 or 1 with low variance is likely to represent a background distribution since most splicing events favor high or low inclusion. A mode with high variance favoring intermediate values is likely to represent an interesting biological signal that could explain disease state. To select for bimodal LSVs, we use the parametricbootstrap KolmogorovSmirnov test. Under this test, the null hypothesis H_{0} is the data was drawn from a single component beta distribution while the alternate hypothesis H_{1} is the data was drawn from multiple beta distributions. The steps for the test are as follows:

For each LSV j, fit a beta distribution to the observed Ψ values by obtaining the maximum likelihood estimates of α and β. Since there is no closed for solution for the MLE, we optimize it numerically.

Obtain the observed test statistic, the KolmogorovSmirnov D, using a 1 sample KS test with the observed data and the CDF of \(Beta(\hat{\alpha },\hat{\beta })\)

Given \(\hat{\alpha }\) and \(\hat{\beta }\), simulate B bootstrapped datasets.

For each bootstrapped dataset, estimate \({\hat{\alpha }}_{b}\) and \({\hat{\beta }}_{b}\) and compute D_{b}.

Compute the empirical p value of the test as the fraction of boostrapped test statics that are greater than the observed test statistic.

We then select all LSVs with p < 0.05. This can be interpreted as selecting LSVs that are multimodal with a 5% chance of being a false positive. If a lower proportion of false positives is desired, one could correct the false discovery rate using a procedure such as Benjamini–Hochberg.
Modeling observed splicing events
Consider a data matrix X_{n×m} with n columns representing patient samples and m rows representing AS events or LSVs. For a given sample i and LSV j, x_{ij} contains the number of junction spanning reads that are mapped to a splice junction of interest while η_{ij} denotes the total number of reads mapped to all junctions in the LSV (e.g two alternative 5′ splice sites of an exon). Under CHESSBOARD’s formulation, each sample has an unobserved label {c_{1}, c_{2}, …, c_{n}} which assigns it to a patients’ group or type \(k\in {{{{{{{{\mathcal{Z}}}}}}}}}^{+}\). Each such group k is defined by a vector r_{k} ∈ {0, 1}^{m} where m is the dimension of the vector. The assignment r_{jk} = 0 indicates LSV j is not part of the unique pattern of group k such that observed inclusion levels for this LSV in samples that do not belong to group k follow some (learned) background Ψ distribution. In contrast, r_{jk} = 1 indicates an abnormal splicing signal in LSV j across all samples belonging to group k. We thus formulate the generative process for each observed Ψ entry of the data matrix as
A plate visualization of this model is shown in Supplementary Fig. 8. A table documenting all variables is given in Supplementary Note 6.1. Under this model, the read rate of sample i in LSV j follows a Binomial distribution with a Beta mixture prior over the level of inclusion Ψ_{ij}. This BetaBinomial model naturally handles uncertainty in Ψ estimates since observations with low read counts will have higher variance. If observation x_{ij} is assigned to the signal in group k as denoted by c_{i} = k, r_{jk} = 1, its likelihood is evaluated using the signal prior distribution Beta(μ_{j1}, κ_{1}). Likewise, the observation is evaluated using the background prior distribution Beta(μ_{j0}, κ_{0}) when r_{jk} = 0. Notice that we reparameterize the Beta in terms of mean and variance using μ_{j} = α_{j}/(α_{j} + β_{j}) and \({\kappa }_{j}=({\mu }_{j}(1{\mu }_{j}))/{\sigma }_{j}^{2}\) where the concentration κ is inversely proportional to variance. This reparameterization enables the use of a Beta hyperprior over the mean of each mixture component to capture known biological behavior of AS. Specifically, normal splicing dynamics have a propensity toward high or low inclusion levels which can be modeled using the Jeffery prior α_{0} = 0.5, β_{0} = 0.5 and high concentration κ. Intermediate levels of inclusion modeled by a distribution with a long tail generally indicate aberrant splicing and can be modeled as a Beta distribution with α_{1} = β_{1} and low concentration. To control the number of tiles, we impose a L1 penalty with hyperparameter λ to induce sparsity in the number of groups for which LSV j is assigned to the signal distribution. Specifically, ∑_{k}r_{jk} ~ Exp(λ).
In most biological contexts, the number of tiles is unknown a priori. This can be modeled as an infinite mixture of groups using a Dirichlet Process prior with Bernoulli base distribution
where G_{0} is Bernoulli(δ)^{m} and α_{0} is the concentration parameter. A larger value of α will result in the discovery of more sample groups. In taking the limit of k, the distribution of c_{i} can be interpreted as a distribution of partitions of natural numbers which is usually formulated as the CRP or stick breaking process.
Modeling missing values
In gene expression data, missing values typically arise due to low sequencing coverage which results in some transcripts lacking any observable reads even if they are expressed. However in splicing data, a lack of junction spanning reads mapped to a transcript that is expressed indicates inclusion of an alternative exon. Many algorithms handle missing values under the MCAR model of missingness by integrating missing values out of the model^{27}. Under this model, the missingness rate of a feature does not depend on any observed or unobserved values. However, this is generally only a valid assumption in scenarios when the data generating instrument malfunctions such as a defective microarray probe. The missingness rate of transcriptomic quantifications from RNAseq is proportional to sequencing depth thus a model in which values are MNAR (missing not at random) will yield better estimates for the missingness rate. Under this model, the missingness rate of features depends on observations in the data matrix and external factors. In cancer data specifically, values can also be systematically missing due to genetic mutations resulting in no reads being mappable to the splice junction. This can occur when a mutation near a splice site reduces junction usage due to changes in splice factor binding or when the mutation introduces a PTC into an exon resulting in rapid degradation of the transcript due to NMD. Thus it becomes necessary to treat missing values as a secondary signal when the missingness rate of an LSV is much higher than expected under the null missingness rate associated with sequencing limitations. To handle missing values under the MNAR model and detect missing value signals, CHESSBOARD identifies unquantifiable LSVs with η_{ij} < 10 and replaces x_{ij} with indicator ω_{ij} = 1. The indicator is modeled as
where θ_{0} is the background missingness rate that represents values that are missing due to techinical factors such as coverage and θ_{1} represents the signal missingness rate that is expected to be higher and represents values that are missing due to mutations. The priors can be estimated empirically. Specifically, we estimate the background priors by fitting the following BetaBinomial regression model.
Here, χ_{j} = median({η_{1j}, η_{2j}, …, η_{nj}}) is the median number of reads η that are mapped to LSV j across the n samples. υ_{j} is the number of samples with missing observations for LSV j. The fitted model returns MLE estimates for the coefficients β_{0} and β_{1} and the dispersion Φ. This trained model can then be used to estimate background priors α_{j0} = μ_{j}Φ and β_{j0} = (1 − μ_{j})Φ by predicting μ_{j} from the median read depths for each LSV j in the cancer data to be analyzed. In this study the above model was fitted to whole blood samples from GTEX V8. Users can of course fit the model to more relevant healthy tissue samples for their specific cancer of interest. Generally, we expat that in healthy control samples the missingness rate will be inversely proportional to sequencing depth (MNAR) but these missing values would not represent signals caused by cancer. In a similar way, we use the same procedure over the training data (beatAML or TARGET) to get estimates for the matching signal prior α_{j1} and β_{j1}.
Posterior sampling
The entire joint likelihood of the CHESSBOARD model is given by:
where a bold variable indicates a vector containing the variables across all possible indices. To sample from the model’s posterior, we develop an efficient blocked Gibbs sampling scheme which we will use to sample from each conditional posterior. The full conditional posterior of c_{i} is denoted by
Due to betabinomial conjugacy, we can integrate out Ψ and θ. This allows us to write
Here, the term before the product represents the prior P(c_{i} = k) which Intuitively captures the cluster’s proportion. The term Θ captures all other variables in the above likelihood not explicitly written again (for clarity). Since the integral over each r_{j(k+1)} here is intractable, we follow Neal 2000 and sample vector r_{(k+1)} from its prior distribution when attempting to open a new cluster. We choose the prior to be δ_{j} = P(r_{jk} = 1) for each element for the vector j. Note that ω is not included in the notation above for clarity but is trivial to include. With Ψ and θ integrated out, we will only need to then explicitly sample r, α and β. The full posterior conditional of \({r}_{j{c}_{i}}\) is given below.
However, due to high correlation between the \({r}_{j{c}_{i}}\)s, we must sample them simultaneously. In other words, rather than sampling \({r}_{j{c}_{i}}\) for each c_{i} = k, sample a vector r_{j.} with length k. Note that as k becomes large (i.e., the number of clusters grows in each MCMC iteration), computing this posterior becomes intractable since there are 2^{k} binary vectors. Therefore, we approximate this posterior by sampling r_{j.} in blocks of r_{ja}. . . r_{jb} where b − a is the maximum blocksize.
Finally, to sample μ_{j0} and μ_{j1}, we use a discrete approximation. We use possible values of μ on the interval p ∈ [0.025, 0.975] from 20 discrete bins. Thus we can evaluate the posterior of μ using a discrete categorical distribution defined as:
The concentration κ_{0} and κ_{1} are hyperparameters that are inversely proportional to the variance of the prior distribution. We choose κ_{0} = 20 and κ_{1} = 10 to model low expected variance of the background and high expected variance of the signal.
Posterior summary and convergence
To obtain a point estimate for r_{jk} and c_{i}, we apply the following posterior summary procedure. First, we obtain the pairwise matrix of probabilities that any two samples clustered together across all posterior samples (after burnin and thinning). In other words, the probability that c_{i} = c_{j} for any two samples i and j. Determining the portion of the chain to discard can be evaluated using the HeidelbergerWelch diagnostic (Supplementary Note 3.2). However in practice, we found for real high dimensional splicing data as we used here that the estimated model parameters converge very quickly and exhibit low posterior variance. In such cases, few MCMC iterations are needed and the optimization can be treated as a variational Bayes approximation (Supplementary Note 3.1). Convergence is then determined to be where the posterior likelihood of the model stops changing. We apply hierarchical clustering to the pairwise probability matrix with the number of clusters k being the median number of clusters across all posterior samples to obtain final clustering assignments. To obtain a point estimate for r, we obtain a matrix of the marginal probabilities that sample i in LSV j is assigned to the signal distribution. r_{jk} = 1 if the mean of \({r}_{j{c}_{i}}\forall {c}_{i}=k \ > \ 0.7\). Note that we also provide an alternative approach to point summary by using the posterior sample that minimizes the MSE to the posterior mean. In other words, we generate the mean pairwise clustering matrix and pick the sample that minimizes MSE to this mean matrix.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The beatAML dataset can be accessed through the National Cancer Institute (NCI) at https://www.cancer.gov/aboutnci/organization/ccg/blog/2019/beataml. The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) dataset, phs000218, managed by the NCI can be accessed at www.ncbi.nlm.nih.gov/projects/gap/cgibin/study.cgi?study_id=phs000218.v22.p8. Information about TARGET can be found at http://ocg.cancer.gov/programs/target. The Penn HTSC dataset is available at GEO (GSE142514). The ENCODE knockout and eCLIP datasets from Van Nostrand et al. 2020 are available at https://www.encodeproject.org^{57}. The GTEx v7 whole blood data is available at https://www.gtexportal.org/home/datasets. Source data are provided with this paper. All processed datasets are available in the Zenodo repository associated with this publication. The data generated in this study including algorithm output and data used in figures is described in the Supplementary Information and Source Data files and can be accessed in the Zenodo repository. Source data are provided with this paper.
Code availability
All code for the algorithm, Python API and GAMBIT is publically available at https://bitbucket.org/biociphers/chessboard/src/master/. A list of Python package dependencies (pandas, scipy, numpy, seaborn, statmodels, scikitlearn, matplotlib) are listed in the installation instructions in the repository and will be automatically installed when installing our software. The GAMBIT tool is available online at https://paros.pmacs.upenn.edu/gambit/. Sample data for GAMBIT can be downloaded from the bitbucket repository. Documentation for the CHESSBOARD python API can be found at https://chessboard.readthedocs.io/en/latest/index.html. All code to reproduce figures and analysis can be found in the Zenodo repository.
References
Lee, S.I. et al. A machine learning approach to integrate big data for precision medicine in acute myeloid leukemia. Nat. Commun. 9, 1–13 (2018).
Way, G. P. et al. Machine learning detects pancancer ras pathway activation in the cancer genome atlas. Cell Rep. 23, 172–180.e3 (2018).
Huang, C. et al. Machine learning predicts individual cancer patient responses to therapeutic drugs with high accuracy. Sci. Rep. 8, 1–8 (2018).
Shen, R., Olshen, A. B. & Ladanyi, M. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 25, 2906–2912 (2009).
Robertson, A. G. et al. Comprehensive molecular characterization of muscleinvasive bladder cancer. Cell 171, 540–556.e25 (2017).
Choi, P. S. & ThomasTikhonenko, A. RNAbinding proteins of COSMIC importance in cancer. J. Clin. Investig. 131, e151627 (2021).
Hentze, M. W., Castello, A., Schwarzl, T. & Preiss, T. A brave new world of RNAbinding proteins. Nat. Rev. Mol. Cell Biol. 19, 327–341 (2018).
Gebauer, F., Schwarzl, T., Valcárcel, J. & Hentze, M. W. RNAbinding proteins in human genetic disease. Nat. Rev. Genet. 22, 185–198 (2021).
Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500, 415–421 (2013).
Rivera, O. D. et al. Alternative splicing redefines landscape of commonly mutated genes in acute myeloid leukemia. Proc. Natl Acad. Sci. 118, e2014967118 (2021).
Kim, E. et al. SRSF2 mutations contribute to myelodysplasia by mutantspecific effects on exon recognition. Cancer Cell 27, 617–630 (2015).
Tyner, J. W. et al. Functional genomic landscape of acute myeloid leukaemia. Nature 562, 526–531 (2018).
Cherry, S. & Lynch, K. W. Alternative splicing and cancer: insights, opportunities, and challenges from an expanding view of the transcriptome. Genes Dev. 34, 1005–1016 (2020).
Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNASeq data with or without a reference genome. BMC Bioinform. 12, 1–16 (2011).
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and biasaware quantification of transcript expression. Nat. Methods 14, 417–419 (2017).
Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Nearoptimal probabilistic RNAseq quantification. Nat. Biotechnol. 34, 525–527 (2016).
Katz, Y., Wang, E. T., Airoldi, E. M. & Burge, C. B. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat. Methods 7, 1009–1015 (2010).
Tiberi, S. & Robinson, M. D. BANDITS: Bayesian differential splicing accounting for sampletosample variability and mapping uncertainty. Genome Biol. 21, 1–13 (2020).
VaqueroGarcia, J. et al. A new view of transcriptome complexity and regulation through the lens of local splicing variations. elife 5, e11752 (2016).
Trincado, J. L. et al. SUPPA2: fast, accurate, and uncertaintyaware differential splicing analysis across multiple conditions. Genome Biol. 19, 1–11 (2018).
Shen, S. et al. rMATS: robust and flexible detection of differential alternative splicing from replicate RNASeq data. Proc. Natl Acad. Sci. 111, E5593–E5601 (2014).
Gu, J. & Liu, J. S. Bayesian biclustering of gene expression data. BMC Genom. 9, 1–10 (2008).
Meeds, E. & Roweis, S. Nonparametric bayesian biclustering. Tech. Rep. Citeseer (2007).
Givoni, I., Cheung, V. & Frey, B. J. Matrix tile analysis. arXiv preprint arXiv:1206.6833 (2012).
Gao, C., McDowell, I. C., Zhao, S., Brown, C. D. & Engelhardt, B. E. Context specific and differential gene coexpression networks via Bayesian biclustering. PLoS Comput. Biol. 12, e1004791 (2016).
Xu, Y. et al. Nonparametric bayesian biclustering for next generation sequencing count data. Bayesian Analy. (Online) 8, 759 (2013).
Rubin, D. B. Inference and missing data. Biometrika 63, 581–592 (1976).
Dhillon, I. S. Coclustering documents and words using bipartite spectral graph partitioning. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, 269–274 https://doi.org/10.1145/502512.502550 (2001).
Slaff, B. et al. MOCCASIN: A method for correcting for known and unknown confounders in RNA splicing analysis. Nat. Commun. 12, 1–9 (2021).
de NecocheaCampion, R., Shouse, G. P., Zhou, Q., Mirshahidi, S. & Chen, C.S. Aberrant splicing and drug resistance in AML. J. Hematol. Oncol. 9, 1–9 (2016).
Anande, G. et al. RNA Splicing Alterations Induce a Cellular Stress Response Associated with Poor Prognosis in Acute Myeloid LeukemiaRNA Splicing Alterations in AML. Clin Cancer Res. 26, 3597–3607 (2020).
Das, S. & Krainer, A. R. Emerging functions of SRSF1, splicing factor and oncoprotein, in RNA metabolism and cancer. Mol. Cancer Res. 12, 1195–1204 (2014).
Anczuków, O. et al. The splicing factor SRSF1 regulates apoptosis and proliferation to promote mammary epithelial cell transformation. Nat. Struct. Mol. Biol. 19, 220–228 (2012).
Massiello, A. & Chalfant, C. E. SRp30a (ASF/SF2) regulates the alternative splicing of caspase9 premRNA and is required for ceramideresponsiveness. J. Lipid Res. 47, 892–897 (2006).
Reiling, J. H. & Sabatini, D. M. Stress and mTORture signaling. Oncogene 25, 6373–6383 (2006).
Port, M. et al. Prognostic significance of FLT3 internal tandem duplication, nucleophosmin 1, and CEBPA gene mutations for acute myeloid leukemia patients with normal karyotype and younger than 60 years: a systematic review and metaanalysis. Ann. Hematol. 93, 1279–1286 (2014).
Network, C. G. A. R. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N. Engl J. Med. 368, 2059–2074 (2013).
Wojtuszkiewicz, A. et al. Maturation StateSpecific Alternative Splicing in FLT3ITD and NPM1 Mutated AML. Cancers 13, 3929 (2021).
van der Werf, I. et al. Splicing factor gene mutations in acute myeloid leukemia offer additive value if incorporated in current risk classification. Blood Adv. 5, 3254–3265 (2021).
Steelman, L. S. et al. Contributions of the Raf/MEK/ERK, PI3K/PTEN/Akt/mTOR and Jak/STAT pathways to leukemia. Leukemia 22, 686–707 (2008).
Bolouri, H. et al. The molecular landscape of pediatric acute myeloid leukemia reveals recurrent structural alterations and agespecific mutational interactions. Nat. Med. 24, 103–112 (2018).
Gröbner, S. N. et al. The landscape of genomic alterations across childhood cancers. Nature 555, 321–327 (2018).
Ma, X. et al. Pancancer genome and transcriptome analyses of 1,699 paediatric leukaemias and solid tumours. Nature 555, 371–376 (2018).
Krentz, S. et al. Prognostic value of genetic alterations in children with first bone marrow relapse of childhood Bcell precursor acute lymphoblastic leukemia. Leukemia 27, 295–304 (2013).
Black, K. L. et al. Aberrant splicing in Bcell acute lymphoblastic leukemia. Nucleic Aacids Res. 46, 11357–11369 (2018).
Sotillo, E. et al. Convergence of acquired mutations and alternative splicing of CD19 enables resistance to CART19 immunotherapy. Cancer Discov. 5, 1282–1295 (2015).
Pierson, E. & Yau, C. ZIFA: Dimensionality reduction for zeroinflated singlecell gene expression analysis. Genome Biol. 16, 1–10 (2015).
Argelaguet, R. et al. MOFA+: a statistical framework for comprehensive integration of multimodal singlecell data. Genome Biol. 21, 1–17 (2020).
Mossmann, D., Park, S. & Hall, M. N. mTOR signalling and cellular metabolism are mutual determinants in cancer. Nat. Rev. Cancer 18, 744–757 (2018).
Jiang, B.H. & Liu, L.Z. Role of mTOR in anticancer drug resistance: perspectives for improved drug treatment. Drug Resist. Updates 11, 63–76 (2008).
Lindblad, O. et al. Aberrant activation of the PI3K/mTOR pathway promotes resistance to sorafenib in AML. Oncogene 35, 5119–5131 (2016).
Yi, H. et al. Inhibition of the AKT/mTOR pathway augments the anticancer effects of sorafenib in thyroid cancer. Cancer Biother. Radiopharmaceut. 32, 176–183 (2017).
Kim, M. N., Lee, S. M., Kim, J. S. & Hwang, S. G. Preclinical efficacy of a novel dual PI3K/mTOR inhibitor, CMG002, alone and in combination with sorafenib in hepatocellular carcinoma. Cancer Chemother. Pharmacol. 84, 809–817 (2019).
Gedaly, R. et al. The role of PI3K/mTOR inhibition in combination with sorafenib in hepatocellular carcinoma treatment. Anticancer Res. 32, 2531–2536 (2012).
Damnernsawad, A. et al. Genomewide CRISPR screen identifies regulators of MAPK and MTOR pathways mediating sorafenib resistance in acute myeloid leukemia. Haematologica 107, 77–85 (2020).
Burdziak, C., Azizi, E., Prabhakaran, S. & Pe’er, D. A nonparametric multiview model for estimating cell typespecific gene regulatory networks. arXiv preprint arXiv:1902.08138 (2019).
Van Nostrand, E. L. et al. A largescale binding and functional map of human RNAbinding proteins. Nature 583, 711–719 (2020).
Acknowledgements
We would like to thank Matthew Gazzara, Dr. Martin Carroll and Dr. Sara Cherry for their input on the data analysis interpretation, Dr. Caleb Radens for assistance with data processing, and Joseph Aicher for early input on the model/algorithm. This work was supported by NIH grants R01 GM128096 (Y.B.) and U01 CA232563/CA23256301S3 (Y.B., A.T.T, and K.W.L.) and grants from CureSearch For Children’s Cancer (A.T.T. and Y.B.) and Emerson Collective (A.T.T., Project 886246066). D.W. is supported by NIH fellowship 1F31CA26521801.
Author information
Authors and Affiliations
Contributions
Y.B. conceived the project. D.W. and Y.B. developed the methods and planned the experiments. D.W. wrote the code and carried out the experiments and analysis. M.Q.V. performed the gene ontology analysis. S.J. and M.E. developed the GAMBIT tool. M.Q.V., K.L., and A.T.T. provided feedback on data analysis interpretation. D.W. and Y.B. wrote the paper with input from M.Q.V., K.L., and A.T.T. All authors read and approved the final paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Simone Tiberi, Liliana Florea and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wang, D., QuesnelVallieres, M., Jewell, S. et al. A Bayesian model for unsupervised detection of RNA splicing based subtypes in cancers. Nat Commun 14, 63 (2023). https://doi.org/10.1038/s41467022353690
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467022353690
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.