Improved methods for RNAseq-based alternative splicing analysis

Halperin, Rebecca F.; Hegde, Apurva; Lang, Jessica D.; Raupach, Elizabeth A.; Legendre, Christophe; Liang, Winnie S.; LoRusso, Patricia M.; Sekulic, Aleksandar; Sosman, Jeffrey A.; Trent, Jeffrey M.; Rangasamy, Sampathkumar; Pirrotte, Patrick; Schork, Nicholas J.

doi:10.1038/s41598-021-89938-2

Download PDF

Article
Open access
Published: 24 May 2021

Improved methods for RNAseq-based alternative splicing analysis

Rebecca F. Halperin¹,
Apurva Hegde²,
Jessica D. Lang³,
Elizabeth A. Raupach³,
C4RCD Research Group,
Christophe Legendre³,
Winnie S. Liang^3,4,
Patricia M. LoRusso⁵,
Aleksandar Sekulic⁶,
Jeffrey A. Sosman⁷,
Jeffrey M. Trent³,
Sampathkumar Rangasamy⁴,
Patrick Pirrotte²^na1 &
…
Nicholas J. Schork¹^na1

Scientific Reports volume 11, Article number: 10740 (2021) Cite this article

33k Accesses
21 Citations
6 Altmetric
Metrics details

Subjects

Abstract

The robust detection of disease-associated splice events from RNAseq data is challenging due to the potential confounding effect of gene expression levels and the often limited number of patients with relevant RNAseq data. Here we present a novel statistical approach to splicing outlier detection and differential splicing analysis. Our approach tests for differences in the percentages of sequence reads representing local splice events. We describe a software package called Bisbee which can predict the protein-level effect of splice alterations, a key feature lacking in many other splicing analysis resources. We leverage Bisbee’s prediction of protein level effects as a benchmark of its capabilities using matched sets of RNAseq and mass spectrometry data from normal tissues. Bisbee exhibits improved sensitivity and specificity over existing approaches and can be used to identify tissue-specific splice variants whose protein-level expression can be confirmed by mass spectrometry. We also applied Bisbee to assess evidence for a pathogenic splicing variant contributing to a rare disease and to identify tumor-specific splice isoforms associated with an oncogenic mutation. Bisbee was able to rediscover previously validated results in both of these cases and also identify common tumor-associated splice isoforms replicated in two independent melanoma datasets.

rMATS-turbo: an efficient and flexible computational tool for alternative splicing analysis of large-scale RNA-seq data

Article 23 February 2024

SpliceVault predicts the precise nature of variant-associated mis-splicing

Article Open access 06 February 2023

Integrated analysis of genomic and transcriptomic data for the discovery of splice-associated variants in cancer

Article Open access 22 March 2023

Introduction

Alternative splicing has been shown to play an important role in normal cellular processes as well as a wide range of pathogenic processes underlying many different diseases^{1, 2}. For example, global dysregulation of splicing, as well as mutations in genes regulating splicing, such as SF3B1, have been observed in a variety of tumors^{3, 4}. In addition, the results of genome wide association studies (GWAS) focusing on common chronic conditions have identified a number of disease-associated variants that influence splicing, suggesting a role for alternative splicing in mediating many common diseases^{5, 6}. Furthermore, highly penetrant variants that affect splicing have been classified as pathogenic in a number of monogenic disorders⁷. The detection of disease relevant splice alterations is not trivial, as there are hundreds of thousands of annotated splice sites in the human genome. In addition, there is also great potential for the emergence of novel unannotated splice sites at countless locations in the genome. This suggests a need for robust statistical methods for detecting and quantifying differential splice events in comparative studies in health and disease. We have developed a novel statistical framework for differential splicing and splice outlier detection. This framework identifies splice events in an individual sample based on a metric evaluating the percentage of reads supporting the event, or the ‘percent spliced in (PSI),’ which reflects evidence supporting that event beyond the range observed in a set of reference samples. The methods are implemented in a package called Bisbee designed for RNAseq data splicing analysis (Fig. 1). Bisbee also provides protein-level splicing effect predictions. We validated these predictions and benchmarked our statistical methods using normal tissue samples with both RNAseq and mass spectrometry data⁸.

Alternative splicing analysis consists of three main steps: detection, statistical comparison, and effect prediction. Here we leverage an existing tool for detection and implement new methods for the statistical analysis and effect prediction steps. Software packages for detecting splicing alterations may be broadly broken down into two categories: those that only identify events found in annotated transcripts such as ballgown⁹, MISO¹⁰, rMATs¹¹, and SUPPA2¹², and those that additionally detect novel splice events, such as ASPLI¹³, SplAdder¹⁴, SGSeq¹⁵, LeafCutter¹⁶, and MAJIQ¹⁷. As aberrant splicing in disease states may result in novel transcripts, we sought to identify and extend the capability of a tool that can identify novel splice events. We chose SplAdder for basic splice event detection because it has demonstrated utility in a large pan-cancer study and can enable comparisons to large sets of normal tissues from GTEx without requiring access to raw GTex data¹⁴. SplAdder also has advantages in terms of its modularity, facilitating analysis of large datasets in a cluster computing environment, and it also reports splice event coordinates in a straightforward manner.

Several splicing analysis packages include functions for testing differential splicing between two groups including ballgown⁹, ASPLI¹³, and SplAdder¹⁴. These typically use a generalized linear model and treat the overall expression level of the gene as a covariate to normalize expression differences that may confound the detection of splicing differences. However, a more straightforward approach would be to explicitly test the difference in the ‘percent spliced in’ (PSI), and therefore obviate the need to normalize for library size or expression level. We selected a beta binomial model, as they are used in many DNA sequencing variant calling strategies to model the distribution of reads supporting the existence of reference and alternate alleles^18,19,20,21. Here we consider that each splice event has two alternate alleles, such as an exon included versus skipped in a particular gene (Fig. 1). The binomial model captures the noise in the technical measurement of the PSI value due to the depth of coverage at the splice event without needing to rely on replicates, and the beta distribution models the biological variation in the splicing. Our proposed approach is most similar to the strategy implemented in the ‘LeafCutter’ program which uses a multinomial to test for differences in intron usage within a region, reporting clusters of likely splicing disruption rather than defined event types such as exon skipping¹⁶. We chose to work with defined splice event types for improved interpretability and potential for insight into the mechanism of splice dysregulation.

The detection of splice events that are specific to an individual when compared to a large reference set of samples is useful in several clinical applications. This outlier analysis may be used to identify disruption of splicing due to somatic mutations or expression of known tumor-specific splice isoforms in an individual’s tumor^{22, 23}. The analysis of splicing outliers may also be used to identify splice variant-induced antigens in a target individual’s tumor that do not exist in normal tissues^{3, 23,24,25}. In addition, some rare Mendelian disorders are caused by variants that disrupt splicing. These disruptions may be detected by comparing an individual transcriptome to a set of reference samples. We are aware of only one tool designed for detecting splicing outliers in individual genomes, LeafCutterMD²⁶, which does not enable the prediction of the effect of the splice variant on an encoded protein.

Predicting the protein level impact of a splice variant is critical for understanding the biological implications and potential mechanisms underlying disease states, yet most RNAseq alternative splicing analysis packages do not incorporate an effect prediction component. Splice variants may result in truncations or deletions at the protein level that result in a loss of protein function. Alternatively, spliced protein isoforms may also exhibit qualitative differences in function. For example, BCL2L1 splice isoforms have opposing effects on apoptosis²⁷. In addition to identifying the expression of such known isoforms, it would also be possible to predict the functional consequences of novel splice isoforms using the domains impacted and other in silico approaches. Alternative splicing may also give rise to novel protein sequences in a cancer cell that could be recognized by the immune system^{3, 23,24,25}.

Genomics, transcriptomics, and proteomics are being used together more often in an effort to better characterize phenotypic effects resulting from genomic alterations and pathway dysregulation^28,29,30. There are many existing proteogenomics pipelines that use transcriptome sequencing to generate protein sequence databases for matching mass spectra^{31,32,33,34,35}. Many of these pipelines take a comprehensive approach using all detected splice junctions and translating them in all six reading frames^{31,32,33,34,35,36}. Such studies have been extremely useful for elucidating gene structure and cataloging splice junctions in specific samples^{31, 37}. Other proteogenomics pipelines leverage transcript assembly and generate protein sequences via either three frame translation or translation of open reading frames of the reconstructed sequences^38,39,40. However, transcriptome assembly is computationally-intensive and not necessary for integration with mass spectrometry, as only peptides rather than full length proteins, are detected. We are aware of only one other proteogenomics pipeline that predicts protein sequences from splice events rather than junctions or assembled transcripts; however, it does not detect events involving novel splice sites⁴¹. Our study utilizes splice event-level analysis as it is more amenable to comparisons between samples and facilitates interpretation.

In order to benchmark Bisbee’s methods against similar approaches we developed a ‘truth set’ with splice events validated through the detection of corresponding protein isoforms. This truth set was generated using mass spectrometry and RNAseq data on a set of normal tissues from Wang et al.⁸. We identified several other splice variant analysis tools to consider for benchmarking against Bisbee. However, only a few of them provide utilities for predicting the effect of splice alterations at the protein level, which would be necessary for use with our mass spectrometry truth set. Using real data with complementary measurements provides a more robust framework for benchmarking and validation. Our truth set takes advantage of the naturally occurring differences in splicing between different tissues³⁰ to evaluate Bisbee’s differential splicing and splice outlier modules.

Results

Predicted splice isoforms are detected at the protein level

In order to validate the existence of proteins/peptides corresponding to splice variants, we leveraged a dataset from Wang et al., which includes paired RNA-seq and proteomics data from normal tissues⁸. In this validation dataset, SplAdder identified 268,791 total splice events, of which 125,683 were predicted by Bisbee to be protein coding. The mass spectrometry searches identified 182,662 unique peptides (Supplemental Fig. 1). Protein evidence of alternative splicing, defined as having at least one peptide supporting each of the two isoforms, was detected for 1587 of the protein-generating events, including 1082 generating novel sequences (Fig. 2). The event categories that generate longer stretches of altered sequence have higher proportions of protein level detection as expected (Supplemental Fig. 2). We observed 330 events showing tissue specific detection patterns at the protein level, and these were used for benchmarking and validation.

Bisbee Diff more accurately detects differentially spliced isoforms

The beta binomial differential splicing test implemented in Bisbee has one parameter that requires tuning, ${\omega }_{M}$, which is a constraint on the parameter $\omega$ which controls the overall shape of the beta distribution, particularly in its tails. In order to avoid overfitting, we reserved the tissue-specific protein isoforms dataset from Wang et al. to compare the accuracy of differential splicing methods¹. GTEx was used for parameter optimization and threshold selection. We compared the distribution of the test statistic for the Bisbee Diff test between sets of samples from the same tissue versus different tissues using different values of the ${\omega }_{M}$ parameter. The percentage of events passing a given threshold in the ‘different’ versus the ‘same’ comparison is used as an indicator of the specificity of the test, while the percentage of events in the different comparisons passing the thresholds is used as an indicator of sensitivity. Setting the ${\omega }_{M}$ parameter to 200, and using a log likelihood ratio (LR) threshold of 8 provides optimal enrichment of splice events detected as different between different tissues compared to splice events detected as different between samples from the same tissue (Supplemental Fig. 2a).

We identified 281 instances of protein expression-confirmed isoform switches over six pairwise tissue comparisons, which represent 196 unique isoform switch events. For comparison, SplAdder’s test module was run as an example of a program that uses a, generalized linear model approach. As a simple approach, a t-test on the PSI values was pursued both using all of the PSI values regardless of depth and only including PSI values with a sequencing read depth at the position of greater than 10. To evaluate these methods, we compared the total number of events passing a given threshold to the number of protein confirmed events passing the threshold. The Bisbee Diff method consistently found higher enrichment of confirmed events out of total events passing a threshold (Fig. 3A, Supplemental Table 1). In order to see how the magnitude of PSI differences and the read depths of the events influence the performance of each of the differential splicing tests we made a volcano plot of the brain versus small intestine comparison (Fig. 4).

Bisbee outlier more accurately detects splice outliers

The Bisbee outlier detection method parameter ${\beta }_{M}$ was optimized using GTEx data. The percentage of outlier scores passing a threshold for models trained on the same tissue was compared to the percentage passing for matching tissue models. We found a ${\beta }_{M}$ value of 80 provides the best enrichment of different tissue outliers with a log likelihood (LL) cutoff of 10 (Supplemental Fig. 2b). We used these values for benchmarking on the Wang et al. dataset with matching proteomics data⁸. We used a set of GI tissues as the reference set and detected outliers in three other tissues. We identified 140 outlier events across the three tissues, which represents 134 unique outlier events. Since we are not aware of another tool that is able to detect splice outliers and generate predicted protein sequences, we implemented two simple methods using the distribution of PSI values in the reference dataset. The first simple outlier method uses the median absolute deviation (mad) and the second using the interquartile range (iqr) of the PSI values. For both of these methods we performed the analysis both with using PSI values for all data points as well as using only PSI values for data points with a depth greater than 10. The Bisbee outlier method detected more proteomics-confirmed events for similar numbers of total events passing the same score threshold (Fig. 3B, Supplemental Table 2).

Case study: detection of a splice event in rare disease

In order to examine the utility of the Bisbee package for research and clinical applications, we analyzed disease-causing splice mutation in the nuclear-encoded mitochondrial methionyl-tRNA formyltransferase (MTFMT)^42,43,44. We previously identified homozygous mutation (c. 626 C > T) in the MTFMT gene in three children from two unrelated families (Clinvar Accession#VCV000039827.4) with Leigh syndrome and combined oxidative phosphorylation (OXPHOS) deficiency. The MTFMT mutation c. 626 C > T in the coding region resulted in a Ser209Leu (S209L) amino acid substitution, which is likely a non-pathogenic event. However, c.626 C > T is predicted to generate a splicing suppressor that results in skipping of exon 4, leading to frame shift and truncation of the protein (p. R181SfsX5)^{42, 43, 45}. The c.626C site 20 base pairs (bp) upstream of the 3′ end of exon 4 is predicted to eliminate the two overlapping exonic splicing enhancers (ESE) (GTCAAG and TCAAGA) and generate an exonic splicing suppressor (ESS) sequence (GTTGTT)^{46, 47}. To confirm the expected exon skipping and truncation, we performed differential splicing analysis of RNA sequencing data obtained from the primary fibroblast cells from three patients carrying the homozygous c. 626 C > T mutation and five unaffected controls using Bisbee. We found that the MTFMT exon 4 skipping event was the 14th highest scoring differentially spliced event. Though the LR (7.999) was just barely below the optimal threshold determined in the GTEx analysis, its high rank makes it likely to be considered in a candidate variant analysis. It is not surprising that the event did not quite pass the threshold as coverage of the event in the cases was only 10, 6 and 2 reads. If we use the protein effects predictions to filter down to events predicted to generate novel sequences that were expressed more highly in the cases compared to the controls, we find that the MTFMT exon 4 skip is the highest scoring of these events (Supplemental Fig. 3).

When trying to discover the causal variant in a rare disease, there is often only one affected case available for sequencing, so we also ran Bisbee outlier analyses on each of the three cases to illustrate the single case scenario. Since it is desirable to have a large set of reference samples for outlier analysis, but there are technical differences in the sequencing between GTEx and this dataset, we performed the outlier analysis both using GTEx fibroblasts as the reference samples and using the five unaffected fibroblast samples used in the differential splicing analysis as the reference samples, and used the minimum score of the two analysis. The Bisbee outlier scores for the MTFMT exon 4 skip in the cases were 4.8, 10.6, and 3.4, ranking 386, 16, and 1746 of all events (Supplemental Fig. 3). When only considering events generating novel protein sequences with increased expression in the cases, the MTFMT event ranked 145, 2, and 587, respectively, in each of the three cases. Despite the very low coverage of the event in the cases, Bisbee was still able to rank the event in the top 1% of all events in all three cases.

The Bisbee annotation output is shown for the MFTMT exon 4 skipping event in Table 1. Each event is assigned a unique identifier (event_jid) using the contig, strand, and junction coordinates to facilitate comparing results between datasets. The effects at the transcript (event_cat) and protein level (effect_cat) are described, as well as whether the splice event is found in ensembl transcripts (aa_change_type). The sequence headers of the two isoforms are provided in order to locate the protein sequences in the fasta output. The sample group with increased expression of the isoform labeled “alt” is indicated (group_increased_alt). The location with the protein sequence as well as the altered amino acid sequence fragments are also provided. These results confirm the expected R181SfsX5 frame shift truncation.

Table 1 Example Bisbee output.

Full size table

Application to TCGA Uveal Melanoma dataset

We selected the TCGA uveal melanoma dataset as an example application as there is a recurrent mutation in the splicing factor 3B1 gene (SF3B1) that has been previously shown to cause aberrant 3′ splice site usage^{48, 49}. To identify tumor-specific splice events, we performed Bisbee Outlier analysis using the complete GTEx tissue library exempt of testis tissue samples. Testis was excluded as it may express developmentally restricted proteins not found in normal somatic tissues^{39, 40}. We also used the TCGA normal samples as a reference and took the minimum score of the two analyses. In examining the total number of splice outliers per patient, we observed a large increase in alternative 3′ splice site outliers with SF3B1 mutation as well as significantly increased exon skipping, intron retention, and mutually exclusive exon outlier burden (Fig. 5A, rank sum p value < 0.01). We also ran Bisbee Diff to identify differentially spliced events between SF3B1 mutant and wild-type tumors. We found 19,950 differentially spliced events of which 72% were mutually exclusive exons and 15% were alternative 3′ splice sites. The alternative 3′ differentially spliced events had higher Bisbee Diff LR and a greater overlap with events also observed in the outlier analysis (Fig. 5B). Alsafadi et al. previously identified differentially spliced events between SF3B1 mutant and wild-type tumors in an independent dataset, and selected seven of these events to validate in isogenic cell lines using a mini-gene splice assay⁴⁹. All seven of these events were detected as differentially spliced by Bisbee Diff (Table 2).

Table 2 Validated SF3B1 mutant versus wild-type differentially spliced events.

Full size table

In order to identify protein isoforms that may be specific to SF3B1 mutant tumors, we selected splice events that were common between the differential splicing and outlier analysis (494) and then identified those predicted to result in altered protein sequence (321). These events are primarily alternative 3′ events causing insertions or frame disruptions resulting in novel protein isoforms in the uveal melanoma tumors (Fig. 5C).

Replication of common melanoma associated splice events in an independent dataset

In addition to observing splice events associated with SF3B1 mutation, we also observed splice events common across the TCGA uveal melanoma cohort, irrespective of SF3B1 mutation status. In order to validate this finding, we performed the Bisbee splicing analysis on an independent melanoma cohort, consisting of 37 patients with BRAF wild-type recurrent tumors including 13 cutaneous, 7 mucosal, 10 uveal, 5 acral, 1 melanoma of unknown primary. We performed the Bisbee outlier analysis using both the GTEx excluding testis as the reference and a set of 28 normal tissue or cell lines sequenced at the same institution as the reference and took the minimum score of the two analyses. We compared the number of patients passing the outlier threshold for each event between the two datasets. We identified 23 splice events with 20 or more tumors meeting the outlier criteria in the TCGA dataset, and found that 10 of these events were also detected as outliers in at least one of the SU2C tumors (Fig. 6A). When only considering events with predicted protein sequence changes, there are ten events meeting the outlier criteria in 20 or more of the TCGA tumors and nine of these events are detected as outliers in at least one of the SU2C tumors (Fig. 6B). These nine events identified in both datasets include five intron exclusion events in GAPDHS predicted to result in novel sequence in the reference samples. There is also an alternative 5 prime site in EXOC3, and intron retention in TBL1X, PTPRH, and PALM that are predicted to result in novel sequence in the tumors (Fig. 6C). The intron retention event in SLC24A5 was not detected by SplAdder in the SU2C dataset.

Discussion

We have developed a new package for splicing data analysis called Bisbee. Bisbee provides functions for differential splicing analysis, splicing outlier analysis, and protein effect prediction. Using a dataset with matched RNAseq and mass spectrometry data on normal human tissues we constructed a truth set to benchmark differential splicing and outlier methods, including Bisbee. We found that Bisbee’s differential splicing approach had substantially better enrichment of proteomics-confirmed events than the other approaches. Bisbee’s outlier test also outperformed other outlier approaches. We demonstrated the utility of the approach in both a rare disease and a cancer context.

The Bisbee package goes beyond many other RNAseq-focused splicing analysis packages by generating protein sequences for the observed splice events. While there are other tools for transcriptomic-proteomic integration, they are generally focus on generating comprehensive databases for mass spectra matching rather than on generating high confidence protein sequence predictions from the RNAseq data. By using Bisbee to generate a database of patient specific protein isoforms from RNAseq data, and then using mass spectrometry to detect which ones have protein level evidence, one could identify high confidence disease-specific protein isoforms for further characterization. The protein domains impacted and other downstream functional predictions from the protein sequences enable further insight into the impact of splicing alterations and can identify splicing-derived pathogenic variants that would go undetected by DNA sequencing alone.

The Bisbee pipeline currently relies on SplAdder for splice event detection¹⁴. While the work presented here as well as previous work demonstrate that SplAdder is a robust tool for splice event detection, in the future, we plan to adapt Bisbee to work with input from other splice detection tools and benchmark against SplAdder. Bisbee is also limited to the types of splice events detected by SplAdder. Other event types that are not currently detected include alternate first exon, alternate terminal exon, and complex events involving more than one type of alteration. Another limitation of the current approach is that it relies on short read sequencing and does not attempt to assemble a full-length transcript but rather focuses on the local changes in the transcript and protein sequences. An expansion of the approach to incorporate long read data would be useful for enabling full length sequence analysis. Currently, Bisbee only offers two statistical tests: comparison between two groups and outlier detection compared to a reference set. Future work may extend the methods to test for associations with continuous variables or other more complex experimental designs.

The differential splicing test in Bisbee uses a novel beta binomial model to test for differences in PSI. Most differential splicing tools, including the SplAdder test included in our evaluation, test for differences in expression level of the splice isoform, controlling for the overall expression level of the gene. Many of the events that are highly significant in SplAdder’s test have relatively small differences in mean PSI between the two groups (Fig. 4B). In order to identify events with more substantial differences in mean PSI between the two groups, one may directly test for a difference in PSI values using a t-test. However, we have shown that the beta binomial model implemented in Bisbee better addresses the relationship between PSI measurement accuracy and depth. Bisbee is able to detect both low coverage events with dramatic differences in PSI and high coverage events with small differences in PSI (Fig. 4A).

Bisbee is the second splicing tool that we are aware of to offer an outlier detection test. This test is intended for identifying splice isoforms unique to an individual patient compared to a set of reference samples. We were not able to compare directly to the other splice outlier detection tool (leafCutterMD) as it does not report splice events in a way that is amenable to protein sequence generation. The case studies we presented illustrate the utility of the outlier approach in both the rare disease and cancer research.

Currently, the collective use of whole-exome sequencing (WES), overlayed with RNA-Seq data, has enhanced the identification of disease-causing splice mutations and has significantly improved the diagnostic rate of rare diseases. We provided a case study of a rare disease whereby three patients with known pathogenic splice variants were available. We performed differential splicing analysis and the likely pathogenic or causal event was the highest scoring of those predicted to generate a novel amino acid sequence, illustrating how the protein level annotation can aid variant prioritization. Outlier analysis is an important approach in studying rare disease as often more than one case is not available. These cases were difficult to detect by the outlier analysis alone due to the very low coverage at the event locus. However, it is conceivable that the Bisbee output could still help identify the causal variant when examined alongside with candidate variants from WES and knowledge of the phenotype and underlying pathways is exploited.

Previous work has suggested that splicing dysregulation in cancer may be a greater source of tumor specific antigens than somatic point mutations^{3, 25}. Application of the Bisbee outlier test to cancer patient samples may enable the discovery of tumor-specific splicing-derived neoantigens, which could be therapeutic or vaccine targets. Splice events that are both outliers compared to normal tissues and differentially spliced between SF3B1 mutant and wild-type tumors are promising candidates as tumor-specific neoantigens, as many of these are predicted to generate novel sequences through frame disruptions and insertions in the tumor-specific isoforms (Fig. 5C). SF3B1 mutant uveal melanomas have better prognosis than SF3B1 wild-type⁵⁰. We hypothesize that the tumor-specific splice isoforms associated with SF3B1 mutations may act as antigens enabling better immune control of the tumors. The protein sequence output from Bisbee would facilitate in silico MHC binding prediction to further investigate the potential immunogenicity of these splice variant generated neoantigens.

We also detected splice outliers common to uveal melanoma regardless of SF3B1 mutation status, and these results showed strong concordance in an independent melanoma cohort. Interestingly, events with predicted protein sequence impact showed stronger concordance than those with no predicted impact (Fig. 6A, B). Nine of the ten events identified as common splice variant outliers with protein impact in the TCGA uveal melanoma dataset were also detected in the SU2C melanoma dataset. These melanoma associated splice variants included several intron retention events in GAPDHS, with the tumors having lower expression of the intron-retained transcripts compared to the normal reference tissues. GAPDHS is typically expressed in sperm, but not in normal somatic tissues, and has previously been shown to be expressed in melanoma⁵¹. We hypothesize that we are seeing these events in GAPDHS due to expression of the immature transcript in the normal tissues. Four melanoma associated splice events were identified that were predicted to lead to frame disruptions in the tumors, resulting in novel protein sequence. These events are most promising for further investigation as candidate targets in melanoma.

In summary, the Bisbee package is able to predict protein sequences of both known and novel protein isoforms. It provides a more statistically powerful differential splicing test than existing methods. It also provides an outlier detection approach, which will be useful in a number of different contexts, including cancer and rare disease. The Bisbee package is publicly available, and should enable the robust detection of aberrant splicing.

Methods

Description of datasets used

For initial evaluation and optimization of the differential and outlier splicing test implemented in Bisbee, we compared the distribution of the likelihood ratios between tests involving samples from the same tissues compared to samples from different tissues. For this analysis GTEx SplAdder results were downloaded from GDC (https://gdc.cancer.gov/about-data/publications/PanCanAtlas-Splicing-2018)³. For the differential splicing evaluation, 50 random pairs of tissues were selected, six random samples were selected from each tissue, and 100,000 events were selected for each tissue pair. The beta binomial differential splicing test was applied to grouping the samples into two groups of three replicates within each tissue as well as between the pairs of different tissues. For the outlier evaluation, 12 tissues with at least 100 samples were selected and 80 samples were randomly selected for fitting the model and 20 were selected for determining the outlier scores.

For further evaluation and benchmarking, we identified a dataset where RNAseq and mass spectrometry data were available on the same set of tissues⁸. We selected a total of seven tissues including four lower GI tissues (colon, duodenum, rectum, and small intestine) to serve as the reference set for relevant outlier analyses and three diverse tissues for comparison purposes (brain, ovary, and tonsil). RNAseq reads were downloaded from ArrayExpress (E-MTAB-2836) and aligned to the human reference genome (GRCh38) using star 2.7.3a two pass basic mapping mode and splice events were detected using SplAdder v2.3.0 with default parameters. Two databases for searching were constructed in order to separately assess the FDR for known and novel sequences. The first consisted of Bisbee prot known splice isoforms as well as canonical sequences from Ensembl, and the second included only novel sequences. The LC–MS/MS spectra were downloaded for 7 tissue types from the EBI PRIDE database (PXD010154). The spectra were searched using Mascot (Matrix Science, London, UK; version 2.6.0) through Proteome Discoverer 2.4 (Thermo Fisher Scientific, Waltham, MA), allowing for oxidation (Met) and carbamidomethylation (Cys) dynamic and static modifications, respectively. A maximum of two missed cleavages were allowed with fragment mass tolerance of 0.02 Da and precursor mass tolerance of 10 ppm. FDR thresholds for PSMs, peptides and proteins were set at 0.01, with a minimum of 1 peptide required for protein identification. Peptides that mapped to protein products of more than one gene were excluded from downstream analysis. Peptides that matched exclusively to only one protein isoform sequence were taken as evidence for that isoform. Events where only one isoform was detected in one tissue and the other isoform detected in a different tissue were taken as protein-level evidence of tissue-specific splicing.

For an example use case, three Leigh syndrome and five unaffected control fibroblast cell lines from the study participants were established. RNA was extracted and sequenced by Illumina paired end sequencing and aligned to the reference genome using STAR. Please see the supplementary methods for details of the cell culture, sequencing, and alignment.

For the uveal melanoma analysis, TCGA SplAdder results were downloaded from GDC (https://gdc.cancer.gov/about-data/publications/PanCanAtlas-Splicing-2018)³. The SF3B1 mutation status was obtained from cBioportal (https://bit.ly/3hagZvp)⁵².

We used an independent set of melanoma patients for comparison with the TCGA melanoma dataset, referred to here as the SU2C melanoma cohort. RNA was extracted from core needle biopsies, sequenced using Illumina paired end sequencing, and reads were aligned to the reference genome using STAR. Please see the supplementary methods for additional details on the cohort, sequencing, and alignment.

Splice event protein sequence prediction

In order to generate protein sequences corresponding to each splice event, we use known transcript sequences from Ensembl as a starting point. We first determined whether each isoform of the event exists with any known transcripts, by comparing the event junction coordinates to the exon coordinates (retrieved using the python package pyEnsembl) of protein coding transcripts for that gene. Each transcript is categorized as matching isoform one, isoform two, or neither for the splice event. For each transcript matching the isoform one, the isoform one junctions are removed and replaced with the isoform two junctions to make the altered sequence, and vice versa for those matching isoform two. The region of altered amino acids is found by aligning the two sequences. If the altered amino acid sequence is not found in any of the canonical sequences, the event is categorized as novel. If no transcript is found that matches either isoform, no sequence is generated and the event’s effect is categorized as unknown. In order to narrow down to one pair of protein sequences per event, the sequences are prioritized as follows: (1) pair of known transcripts, (2) longest altered amino acid sequence, (3) longest starting isoform sequence.

Differential splicing test (Bisbee diff)

Read counts for a splice variant are modeled as following a beta binomial distribution. Here the number of reads supporting the first isoform is the number of successes, the total number of reads covering the event is the number of trials, and the expected PSI (percent spliced in) value across the samples of interest is represented by the beta distribution. The beta distribution is reparameterized as $\psi =\frac{\alpha }{\alpha +\beta }$ and $\omega =\alpha +\beta$. $\psi$ represents the expected value of the beta distribution and $\omega$ affects the sharpness of the distribution, but is more intuitive then actual variance of the beta function, which is a much more complex function of $\alpha$ and $\beta$. In the one group model, all of the samples are assumed to have the same underlying distribution of PSI values and a maximum likelihood estimate is made for $\psi$ and $\omega$. In the models below, ${I}_{n}$ indicates the number of reads supporting the first isoform for sample n, ${T}_{n}$ indicates the total number of reads supporting either isoform for sample n, and ${N}_{A}$ indicates the total number of samples.

$$L_{A} = \mathop \sum \limits_{n = 1}^{{N_{A} }} betabinomial_{pdf} \left( {I_{n} ,T_{n} ,\omega_{A} \times \Psi_{A} ,\omega_{A} \times (1 - \Psi_{A} )} \right)$$

In the two-group model, it is assumed that the two groups have different expected PSI values, but similar distribution shapes, so the two groups have different values of $\psi$ but the same $\omega$.

$$\begin{aligned} L_{G} & = \mathop \sum \limits_{n = 1}^{{N_{1} }} betabinomial_{pdf} \left( {I_{n} ,T_{n} ,\omega_{G} \times \Psi_{1} ,\omega_{G} \times \left( {1 - \Psi_{1} } \right)} \right) \\ & \quad + \mathop \sum \limits_{n = 1}^{{N_{2} }} betabinomial_{pdf} \left( {I_{n} ,T_{n} ,\omega_{G} \times \Psi_{2} ,\omega_{G} \times \left( {1 - \Psi_{2} } \right)} \right) \\ \end{aligned}$$

For both the one group and two group models we find the values of the parameters that maximize the sum of the probability densities across the data points. In fitting the model, we use logistic transformations as shown below to constrain $\omega$ to be greater than 2 and less than ${\omega }_{M}$ and constrain $\Psi$ to be between 0 and 1. $a$ and $b$ are the parameters that are optimized in order to ensure that the values of $\Psi$ and $\omega$ remain within the constraints.

$$\omega = \frac{{\omega_{M} }}{{1 + e^{a} }} + 2, \Psi = \frac{1}{{1 + e^{b} }}$$

The difference in the sum of the log probability densities across the two models is used to identify that events have different underlying PSI distributions in the two groups.

$$LR = log\left( {L_{A} } \right) - log\left( {L_{G} } \right)$$

Splicing outlier test (Bisbee outlier)

As in the two-group test, the read counts are assumed to follow a beta binomial distribution. The beta distribution parameters are found that maximizes the sum of the log probability densities across a set of reference samples.

$$\{{\alpha }_{R},{\beta }_{R}\} =argmax\left({\sum }_{n=1}^{{N}_{R}}{betabinomial}_{pdf}\left({I}_{n},{T}_{n},{\alpha }_{R},{\beta }_{R}\right)\right)$$

Here ${I}_{n}$ indicates the read count supporting the isoform with mean PSI < 0.5 across the samples. Nelder-Mead optimization⁵³ (or BFGS if Nelder-Mead fails) is used to find the maximum likelihood values of ${\alpha }_{R},{\beta }_{R}$. The reparameterizations below are used to constrain ${\alpha }_{R}$ to be between $\frac{1}{{\beta }_{M}}$ and 1 and ${\beta }_{R}$ to be between 1 and ${\beta }_{M}$ so that the beta distribution is strictly decreasing. The values the $a$ and $b$ parameters are optimized to find the maximum likelihood values of ${\alpha }_{R},{\beta }_{R}$ within the constraints.

$$\alpha_{R} = \frac{{1 - \frac{1}{{\beta_{M} }}}}{{1 + e^{a} }} + \frac{1}{{\beta_{M} }},\beta_{R} = \frac{{\beta_{M} - 1}}{{1 + e^{b} }} + 1$$

If zero reads are detected supporting the minor isoform in the reference sample set, alpha is set to one and beta is set as shown below.

$$\beta_{R} = min\left( {\beta_{M} ,\mathop \sum \limits_{n = 1}^{{N_{R} }} T_{N} } \right)$$

For each sample of interest, the log cumulative probability of the major isoform read counts being less than or equal to those observed given the total read depth and the beta distribution fit to the reference sample is used as the outlier score.

$$LL_{s} = log\left( {betabinomial_{cdf} \left( {I_{s} ,T_{s} ,T_{n} ,\alpha_{R} ,\beta_{R} } \right)} \right)$$

Implementation of other differential splicing methods

For benchmarking we selected SplAdder’s differential splicing test using default parameters as a representative of the generalized linear model approach. We also wanted to include a simple method directly testing differences in PSI values. While a non-parametric test would be more appropriate, as PSI values are unlikely to be normally distributed, we would not have any power to detect differences with only three or four replicates per group. Instead, used a two-sample t-test, with the more conservative assumption of unequal variance, on the PSI values. SplAdder only reports PSI values for samples with a coverage of 10 for a given event, though PSI values can still be calculated from the isoform one and two coverages. We applied the t-test both to all PSI values as well as treating the data points with depth less than 10 as missing data.

Implementation of other splicing outlier detection methods

We implemented two simple methods using the distribution of PSIs for comparison. The first finds the median absolute deviation from the set of reference samples. Below, ${x}_{s}$ indicates the PSI of the sample s and ${x}_{R}$ indicates the array of PSI values associated with reference samples.

$${MAD}_{s}=\frac{{x}_{s}-median({x}_{R})}{max(median(\left|{x}_{r}-median({x}_{R})\right|),0.01)}$$

The second normalizes to the interquartile range.

$${x}_{s}<{Q}_{25}:{IQR}_{s}= \frac{{Q}_{25}-{x}_{s}}{max({Q}_{75}-{Q}_{25}),0.01)}$$

$${{Q}_{25}\ge x}_{s}\ge {Q}_{75}: {IQR}_{s}=0$$

$${x}_{s}>{Q}_{75}:{IQR}_{s}= \frac{{{x}_{s}-Q}_{75}}{max({Q}_{7t}-{Q}_{25}),0.01)}$$

We applied both of these methods either using all of the PSI values, as well as only using data points with depth less than 10 as missing data.

Ethics approval and consent to participate

The study protocol and written informed consent for the use of human fibroblast cells of the MFTMT cases and controls was approved by the Western Institutional Review Board (WIRB; study number 20120789). The SU2C melanoma biopsies were collected through a clinical trial (NCT02094872). Ethics review boards at all participating institutions approved the study, which was conducted in accordance with the Declaration of Helsinki and Good Clinical Practice guidelines. All patients provided written informed consent.

Data availability

The RNA sequencing data of the MTFMT cases and controls will be deposited in dbGap. The SU2C RNA sequencing data is available in dbGap under accession phs001786.v1.p1. The Bisbee package is available at https://github.com/tgen/bisbee.

References

Park, E., Pan, Z., Zhang, Z., Lin, L. & Xing, Y. The expanding landscape of alternative splicing variation in human populations. Am. J. Hum. Genet. 102, 11–26 (2018).
Article CAS PubMed PubMed Central Google Scholar
Gamazon, E. R. & Stranger, B. E. Genomics of alternative splicing: Evolution, development and pathophysiology. Hum. Genet. 133, 679–687 (2014).
Article CAS PubMed Google Scholar
Kahles, A. et al. Comprehensive analysis of alternative splicing across tumors from 8,705 patients. Cancer Cell 34, 211.e6-224.e6 (2018).
Article CAS Google Scholar
Dvinge, H., Kim, E., Abdel-Wahab, O. & Bradley, R. K. RNA splicing factors as oncoproteins and tumor suppressors. Nat. Rev. Cancer. 16, 413–430 (2016).
Article CAS PubMed PubMed Central Google Scholar
Li, Y. I. et al. RNA splicing is a primary link between genetic variation and disease. Science 352, 600–604 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Barbeira, A. N. et al. Widespread dose-dependent effects of RNA expression and splicing on complex diseases and traits. bioRxiv 814350 (2019).
Anna, A. & Monika, G. Splicing mutations in human genetic disorders: examples, detection, and confirmation. J. Appl. Genet. 59, 253–268 (2018).
Article CAS PubMed PubMed Central Google Scholar
Wang, D. et al. A deep proteome and transcriptome abundance atlas of 29 healthy human tissues. Mol. Syst. Biol. 15, e8503 (2019).
Article PubMed PubMed Central CAS Google Scholar
Frazee, A. C. et al. Ballgown bridges the gap between transcriptome assembly and expression analysis. Nat. Biotechnol. 33, 243–246 (2015).
Article CAS PubMed PubMed Central Google Scholar
Katz, Y., Wang, E. T., Airoldi, E. M. & Burge, C. B. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat. Methods 7, 1009–1015 (2010).
Article CAS PubMed PubMed Central Google Scholar
Shen, S. et al. rMATS: Robust and flexible detection of differential alternative splicing from replicate RNA-Seq data. Proc. Natl. Acad. Sci. 111, E5593–E5601 (2014).
Article CAS PubMed PubMed Central Google Scholar
Trincado, J. L. et al. SUPPA2: fast, accurate, and uncertainty-aware differential splicing analysis across multiple conditions. Genome Biol. https://doi.org/10.1186/s13059-018-1417-1 (2018).
Article PubMed PubMed Central Google Scholar
Mancini, E., Iserte, J., & Yanovsky, M. ASpli: An integrative R package for analysing alternative splicing using RNA-Seq. 34.
Kahles, A., Ong, C. S., Zhong, Y. & Rätsch, G. SplAdder: Identification, quantification and testing of alternative splicing events from RNA-Seq data. Bioinformatics 32, 1840–1847 (2016).
Article CAS PubMed PubMed Central Google Scholar
Goldstein, L. D. et al. Prediction and quantification of splice events from RNA-Seq data. PLoS ONE 11, e0156132 (2016).
Article PubMed PubMed Central CAS Google Scholar
Li, Y. I. et al. Annotation-free quantification of RNA splicing using LeafCutter. Nat. Genet. 50, 151–158 (2018).
Article CAS PubMed Google Scholar
Vaquero-Garcia, J. et al. A new view of transcriptome complexity and regulation through the lens of local splicing variations. Elife 5, e11752 (2016).
Article PubMed PubMed Central CAS Google Scholar
Gerstung, M. et al. Reliable detection of subclonal single-nucleotide variants in tumour cell populations. Nat. Commun. 3, 811 (2012).
Article ADS PubMed CAS Google Scholar
Christoforides, A. et al. Identification of somatic mutations in cancer through Bayesian-based analysis of sequenced genome pairs. BMC Genomics 14, 302 (2013).
Article CAS PubMed PubMed Central Google Scholar
Halperin, R. F. et al. Leveraging spatial variation in tumor purity for improved somatic variant calling of archival tumor only samples. Front. Oncol. https://doi.org/10.3389/fonc.2019.00119/abstract (2019).
Article PubMed PubMed Central Google Scholar
Shiraishi, Y. et al. An empirical Bayesian framework for somatic mutation detection from cancer genome sequencing data. Nucleic Acids Res. 41, e89 (2013).
Article CAS PubMed PubMed Central Google Scholar
Urbanski, L. M., Leclair, N. & Anczuków, O. Alternative-splicing defects in cancer: Splicing regulators and their downstream targets, guiding the way to novel cancer therapeutics. Wiley Interdiscip. Rev. RNA. 9, e1476 (2018).
Article PubMed PubMed Central Google Scholar
Jayasinghe, R. G. et al. Systematic analysis of splice-site-creating mutations in cancer. Cell Rep. 23, 270.e3-281.e3 (2018).
Article CAS Google Scholar
Hoyos, L. E. & Abdel-Wahab, O. Cancer-specific splicing changes and the potential for splicing-derived neoantigens. Cancer Cell 34, 181–183 (2018).
Article CAS PubMed PubMed Central Google Scholar
Frankiw, L., Baltimore, D. & Li, G. Alternative mRNA splicing in cancer immunotherapy. Nat. Rev. Immunol. 19, 675–687 (2019).
Article CAS PubMed Google Scholar
Jenkinson, G. et al. LeafCutterMD: An algorithm for outlier splicing detection in rare diseases. Bioinformatics 36, 4609–4615. https://doi.org/10.1093/bioinformatics/btaa259/5823301 (2020).
Article PubMed PubMed Central Google Scholar
Boise, L. H. et al. bcl-x, a bcl-2-related gene that functions as a dominant regulator of apoptotic cell death. Cell 74, 597–608 (1993).
Article CAS PubMed Google Scholar
Zhang, H. et al. Integrated proteogenomic characterization of human high-grade serous ovarian cancer. Cell 166, 755–765 (2016).
Article CAS PubMed PubMed Central Google Scholar
Mertins, P. et al. Proteogenomics connects somatic mutations to signalling in breast cancer. Nature 534, 55–62 (2016).
Article CAS PubMed PubMed Central Google Scholar
Vasaikar, S. et al. Proteogenomic analysis of human colon cancer reveals new therapeutic opportunities. Cell 177, 1035.e19-1049.e19 (2019).
Article CAS Google Scholar
Castellana, N. & Bafna, V. Proteogenomics to discover the full coding content of genomes: A computational perspective. J. Proteomics 73, 2124–2135 (2010).
Article CAS PubMed PubMed Central Google Scholar
Nesvizhskii, A. I. Proteogenomics: Concepts, applications and computational strategies. Nat. Methods 11, 1114–1125 (2014).
Article CAS PubMed PubMed Central Google Scholar
Alfaro, J. A., Sinha, A., Kislinger, T. & Boutros, P. C. Onco-proteogenomics: cancer proteomics joins forces with genomics. Nat. Methods 11, 1107–1113 (2014).
Article CAS PubMed Google Scholar
Ruggles, K. V. & Fenyö, D. Next generation sequencing data and proteogenomics. In Proteogenomics (ed. Végvári, Á.) 11–19 (Springer, 2016). https://doi.org/10.1007/978-3-319-42316-6_2.
Chapter Google Scholar
Sheynkman, G. M., Shortreed, M. R., Cesnik, A. J. & Smith, L. M. Proteogenomics: Integrating next-generation sequencing and mass spectrometry to characterize human proteomic variation. Annu. Rev. Anal. Chem. 9, 521–545 (2016).
Article Google Scholar
Ruggles, K. V. et al. An analysis of the sensitivity of proteogenomic mapping of somatic mutations and novel splicing events in Cancer. Mol. Cell Proteomics MCP 15, 1060–1071 (2016).
Article CAS PubMed Google Scholar
Ye, X. et al. Improving silkworm genome annotation using a proteogenomics approach. J. Proteome Res. 18, 3009–3019 (2019).
Article CAS PubMed Google Scholar
Wu, P. et al. PASS: A proteomics alternative splicing screening pipeline. Proteomics 19, 1900041 (2019).
Article CAS Google Scholar
Krasnov, G. S. et al. PPLine: An automated pipeline for SNP, SAP, and splice variant detection in the context of proteogenomics. J. Proteome Res. 14, 3729–3737 (2015).
Article CAS PubMed Google Scholar
Jeong, S.-K., Kim, C.-Y. & Paik, Y.-K. ASV-ID, a proteogenomic workflow to predict candidate protein isoforms on the basis of transcript evidence. J. Proteome Res. 17, 4235–4242 (2018).
Article CAS PubMed Google Scholar
Komor, M. A. et al. Identification of differentially expressed splice variants by the proteogenomic pipeline splicify. Mol. Cell Proteomics 16, 1850–1863 (2017).
Article CAS PubMed PubMed Central Google Scholar
Tucker, E. J. et al. Mutations in MTFMT underlie a human disorder of formylation causing impaired mitochondrial translation. Cell Metab. 14, 428–434 (2011).
Article CAS PubMed PubMed Central Google Scholar
Haack, T. B. et al. Phenotypic spectrum of eleven patients and five novel MTFMT mutations identified by exome sequencing and candidate gene screening. Mol Genet Metab. 111, 342–352 (2014).
Article CAS PubMed Google Scholar
Hayhurst, H. et al. Leigh syndrome caused by mutations in MTFMT is associated with a better prognosis. Ann. Clin. Transl. Neurol. 6, 515–524 (2019).
Article CAS PubMed PubMed Central Google Scholar
Neeve, V. C. M. et al. Clinical and functional characterisation of the combined respiratory chain defect in two sisters due to autosomal recessive mutations in MTFMT. Mitochondrion 13, 743–748 (2013).
Article CAS PubMed PubMed Central Google Scholar
Fairbrother, W. G., Yeh, R.-F., Sharp, P. A. & Burge, C. B. Predictive identification of exonic splicing enhancers in human genes. Science 297, 1007–1013 (2002).
Article ADS CAS PubMed Google Scholar
Wang, Z. et al. Systematic identification and analysis of exonic splicing silencers. Cell 119, 831–845 (2004).
Article CAS PubMed Google Scholar
Furney, S. J. et al. SF3B1 mutations are associated with alternative splicing in uveal melanoma. Cancer Discov. 3, 1122–1129 (2013).
Article CAS PubMed PubMed Central Google Scholar
Alsafadi, S. et al. Cancer-associated SF3B1 mutations affect alternative splicing by promoting alternative branchpoint usage. Nat. Commun. 7, 10615 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Harbour, J. W. et al. Recurrent mutations at codon 625 of the splicing factor SF3B1 in uveal melanoma. Nat. Genet. 45, 133–135 (2013).
Article CAS PubMed PubMed Central Google Scholar
Sevostyanova, I. A., Kulikova, K. V., Kuravsky, M. L., Schmalhausen, E. V. & Muronetz, V. I. Sperm-specific glyceraldehyde-3-phosphate dehydrogenase is expressed in melanoma cells. Biochem. Biophys. Res. Commun. 427, 649–653 (2012).
Article CAS PubMed Google Scholar
Cerami Jianjiong, E. et al. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data: figure 1. Cancer Discov. 2(5), 401–404 https://doi.org/10.1158/2159-8290.CD-12-0095 (2012).
Article PubMed Google Scholar
Nelder, J. A. & Mead, R. A simplex method for function minimization. Comput. J. 7, 308–313 (1965).
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The authors would like to thank Krystine Garcia-Mansfield for input on setting up the mass spectrometry searches and Megan Johnson for beta testing and debugging the Bisbee package.

Funding

The development and evaluation of Bisbee was funded by the Sylvia Chase Early Career award to RFH as well as donations from Dell, Inc. AH, EAR, JL and PP are funded in part by R01 CA195670. NJS is funded in part by NIH grants UH2 AG064706, U19 AG023122, U24 AG051129, U24 AG051129-04S1; NSF grant (FAIN number) 2031819; and the Ivy and Ottesen Foundations. RS and the C4RCD team is supported by private donations to the TGen Foundation and Center for Rare Childhood Disorders (C4RCD). The melanoma patient study was funded by a Stand Up To Cancer (SU2C) – Melanoma Research Alliance Melanoma Dream Team Translational Cancer Research Grant (#SU2C-AACR-DT0612) and the Gateway for Cancer Research Foundation (#G-12–500). Stand Up To Cancer is a program of the Entertainment Industry Foundation administered by the American Association for Cancer Research (AACR). Research reported in this publication included work performed in the Mass Spectrometry & Proteomics Core Facility supported by the National Cancer Institute of the National Institutes of Health under grant number P30CA033572. This work was also supported by the Ovarian Cancer Research Alliance [Ann and Sol Schreiber Mentored Investigator Award 650001] to EAR. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information

These authors contributed equally: Patrick Pirrotte and Nicholas J. Schork.
A comprehensive list of consortium members appears at the end of the paper.

Authors and Affiliations

Quantitative Medicine and Systems Biology Division, Translational Genomics Research Institute, Phoenix, AZ, USA
Rebecca F. Halperin & Nicholas J. Schork
Collaborative Center for Translational Mass Spectrometry, Translational Genomics Research Institute, Phoenix, AZ, USA
Apurva Hegde & Patrick Pirrotte
Integrated Cancer Genomics Division, Translational Genomics Research Institute, Phoenix, AZ, USA
Jessica D. Lang, Elizabeth A. Raupach, Christophe Legendre, Winnie S. Liang & Jeffrey M. Trent
Neurogenomics Division, Translational Genomics Research Institute, Phoenix, AZ, USA
Vinodh Narayanan, Matt Huentelman, Newell Belnap, Anne-Marie Aziz, Keri Ramsey, Winnie S. Liang & Sampathkumar Rangasamy
Yale Cancer Center, New Haven, CT, USA
Patricia M. LoRusso
Mayo Clinic, Scottsdale, AZ, USA
Aleksandar Sekulic
Northwestern Medicine, Chicago, IL, USA
Jeffrey A. Sosman

Authors

Rebecca F. Halperin
View author publications
You can also search for this author in PubMed Google Scholar
Apurva Hegde
View author publications
You can also search for this author in PubMed Google Scholar
Jessica D. Lang
View author publications
You can also search for this author in PubMed Google Scholar
Elizabeth A. Raupach
View author publications
You can also search for this author in PubMed Google Scholar
Christophe Legendre
View author publications
You can also search for this author in PubMed Google Scholar
Winnie S. Liang
View author publications
You can also search for this author in PubMed Google Scholar
Patricia M. LoRusso
View author publications
You can also search for this author in PubMed Google Scholar
Aleksandar Sekulic
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey A. Sosman
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey M. Trent
View author publications
You can also search for this author in PubMed Google Scholar
Sampathkumar Rangasamy
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Pirrotte
View author publications
You can also search for this author in PubMed Google Scholar
Nicholas J. Schork
View author publications
You can also search for this author in PubMed Google Scholar

Consortia

C4RCD Research Group

Vinodh Narayanan
, Matt Huentelman
, Newell Belnap
, Anne-Marie Aziz
& Keri Ramsey

Contributions

R.F.H. designed and implemented the Bisbee package and performed the splicing analysis. A.H. performed the proteomics analysis and beta tested the Bisbee package. J.D.L., E.A.R., P.P., and N.J.S. contributed to the interpretation of the data and writing of the manuscript. R.S., C.L., and the C4RCD research group designed and performed the experiments and contributed to the analysis and interpretation of the rare disease dataset. W.S.L., P.M.L., A.S., J.A.S., and J.M.T. designed and performed the experiments and contributed to the analysis and interpretation of the melanoma dataset. P.P. conceived the proteomics evaluation strategy and N.J.S. contributed to the statistical analysis. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Rebecca F. Halperin.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Halperin, R.F., Hegde, A., Lang, J.D. et al. Improved methods for RNAseq-based alternative splicing analysis. Sci Rep 11, 10740 (2021). https://doi.org/10.1038/s41598-021-89938-2

Download citation

Received: 31 August 2020
Accepted: 13 April 2021
Published: 24 May 2021
DOI: https://doi.org/10.1038/s41598-021-89938-2

This article is cited by

Prediction of alternative pre-mRNA splicing outcomes
- Rayan Najjar
- Tomas Mustelin
Scientific Reports (2023)
Tutorial: integrative computational analysis of bulk RNA-sequencing data to characterize tumor immunity using RIMA
- Lin Yang
- Jin Wang
- X. Shirley Liu
Nature Protocols (2023)
Single-cell transcriptomics of the ocular anterior segment: a comprehensive review
- Sofia Ahsanuddin
- Albert Y. Wu
Eye (2023)
Comprehensive characterization of 11 prognostic alternative splicing events in ovarian cancer interacted with the immune microenvironment
- Congbo Yue
- Tianyi Zhao
- Yi Zhang
Scientific Reports (2022)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.