Integrated analysis of genomic and transcriptomic data for the discovery of splice-associated variants in cancer

Somatic mutations within non-coding regions and even exons may have unidentified regulatory consequences that are often overlooked in analysis workflows. Here we present RegTools (www.regtools.org), a computationally efficient, free, and open-source software package designed to integrate somatic variants from genomic data with splice junctions from bulk or single cell transcriptomic data to identify variants that may cause aberrant splicing. We apply RegTools to over 9000 tumor samples with both tumor DNA and RNA sequence data. RegTools discovers 235,778 events where a splice-associated variant significantly increases the splicing of a particular junction, across 158,200 unique variants and 131,212 unique junctions. To characterize these somatic variants and their associated splice isoforms, we annotate them with the Variant Effect Predictor, SpliceAI, and Genotype-Tissue Expression junction counts and compare our results to other tools that integrate genomic and transcriptomic data. While many events are corroborated by the aforementioned tools, the flexibility of RegTools also allows us to identify splice-associated variants in known cancer drivers, such as TP53, CDKN2A, and B2M, and other genes.


Supplementary Figure 2. Overview of input data considered and significant events identified by RegTools for each tumor type.
A) Summary of initial variants considered for analysis by RegTools per sample per tumor cohort. Each sample's variant count is plotted, and violin plots are overlaid for each cohort. B) Summary of unique exon-exon junction observations for each sample. Each sample's unique junction count is plotted, and violin plots are overlaid for each cohort. C) Summary of significant junction types for each cohort for each of the splice variant window parameters that were used in this analysis. Source data are provided as a Source Data file.

Supplementary Figure 3. Summary of variants analyzed by RegTools in each tumor cohort
Summary of the starting number of high-quality variants per sample, the number of initial variants considered for analysis by RegTools for each variant window used per tumor cohort, and the number of significant variants for each variant window used per tumor cohort. Source data are provided as a Source Data file.

Supplementary Figure 6. Pan-cancer analysis of cohorts from TCGA and MGI reveals genes recurrently disrupted by variants that promote splicing of particular canonical junctions
Results of analysis for recurrently disrupted genes in each TCGA and MGI cohort. A) Rows correspond to the 40 most frequently recurring genes, as ranked by binomial p-value (See Methods, Identification of genes with recurrent splice-associated variants). Genes are clustered by whether they were annotated by the CGC as an oncogene (red), an oncogene and tumor suppressor gene (yellow), a tumor suppressor gene (green), or another type of cancer-relevant gene (black, bold). Shading corresponds to −log10(p-value) and columns represent cancer types. Blue marks within cells indicate that the gene was annotated by CHASMplus as a driver within a given TCGA cohort. B) Rows correspond to the 40 most frequently recurring genes, as ranked by the fraction of samples. Shading corresponds to the fraction of samples and columns represent cancer types. Blue dots within cells indicate that the gene was annotated by CHASMplus as a driver within a given TCGA cohort. These results were obtained using the default splice variant window parameter (i2e3). Source data are provided as a Source Data file.

Supplementary Figure 7. TCGA pan-cancer analysis reveals genes recurrently disrupted by variants that cause non-canonical splicing patterns
Results of analysis for recurrently disrupted genes in each TCGA cohort. A) Rows correspond to the 40 most frequently recurring genes, as ranked by binomial p-value (See Methods, Identification of genes with recurrent splice-associated variants). Genes are clustered by whether they were annotated by the CGC as an oncogene (red), an oncogene and tumor suppressor gene (yellow), a tumor suppressor gene (green), or another type of cancer-relevant gene (black, bold). Shading corresponds to −log10(p-value) and columns represent cancer types. Blue marks within cells indicate that the gene was annotated by CHASMplus as a driver within a given TCGA cohort. B) Rows correspond to the 40 most frequently recurring genes, as ranked by the fraction of samples. Shading corresponds to the fraction of samples and columns represent cancer types. Blue dots within cells indicate that the gene was annotated by CHASMplus as a driver within a given TCGA cohort. These results were obtained using the default splice variant window parameter (i2e3). Source data are provided as a Source Data file.

Supplementary Figure 8. TCGA pan-cancer analysis reveals genes recurrently disrupted by variants that promote splicing of particular canonical junctions
Results of analysis for recurrently disrupted genes in each TCGA cohort. A) Rows correspond to the 40 most frequently recurring genes, as ranked by binomial p-value (See Methods, Identification of genes with recurrent splice-associated variants). Genes are clustered by whether they were annotated by the CGC as an oncogene (red), an oncogene and tumor suppressor gene (yellow), a tumor suppressor gene (green), or another type of cancer-relevant gene (black, bold). Shading corresponds to −log10(p-value) and columns represent cancer types. Blue marks within cells indicate that the gene was annotated by CHASMplus as a driver within a given TCGA cohort. B) Rows correspond to the 40 most frequently recurring genes, as ranked by the fraction of samples. Shading corresponds to the fraction of samples and columns represent cancer types. Blue dots within cells indicate that the gene was annotated by CHASMplus as a driver within a given TCGA cohort. These results were obtained using the default splice variant window parameter (i2e3). Source data are provided as a Source Data file.