Empirical prediction of variant-activated cryptic splice donors using population-based RNA-Seq data

Dawes, Ruebena; Joshi, Himanshu; Cooper, Sandra T.

doi:10.1038/s41467-022-29271-y

Download PDF

Article
Open access
Published: 29 March 2022

Empirical prediction of variant-activated cryptic splice donors using population-based RNA-Seq data

Nature Communications volume 13, Article number: 1655 (2022) Cite this article

4902 Accesses
12 Citations
8 Altmetric
Metrics details

Subjects

Abstract

Predicting which cryptic-donors may be activated by a splicing variant in patient DNA is notoriously difficult. Through analysis of 5145 cryptic-donors (versus 86,963 decoy-donors not used; any GT or GC), we define an empirical method predicting cryptic-donor activation with 87% sensitivity and 95% specificity. Strength (according to four algorithms) and proximity to the annotated-donor appear important determinants of cryptic-donor activation. However, other factors such as splicing regulatory elements, which are difficult to identify, play an important role and are likely responsible for current prediction inaccuracies. We find that the most frequently recurring natural mis-splicing events at each exon-intron junction, summarised over 40,233 RNA-sequencing samples (40K-RNA), predict with accuracy which cryptic-donor will be activated in rare disease. 40K-RNA provides an accurate, evidence-based method to predict variant-activated cryptic-donors in genetic disorders, assisting pathology consideration of possible consequences of a variant for the encoded protein and RNA diagnostic testing strategies.

SpliceVault predicts the precise nature of variant-associated mis-splicing

Article Open access 06 February 2023

Detection of aberrant splicing events in RNA-seq data using FRASER

Article Open access 22 January 2021

Systematic identification of intron retention associated variants from massive publicly available transcriptome sequencing data

Article Open access 29 September 2022

Introduction

Genetic variants affecting the conserved sequences of the consensus splicing motifs can alter binding of spliceosomal components and induce mis-splicing of precursor messenger RNA (pre-mRNA)¹, making them a common cause of inherited disorders^2,3,4,5. Splicing variants can simultaneously induce different mis-splicing outcomes, including skipping of one or more exons, activation of a cryptic splice-site(s), and/or retention of one or more introns¹. Whether induced mis-splicing disrupts the reading frame or affects a region of known functional (and clinical) importance, has significant diagnostic implications. Therefore, knowing the specific mis-splicing outcome of genetic variant is necessary to conclusively link it to a disease. While the accuracy of in silico algorithms in predicting whether a variant will cause mis-splicing has been comprehensively assessed^6,7,8,9, there is currently no reliable means to predict which mis-splicing event(s) may occur in response to a variant that activates mis-splicing. As a result of this and other factors, the vast majority of splice site variants are classified as variants of uncertain significance (VUS); a non-actionable diagnostic endpoint in genomic medicine¹⁰.

We recently evaluated the accuracy and concordance of SpliceAI (SAI)¹¹ and algorithms within Alamut Visual® (Interactive Biosoftware, Rouen, France)^12,13, to predict splicing outcomes arising from genetic variants identified in 74 families with monogenetic conditions subject to RNA diagnostic studies (79 variants; 19% essential GT-AG splice-site variants and 71% extended splice-site variants)¹⁴. Algorithmic predictions of the strengths of activated cryptic splice sites were highly discordant, especially for cryptic donors. SAI’s deep learning showed the greatest accuracy in predicting activated cryptic splice-site(s) (66% true positive with 34% false negative), whereas historical algorithms within Alamut Visual® resulted in 34–69% false negatives¹⁴.

In this study we focus on determining empirical features that inform prediction of variant-associated spliceosomal selection of a cryptic-donor, in preference to the annotated-donor and other nearby decoy-donors (any GT or GC not used by the spliceosome). Through analysis of 4811 variants in 3399 genes, we show that while splice-site strength and proximity to the annotated-donor strongly influence spliceosomal selection of a cryptic-donor, these factors alone are not sufficient for accurate prediction. Importantly, we show that the most common mis-splicing events seen at each exon-intron junction across 40,233 publicly available RNA-seq samples compiled within the 40K-RNA database, predict with accuracy which cryptic-donor will be activated in rare disease.

Results

Reference database of variants activating a cryptic-donor

We collate a database of cryptic-donor variants, defined as variant-associated erroneous use of a donor other than the annotated-donor. Variants were derived from several sources^11,15,16 (Fig. 1a, see methods). The genomic locations and extended sequences of the annotated-donor, cryptic-donor(s), as well as any decoy-donors (any GT/GC motif within 250 nucleotides (nt) of the annotated-donor), were compiled for analysis. We define the extended donor splice-site region as spanning the fourth to last nucleotide of the exon (E^-4, E = exon) to the eighth nucleotide of the intron (D⁺⁸; D = donor), as constraint on sequence diversity eases beyond this point (supplementary Fig. 1).

Cryptic-donor variants fall into three categories (Fig. 1b, Box 1): A) Annotated-Modified (AM): a genetic variant modifies the annotated-donor resulting in activation one or more unmodified cryptic-donors (n = 2186) (Fig. 1c–e). AM-variants which are SNVs and DNA insertions commonly affect the E^-1, D⁺¹, D⁺² and D⁺⁵ positions of the annotated-donor (Fig. 1c), and AM-variants which are DNA deletions ranged from 1 to 57 nts in length (Fig. 1d). 89% of AM-variants result in use of a single cryptic-donor, 9% activate 2 cryptic-donors and 2% activate 3 or more cryptic-donors (Fig. 1e).

B) Cryptic-Modified (CM): a genetic variant modifies a cryptic-donor and does not affect the annotated-donor (n = 2252) (Fig. 1f, g). CM-variants most frequently affect the D⁺² position of the cryptic-donor (Fig. 1f), with 32% of all CM SNVs changing the cryptic-donor essential splice motif from GC to GT (Fig. 1g).

C) Annotated-Modified/Cryptic-Modified (AM/CM): a genetic variant that simultaneously modifies the annotated-donor and nearby cryptic-donor (n = 373) (Fig. 1h–j). AM/CM-variants which are SNVs and DNA insertions also most frequently affect the D⁺² position (122/373) of the cryptic-donor (Fig. 1h), with 31% of SNVs converting a GC to GT (Fig. 1i). AM/CM-variants which are DNA deletions range from 1 to 36 nts in length (Fig. 1j).

Box 1 Glossary

Annotated-donor: A donor in an ensembl-annotated transcript.

Decoy-donor: Any essential donor dinucleotide (GT/GC) that is not an annotated-donor.

Cryptic-donor: A decoy-donor shown to be activated (i.e. used by the spliceosome) by a genetic variant.

Annotated-Modified (AM): A genetic variant modifies the annotated-donor resulting in activation one or more unmodified cryptic-donors.

Cryptic-Modified (CM): A genetic variant modifies a cryptic-donor and does not affect the annotated-donor.

Annotated-Modified/Cryptic-Modified (AM/CM): A genetic variant that simultaneously modifies the annotated-donor and nearby cryptic-donor.

Donor Frequency (DF): A measure of donor strength based on how many annotated-donors in the human genome have the exact same sequence.

Competitive decoy-donor: A decoy-donor with a DF score at least 10% the score of the nearby annotated-donor.

40K-RNA: An aggregated database of splice-junctions detected across 40,233 publicly available RNA-seq samples.

87% of cryptic-donors lie within 250 nt of the annotated-donor

99% of cryptic-donors activated by AM-variants and 71% of cryptic-donors activated by CM-variants, lie within 250 nt of the annotated-donor (87% collectively, Fig. 2a, b). By definition, AM/CM-variants activate a cryptic-donor that spatially overlaps the annotated-donor; 26% of AM/CM cryptic-donors lie at either the E^-4 or D⁺⁵ position (Fig. 2c). For exonic cryptic-donors activated at E^-4, the GT at D^+1/+2 of the annotated-donor becomes D^+5/+6 of the cryptic-donor; conversely for intronic cryptic-donors activated at D⁺⁵, the GT at D^+5/+6 of the annotated-donor becomes D^+1/+2 of the cryptic-donor).

**Fig. 2: Cryptic-donor activation is influenced by proximity to the annotated-donor.**

While decoy-donors are present everywhere, which ones are selected as cryptic-donors by the spliceosome in the context of a genetic variant appears strongly influenced by their proximity to the annotated-donor (Fig. 2a, b), as shown by their enrichment at proximal locations relative to all decoys present in the genome (Fig. 2d). The steep decline in exonic decoys (Fig. 2d, left) is due to the shorter lengths of exons limiting their frequency at these distances (50th and 90th percentile for exon length shown). Notably, each annotated-donor has on average 36 decoy-donors within + /−250 nt not used by the spliceosome – indicating that features other than proximity to the annotated-donor define a usable cryptic-donor (Fig. 2e).

Only 31–67% of1 cryptic-donors are stronger than the annotated-donor

We examined the ability of four algorithmic measures of splice-site strength to predict cryptic-donor activation (Fig. 3). We compared the performance of MaxEntScan (MES)¹³, NNSplice (NNS)¹² and SpliceAI (SAI)¹¹ as well as our own method termed Donor Frequency (DF) (see methods and supplementary Fig. 1 for details, supplementary Fig. 2a–c for full plots). DF measures donor strength based on how many annotated-donors in the human genome have the exact same sequence. DF calculates the median frequency of four consecutive windows of nine nucleotides in length (between E^-4 and D⁺⁸) among all annotated-donors, converted to a cumulative frequency distribution. For example, if an E^-3 to D⁺⁶ sequence has a raw frequency of 222, this combination of nine bases occurs at the analogous position for 222 annotated-donors, corresponding to the 35th percentile of a cumulative frequency distribution across the human genome (see supplementary Fig. 1c). For these and all further analyses, we excluded the 1113 cryptic variants in the database derived from SAI predictions already validated on GTEx RNA-seq data¹¹. Our nomenclature of REF and VAR corresponds to the reference (REF) or variant (VAR) donor sequence.

**Fig. 3: Cryptic donor activation is influenced by relative strength.**

The four algorithms use different methods to measure the intrinsic strength of a given donor splice-site. In the following discussion we use the term stronger and weaker to denote a donor that has a higher or weaker score, respectively, according to that algorithm. Comparisons such as weaker by >50% denote that the donor score has been reduced by more than half by the variant.

For AM-variants, activation of a cryptic-donor typically occurs in the context of a variant that weakens the annotated-donor to less than half of its original strength (Fig. 3a, dark blue). While many AM cryptics are stronger than the annotated_VAR (Fig. 3c, example shown in Fig. 3b), a substantial subset are not the strongest decoy-donor within 250nt (Fig. 3d). In fact, many activated cryptic-donors are not recognised as bona fide donors by the respective algorithms, notably NNS (Fig. 3e).

Intuitively, for most CM-variants the cryptic is strengthened by the variant (Fig. 3f, orange, example shown in Fig. 3g). However, less than half of activated cryptics are stronger than the annotated-donor (Fig. 3h). Along similar lines, for a majority of AM/CM-variants the annotated-donor is weakened (Fig. 3i, blue) while the adjacent cryptic is strengthened by the variant (Fig. 3j, orange, example shown in Fig. 3k). However, only 29–67% of AM/CM-cryptics_VAR are stronger than the annotated-donor_VAR (Fig. 3l). Despite similar overall performance for each algorithm, they showed discordance in variant outcome predictions (Fig. 3M, N) and measures of splice-site strength (Supplementary Fig. 2d).

In summary, four independent algorithms concur that cryptic-donor activation typically occurs in response to weakening of the annotated-donor (85–99% of variants) or strengthening of the cryptic-donor (67–98% of variants). However, only 35–70% of activated cryptic-donors are stronger than the annotated-donor_VAR, and for unmodified cryptic-donors, 29–62% are not the strongest decoy-donor within 250 nt. Thus, while relative strength of the annotated- and cryptic-donor influence spliceosomal use, there are other factors at play.

Competitive decoy-donors are depleted close to annotated-donors

Decoy-donors of comparable or greater strength to the annotated-donor rarely occur within 150 nt (Fig. 4a, red). However, exonic and intronic regions around donors have characteristic single and dinucleotide frequencies which may contribute to the rarity of decoy-donors (supplementary Fig. 3). In particular, the first 50 nt of the intron often shows enrichment in G and T dinucleotides, with distinct patterns: 1) G repeats are enriched in the shortest of introns and T repeats in the longest (supplementary Fig. 3c); 2) Introns with G (or C) at the D⁺³ position are enriched in G dinucleotides whereas introns with A (or T) at the D⁺³ position are enriched in T dinucleotides (supplementary Fig. 3d); 3) Introns with rare donors (low DF) are enriched in T-repeats compared to introns with the most common donors (supplementary Fig. 3e). Therefore, we adapted a previously used method¹⁷ which takes dinucleotide preferences into account, to assess if decoy-donors occur more or less commonly than expected by random chance (see Methods and supplementary Fig. 4).

**Fig. 4: Competitive decoy-donors are specifically surrounding annotated-donors.**

GT decoy-donors show increasing exonic depletion approaching the annotated-donor, with out-of-frame decoys (red) depleted more than in-frame decoy-donors (orange), while showing negligible depletion in the intron (Fig. 4b). GC decoy-donors show no depletion in either the exon or the intron (supplementary Fig. 5a).

We next assessed what proportion of decoy-donors in the genome are used, albeit rarely, via unannotated splice-junctions detected across 40,233 publicly available RNA-seq samples from GTEx¹⁸ and Intropolis¹⁹ (40K-RNA). Unannotated splice-junctions (representing stochastic mis-splicing), seen rarely in RNA-seq samples aggregated across a population, constitute empirical evidence that both splicing reactions can be executed using a decoy-donor. Therefore, we mined 40K-RNA for splice-junctions representing the use of cryptic-donors within 250 nt of any annotated-donor, and ranked them according to the number of samples they were present in (see methods). Overall, ~7% of all unannotated decoy-donors are in fact present as rare, stochastic mis-splicing events in 40K-RNA.

The proportion of exonic GT decoy-donors present in 40K-RNA (relative to all decoys) dramatically increases with proximity to the annotated-donor, with intronic decoys showing only a modest change (Fig. 4c). This mirrors depletion patterns (Fig. 4b) and confirms that decoy-donors closer to the annotated-donor are inherently more likely to be used by the spliceosome. Less than 4% of exonic GC decoy-donors are present in 40K-RNA, even very close to the exon/intron junction, in line with their observed lack of depletion (Supplementary Fig. 5b).

The ability of DF to measure donor strength is evidenced by Fig. 4d, e. While there is negligible depletion of decoy-donor sequences that do not exist as a bona fide donor at any exon-intron junction in GRCh37 (DF = 0, grey), there is clear depletion of exonic decoy-donors closer in DF (50–90% DF, mid-blue), or of similar or greater DF ( ≥90% DF, dark blue) (Fig. 4d, left), relative to the annotated-donor. Depletion is even evident for decoy-donors that have DF of only 10% relative to the annotated-donor, and so we define a competitive decoy-donor as one above this threshold. Interestingly, except for the most competitive decoy-donors (≥90% DF; Fig. 4d, right, dark blue), decoy-donors show no depletion in the intron. Concordantly, the proportion of exonic decoy-donors present in 40K-RNA increases with increasing relative DF, and to a lesser extent at the start of the intron (Fig. 4e).

Why are intronic decoy-donors less likely to be used by the spliceosome?

The fact that intronic decoy-donors are less depleted and less likely to be seen in 40K-RNA (Fig. 4b–e) was initially perplexing, given that cryptic-donors are just as common in the intron as in the exon (Fig. 2a, b). However, we reasoned distinctive nucleotide preferences in the first ~50 nt of the intron could affect measures of depletion, and/or, influence the usability of decoy-donors in this region. For example, G-repeat splicing regulatory elements (SREs) are abundant within the first ~50 nt of the intron^20,21,22.

We defined competitive decoy-donors as those with a DF of at least 10% that of the associated annotated-donor (see Fig. 4d, e). In the first 50 nt of the intron, competitive decoy-donors overlapping G-triplets show no depletion and conversely appear enriched (Fig. 5a, intron- orange). In contrast competitive decoy-donors not overlapping G-triplets are depleted (Fig. 5a, intron- grey). Additionally, a higher proportion of intronic decoy-donors not overlapping G-triplets are seen in 40K-RNA than those overlapping G-triplets (Fig. 5b, intron- grey). The reciprocity in these data is consistent with a masking effect of intronic G-repeat motifs on (competitive) decoy-donors, likely due to RNA secondary structure and/or RNA binding proteins preventing their use.

**Fig. 5: Utility of 40K-RNA to identify decoy-donors able to be used by the spliceosome.**

Figure 5c shows an example variant in gene GAA (NM_000152.3:C.2646 + 2 T > A) identified in an individual affected with glycogen storage disease type II²³ that induces splicing to an exonic cryptic-donor 21 nt upstream of the annotated-donor. NNS, MES, and DF rank the decoy-donor at +57 as the strongest donor - however this donor is enveloped within a G-repeat rich region which may mask it, and accordingly is not present in 40K-RNA. SAI instead predicts use of the cryptic-donor at −21. Notably, this cryptic-donor is present in 137 samples in 40K-RNA, providing empirical evidence that despite its weak primary motif, it can be used by the spliceosome.

SAI in silico mutagenesis of the cryptic-donor at -21 and decoy-donor at +57 show that SAI deep-learning perceives the negative impact of the G-repeats on the usability of the +57 decoy-donor (Fig. 5d). Intronic G-repeats are known examples of SREs^20,24 (see Fig. 5 and additional examples supplementary Fig. 5c–f). Whether or not a cryptic donor can be used is influenced by a constellation of features: the consensus donor sequence, as well as proximal and more distal splicing regulatory elements. Regulatory elements are not factored by many algorithms, though may be identified by SAI, likely underpinning its enhanced capabilities in recognition of usable (cryptic) splice-sites. In contrast, 40K-RNA uses empirical evidence from RNA-Seq data that reveals which cryptic splice-sites are usable in the context of the specimens tested.

90% of cryptic-donors in AM-variants are present in 40K-RNA

We assessed whether 40K-RNA provides a viable means to prioritise cryptic-donors likely to be activated in the context of a genetic variant affecting the annotated-donor (i.e. AM-variants). 90% of AM-variant activated cryptic-donors are present in 40K-RNA, while 91% of unused decoy-donors are absent. Therefore, 40K-RNA provides potent predictive information with respect to both true positives (cryptic-donors) and true negatives (decoy-donors). Notably, while cryptic-donors were observed in multiple independent samples across 40K-RNA, they were typically very low frequency splice-junctions (44% had a maximum of 4 reads or less in any one sample, supplementary Fig. 6b).

We chose the top 4 40K-RNA events at each splice-junction (or all events if there were less than 4 detected) as our predicted cryptic-donors as this maximised sensitivity (87%) without compromising specificity (95%) (Fig. 6a). Use of 40K-RNA had a higher sensitivity than all four algorithms assessed (Fig. 6a, b, supplementary Fig. 6a). The sensitivity of 40K-RNA is inherently influenced by read-depth of the target transcript: more than 85% of cryptic-donors are detected in transcripts with a read depth of >250 for the annotated exon-exon splice-junction (normal splicing); whereas only 29% of cryptic-donors are detected in 40K-RNA in transcripts where normal splicing had a maximum read count of <100 (supplementary Fig. 6c). Consequently, we assessed SAI as a complementary approach for situations where our empirical method is underpowered or not well suited.

**Fig. 6: 40K-RNA potently informs cryptic-donor activation.**

We define SAI prediction of cryptic-donor activation as a donor-gain Δ−score of 0.1 or greater, which accurately predicts 75% of cryptic-donors and inaccurately predicts only 1% of decoy-donors (Fig. 6b). SAI showed higher sensitivity then NNS, and comparable sensitivity to MES and DF, while greatly improving on their specificity (supplementary Fig. 6a). However, the sensitivity of SAI is compromised for cryptics at increasing distance from the annotated-donor - only 55% of cryptic-donors further than 100 nt from the annotated splice site had a Δ-score above 0.1 (supplementary Fig. 6d, e). If we take the union of SAI and 40K-RNA cryptic-donor predictions (i.e., cryptics predicted by either of the two methods), we accurately predict 2210/2389 (93%) of cryptic-donors (Fig. 6c) and inaccurately predict 6% of unused decoy-donors.

Use of 40K-RNA has caveats for CM-variants and AM/CM-variants, and cannot be used for variants that create a GT (or GC) motif. However, for the subset of CM-variants and AM/CM-variants where the variant modifies the extended splice site region of an extant GT/C decoy-donor (1525 variants, Fig. 6d, top- blue), 76% are present in 40K-RNA, with 56% in the top 4 events.

40K-RNA is least sensitive for variants that most significantly impact the strength of the cryptic-donor: For D⁺² CM-variants, only 32% of the cryptic-donors are present in the top 4 events, as compared to 85% for E^-3 variants (supplementary Fig. 6f; E^-3 is the third to last exonic base). Accordingly, even if a GC decoy-donor is not present in 40K-RNA, conversion by a variant to a GT donor presents high risk for cryptic-activation. SAI performed well for CM-variants and AM/CM-variants, correctly predicting 96% of variants that created an essential donor motif and 90% which modified an existing essential motif (Fig. 6d).

Discussion

The ultimate goal of splicing predictions is to determine if and how a genetic variant will induce mis-splicing of pre-mRNA. Even for essential splice-site variants that almost invariably cause mis-splicing, consideration of probable consequences of the variant is critical for pathology application of the ACMG-AMP code PVS1²⁵ (null variant due to presumed mis-splicing of the pre-mRNA) and of equal importance to strategise functional testing for RNA diagnostics¹⁴. While activation of a cryptic-donor 6 nucleotides away will remove or insert two amino-acids, activation of a cryptic-donor 4 nucleotides away will induce a frameshift, with vastly different implications for pathology interpretation.

We learned five key lessons from our analyses of 4811 cryptic-donor variants in 3399 genes: (1) Decoy-donors that show evidence of natural stochastic use by the spliceosome in population-based RNA-Seq data (i.e., are present in 40K-RNA) have the greatest probability of activation as cryptic-donors. (2) Decoy-donors closer to the annotated splice site are inherently more likely to be used by the spliceosome, likely due to the presence of all required sequence features that are facilitating use of the annotated donor.

(3) Cryptic-donors do not necessarily need to be stronger than the annotated-donor to present substantive risk for mis-splicing, with decoy-donors only 10% of the strength of the annotated-donor able to compete for spliceosomal binding. (4) Intronic G-repeats can diminish the likelihood of spliceosomal recognition and use of intronic decoy splice sites. (5) SAI’s deep-learning appreciates the broader sequence context determining spliceosomal usability of a cryptic-donor, though less accurately predicts activation of more distal cryptic-donors (>100 nt from the annotated-donor).

SAI’s deep learning presents a major improvement in predicting cryptic-donor activation. However, use of SAI in a pathology context is limited by the challenge of deriving a clinically-meaningful interpretation of a number on a 0–1 scale, without independently verifiable evidence. In contrast, 40K-RNA provides an accurate, evidence-based means to rank cryptic-donors likely to be activated by genetic variants.

Brandão et al.²⁶ used deep sequencing of twelve major cancer susceptibility genes to catalogue all alternative and aberrantly spliced transcripts. They found variant-activated cryptic splicing was often seen at much lower levels in disease controls, suggesting that the dominant transcript in rare disease may be seen as a stochastic mis-splicing event in other samples. We use this insight, mining the breadth of publicly available RNA-seq data across numerous tissues to comprehensively catalogue stochastic cryptic splicing events across all genes.

The heightened sensitivity and empirical nature of using 40K-RNA is of vital importance for pathology assessment of variants affecting the essential donor splice-site, as not considering a likely cryptic-donor activated can lead to profound complications in variant interpretation. Prospectively, the sensitivity of 40K-RNA can be enhanced by ultra-deep sequencing. It is also easy to envisage extending the method to predict other mis-splicing events such as exon skipping, and mis-splicing events at the acceptor splice site. 40K-RNA can reliably identify distal cryptic-donors with high likelihood of activation, which may not be identified by SAI. Conversely, SAI can reliably identify cryptic donors with high likelihood of activation not detected in 40K-RNA, due to low read depth of the target gene.

In conclusion, we define an accurate, evidence-based method to predict cryptic-donor activation in the context of a variant affecting the annotated-donor, based on stochastic mis-splicing events observed in 40,233 publicly available RNA-seq samples (40K-RNA). We provide a web resource that reports and ranks the most commonly (mis)used cryptic donors proximal to every ensembl annotated-donor²⁷ (https://kidsneuro.shinyapps.io/splicevault-40k/). Our research establishes that for AM-variants, if a cryptic-donor is activated, in 87% of cases it will be among the top 4 events. We hope this evidence-based method may improve clinical interpretability of donor variants.

Methods

Creating a database of cryptic-donor variants

Variants were derived from several sources: (1) 439 variants curated from literature, predominantly comprised of 364 variants in DBASS5¹⁵ and supplemented by curation from published literature of 75 additional variants^28,29 (2) 4372 variants derived from RNA-seq studies: Variant-associated aberrant cryptic-donor activation detected from RNA-seq data identified by SAVnet in somatic tumor samples (n = 3259)¹⁶ and 1113 variants identified in GTEx samples by spliceAI and verified using RNA-seq data¹¹. The following inclusion criteria applied: (1) Variants had to occur within E^-4-D⁺⁸ of the annotated or the cryptic-donor, otherwise they were excluded as outside the bounds of this analysis. (2) annotated cryptic-donors were within the same exon/intron as the variant (i.e., between the 5′ end of the exon and 3′ end of the intron surrounding the affected donor). (3) The annotated cryptic-donor VAR sequence had to have an essential GT/GC dinucleotide at D⁺¹/D⁺², to minimise misannotated variants being included.

Annotating variant categories

We annotated variants with categories we defined– if the variant occurred within E^-4-D⁺⁸ of the annotated-donor, it was an AM-variant, if it occurred within E^-4-D⁺⁸ of the cryptic-donor it was a CM-variant, and if it occurred within E^-4-D⁺⁸ of both the annotated- and cryptic-donor it was an AM/CM-variant. For 37/373 of AM/CM-variants, an additional unmodified cryptic-donor was activated, in addition to the cryptic-donor modified by the variant- these were excluded from analyses.

Compiling annotated-, cryptic- & decoy-donor sequences

The R package BSgenome.Hsapiens.1000genomes.hs37d5³⁰ was used to extract (up to) 500 nt of genomic sequence preceding and succeeding the annotated-donor (GRCh37). For each variant in the cryptic-donor database, we extracted up to 250 exonic nucleotides in the 5′ direction (i.e., if the exon was only 50 nt the window of analysis would be 50 nucleotides), and up to 250 intronic nucleotides in the 3′ direction, in the same fashion (Fig. 1a).

From the (up to) 500 nt of sequence we pulled E^-4-D⁺⁸ sequences for the annotated- and cryptic-donor before and after each variant (REF and VAR respectively). We also identified any other essential donor dinucleotides (i.e., GT or GC) which were present in the sequence and extracted the E^-4-D⁺⁸ sequence surrounding them. These sequences we define as decoy-donor- sequences containing the essential donor dinucleotides (i.e., a GT or a GC) but which weren’t utilised by the spliceosome as a result of the variant (Fig. 1a, lower). For intronic decoy-donors, we excluded any which would result in an intron too short to be spliced (as defined by the 1^st percentile for intron length in the human genome = 80 nt)³¹. Importantly, without additional filtering, no cryptic-donors in the database violated this rule.

Algorithms for splice site strength

We retrieved predicted scores for annotated-donors, cryptic-donors and decoy-donors in the database in both the REF and VAR sequence context, for four algorithms. (1) MaxEntScan (MES)¹³ scores were retrieved using the perl script provided at http://hollywood.mit.edu/burgelab/maxent/Xmaxentscan_scoreseq.html. MES scores below 0 were standardised to 0 as predicted non-functional splice sites (2) NNSplice (NNS)¹² scores were retrieved using the online portal (https://www.fruitfly.org/seq_tools/splice.html), set to human, with default settings (i.e., a minimum score of 0.4, with any scores below predicting a non-functional splice site) (3) SpliceAI (SAI)¹¹ scores were retrieved using a script adapted from the SAI GitHub repository (https://github.com/Illumina/SpliceAI) which allows spliceAI to score custom sequences. We rounded the scores to three decimal places, and scores at 0 when rounded as such (i.e., < 0.01) were termed non-functional splice site predictions. (4) Donor Frequency (DF) calculates the median frequency among four 9 nt windows of sequence spanning the donor (see supplementary Fig. 1b, c) converted to a cumulative percentile distribution scale. DF measures donor strength based on how many annotated-donors in the human genome have the exact same sequence. An example of a DF calculation is shown in supplementary Fig. 1c, where a median DF raw value of 179 lies at the 31st percentile of a cumulative frequency distribution. After assessing several window lengths (supplementary Fig. 1a) we used 9nt windows as optimally encompassing the splice site.

Naturally occurring decoy-donors

Our set of naturally occurring human decoy-donors were derived from the set of all canonical Ensembl transcripts (Release 75)²⁷, with first and last introns and single exon transcripts removed. We filtered the set so that junctions were within the open reading frame for the given gene, so we knew that cryptic splicing here would affect the protein. We also removed exons with alternative 5′ or 3′ ends already annotated in different transcripts. We used these criteria to form a set of 142,014 canonical exon-intron junctions that are not alternatively spliced (or at least not annotated as such). We extracted sequences surrounding annotated-donors and extracted all decoy-donors just as for the cryptic database (see methods section creating a database of cryptic-donor variants).

Decoy-donor depletion

Decoy-donor depletion was calculated using a method we adapted from a previous study¹⁷ that controls for dinucleotide frequencies (supplementary Fig. 4). For exonic sequences, we took up to 150 nt or the maximum length of the exon, whichever was shorter (and similarly for the intron, stopping 50nt from the acceptor). We limited analysis to 150nt of exonic sequence as the majority of exons are shorter than this. We then shuffled exonic and intronic sequences separately, maintaining dinucleotide frequencies (using shuffle_sequences with euler method from the universal motif R package³²). We performed the shuffle 15 times and took the average count of decoy-donors at each nucleotide position as our expected count at this position. The observed count of decoy-donors was then divided by the expected count at each position.

Creating 40K-RNA

We had two sources of data for 40K-RNA- RNA-seq data from Intropolis¹⁹ and GTEx¹⁸. Intropolis is a set of ~42 M splice-junctions found across 21,504 human RNA-seq samples from the Sequence Read Archive (SRA). Samples were aligned using Nellore et al. annotation-agnostic aligner Rail-RNA³³. Intropolis was downloaded from its dedicated github repository (https://github.com/nellore/intropolis). Per sample splice-junction files were obtained from GTEx (phs000424.v8.p2 [https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000424.v8.p2]). Using Datamash³⁴, splice-junction read counts were summarised across all samples for each unique splice-junction and translated from GRCh38 to GRCh37 using liftOver³⁵.

For each set of splice-junctions (Intropolis and GTEx), we cross-referenced and located junctions within ensembl transcripts. We filtered to cryptic-donor events by scanning for any unannotated donors used between the 5′ end of the exon and the 3′ end of the intron for that respective exon-intron junction, where the junction also spliced to the next annotated acceptor. Events from the two sources were merged, sample counts were tallied across the two datasets, and splice-junctions present in at least 3 samples and representing cryptic-donor use within 250 nt of any annotated-donor were retained.

Sashimi plots

For Fig. 5c, and S6b, c sashimi plots were generated using 3 GTEx bam files for each example, each from the tissue with the highest TPM for the respective gene. Sashimi plots were created using ggsashimi³⁶.

SpliceAI in silico mutagenesis plots

For Fig. 5d and S6b, c we performed the in silico mutagenesis method described by Jaganathan et al¹¹. That is, the importance score of each nucleotide was calculated as:

$${s}_{{actual}}-\frac{{s}_{A}+{s}_{C}+{s}_{G}+{s}_{T}}{4}$$

(1)

where s_actual is the score calculated on the genuine sequence, and s_A, for example, is the score calculated when an A is substituted at this position.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

The variants used in the cryptic-donor database are provided in the Source Data file. 40K-RNA is available as a web-resource at: https://kidsneuro.shinyapps.io/splicevault-40k/. Additionally, the full dataset is available under restricted access to limit hosting costs. Access can be obtained by creating a google cloud billing account and downloading at this link using google cloud tools- https://storage.googleapis.com/misspl-db-data/misspl_events_40k_hg19.sql.gz. The GTEx v8 data used in this study were obtained from dbGaP accession number phs000424.v8.p2 [https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000424.v8.p2]. Intropolis data used in this study were obtained from the dedicated GitHub repository https://github.com/nellore/intropolis. Source data are provided with this paper.

Code availability

All code required to replicate figures in the study are available in a GitHub repository: https://github.com/kidsneuro-lab/cryptic_donor_prediction. Additionally, code required to create 40K-RNA is available in a separate repository https://github.com/kidsneuro-lab/40K-RNA.

References

Anna, A. & Monika, G. Splicing mutations in human genetic disorders: Examples, detection, and confirmation. J. Appl. Genet. 59, 253–268 (2018).
Article CAS Google Scholar
López-Bigas, N., Audit, B., Ouzounis, C., Parra, G. & Guigó, R. Are splicing mutations the most frequent cause of hereditary disease? FEBS Lett. 579, 1900–1903 (2005).
Article Google Scholar
Ars, E. Mutations affecting mRNA splicing are the most common molecular defects in patients with neurofibromatosis type 1. Hum. Mol. Genet 9, 237–247 (2000).
Article CAS Google Scholar
Ezquerra-Inchausti, M. et al. High prevalence of mutations affecting the splicing process in a Spanish cohort with autosomal dominant retinitis pigmentosa. Sci. Rep. 7, 39652 (2017).
Article ADS CAS Google Scholar
Teraoka, S. N. et al. Splicing defects in the ataxia-telangiectasia gene, ATM: Underlying mutations and consequences. Am. J. Hum. Genet. 64, 1617–1631 (1999).
Article CAS Google Scholar
Colombo, M. et al. Comparative in vitro and in silico analyses of variants in splicing regions of BRCA1 and BRCA2 genes and characterization of novel pathogenic mutations. PLoS ONE 8, e57173 (2013).
Article ADS CAS Google Scholar
Houdayer, C. et al. Guidelines for splicing analysis in molecular diagnosis derived from a set of 327 combined in silico/in vitro studies on BRCA1 and BRCA2 variants. Hum. Mutat. 33, 1228–1238 (2012).
Article CAS Google Scholar
Jian, X., Boerwinkle, E. & Liu, X. In silico tools for splicing defect prediction: A survey from the viewpoint of end users. Genet. Med. 16, 497–503 (2014).
Article CAS Google Scholar
Tang, R., Prosser, D. O. & Love, D. R. Evaluation of Bioinformatic Programmes for the Analysis of Variants within Splice Site Consensus Regions. Adv. Bioinforma. 2016, 5614058 (2016).
Article Google Scholar
Truty, R. et al. Spectrum of splicing variants in disease genes and the ability of RNA analysis to reduce uncertainty in clinical interpretation. Am. J. Hum. Genet. 108, 696–708 (2021).
Article CAS Google Scholar
Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535–548.e24 (2019).
Article CAS Google Scholar
Reese, M. G., Eeckman, F. H., Kulp, D. & Haussler, D. Improved splice site detection in genie. J. Comput. Biol. 4, 311–323 (1997).
Article CAS Google Scholar
Yeo, G. & Burge, C. B. Maximum entropy modeling of short sequence motifs with applications to RNA Splicing signals. J Comput Biol. 11, 377–394 (2004).
Bournazos, A. M. et al. Standardized practices for RNA diagnostics using clinically accessible specimens reclassifies 75% of putative splicing variants. Genet. Med. (2021) https://doi.org/10.1016/j.gim.2021.09.001.
Buratti, E., Chivers, M., Hwang, G. & Vorechovsky, I. DBASS3 and DBASS5: databases of aberrant 3′- and 5′-splice sites. Nucleic Acids Res. 39, D86–D91 (2011).
Article CAS Google Scholar
Shiraishi, Y. et al. A comprehensive characterization of cis-acting splicing-associated variants in human cancer. Genome Res. 28, 1111–1125 (2018).
Article CAS Google Scholar
Iacono, M., Mignone, F. & Pesole, G. uAUG and uORFs in human and rodent 5′untranslated mRNAs. Gene 349, 97–105 (2005).
Article CAS Google Scholar
GTEx Consortium. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).
Article Google Scholar
Nellore, A. et al. Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol. 17, 266 (2016).
Article Google Scholar
McCullough, A. J. & Berget, S. M. G triplets located throughout a class of small vertebrate introns enforce intron borders and regulate splice site selection. Mol. Cell. Biol. 17, 4562–4571 (1997).
Article CAS Google Scholar
Caputi, M. & Zahler, A. M. Determination of the RNA binding specificity of the heterogeneous nuclear ribonucleoprotein (hnRNP) H/H′/F/2H9 Family. J. Biol. Chem. 276, 43850–43859 (2001).
Article CAS Google Scholar
Rosenberg, A. B., Patwardhan, R. P., Shendure, J. & Seelig, G. Learning the sequence determinants of alternative splicing from millions of random sequences. Cell 163, 698–711 (2015).
Article CAS Google Scholar
Huie, M. L., Anyane-Yeboa, K., Guzman, E. & Hirschhorn, R. Homozygosity for multiple contiguous single-nucleotide polymorphisms as an indicator of large heterozygous deletions: Identification of a novel heterozygous 8-kb intragenic deletion (IVS7–19 to IVS15–17) in a patient with glycogen storage disease Type II. Am. J. Hum. Genet. 70, 1054–1057 (2002).
Article CAS Google Scholar
Xiao, X. et al. Splice site strength–dependent activity and genetic buffering by poly-G runs. Nat. Struct. Mol. Biol. 16, 1094–1100 (2009).
Article CAS Google Scholar
Abou Tayoun, A. N. et al. Recommendations for interpreting the loss of function PVS1 ACMG/AMP variant criterion. Hum. Mutat. 39, 1517–1524 (2018).
Article Google Scholar
Brandão, R. D. et al. Targeted RNA-seq successfully identifies normal and pathogenic splicing events in breast/ovarian cancer susceptibility and Lynch syndrome genes. Int. J. Cancer 145, 401–414 (2019).
Article Google Scholar
Howe, K. L. et al. Ensembl 2021. Nucleic Acids Res. 49, D884–D891 (2021).
Article CAS Google Scholar
Leman, R. et al. Novel diagnostic tool for prediction of variant spliceogenicity derived from a set of 395 combined in silico/in vitro studies: An international collaborative effort. Nucleic Acids Res. 46, 7913–7923 (2018).
Article CAS Google Scholar
Pros, E. et al. Nature and mRNA effect of 282 different NF1 point mutations: Focus on splicing alterations. Hum. Mutat. 29, E173–E193 (2008).
Article Google Scholar
Gehring, J. BSgenome.Hsapiens.1000genomes.hs37d5: 1000genomes Reference Genome Sequence (hs37d5). R package version 0.99.1. (2016).
Bryen, S. J. et al. Pathogenic abnormal splicing due to intronic deletions that induce biophysical space constraint for spliceosome assembly. Am. J. Hum. Genet. 105, 573–587 (2019).
Article ADS CAS Google Scholar
Tremblay, B. universalmotif: Import, Modify, and Export Motifs with R. R package version 1.8.4. (2021).
Nellore, A. et al. Rail-RNA: Scalable analysis of RNA-seq splicing and coverage. Bioinformatics 33, 4033–4040 (2016).
PubMed Central Google Scholar
Free Software Foundation, I. GNU Datamash, Available at: https://www.gnu.org/software/datamash/. (2014).
Hinrichs, A. S. et al. The UCSC genome browser database: Update 2006. Nucleic Acids Res. 34, D590–D598 (2006).
Article CAS Google Scholar
Garrido-Martín, D., Palumbo, E., Guigó, R. & Breschi, A. ggsashimi: Sashimi plot revised for browser- and annotation-independent splicing visualization. PLOS Comput. Biol. 14, e1006360 (2018).
Article ADS Google Scholar

Download references

Acknowledgements

This project was supported by a National Health and Medical Research Council of Australia Senior Research Fellowship (S.T.C. APP1136197) and Ideas Grant (S.T.C. APP1186084). R.D. is supported by a University of Sydney Research Training Program Scholarship and Merit Award Supplementary Scholarship. The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH and NINDS.

Author information

Authors and Affiliations

Kids Neuroscience Centre, Kids Research, Children’s Hospital at Westmead, Sydney, NSW2145, Australia
Ruebena Dawes, Himanshu Joshi & Sandra T. Cooper
Discipline of Child and Adolescent Health, Faculty of Health and Medicine, University of Sydney, Sydney, NSW2006, Australia
Ruebena Dawes & Sandra T. Cooper
The Children’s Medical Research Institute, 214 Hawkesbury Road, Westmead, NSW, 2145, Sydney, Australia
Sandra T. Cooper

Authors

Ruebena Dawes
View author publications
You can also search for this author in PubMed Google Scholar
Himanshu Joshi
View author publications
You can also search for this author in PubMed Google Scholar
Sandra T. Cooper
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Data curation and analysis: R.D. and H.J. Funding acquisition and supervision: S.T.C. Visualization: R.D. Writing – original draft: R.D. Writing review and editing: R.D. and S.T.C.

Corresponding author

Correspondence to Sandra T. Cooper.

Ethics declarations

Competing interests

S.T.C. and H.J. are named inventors of Intellectual Property (IP) described in part within this manuscript owned jointly by the University of Sydney and Sydney Children’s Hospitals Network. S.T.C. is director of Frontier Genomics Pty Ltd (Australia) who have licenced this IP. S.T.C. receives no payment or other financial incentives for services provided to Frontier Genomics Pty Ltd (Australia). Frontier Genomics Pty Ltd (Australia) has no existing financial relationships that will benefit from publication of these data. The remaining co-authors declare no conflicts of interest.

Peer review

Peer review information

Nature Communications thanks Graziano Pesole and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Reporting Summary

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Dawes, R., Joshi, H. & Cooper, S.T. Empirical prediction of variant-activated cryptic splice donors using population-based RNA-Seq data. Nat Commun 13, 1655 (2022). https://doi.org/10.1038/s41467-022-29271-y

Download citation

Received: 18 July 2021
Accepted: 01 March 2022
Published: 29 March 2022
DOI: https://doi.org/10.1038/s41467-022-29271-y

This article is cited by

Combining full-length gene assay and SpliceAI to interpret the splicing impact of all possible SPINK1 coding variants
- Hao Wu
- Jin-Huan Lin
- Jian-Min Chen
Human Genomics (2024)
Introme accurately predicts the impact of coding and noncoding variants on gene splicing, with clinical applications
- Patricia J. Sullivan
- Velimir Gayevskiy
- Mark J. Cowley
Genome Biology (2023)
Aberrant splicing prediction across human tissues
- Nils Wagner
- Muhammed H. Çelik
- Julien Gagneur
Nature Genetics (2023)
SpliceVault predicts the precise nature of variant-associated mis-splicing
- Ruebena Dawes
- Adam M. Bournazos
- Sandra T. Cooper
Nature Genetics (2023)
Systematic analysis of CNGA3 splice variants identifies different mechanisms of aberrant splicing
- Peggy Reuter
- Magdalena Walter
- Nicole Weisschuh
Scientific Reports (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.