Empirical prediction of variant-activated cryptic splice donors using population-based RNA-Seq data

Predicting which cryptic-donors may be activated by a splicing variant in patient DNA is notoriously difficult. Through analysis of 5145 cryptic-donors (versus 86,963 decoy-donors not used; any GT or GC), we define an empirical method predicting cryptic-donor activation with 87% sensitivity and 95% specificity. Strength (according to four algorithms) and proximity to the annotated-donor appear important determinants of cryptic-donor activation. However, other factors such as splicing regulatory elements, which are difficult to identify, play an important role and are likely responsible for current prediction inaccuracies. We find that the most frequently recurring natural mis-splicing events at each exon-intron junction, summarised over 40,233 RNA-sequencing samples (40K-RNA), predict with accuracy which cryptic-donor will be activated in rare disease. 40K-RNA provides an accurate, evidence-based method to predict variant-activated cryptic-donors in genetic disorders, assisting pathology consideration of possible consequences of a variant for the encoded protein and RNA diagnostic testing strategies.

This paper was well written and clear, with an appropriate level of detail split between the methods and results. The figures are also excellent. I have only minor comments: -Have you seen "authentic" used elsewhere? I am used to calling these "canonical" SD. I don't hate "authentic" but I would prefer papers stick to standard nomenclature when possible.
-"with deletions ranging from 1 to 57 nts in length" make it clearer this means deletions in the transcript not indels in the DNA (as I originally thought reading this sentence!) -REF and VAR should be REF and ALT since this is more standard -you should describe HOW you combine DF and SAI.
-"We used the 1000genomes Phase2 Reference Genome Sequence (hs37d5) version of hg1927 implemented in RStudio" I would call this "in R"... and you implemented the genome?! -"as defined by the 5th percentile for intron length in the human genome"... which is?
-"Decoy-donor depletion was calculated using an adapted method" adaptive? -aligned Rail-RNA -> aligner -"translated from GRCh38 to GRCh37" urgh come on it's 2021. But I realize it would be a lot of work to redo this.
The work seems technically solid to me, if not extremely innovative. The observation that large RNAseq cohorts are valuable for this task is valuable and insightful however.
-"Counter-intuitively, less than half of activated cryptic..." This is only counter-intuitive to me if the canonical SD is now not used at all in favor of the cryptic SD, but I don't think that is the case typically here? If the cryptic SD is close in strength to the canonical it is intuitive to me the cryptic will be chosen some proportion of the time. Splicing is somewhat noisy after all.
-"85-99% weaker" There is an awkward question of scale here. What does splice site strength mean? I'm not sure there's a great answer since it will depend on the specific algorithm. e.g. for spliceAI it means "how probable is this to be an annotated SD", but MES takes a more enrichment based approach. Equivalently one could think of considering the log odds instead of the probability... 90% would then mean something different. I don't think there is a good answer (no fault of the authors) but it might be worth discussing a bit.
-"NNS, MES, and DF rank the decoy-donor at +57 as the most" ... it is not surprising to me that NNS and MES get this wrong since they don't know (or even see) the G-rich SRE. But it is surprising to me that DF gets this wrong: shouldn't +57 not be used in the RNA-seq database following your logic? Discussion: -"determining spliceosomal 'usability' of a decoy splice-site, including 2-4 above,". I agree SpliceAI models your points 3 and 4, but not 2. When it makes a prediction it doesn't know anything about where the canonical SD is.
I recommend acceptance on the condition the authors make their literature curated set of known (variant, canonical SD, cryptic SD) publicly available, since this will be a nice resource for the community. Oh a final point: I don't like the title. Donor -> Splice donor, and replace the numbers with just "high". The precise numbers are too dependent on many factors and not particularly meaningful in themselves.
Reviewer #3: Remarks to the Author: The manuscript by Dawes et al presents an interesting overview of variant-associated cryptic splicing donors by integrating the analysis of 4811 genetic variants with available RNA-seq data. Although some interesting results are presented, I have some serious concerns as detailed below.
1) The title is misleading, the approach proposed by the authors is not general and can not be applied for the "prediction of variant-associated cryptic-donors" as it is. It is my understanding that RNAsequencing data are required, which are not always available in clinical studies. The authors claim that their method can be used to evaluate and understand the pathological potential of variants currently considered to be VUS, however this is not the case, since the approach they propose can be used only for a limited selection of cases.
2) The evaluation of the sensitivity and specificity of the approach proposed by the authors is not carried out in a rigorous manner. No cross fold validation, training and/or testing sets are used to provide an unbiased estimate. No independent dataset is used (for example one based on long read sequencing technologies). No wet-lab validation is performed. With the paper in the current form authors can not claim that they have 87% sensitivity and 95% specificity.
3) The method for the prediction of cryptic donor sites is based on the integration of a pre-existing tool SpliceAI with RNA-sequencing data. Per se it does not provide the required level of scientific advancement/novelty required for a top-level scientific publication. Additionally levels of sensitivity depend dramatically on depth of sequencing of the RNAseq. Finally, it is the opinion of this reviewer that availability of RNA sequencing data is not a common feature in several clinical studies. Additionally it not clear why the authors decided to use SpliceAI and not any other tool/tools that they considered in the study. 4) Most of the analyses are purely descriptive (as in they present descriptive statistics about splicing donor sites), the novel approach proposed by the authors Donor Frequency (DF), simply captures the nucleotide composition around splice donor sites and is not substantially different from any consensus-PWSM (positional weight scoring matrix) based method. As such the analyses based on DF score do not provide interesting additional insights apart from the fact that a specific consensus sequence is preferentially observed. 5) Splicing is also mediated/controlled by epigenetic factors and/or regulation, so it would be appropriate do discuss the approach proposed by the user should be at least discussed in this context

Minor remarks
Some references are incomplete (see ref. 12 or 14, where the journal is missing). Please check all.

REVIEWER COMMENTS
As a general note, in the writing of our paper we had a difficult job trying to devise easily understandable terminology for several concepts. We can see from our reviewer's comments a need for further improvement. Therefore, we have refined our language throughout the manuscript, for example:  substituting 'annotated' for 'authentic' donors  defining our database of splice-junctions derived from 40,233 RNA-seq samples as '40K-RNA' and removing use of the term 'splice competent'.  We take care to define each new term in the paper with (what we hope is) improved clarity, and ensure we use the same terminology consistently throughout. We now also include a glossary in the methods section which we think may be helpful for our readers to explain a number of new terms: Our selection of E-4 to D+8 region is a fundamental element of our analysis, and we agree that we do not articulate clearly enough how and why it was selected. We have reworded to improve clarity: PG3, Line 55: Fig. 1)."

"We define the extended donor splice-site region as spanning the fourth to last nucleotide of the exon (E -4 , E = exon) to the eighth nucleotide of the intron (D +8 ; D = donor), as constraint on sequence diversity eases beyond this point (Supplementary
PG28, Line 1315, Figure S1 legend: " Supplementary Fig. 1  To clarify here: a) Figure S1 shows evolutionary constraint upon sequence combinations across the donor splice-site varying window length from 6 nt to 10 nt. Constraint upon sequence combinations reduces sharply at the peripheries (most obvious with short 6 nt or 7 nt windows). The jump in sequence diversity seen with windows upstream of/including E-5, and downstream of/including E+9, defined the boundaries of E-4 to D+8 as the most constrained region. b) Figure 1Ci and 1Di both show sharply decreasing numbers of variants towards the periphery of our window, proportional to the sequence constraint of these bases, and providing further justification for using the E-4 to D+8 window. Figure 1Ei is the exception, however the higher numbers of splice-altering E-4 and E-3 variants is due specifically to the mechanism highlighted in Figure 2C, whereby the +5 and +6 of the annotated-donor now becomes the +1 and +2 GT of the cryptic-donor.

R1.2 It is not clear to me why you used a 150 nt window to calculate decoy donor depletion whereas you used 250 nt window for your other analyses.
As the 50 th percentile of human exon lengths is 123 nt, the number of exons declines exponentially with longer exon lengths (as can be seen in figure 2D). We originally plotted +/-250 nt. Decoy depletion on the exonic side also did not change between 150 and 250 nt, though the plots got 'messier', due to declining numbers of exons to input into the observed / expected calculation. Additionally, decoy-depletion on the intronic side was uniform (i.e. a straight line). Thus, for simplicity and to communicate the most relevant findings, we present a 150 nt window. To address this point, we emend the legend for Figure 4 to clarify our reasoning for presenting a 150 window rather than 250 nt.
PG 11, Line 400, Figure 4 legend: "The relative DF of all decoy-donors within 150 nt of annotated-donors (decoy-donor DF / annotated-donor DF). Plots are shown for +/-150 nt of annotated-donors due to the steeply declining number of exons longer than this"

R1.3
As extensively used in the manuscript; the 'strong' donor needs more clear definition.
The term 'strong'/ 'stronger' is always in the context of comparing measures of splice-site strength by the same algorithm. We initially used the term 'scored as stronger by the respective algorithm' in drafting the manuscript, though felt this was a little unwieldly. Therefore, we address this point by providing a specific definition at first mention: PG 8, Line 240: "The four algorithms use different methods to measure the intrinsic 'strength' of a given donor splice-site. In the following discussion we use the term 'stronger' and 'weaker' to denote a donor that has a higher or weaker score, respectively, according to that algorithm. Comparisons such as 'weaker by >50%' denote that the donor score has been reduced by more than half by the variant" R1.4 Fig. 2

a&b, I am confused by the presence of cryptic donors that have 0 distance from the annotated-donor. Are there overlaps between the cryptic and annotated donor sites?
We acknowledge the confusion, due to each bar in these histograms corresponding to a 10 nt window. Thus, cryptic donors positioned between +/-5 nt of the annotated-donor appeared in the '0' tick label. To improve clarity, we have split histograms 2a, b (and d) into intronic and exonic cryptics so that the bars do not cross over '0'.

R1.5 A brief highlight and conclusion in the end to inform readers the potential applications of the proposed method/dataset should be helpful.
We've added a concluding paragraph to the end of the discussion (line 454) to address this point, which was also raised by R3:

PG 18, Line 1002 "In conclusion, we define an accurate, evidence-based method to predict crypticdonor activation in the context of a variant affecting the annotated-donor, based on stochastic mis-splicing events observed in 40,233 publicly available RNA-seq samples (40K-RNA). We provide a web resource that reports and ranks the most commonly (mis)used cryptic donors proximal to every ensembl annotated-donor 20 (link). Our research establishes that for AM-variants, if a cryptic-donor is activated, in 87% of cases it will be among the top 4 events. We hope this evidence-based method may improve clinical interpretability of donor variants."
(Link will be provided upon acceptance of the manuscript)

Reviewer #2 (Remarks to the Author):
R2.1 Have you seen "authentic" used elsewhere? I am used to calling these "canonical" SD. I don't hate "authentic" but I would prefer papers stick to standard nomenclature when possible.
We used the terminology "authentic" rather than "canonical" as a more general term for any annotated-donor, as variants in our database didn't necessarily affect canonical transcripts, and our 40K-RNA similarly includes non-canonical as well as canonical transcripts. In consideration of our reviewer comments, we substitute the term "authentic" with "annotated". We feel this nomenclature is more explicit and intuitive and is now used throughout the paper.

R2.2 "with deletions ranging from 1 to 57 nts in length" make it clearer this means deletions in the transcript not indels in the DNA (as I originally thought reading this sentence!)
We apologise for the lack of clarity. This sentence does in fact refer to deletions in the DNA and we have now modified the text within the results section to clarify.

R2.3 REF and VAR should be REF and ALT since this is more standard
While alternate (ALT) is more standard terminology in data science, variant (VAR) is the standard clinical terminology in reference to DNA variants identified by genomic sequencing. Therefore, we prefer to maintain use of the term VAR in order to reach our clinical genetics readership, in the hope of maximising the clinical applicability of our study.

R2.4 you should describe HOW you combine DF and SAI.
Agreed. We have clarified in the text: PG 16, Line 879 "If we take the union of SAI and 40K-RNA cryptic-donor predictions (i.e., cryptics predicted by either of the two methods), we accurately predict 2210/2389 (93%) of cryptic-donors ( Figure 6Aiii) and inaccurately predict 6% of unused decoy-donors." R2.5 "We used the 1000genomes Phase2 Reference Genome Sequence (hs37d5) version of hg1927 implemented in RStudio" I would call this "in R"... and you implemented the genome?!
We agree and have revised this (clumsy) sentence: PG 20, Line 1070 "The R package BSgenome.Hsapiens.1000genomes.hs37d5 27 was used to extract (up to) 500 nt of genomic sequence preceding and succeeding the annotated-donor (GRCh37)."

R2.6 "as defined by the 5th percentile for intron length in the human genome"... which is?
We have refined this sentence and corrected an error (we had written 5 th instead of 1 st percentile in error).
PG 20, Line 1082: "(as defined by the 1 st percentile for intron length in the human genome = 80 nt)"

R2.7 "Decoy-donor depletion was calculated using an adapted method" adaptive?
We have edited this sentence to improve the wording.
PG 21, line 1128: "Decoy-donor depletion was calculated using a method we adapted from a previous study…."
R2.9 "translated from GRCh38 to GRCh37" urgh come on it's 2021. But I realize it would be a lot of work to redo this.
We recognise this point and agree it is time to contemporise to GRCh38. However, to explain: our discovery of the predictive utility of splice-junction data in 40K-RNA occurred >2 years into a PhD study, with all other data and analysis founded on the GRCh37 genome assembly. We considered conversion to the GRCh38 genome assembly, but as this would not alter the key learnings, and with funding and time constraints trying to complete the PhD project with published outcomes, whilst managing ongoing COVID lockdowns for our laboratory, we elected to proceed to manuscript submission using GRCh37 datasets.
However, since manuscript submission, we have expanded our method using 300,000 publicly available RNA-Seq files annotated in GRCh38 (300K-RNA). Importantly, we confirm our method of ranking the Top 4 events reliably predicts the nature of multiple forms of variant-associated mis-splicing (exon-skipping, cryptic splice-site activation) with 90% sensitivity -for ~100 clinical cases subject to RNA Diagnostics (donor and acceptor variants). These data will be submitted separately for publication (soon), and the GRCh38 300K-RNA will be made publicly available as a web resource on publication of this second manuscript.
R2.10 "Counter-intuitively, less than half of activated cryptic..." This is only counter-intuitive to me if the canonical SD is now not used at all in favor of the cryptic SD, but I don't think that is the case typically here? If the cryptic SD is close in strength to the canonical it is intuitive to me the cryptic will be chosen some proportion of the time.
Splicing is somewhat noisy after all.
We agree that this word is unnecessary and have changed the sentence: PG 8, Line 252 "Intuitively, for most CM-variants the cryptic is strengthened by the variant (Figure  3Bi, orange, example shown in Figure 3Bii). However, less than half of activated cryptics are stronger than the annotated-donor (Figure 3Biii)." R2.11 "85-99% weaker" There is an awkward question of scale here. What does splice site strength mean? I'm not sure there's a great answer since it will depend on the specific algorithm. e.g. for spliceAI it means "how probable is this to be an annotated SD", but MES takes a more enrichment based approach. Equivalently one could think of considering the log odds instead of the probability... 90% would then mean something different. I don't think there is a good answer (no fault of the authors) but it might be worth discussing a bit.
In the particular sentence quoted we were actually referring to the proportion of variants that caused a decrease in score for that donor (across the four algorithms). We have amended the text to clarify: PG 10, Line 356 "In summary, four independent algorithms concur that cryptic-donor activation typically occurs in response to weakening of the annotated-donor (85 -99 % of variants)…" However, we do appreciate the reviewers point that the comparison of predicted splice-site strengths is complex. For that reason, we elected to simply compare the scores given by a specific algorithm on a linear scale for the purpose of summarising key insights, while also providing plots in the supplementary materials reporting the full dataset. We've made more explicit reference to this in the text and legend for figure 3.
PG 8, Line 240: "The four algorithms use different methods to measure the intrinsic 'strength' of a given donor splice-site. In the following discussion we use the term 'stronger' and 'weaker' to denote a donor that has a higher or weaker score, respectively, according to that algorithm. Comparisons such as 'weaker by >50%' simply denote that the score given to the donor has been reduced by more than half by the variant." PG 9, Line 278: "Assessment of algorithmic scores of splice-site strength for AM-variants. Categories such as 'weaker by >50%' are assigned based on how the score has been impacted by the variant (i.e. more than halved)." R2.12 "NNS, MES, and DF rank the decoy-donor at +57 as the most" ... it is not surprising to me that NNS and MES get this wrong since they don't know (or even see) the G-rich SRE. But it is surprising to me that DF gets this wrong: shouldn't +57 not be used in the RNA-seq database following your logic?
We suspect in this instance R2 has confused Donor Frequency (DF) with 40K-RNA (previously called Splice Competent Events (SCE)). We hope our emended terminology partly addresses this confusion and we have further refined the next sentence to improve clarity: PG 14, Line 608: "NNS, MES, and DF rank the decoy-donor at +57 as the strongest donor -however this donor is enveloped within a G-repeat rich region which may mask it, and accordingly is not present in 40K-RNA." Discussion: R2.13 "determining spliceosomal 'usability' of a decoy splice-site, including 2-4 above,". I agree SpliceAI models your points 3 and 4, but not 2. When it makes a prediction it doesn't know anything about where the canonical SD is.
Agreed. We have removed the phrase 'including 2-4 above' from the sentence (line 608).
R2.14 I recommend acceptance on the condition the authors make their literature curated set of known (variant, canonical SD, cryptic SD) publicly available, since this will be a nice resource for the community.
We agree and have provided the 'cryptic-donor database' (annotated-, cryptic-and decoydonors, with predicted scores from all 4 algorithms for REF and VAR sequences) in the source data.