Introduction

Removal of introns from precursor messenger RNA (pre-mRNA) by splicing is a critical step in eukaryotic gene expression. Splicing of human pre-mRNAs is mediated by conserved but highly degenerate sequences that include the MAGGURAGU consensus (M is A or C; R is purine; is the exon–intron boundary) at the 5′ splice site (ss) and the YAGR motif (Y is pyrimidine) at the 3′ss, which are preceded by an upstream poly-Y tract and the branch point sequence. In addition to these traditional sequences, accurate recognition of exons and introns by the spliceosome requires auxiliary elements in the pre-mRNA that repress or promote splicing, termed exonic or intronic splicing silencers (ESSs/ISSs) or enhancers (ESEs/ISEs). These signals are thought to act through combinatorial effects of RNA secondary structure1, 2 and/or numerous regulatory factors that bind to pre-mRNAs, including serine/arginine-rich (SR) proteins3, 4 and heterogeneous nuclear ribonucleoproteins.5, 6, 7, 8

Auxiliary splicing sequences have been characterized experimentally,3, 9, 10 computationally11, 12 or by a combination of the two approaches.13, 14, 15 Recent gene-specific16 and genome-wide17 comparisons showed that exonic sequences between authentic and aberrant splice sites that were activated by splice-site mutations have lower frequencies of ESEs and higher densities of ESSs than average exons. Conversely, intronic sequences between authentic and cryptic/de novo splice sites have more enhancers and less silencers than average introns.16, 17 Although the relative importance of various auxiliary signals in the development of aberrant splice-site activation in vivo is poorly understood, silencers, and putative octamer ESSs (PESSs)12 in particular, were identified as stronger predictors of aberrant splice-site activation than ESEs,16, 17, 18 consistent with a dominant role of repressive elements that keep highly abundant decoy splice-site signals in check.

Splicing mutations play a major role in the development of hereditary diseases and may represent up to 50% of disease-causing alterations in genes with a large number of introns.19, 20 This figure is likely to be an underestimate, because current mutation-screening policies are biased towards coding DNA, leaving aberrant splicing undetected in many cases, particularly de novo splice sites in introns. Pre-mRNA splicing can be impaired by mutations located anywhere in the gene, but most have been found in GT and AG dinucleotides that define 5′ and 3′ intron ends,21 reflecting their highest level of conservation among splice-site consensus sequences.22 Although accurate description of the resulting abnormal transcripts is important for predicting the degree of severity and the age of onset of both Mendelian and complex traits, RNA samples from affected individuals or their family members are often not available and functional splicing assays are costly and time-consuming. Computational prediction of these outcomes from genomic sequences would therefore provide a useful alternative, but such methods have not been available.

Here, we describe the development of a method that can distinguish exons that are skipped and exons that activate cryptic splice sites as a result of splicing mutations. The new procedure was capable of predicting the correct outcome in 72% of the cases and was implemented as an easy-to-use web application termed CRYP-SKIP, which is freely available at http://www.dbass.org.uk/cryp-skip/.

Materials and methods

To distinguish exon skipping and aberrant splice-site activation from pre-mRNA sequences, we set out to compare both traditional and auxiliary splicing elements between two well-defined groups of sequences. The first group (dataset termed EXSK) contained a set of 250 exons that were skipped as a result of disease-causing splicing mutations but did not activate cryptic splice sites in flanking exons or introns.17 We used the same ascertainment criteria17 to obtain 47 additional EXSK sequences from recently published reports (Supplementary Table 1). For the second group, we analyzed a total of 204 exonic sequences that sustained cryptic splice-site activation as a result of germ-line or somatic splicing mutations. This dataset (termed CR-E) is available from the updated Database of Aberrant Splice Sites (DBASS) maintained at http://www.dbass.org.uk23, 24 and includes all mutation-induced cryptic splice sites reported in peer-reviewed journals between 1981 and June 2008.

In each exonic sequence, we determined the location and strength of predicted (decoy) 3′ and 5′ss, and counts and densities of previously identified auxiliary splicing sequences. For decoy splice sites, we employed a neural network (NN) splice-site prediction algorithm and the NN Splice server (http://www.fruitfly.org/seq_tools/splice.html).25 ESSs discovered by a fluorescence-activated screen (FAS-ESSs)13 were computed using the FAS-ESS server (http://genes.mit.edu/fas-ess/). Putative enhancers and silencers obtained by comparing non-coding exons and pseudoexons or untranslated regions of intron-less genes (PEXSs) were identified with the PEXS algorithm12 (http://cubweb.biology.columbia.edu/pesx/). Scores for SF2/ASF, which was confirmed as the most important SR protein for aberrant splice-site activation in vivo,17 were computed using the updated matrix26 and a standard threshold implemented in the ESEFinder (version 3; http://rulai.cshl.edu/tools/ESE/).27 In addition, we employed a recently published set of 1131 and 708 exon and intron identity elements (EIEs and IIEs, respectively) and their Z-scores derived from DNA-strand asymmetry patterns.28 Finally, we examined both datasets using the Neighborhood Inference (NI) method29 that predicts the activity of splicing regulatory elements based on the local density of known sites in sequence space. For each potential predictor variable, we computed the total number of ESSs/ESEs per exon and, where applicable, the sum of their scores. Each count and score density was calculated for 100 nucleotides as described earlier.17

To model the relationship between the predictor variables and a dichotomous response (either EXSK or CR-E), we used multiple logistic regression to estimate the probability of cryptic splice-site activation (PCR-E; defined below for the final model) and exon skipping (1−PCR-E). Eighty per cent of EXSK (n=238) and CR-E (n=163) sequences were randomly chosen as a training set, whereas the remaining EXSK and CR-E sequences were used as a test set to validate the performance of our discrimination procedure. Competing models were compared by the likelihood ratio test and the likelihood-based Akaike's information criterion. The discrimination ability of each model was assessed using leave-one-out cross-validation with the training dataset. For data handling and statistical analysis, we employed the R-statistical software (http://www.r-project.org).30

Results

Algorithm

Median values of the predictor variables in EXSK and CR-E datasets and their distribution are shown in Table 1 and Supplementary Figure 1; full datasets are available in Supplementary Table 2. After comparing univariate models (Supplementary Table 3), we built an initial multivariate model using appropriately transformed predictor variables (Supplementary Figure 2). The NI and IIE score densities were omitted from this model, as they did not improve the model fit in the presence of the remaining variables (P-value of the likelihood ratio test was 0.83). Count densities were also left out, because they were highly correlated with the score densities and provided less information (for example, Pearson's correlation coefficients for EIEs and IIEs count/score densities were 0.86 and 0.94, respectively). The EIE score density was retained in the model despite its non-significance, as this predictor appeared to improve discrimination of EXSK and CR-E (Supplementary Table 4).

Table 1 Median values of potential predictors of the splicing outcome

On the basis of our final multivariate model, we define PCR-E as:

where L is exon length (in nucleotides), PESS is the PESS density, NN5 is the density of decoy 5′ss, SF2 is the SF2/ASF score density, FAS is the FAS-ESS hex2 density and EIE is the EIE score density (Supplementary Table 2). Coefficient estimates of the final model, their standard errors and significance are shown in Table 2.

Table 2 Multiple logistic regression table

We next evaluated our discrimination procedure using an independent set of exons. Figure 1 shows the PCR-E distribution computed separately for the training and independent set. In the independent dataset, 54% of CR-E sequences had the PCR-E value >0.5, whereas 85% of EXSK sequences had the estimated PCR-E value ≤0.5. Conversely, 72% of exons with the PCR-E value ≤0.5 underwent exon skipping, whereas 71% of exons with the PCR-E value >0.5 sustained cryptic splice-site activation. Taken together, 72/100 (72%) sequences in the test set were correctly classified.

Figure 1
figure 1

PCR-E distribution of exonic sequences that underwent cryptic splice-site activation or exon skipping. (a) Training set; (b) independent set. All exonic sequences, the intrinsic strength of their authentic and cryptic splice sites, underlying mutations and their phenotypic consequences are described in the online Database of Aberrant Splice Sites. Their PCR-E values were determined as in Equation (1). Each column shows the number of EXSK (gray) or CR-E (black) events that had the PCR-E value in the interval shown on the x-axis.

The dependence of the PCR-E on predictor variables can be described in terms of the odds of a cryptic splice-site activation (OCR-E), defined as PCR-E/(1−PCR-E). Assuming the above model, a 33% increase in exon length would increase the estimated OCR-E by 42% if the values of the remaining predictors are kept unchanged. The same increase in the estimated OCR-E would require a rise in NN5 by 0.076, or in SF2 by 3.4, or in EIE by 346.6, with the values of the remaining predictors fixed. Conversely, a 42% decrease in the estimated OCR-E would result from an increase in PESS by 4.1 or from an increase in FAS density by 1.54, again without changing the values of the remaining five predictors. The OCR-E value was not much influenced by NN5 higher than 0.2, SF2 higher than 10 and FAS lower than 6.5 (Supplementary Figure 2). Together, these results illustrate how the values of some predictors influence the odds of cryptic splice-site activation, with practical implications for predicting aberrant splice-site activation ab initio.

To further validate the performance of the algorithm using experimental data, we compared a previously observed outcome of splicing mutations in the RB1 gene 31 with PCR-E values for each exon (Supplementary Table 5). This comparison showed that 13/14 (93%) exons that were skipped as a result of mutation were correctly predicted by the CRYP-SKIP algorithm. The only exception was exon 25, in which the predicted cryptic 5′ss activation (ATG/GTATGT in the middle of the exon) was not observed, despite a relatively high PCR-E value of 0.64. A failure to activate this splice site in vivo (mutation IVS25+1G>A31) may be explained by the small size of reduced exonic segment (33 nt), which may need additional splicing enhancer elements in flanking intronic sequences.

Finally, we calculated PCR-E values for 43 243 constitutively spliced human exons32 and for a set of 1909 alternatively spliced exons that are conserved between mouse and humans.33 As expected, the average PCR-E values were higher in the former group (0.39 vs 0.34, t-test, P<10−15), suggesting that alternatively spliced or weakly included exons are less likely to sustain cryptic splice-site activation when their authentic site is mutated. This is probably attributable mainly to a smaller average exon size of the latter group (139 vs 128 nt).

CRYP-SKIP

Our final regression model was incorporated in a new algorithm termed CRYP-SKIP, which was implemented as a common gateway interface script on a public server available at http://www.dbass.org.uk/cryp-skip/ or http://cryp-skip.img.cas.cz. The algorithm determines the exon length for each submitted sequence and performs a search for PESSs with the Z-score cut-off value of −2.6212, FAS-ESSs hex2 set13 and EIEs.28 For the analysis of decoy splice sites and SF2/ASF scores, the script (programmed in Perl) interacts with the NN splice server and the ESEFinder, respectively. The web application computes count and score densities of these elements and employs the logistic regression model as described above to calculate the PCR-E value (Equation (1)) for each submitted sequence.

CRYP-SKIP users submit a DNA sequence (FASTA format) consisting of one exon (in upper case) and 100 nt of flanking intervening sequences (in lower case). Pairs of wild-type and mutated sequences or multiple FASTA sequences are permitted, as long as the total sequence does not exceed a limit of 4000 bp. The server output is a single page with a summary table containing a list of predictors, their calculated values and PCR-E, which is graphically shown as a pointer next to the table (Figure 2). PCR-E takes values between 0 and 1, with higher values speaking in favor of cryptic splice-site activation and lower values in favor of exon skipping (Figure 1). Finally, CRYP-SKIP shows predicted cryptic splice sites as vertical marks in the input sequence; their size reflects the relative intrinsic strength of decoy splice sites and sums to the PCR-E value of the submitted exon.

Figure 2
figure 2

A screenshot of the CRYP-SKIP output. PCR-E is shown on the right as a pointer balancing between exon skipping (EXSK) and cryptic splice-site activation (CR-E). Values of each predictor variable used in the regression model are summarized in a table on the left. The last row of each table shows the numerical value of PCR-E for each input sequence. Exonic sequences (highlighted in light blue) are in upper case and flanking introns are in lower case. Predicted aberrant splice sites are shown as arrows, with their size reflecting their relative predicted strength (scale 0–1 on the right). The sum of their sizes equals the PCR-E value for the output sequence. The color reproduction of this figure is available on the html full text version of the manuscript.

Discussion

CRYP-SKIP is a comprehensive computational tool that predicts whether inactivation of authentic splice sites by mutation is more likely to result in exon skipping or aberrant splice-site activation. As the two events represent the vast majority of pathogenic transcripts induced by splice-site mutations, the algorithm will facilitate prediction of aberrant splicing outcomes for most splicing mutations in human genes. Because cis-acting splicing signals and spliceosome components are generally well conserved in higher eukaryotes, the same algorithm should discriminate the two aberrant splicing outcomes in other mammalian or vertebrate species as well, although this remains to be tested.

Our model is based on a comprehensive sample of carefully selected and well-documented dichotomous events that gives us a unique opportunity to study splice-site selection in vivo as opposed to in vitro experiments. This approach should be instrumental when addressing the question why a particular decoy splice-site signal was selected by the spliceosome despite having a lower intrinsic strength than similar signals in the vicinity that were not recognized. Both CR-E and EXSK datasets are likely to expand in future, which will be facilitated by a more widespread use of RNA-based mutation screening and by more rigorous characterization of aberrant transcripts at the nucleotide level in affected individuals. Published reports should include estimates of the relative amounts of RNAs transcribed from mutated alleles and also evidence for adequate separation of RT-PCR products using polyacrylamide gel electrophoresis as many cryptic splice sites are activated just a few nucleotides away from authentic sites. Finally, because ins/del polymorphisms represent an important and underappreciated source of disease-associated cryptic splice sites or pseudoexon activation, particularly in repetitive sequences (Meili et al,34), DNA-based mutation screening of disease genes should employ and further develop methods capable of detecting structural variants.

In the future, it will be desirable to extend this tool to ab initio prediction of mutation-induced cryptic splice sites in flanking intronic sequences. Location of aberrant splice sites in introns is not symmetrical, reflecting the more complicated pattern of 3′ss organization compared with the 5′ss.23, 35 The variability of traditional splicing signals at the 3′ss (branch site, polypyrimidine tracts, 3′YAG and upstream AG exclusion zones) from one intron to another is considerable, and cooperative assembly of spliceosomal complexes at these signals is further confounded by local secondary structure and by multiple, distant or non-canonical branch points. In addition, auxiliary splicing sequences have been studied less in introns than in exons, although some of them have recently been characterized in more detail, such as short G-rich repeats.36, 37, 38 The extended datasets should provide more power for building robust models that include additional predictors, which did not give significant P-values in the analyzed sample and/or did not significantly improve our model, including decoy 3′ss, RESCUE-ESE11 and NI.29 Future efforts should also be facilitated by a more comprehensive dissection of cooperative interactions between splicing signals upstream of 3′ss, including distant branch sites. Thus, prediction algorithms discriminating the two aberrant RNA outcomes from DNA sequences are likely to be further improved, ultimately leading to better understanding of splice-site selection in vivo and more accurate characterization of human mutations and their phenotypic consequences.