The Usage of Exon-Exon Splice Junctions for the Detection of Alternative Splicing using the REIDS model

Alternative gene splicing is a common phenomenon in which a single gene gives rise to multiple transcript isoforms. The process is strictly guided and involves a multitude of proteins and regulatory complexes. Unfortunately, aberrant splicing events have been linked to genetic disorders. Therefore, understanding mechanisms of alternative splicing regulation and differences in splicing events between diseased and healthy tissues is crucial in advancing personalized medicine and drug developments. We propose a linear mixed model, Random Effects for the Identification of Differential Splicing (REIDS), for the identification of alternative splicing events using Human Transcriptome Arrays (HTA). For each exon, a splicing score is calculated based on two scores, an exon score and an array score. The junction information is used to rank the identified exons from strongly confident to less confident candidates for alternative splicing. The design of junctions was also discussed to highlight the complexity of exon-exon and exon-junction interactions. Based on a list of Rt-PCR validated probe sets, REIDS outperforms AltAnalyze and iGems in the % recall rate.

gene level intensities, and compares the normalized intensities between sample groups. More recent methods for alternative splicing detection include the Robust Alternative Splicing Analysis for Human Transcriptome Arrays method (RASA) 23 , Integrated Gene and Exon Model of Splicing (iGEMS) 14 , AltAnalyze 24 and the Random Effects model for the Identification of Differential Splicing (REIDS) 25 .
In this paper, we focus on HTA microarrays 26 . This array platform does not only assess each exon with more probes but also contains exon-exon junction probes that cover the region between two exons. In the current paper we extend the REIDS model 25 to include junction information. The presence of junction probes can be beneficial for detecting alternatively spliced exons and for unravelling the composition of transcript isoforms. The analytical framework presented in this paper consists of two steps. In the first step the REIDS model was used for alternative splicing detection. In the second step the reliability of the junctions is evaluated to support (or not) alternatively spliced exons. The REIDS analytical framework model was compared with RASA, iGEMS, the Affymetrix TAC tool 12 and the AltAnalyze software 24 .

Methods and Materials
Data. The HJAY Platform. The Affymetrix Human Transcriptome Array 1.0 (HJAY, HTA 1.0) 27 is an expansion of the Human Exon array containing 10 probes per probe set. The HJAY array also contains exon-exon junctions which are supported by four to eight probes per junction. Junction probes were designed to reveal alternative splicing events and transcript isoforms. There are three types of junctions: a 5′ end junction, a 3′ end junction (located at the 5′ end and the 3′ end of the probe sets respectively) and an exclusion junction. The data we analysed in this paper is described in detail in 23 . Briefly, the data consists of two tissues, liver and muscle, each with three replicates. In total 33,516 genes were measured using 298,281 exons and 249,475 junctions, each represented by eight probes on average. The data is publicly available on the website http://igenomed.stanford. edu/~junhee/RASA/.
The HTA-2.0 Platform. The HTA-2.0 microarray is a more recent and updated version of the HJAY platform. The number of interrogated probe sets is increased to 1,048,904 and the number of junctions to 339,000. On avergae, exons are measured with 10 probes and junctions with four probes. The analyzed data set contains cell lines with A549 lung adenocarcinoma that were treated with either scrambled RNA or transfected with a siRNA that targets SRSF1. Each condition is represented by three independent subjects with three replicates per subjects, leading to nine samples per condition. The data is publicly available in the GEO database under the accession number GSE76902 28 .

Detection of Alternative Spliced Exons: A Mixed Effects Model Approach. This section presents
the extension of REIDS analytical framework to incorporate exon-exon junctions. In the first step the REIDS model was applied to find alternative spliced (AS) candidates as originally proposed 25 . In the second step a scheme was developed using junction formation to classify the identified exons from strongly confident to less confident candidates for alternative splicing.
The REIDS Model. The REIDS model 25 formulates alternative splicing detection as a variance decomposition problem based on the assumption that between array variability of an alternatively spliced exon would be higher than the within array variability among the exons of the same transcript cluster (a gene). A non-alternatively spliced exon would have a between array variability that is at most the within array variability across all exons of the same transcript cluster. The two variance components, representing the between and within array variability, can be estimated using a gene specific mixed model given by The observed perfect match (PM) probe intensities are modelled in terms of the jth probe effect (p j , j = 1, …, J) and the overall effect of the ith array (c i , i = 1, …, I). Both p j and c i are considered to be fixed effects. The specific deviation of the kth exon from the overall gene effect is captured by the random effects, b ik , k = 1, …, K. The random effects b ik ~ N(0, D k ), are assumed to be independent of the background noise, ε ijk ~ N(0, σ 2 ), which captures the within array variability with σ 2 . Differential expression between the arrays is expected to be negated by incorporating the gene effect parameter c i . As a consequence, the exon specific parameters capture the deviation of a particular exon from its corresponding gene. A relatively small deviation implies that the exon is present in the sample while a large deviation indicates that the exon is likely to be absent.
Detection of Alternatively Spliced Exons. The advantage of a mixed model formulation for alternative splicing detection is the existence of a standard score to quantify the trade-off between signal and noise, the so-called intra-cluster correlation (ICC) 29 . From here onward we use the term exon score which is given, for the kth exon, by: where σ 2 is the same for all exons belonging to the same transcript cluster. A value of ρ k > 0.5 indicates that the kth exon is likely to be alternatively spliced. Note that the threshold for the exon score could be adjusted to the relative amount of signal in the data. Given that the kth exon has been identified to have substantial between array variability, the estimated random effects b ik can be used as array scores to identify arrays in which the alternatively spliced exon is expressed. Positive array scores denote an enrichment while negative values denote a depletion of the exon. The REIDS method is applied to each gene to obtain exon and array scores after which the exons are prioritized according to their exon scores. Probe sets with exon scores greater than a pre-specified threshold are retained for further investigation. The differences in array scores between biological conditions or tissue types are tested using the t-test for independent groups or the paired t-test for paired data. A probe set is considered to be alternatively spliced if the difference between the biological conditions or tissue types is statistically significant after the Benjamini-Hochberg multiplicity correction 30 .
Using Junction Information to Support Detection of Alternative Splicing Exons. In this section we explain how exon-exon junctions can either support or not support an exon to be allternatively spliced.
Motivation and Design. We illustrate the exon-exon junctions by inspecting the design of probe set PSR010025633 of the transcript cluster TC0102569, annotated to the CSDE1 gene, on the HJAY microarray. This probe set has five annotated junctions: a 5′ end junction, a 3′ end junction and three exclusion junctions. Figure 1 illustrates the design of the probe set and its junctions.
The upper panel of Fig. 1 shows that junction JUC0100236357 is the 3′ end junction of PSR010025632 and the 5′ end junction of PSR010025633. This implies that the sequence of junction JUC0100236357 is constructed from the end (3′ end) of PSR010025632 and the start (5′ end) of PSR010025633. This is shown in Fig. 2. Notice how the sequence of JUC0100236357 overlaps with the sequence of its annotated probe sets. Similarly, JUC0100236756 is the 3′ end junction of PSR010025633 and the 5′ end junction of PSR010025634. This is shown in Supplementary  Fig. 1.
We expect the presence of a junction only when both of its anchor points are present. This implies that JUC0100236357 will be quantified when both PSR010025632 and PSR010025633 are present in a transcript isoform without interruption. Similarly for junction JUC0100236756 and probe sets PSR010025633 and PSR010025634. Consequently, the presence or absence of the 5′ end and/or 3′ end junctions can indicate the presence or absence of a link between probe sets. As shown in Fig. 2, Junction JUC0100236357 has a matched sequence with the designed probes for both PSR010025632 and PSR010025633. This does not hold for all junctions. It is possible for a junction to have no 5′ end and/or 3′ end annotation or that the sequences of the junction do not match with the sequences of the probe sets. The latter implies that the junction might not quantify the same transcript as its adjacent probe sets.
The lower panel of Fig. 1 shows three exclusion junctions annotated to probe set PSR010025633: JUC0100236628, JUC0100236286 and JUC0100236573. These junctions are present when the exon is absent. Junction JUC0100236628 is annotated as the 3′ end junction of PSR010025630 and the 5′ end junction of PSR010025634. This entails that it will only be observed if probe sets PSR010025630 and PSR010025634 are  Junction Assessment Procedure. First, we consider the 5′ end and 3′ end linking junctions. The linking junctions indicate whether the annotated anchor points are neighbouring in a transcript isoform. A linking junction reflects such a bond if the junction represents both exon probe sets simultaneously, i. e. if the junction probe set reflects the median profile of the exon probe sets. In order to validate this, a model is fitted on the junction probe set and it's annotated exon probe sets. Since the pattern of splicing is more important than the expression values, the values are transformed to their corresponding ranks (R xk ). Next, the ranks of the individual probes of the exon probe sets (i.e. not of the junction) are averaged and ranked anew to obtain a median profile representing both anchor points of the junction. Finally, two models were fitted on the averaged ranks of the probe sets and the ranks of the junction.
with k = 1, …, K and x = 1, 2. Both models include a probe set effect and a group effect. The probe set effect is a factor with two levels either representing the profile of the anchor points or the junction. For the null model, M1, the junction has the same profile as the probe sets. Hence, no interaction effect is specified. The alternative model, M2, assumes a conflict between the junction and the probe sets, which is captured by an interaction effect. A significant interaction term implies that the junction is not an end product of the two probe sets. Figure 3 illustrates two figurative scenarios of alternatively spliced exons between condition A (samples A 1 , A 2 and A 3 ) and a condition B (samples B 1 , B 2 and B 3 ). Panel (a) reflects a situation in which the median profile of the exon probe sets (PSR1 and PSR2) is reflected by the linking junction probe set (JUC). The values of the exon probe sets PSR1 and PSR2 are high in condition A and low in condition B. The median profile of PSR1 and PSR2 shows the expected the behaviour of the linking junction: high in condition A and low in condition B. The expression values of the junction reflect the same pattern as the prob sets, resulting in an insignificant interaction term. Therefore, the junction supports the link between PSR1 and PSR2. In the second scenario, presented in panel (b), the probe sets and the junction show a conflicting pattern. PSR1 is highly expressed in condition A and depleted in condition B while PSR2 shows a reversed pattern. The resulting median profile is straight line. However, the observed junction values show a deviating pattern. Consequently, the interaction term will be significant. A real data example and the conversion to ranks are illustrated in Supplementary Figs 4

and 5.
An exclusion junction is considered to be absent if it is not detected above background (DABG) 31 level across all samples. This means that at least one of the skipped probe sets is considered present in an isoform and the junction does not support an alternative splicing event. Otherwise, the exclusion junction is present and support Otherwise, the junction is present and supports of the linking bond as well as of the AS candidacy of the excluded probe sets. The linking probe sets are considered to be neighbouring in at least one isoform, skipping all probe sets in between. For example in Fig. 1, if JUC0100236573 is deemed present by DABG analysis, probe sets PSR010025632 and PSR010025638 are neighbouring in at least one transcript isoform. However, if the junction is deemed absent by DABG analysis, then there is at least one exon probe set in between PSR010025632 and PSR010025638. DABG scores were calculated using the Affymetrix tool TAC. More information on exclusion junctions is given in Section 2.2 of the Supplementary Material.
The are four resulting categories of junction assessment of probe sets identified by REIDS: probe sets with linkages supported by all junctions, probe sets with linkages supported by at least one junction, not supported probe sets and probe sets without a 5′ end or 3′ end junction.
Alternative Splicing Types. Behind an AS identification, there is a specific AS type. This is determined by the location of the probe set in the transcript isoforms and the pattern of the neighbouring probe sets. The basic AS types are: a cassette exon, mutually exclusive exons, an alternative 5′ site, an alternative 3′ site, an alternative last, an alternative first and an intron retention. If information regarding the isoform composition of a transcript cluster is available, it can be deduced to which class an AS probe set belongs. In combination with the junction assessment procedure, the number of isoforms can be reduced if a reported linkage between probe sets is unsupported. A probe set is appointed the "complex event" class if the event does not belong to the arbitrary classes. Availability of data and materials. The HJAY data can be retrieved from the repository of the RASA paper 23 and the HTA-2.0 date set from GEO database under the accession number GSE76902 28 . The code to process the data and perform the analysis is bundled into an R package, REIDS, available on CRAN.

Results
The HJAY and HTA-2.0 data were preprocessed with the R package aroma.affymetrix 32 . An annotated.cdf file was created by merging the information from the provided.pgf and.clf files. In order to prepare the data for the REIDS model, the raw.CEL files were background corrected using the rma background correction, normalized with quantile-normalization and log2-transformed but without summarization. We illustrate the REIDS model with the HJAY data and benchmark the method with TAC and AltAnalyze based on the HTA-2.0 data.

Illustration of the REIDS model on the HJAY platform. Step 1: Identification of Candidate AS Exons.
A total of 33,516 genes were retained after pre-processing the data to probe level, 24,096 out of 33,516 genes had more than one probe set and were therefore selected for alternative splicing detection. Combined, these genes had 288,515 exon probe sets of which eventually 17,700 were identified by REIDS as candidates for alternative splicing. Figure 4 presents probe sets PSR010025633 of transcript TC0102569 and PSR080007308 of transcript TC0800969 which were found to be alternatively spliced between the liver (L) samples and the muscle samples (M) in the first step of the analysis.
The identified probe sets were further analysed using their associated junction information. In the conservative mode we require the support of all annotated junctions. 8,626 out of the 17,700 exon probe sets were identified as alternative spliced candidates and were supported by all their annotated junctions. 9,946 out of the 17,700 exon probe sets were supported by at least one of their annotated junctions. Finally, 12,323 probe setshad no supporting junctions. Figure 5 provides an overview of the classification of the probe sets with and without supporting junctions. Step 2: Junction Based Classification. We show examples of exon probe sets which had support of all their annotated junctions to be alternatively spliced and those which had no junction support.
Alternative Splicing With Supporting Junctions. A top ranked probe set is PSR010025633 of transcript TC0102569 (Fig. 1) with an exon score of 0.99 and a p-value < 0.01 for the significance testing of its array scores between the liver and the muscle tissues. This exon belongs to category (a) in Fig. 5. Figure 6 shows the probe data and the summarized values of the probe set, its 5′ end and 3′ end junctions. Probe set PSR010025633 is depleted in the liver samples and enriched in the muscle samples. This pattern is also seen for the neighbouring probe sets PSR010025632 and PSR010025634 and the junction probe sets JUC0100236357 and JUC0100236756. Both linking junctions had a similar pattern as the probe set and support the exon to be alternatively spliced. The three exclusion junctions are presented in Supplementary Fig. 13 and revealed an opposite pattern. The presence of Figure 5. The distribution of the 17,700 exon probe sets that were identified as alternatively spliced by the REIDS model in the first step of the analytical framework. 8,626 probe sets were found to be AS and supported by all their annotated junctions. 9,946 probe sets were found to be AS with the support of at least one of their annotated junctions. There were 12,323 probe sets without junction support. the exclusion junction confirms the absence of the probe set in at least some transcript isoforms. The alternative splicing detection of PSR010025633 was supported by all its annotated junctions.
Probe set PSR010025632 and PSR010025634 were both identified as constituitive probe sets. This implies that PSR010025633 is a cassette exon. Further, JUC0100236357 was not DABG in the liver samples. Therefore, a transcript which contains PSR010025632 and PSR010025633 simultaneously was not present in the liver samples. In the muscle samples, some transcripts include PSR010025633 while other transcripts exclude this probe set. Supplementary Fig. 14 shows the transcript composition of TC0102569 confirming the candidacy of probe set PSR010025633.
Alternative Splicing With At Least One Supporting Junction. Probe set PSR080007308 is part of the TC0800969 transcript annotated to the ASPH gene. It has a 5′ end, a 3′ end and three exclusion junctions. The junction architecture is shown in Supplementary Fig. 15. Figure 7 illustrates the probe set and its annotated junctions. In the upper row, 3′ end junction JUC0800063802 reflects the pattern of its anchor points PSR080007308 and PSR080007307. The junction supports a linkage between the two probe sets. However, in the middle row, the 5′ end junction JUC0800064097 shows probe is depleted for all samples with a DABG p-values < 0.05. A possible reason could be the pattern of its other anchor point. However, no 3′ end annotation was found for this junction. The junction was annotated as an exclusion junction for probe sets PSR080007309 and PSR080007310. The consistent presence of either of these probe sets can disrupt the junction sequence leading to its depletion. Probe set PSR080007309 was excluded for all samples but the values of PSR080007310 were indicative of an inclusion in at least a transcript isoform.
The observed pattern of the junction might also arise from the design of its probes. The designed sequences of the probes of the 5′ end junction JUC0800064097 and probe set PSR080007308 are shown in Supplementary  Fig. 16. Both probe sets show overlapping sequences. In contrary, the whole sequence of probe set PSR080007308 has a different beginning in the ensemble genome browser as presented in Fig. 8.
The sequence of PSR080007308 starts with the base pairs "TT" which is not part of the sequence of the designed junction. The sequence of the junction in the whole genome actually matches with an intron adjacent to PSR080007308. Consequently, it will not be present in the cDNA sequence and thus cannot be measured with a microarray. It is difficult to assess whether the pattern of the junction is due to this apparent annotation to an intron or not.
The exclusion junctions JUC0800064160 and JUC0800064078 do not have two anchor probe sets, but they had low expression levels. The exclusion junction JUC0800063922 also had low expression levels for all samples. This is supported by the DABG values which confirms the presence of probe set PSR080007308. The transcript composition of PSR080007308 is shown in Supplementary Fig. 17 and shows a cassette exon event for PSR080007308. Since the probe set is supported by its 3′ end junction and one exclusion junction, it could be considered to be alternatively spliced. Section 3. 3 Supplementary  Figs 18 and 19.

of the Supplementary Material shows an additional examples in
Alternative Splicing Without Supporting Junctions. In this section we present exon probe set PSR010016411 which was not supported by its annotated exclusion junction to be alternatively spliced This example belongs to category d in Fig. 5.
Probe set PSR010016411 is part of transcript TC0101665 referring to the KMO gene. Figure 9 presents the probe set, the exclusion junction JUC0100146697 and the anchor probe sets of the junction: PSR010016408 and PSR010016412. The junction is not DABG indicating that PSR010016408 and PSR010016412 are not neighbouring probe sets in the isoform composition in the tissues. The sequence can be interrupted by the presence of either PSR010016408, PSR010016409 or PSR010016410. It is certain that at least one of these probe sets interrupts the sequence but without further junction annotation, it is unknown which probe set is continuously included in the isoforms.
The transcript composition is shown in Supplementary Fig. 20. The ensemble data base indicates that only the four isoforms containing PSR010016408 are protein coding. Therefore, PSR010016408 can be consid- Validation of the REIDS model. This section assess the performance of the REIDS model on a list of 373 exons that were identified by the RASA method and validated as alternatively spliced between liver and muscle tissues by RT-PCR 23 . A total of 49,014 probe sets were identified as alternatively spliced by the RASA method. As shown in Fig. 5, REIDS identifies 17,700 candidacies for alternative splicing in the first step of the analytical framework. 8,626 out of the 17,700 probe sets supported by all their annotated junctions while 9,946 probe sets were supported by at least one of their annotated junctions. The RASA and REIDS model shared 8,273 probe sets in common after step one but this reduced to 4,973 and 5,736 probe sets depending on whether the probe sets were supported by all their annotated junction or by at least one of their annotated junctions. Table 1 shows the comparison of the REIDS analytical framework with AltAnalyze and iGEMS based on the list of probe sets identified by RASA and validated by either RNA sequencing or RT-PCR.
The REIDS model identified 287 (77%) probe sets in common with the list without using junction probe sets, 210 (56%) were supported by all their annotated junctions and 244 (65%) were supported by at least one of their annotated junctions. This shows that the REIDS analytical framework detected between 65-77% of the validated exon probe sets. The true positive rates for AltAnalyze and iGEMS were 27% and 62%, respectively. Supplementary Section 4 illustrates differences in identifications between REIDS and RASA on the HJAY microarray platform.
Comparison with Other Software. The REIDS analytical framework was further compared with the Affymetrix TAC tool and AltAnalyze software using a more recent HTA-2.0 platform. The data set contains A549 lung adenocarcinoma cell lines which were treated with either scrambled RNA (SCR) or transfected with a siRNA that targets SRSF1. The TAC tool identifies alternatively spliced exons based on a splicing index (SI) 12 while AltAnalyze relies on the ASPIRE algorithm 24 . The Venn diagram in Fig. 10 compares the number of probe sets identified by the three methods.
All probe sets identified by REIDS were also identified by TAC and AltAnalyze. TAC identifies the largest number of probe sets but most of the probe sets had little variation in their intensities with less than 50% signal (exon scores <0.5) between arrays or biological samples. This means that the TAC method is prone to identifying probe sets which do not show variation across the conditions and it is therefore likely to be more prone to false positives. Table 2 shows the top 10 probe sets identified by the REIDS method. Section 3.8 in the Supplementary Material shows similar tables for the top 10 probe set obtained by AltAnalyze and TAC.
An example of a probe set that was highly ranked by all the three methods is probe set PSR12000150 of transcript cluster TC12000010 (WNK1 gene). The probe set has two annotated 5′ end, one 3′ end and two exclusion junctions. Figure 11 illustrates the isoform transcriptions of TC12000010. Black probe sets were identified as constituitive while coloured probe sets were identified as alternatively spliced. Green colour indicates an enrichment of a probe set and red colour denotes depletion. Junction which were DABG are shown in blue while depleted junctions are again shown in red. PSR17017166 was identified as a cassette exon which was enriched in the SCR samples (upper panel) and depleted in the siRNA samples (lower panel). Supplementary Section 3.7 shows the junction architecture and expression levels for probe set PSR12000150. Probe set PSR01003418 was ranked higher by REIDS (32) compared to AltAnalyze (6,119) and TAC (15,074). The probe set was part of transcript cluster TC01000205 (SZRD1 gene) and was identified as an complex event. Figure 12 illustrates the composition of transcript cluster TC01000205. The isoform which has probe sets PSR01003417 and PSR01003417 as neighbouring probe sets seems more prominent in the siRNA samples.
Finally, we present probe set PSR17017175 of transcript cluster TC17001298 (SPAG5 gene) which was ranked higher by TAC (187) and AltAnalyze (178) than by the REIDS method (4545). Figure 13 shows the gene model. PSR17017175 was not one of the top ranked probe sets by the REIDS method since the probe set levels reflect the gene expression levels as shown in Supplementary Section 3.7.

Discussion
The identification of alternative splicing events is important drug targeting. Numerous studies have shown that aberrant splice variants are involved in cancer development, neurodegenerative diseases, autosomal recessive diseases and more. Therefore, the identification of alternatively spliced probe set regions is an important step forward. The HTA microarray platform probe sets target exons as well as exon-exon junctions. An alternatively used probe set region supported by at least one of its linking junctions is less likely to be false positive compared to a probe set without junction support.   We proposed an extended REIDS analytical framework to incorporate information from junction probe sets. The REIDS analytical frame work outperforms AltAnalyze and iGEMS based on validated alternatively splcied exons. Although, only 65-77% of the validated alternatively spliced exons were detected by the REIDS method, this is partly because of the exon score filtering step. This filtering step assumed that transcript clusters with low variation between exons are not likely to be alternatively spliced. The REIDS method was further compared with  TAC and AltAnalyze based on a more recent HTA-2.0 microarray platform. The REIDS method performs as good AltAnalyze, and is less prone to false positives than TAC.
The presence of exon-exon junctions provides valuable information on reducing the false positive rates in the detection of alternative spliced events using a microarray platform. However, caution is needed. In some cases behaviour of junction probe sets cannot be wholly captured by a test statistic or numeric metric. A careful inspection of the different junctions and their annotated transcript could provide useful information on the composition of the transcript isoforms. This means in addition to the numerical algorithms, a careful qualitative inspection of the different junctions and annotations is important in minimising false positive rates. In conclusion, the REIDS method incorporates exon-exon junctions to robustly identified alternatively spliced events. We have shown how junction information can be used to make a more adequate decision about alternatively spliced events. Like any other numerical method, the REIDS analytical framework has its drawback because of the pre-specified threshold required for exon score. A careful consideration of an appropriate threshold should be considered before applying the proposed REIDS analytical framework 33 .