Introduction

Alternative splicing regulates developmental stage-specific and tissue-specific gene expressions and markedly expands the proteome diversity with a limited number of genes. High-throughput sequencing of total mRNAs expressed in cells has revealed that 98% or more of multiexon genes are alternatively spliced,1 with an average of seven alternative splicing per multiexon gene.2 Alternative splicing is achieved by exonic/intronic splicing enhancers/silencers (ESE, ISE, ESS, ISS) in combination with spatial and temporal expression of trans-acting splicing factors, such as serine/arginine-rich (SR) proteins and heterogeneous nuclear ribonucleoproteins.3, 4 Aberrations of alternative splicing are mediated by either mutations disrupting splicing cis-elements or dysregulation of splicing trans-factors.5, 6

Myotonic dystrophy is an autosomal dominant multisystem disorder affecting the skeletal muscles, eye, heart, endocrine system and central nervous system. The clinical symptoms include muscle weakness and wasting, myotonia, cataract, insulin resistance, hypogonadism, cardiac conduction defects, frontal balding and intellectual disabilities.7 Myotonic dystrophy is caused by abnormally expanded CTG repeats in the 3′ untranslated region of the DMPK gene encoding the dystrophia myotonica protein kinase on chromosome 19q13 (myotonic dystrophy type 1, DM1)8, 9, 10 or by abnormally expanded CCTG repeats in intron 1 of the ZNF9 gene encoding the zinc finger protein 9 on chromosome 3q21 (myotonic dystrophy type 2, DM2).11 In DM1, normal individuals have 5–30 repeats; mildly affected individuals have 50–80 repeats; and severely affected individuals have 2000 or more repeats of CTG.12, 13 In DM2, the size of expanded repeats is extremely variable, ranging from 75 to 11 000 repeats, with a mean of 5000 CCTG repeats.11, 14

In DM1 and DM2, expanded CTG or CCTG repeats in the non-coding regions sequestrate a splicing trans-factor muscleblind encoded by MBNL1 to intranuclear RNA foci harboring mutant RNA.15 In addition, in DM1 cells, another splicing trans-factor CUG-binding protein encoded by CUGBP1 is hyperphosphorylated by protein kinase C and is stabilized.16, 17, 18 Dysregulation of the two splicing trans-factors then causes aberrant splicing of their target genes. A total of 28 exons/introns of 22 genes have been identified to date in the skeletal and cardiac muscles in myotonic dystrophy (Table 1).

Table 1 A total of 28 aberrantly spliced exons and introns identified to date in skeletal and cardiac muscles in myotonic dystrophy

The Affymetrix GeneChip Human Exon 1.0 ST array contains ∼1.4 million probe sets comprised of ∼5.4 million probes. The exon array is designed to measure the expression level of each exon and to enable quantitative analysis of alternative splicing. In designing the exon array, probe selection region(s) (PSR) are placed within each exon throughout the genome. Each PSR has a unique probe set ID and carries four probes. A group of PSRs placed on a single exon is given a unique exon cluster ID. In most instances, each exon cluster ID, which represents an individual exon, carries a single PSR. A group of exon clusters spanning a single gene has a unique transcript cluster ID. The exon array thus carries 30–40 probes along the entire length of each gene. Data analysis of the exon array, however, is more complicated than that of the expression array as in the HuEx1.0 ST exon array, (i) each probe set is comprised of only four probes, (ii) each probe does not have a corresponding mismatched probe and (iii) each probe cannot be optimally designed due to a short span of the PSR.

In an effort to elucidate a diverse array of alternatively spliced genes in myotonic dystrophy, we performed the exon array analysis with skeletal muscles of three DM1 patients and three normal controls. In the course of the analysis, we tested alternative splicing of 72 exons by RT-PCR and found that 27 were alternatively spliced, whereas 45 were not. We sought for parameters that best discriminate true and false positives, and found that four parameters discriminate the true and false positives with a sensitivity of 77.8% and a specificity of 95.6%.

Materials and methods

Patient samples

Skeletal muscles were previously biopsied for diagnostic purposes. Clinical features of the patients are summarized in Supplementary Table S1. Two control muscles were biopsied muscle specimens that showed no pathological abnormalities. One control muscle RNA was the Human Skeletal Muscle PCR-Ready cDNA from Life Technologies (Carlsbad, CA, USA). All experiments were performed under the IRB approvals of the Nagoya University Graduate School of Medicine and the Aichi Medical University. The samples were used for the current studies after appropriate informed consents were given. High-molecular weight DNA was extracted by the conventional proteinase K and phenol chloroform method. We determined the CTG repeat numbers at the 3′ UTR of the DMPK gene by Southern blotting and found that skeletal muscles of patients 1, 2 and white blood cells of patient 3 carried 3430, 4500 and 1500 CTG repeats, respectively. Our analysis underscored a notion that skeletal muscles have larger numbers of repeats compared to leukocytes.19

RNA preparation and array hybridization

Total RNA was extracted by the RNeasy Mini Kit (Qiagen, Hilden, Germany). We confirmed that the RNA integrity numbers were all above 7.0. Hybridization and signal acquisition of the HuEx1.0 ST exon array (Affymetrix, Santa Clara, CA, USA) were performed according to the manufacturer’s instructions. The signal intensities were normalized by the RMA method using the Expression Console 1.1 (Affymetrix).

Exclusion of genes with undependable signals using four criteria

Before we analyzed our exon array data, we excluded genes and probe sets with undependable signals using the following criteria. First, the gene must be comprised of four or more exons. Second, the smaller detection above background (DABG) P-value in either controls or DM1 muscles is ⩽0.01 for a probe set to be analyzed. DABG is a detection metric generated by comparing perfectly matched probes to a distribution of background probes. Affymetrix expression arrays used a mismatched probe to measure the background signal for a specific probe, whereas Affymetrix exon arrays use shared background probes to estimate the background signals. As exon skipping results in low signals that give rise to high DABG P-values, we did not discard probe sets with unreliable signals in either controls or DM1, but not in both. Third, three or more dependable probe sets with DABG P-value ⩽0.01 should be included in a gene to be analyzed, and such probe sets should comprise 15% or more of all the probe sets on the gene. Fourth, the average signal intensities of either the controls or DM1 should be no less than 150. Among the 336 293 exonic probe sets in our data set, 103 543 probe sets met these criteria.

Unique exon cluster IDs and unique transcript cluster IDs

In order to provide our unique exon cluster IDs and unique transcript cluster IDs, we analyzed annotations of the NCBI human gene database build 36.3 by writing and running Perl programs on the PrimePower HPC2500/Solaris 9 supercomputer (Fujitsu Ltd, Tokyo, Japan). We analyzed the exon array signals on Microsoft Excel by making VBA programs. Partitioning of parameters to distinguish true and false positives was performed by the JMP statistical software Ver. 8.0.1 (SAS Institute, Cary, NC, USA) with its default settings.

Results

We provided our unique transcript cluster IDs and exon cluster IDs for the exon array based on the NCBI RefSeq database

We analyzed muscle specimens of three DM1 and three controls using the HuEx1.0 ST exon array. In the course of the analysis, we noticed that the exon array annotations provided by the manufacturer are based on comprehensive collation of several different gene databases and do not match to any single annotation database. We thus exploited the NCBI RefSeq annotation and provided our unique exon cluster IDs (Supplementary data). We also provided our unique transcript cluster IDs because in the manufacturer’s annotations some exon clusters either upstream or downstream of the RefSeq-defined gene region are given the same transcript cluster ID as intragenic exon clusters. An example of the MBNL1 annotations is shown in Figure 1. Our data set was comprised of 336 293 probe sets that were grouped into 218 622 exonic clusters on 27 208 transcript clusters. For 1766 probe sets, we assigned duplicated exonic and transcript clusters, as two genes shared the same exonic regions. Our data set thus utilized 23.5% (336 293/1 432 144) of probe sets placed on the array.

Figure 1
figure 1

Comparison of Affymetrix annotations and our annotations. Seven alternative transcripts of MBNL1 according to the NCBI Build 36.3 are drawn to individual exonic and intronic scales indicated at the top. The PSR bars represent the ‘core’ probe sets by Affymetrix. Each probe set is comprised of four probes (not shown). Each exon cluster corresponds to a single exon and carries one or more probe sets. A transcript cluster is comprised of exon clusters on the gene. Note that three 3′ exon clusters (asterisks) by Affymetrix are discordant to the genomic structure by NCBI: the Affymetrix exon cluster ID 207258 corresponds to exons 10 and 11 by NCBI, and we provided two different exon cluster IDs of 56123037 and 56123039. The exon cluster IDs 207259 and 207260 correspond to exon 12 by NCBI, and provided a single exon cluster ID of 56123041. Our unique exon cluster IDs and transcript cluster IDs throughout the genome are in the Supplementary data. Arrows indicate aberrant exons 6 and 10 identified in the current studies, and daggers indicate the corresponding probe sets. Shaded areas represent non-coding regions.

The Affymetrix HuEx-1.0-st-v2 annotations release 32 carries 284 805 ‘core’ probe sets. The ‘core’ probe sets represent the RefSeq transcripts and the full-length GenBank mRNAs. The ‘core’ probe sets are grouped into 192 554 exonic clusters on 19 231 transcript clusters. When the Affymetrix ‘core’ annotations are compared with the NCBI Build 36.3 database that we utilized, 17 372 (6.1%) of the 284 805 ‘core’ probe sets are on non-exonic regions according to NCBI. Conversely, 69 278 (6.0%) of the 1 147 338 ‘non-core’ probe sets are on exonic regions according to NCBI.

Deviation values (DVs) of splice indices (SIs) provide essential information to distinguish true and false positives

According to the manufacturer’s suggestions, we first calculated the normalized intensity (NI) of each exon cluster by dividing the signal intensity of a given exon cluster by a sum of all the exonic signals throughout the gene. We next calculated the SI by dividing NIDM1 by NIcontrol.20 We also calculated t-test P-values of SIs between three controls and three patients.

Validation by RT-PCR of ∼20 exons, however, revealed that only about a quarter of candidate exons were aberrantly spliced, whereas three quarters were not. In the course of analysis, we noticed that, in most cases, the SIs of the truly positive gene were all close to 1.0 throughout the gene, whereas those of the falsely positive gene were variable from probe set to probe set (Figure 2). In order to quantify how much the normalized SI of a particular exon is deviated from those of the other exons, we calculated the mean and standard deviation (s.d.) of SIs of the other exons. We then calculated the DV of the SI of an exon cluster of our interest.

Figure 2
figure 2

Falsely predicted skipping of ABCC5 exon 17, and correctly predicted skipping of UBE2D3 exon 7. (a) RT-PCR analysis of three DM1 muscles and three control muscles. (b) Splice index (an exonic signals divided by a sum of all the exonic signals throughout a gene) of each probe set on ABCC5 in DM1 is normalized for that in controls. The normalized splice indices in DM1 are variable from probe set to probe set, giving rise to the mean and s.d. of 1.00±0.20 (bold symbols). An arrow points to exon 17. The deviation value of probe set ID 2708314 on exon 17 is −3.7 s.d. (c) Normalized splice indices of UBE2D3 in DM1 are less variable compared with those of ABCC5, which gives rise to the mean and s.d. of 1.00±0.08 (bold symbols). An arrow points to exon 7. The deviation value of probe set ID 2780003 on exon 7 is −9.7 s.d.

We indeed found that SIs of the truly positive genes were all close to 1.0 and alternative exons tend to give rise to high DV values as explained in Figure 4a and the relevant statements below.

We analyzed 72 exons and identified 27 aberrant exons in DM1

In order to seek for aberrantly spliced exons/introns in DM1, we arbitrarily set three thresholds of DV >3.0, SI >1.5 and t-test P-value <0.1. The three criteria were satisfied in 256 exons. Among these, we arbitrarily chose 72 exons. As we started our analysis without knowing which parameters were efficiently able to predict true positives, there were no strict objective criteria how we chose these exons. We, however, looked into the following features when we chose candidate exons: (i) a conspicuous value in one or more of the three parameters; (ii) alternative spliced exons annotated in the NCBI database; or (iii) a possible pathogenic gene that can be causally associated with DM1. We also avoided previously known aberrant splicing except for LDB3 exon 7 and MBNL1 exon 6, as the two exons were of special interest to us. RT-PCR analysis of the 72 exons revealed that 27 exons were indeed aberrantly spliced (Supplementary Figure S1; Table 2), whereas 45 exons were not (Supplementary Table S2), which gave rise to a positive predictive value of 27/72=37.5% (Figure 3b). Most aberrant fragments were observed in normal and disease controls to variable extents, and we defined ‘aberrant’ splicing when the ratios of aberrant fragments in DM1 were more than those in any normal controls. In 11 of the 27 exons, intensities of aberrant fragments in DM1 exceeded those of normal controls but not all of disease controls, indicating that the splicing aberrations are not specific to DM1. In the remaining 16 ‘specific’ exons (asterisks in Supplementary Figure S1), the ratios of aberrant fragments in DM1 were more than those in any disease controls. Especially, splicing aberrations in LDB3 exon 4 and TTN exon 45 were almost exclusively observed in DM1, and were ‘unique’ to DM1. To summarize, among the 27 aberrantly spliced exons that we identified in the current studies, 25 were novel, 16 were ‘specific’ to DM1 and 2 were ‘unique’ to DM1 (LDB3 exon 4 and TTN exon 45).

Table 2 A total of 27 aberrantly spliced exons in DM1 identified in the current studies
Figure 3
figure 3

(a) Arbitrary criteria to search for aberrant splicing (left panel) and their results (right panel). (b) Four discriminative criteria indicated by JMP-IN (left panel) and their results (right panel).

In an effort to understand the rarity of ‘specific’ and ‘unique’ aberrant splicing, we examined the disease specificity of four previously reported aberrant splicing in DM1.21 Limited availability of biopsied muscles hindered us from analyzing all the 28 previously reported splicing aberrations shown in Table 1. We found that aberrant splicing of PDLIM3 exon 5 was ‘specific’; that of CAPN3 exon 16 was ‘unique’; and those of GFPT1 exon 10 and NRAP exon 12 were observed in normal and disease controls (Supplementary Figure S2). Thus, some of the previously reported splicing aberrations in DM1 (Table 1) are likely to represent muscle degeneration and/or regeneration.22

Four parameters increased the sensitivity of exon array analysis

Using the 72 analyzed exons, we next asked which parameters were able to discriminate the true and false positives. We analyzed 10 parameters and found that DV were most discriminative and SI followed (Figure 4). Additionally, we observed significant differences in average signal intensities and in ratios of probe sets with DABG P-value=0.000. A DABG P-value is attached to a signal intensity of each probe set and represents reliability of the signal intensity. The t-test P-values were lower in true positives, but without statistical significance.

Figure 4
figure 4

(a) Parameters that differentiate the true and false positives. Values are normalized to those of false positives, and the mean and s.e. are indicated. The true and false positives are indicated by solid and gray symbols, respectively. For SI, an inverse of the SI value is taken when the signal is decreased in DM1. Four parameters exhibit statistical significance with the Student’s t-test. (b) Deviation values of 27 true and 45 false positives are plotted on a logarithmic scale. Gray lines indicate means and 95% confidence intervals.

Comparison of the true (Table 2) and false (Supplementary Table S2) positives using the recursive partitioning functionality of the JMP 8.0.1 statistical software indicated four thresholds. The partitioning functionality of JMP seeks for the best splitting point of the best factor, Xi, among a group of factors that best discriminate the response Y. The 10 factors indicated Figure 4a were analyzed to discriminate the true and false positives. First, the signal intensities of either controls or DM1 should be more than 270. Second, the ratio of probe sets with DABG P-value=0.000 either in controls or DM1 should be more than 0.05. Third, DV should be more than 10.0 or SI should be more than 2.6. Application of the four thresholds excluded 6 out of 27 true positives and 43 out of 45 false positives, and gave rise to a sensitivity of 21/27=77.8% and a specificity of 43/45=95.6% (Figure 3b). If we exclude the threshold for the DV and include the threshold for t-test P-value of less than 0.05, which are commonly used in the analysis of exon arrays, the sensitivity becomes as low as 15/27=55.6%, whereas the specificity rather becomes 45/45=100%. Thus, the inclusion of DVs in the analysis increases a chance of identifying aberrantly spliced exons by 22.2%, although a chance of detecting false positives is rather increased by 4.4%.

Although the significance of four parameters is demonstrated in our data set (Figure 4a), the thresholds should be unique to our data set and different thresholds need to be applied to different data sets. To prove this, we analyzed four human exon arrays of GSE21795,23 GSE28672,24 GSE2458125 and GSE2184026 in the Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/). Each data set was comprised of a pair of three to five samples, and aberrant and alternative splicing events of a total of 23 exons were validated by RT-PCR in the original papers. Although SIs and DVs of the 23 exons were as high as 2.58±1.05 and 8.3±12.8 (mean and s.d.), respectively, 9 exons did not meet the criteria indicated in Figure 3b, which gave rise to a sensitivity of 14/23=60.9%. As no false positive results were documented in these papers, we could not calculate the specificity, and we could not estimate if calculation of four parameters indeed increases the sensitivity and specificity of the exon array analysis for these data sets.

Discussion

A total of 25 novel aberrantly spliced exons in DM1

We identified 25 novel aberrantly spliced exons in DM1. Among these, aberrant splicing events of LDB3 exon 4 and TTN exon 45 are ‘unique’ to DM1. Aberrations of the other 23 exons are observed in other muscle diseases with variable degrees. MBNL1 normally translocates from cytoplasm to nucleus in the postnatal period to induce adult-type splicing, and lack of muscleblind due to sequestration to RNA foci in myotonic dystrophy recapitulates fetal splicing patterns.21, 27 Downregulation of MBNL1 and upregulation of CUGBP1 is likely to occur in rejuvenating muscle fibers, and is likely to result in altered splicing patterns that we observe in disease controls. Not all aberrantly spliced exons in DM1, however, are observed in disease controls. Pathological significance of aberrant splicing in disease controls thus remains to be elucidated.

Exon array analysis

In expression arrays, fold-changes and t-test P-values have been successfully employed to detect altered gene expressions. On the other hand, these parameters are not sufficient to detect aberrant splicing in the exon array data. We thus sought for additional parameters and found that four parameters are informative to discriminate true and false positives: (i) the DV, (ii) the normalized splice index, (iii) the signal intensity and (iv) the ratio of probe sets with DABG P-value=0.000. Application of these four parameters has enabled us to achieve a sensitivity of 77.8% and a specificity of 95.6%. On the other hand, the t-test P-values are not significantly lower in true positives. This represents that the threshold of t-test P-value ⩽0.10 is likely to be sufficient to exclude a large amount of false positives and that further stringent P-values would not help discriminate true and false positives.

In addition, our unique annotations of exon cluster IDs and transcript cluster IDs also make the DVs more dependable. This is because probe sets on rare transcripts or probe sets outside of the NCBI-defined gene region sometimes give rise to falsely strong signals with dependable DABG P-values. Inclusion of these probe sets increases the standard deviation and decreases the DV of the aberrantly spliced exon.

Potential roles of novel aberrant splicing events in DM1

In this study, we identified 27 DM1-specific aberrant splicing, in which 25 have not been published yet. Among the 25 exons, aberrant splicing events of two exons were ‘uniquely’ observed in DM1: one is inclusion of the LIM domain binding 3 (LDB3) exon 4 and the other is inclusion of titin (TTN) exon 45. Interestingly, both encode structural proteins of muscle fiber.

LDB3, also known as Cypher/ZASP (Z-band alternatively spliced PDZ-motif protein), contains a PDZ domain at the N-terminus and one or three LIM domains at the C-terminus. LDB3 is localized to the Z-line and interacts with α-actinin 2 through its PDZ-domain and with protein kinase C via its C-terminal LIM domains.28 LDB3 is likely to have an essential role in supporting Z-line structure and muscle function during contraction.29 LDB3 has several isoforms. As inclusion of exon 4 is preferentially observed in the fetal heart,30 the aberrant inclusion of exon 4 in the skeletal muscles in DM1 would lead to dysfunction or morphological abnormalities of muscle fiber. Recently, phosphoglucomutase 1 (PGM1), an enzyme involved in glycolysis and gluconeogenesis, has been known to bind to the domain encoded by exon 4 of LDB3. LDB3 mutations in exon 4 reduce the binding to PGM1 and develop dilated cardiomyopathy.31 On the other hand, the increased binding of PGM1 and LDB3 through aberrant inclusion of exon 4 might be involved in the pathogenesis of muscle atrophy, weakness and histological abnormalities in DM1.

TTN encodes the largest protein in mammals and the third most abundant protein in muscle.32 An N-terminal Z-disc region and a C-terminal M-line region bind to the Z-line and M-line of the sarcomere, respectively, so that a single molecule extends half the length of a sarcomere. Titin is critically important for myofibril elasticity and structural integrity. Its elasticity lies specifically in the I-band region and contains two elements in series with different properties: the tandem immunoglobulin (Ig) and PEVK domains.33 Different TTN isoforms contribute to differences in elasticity of different muscle types.34 As exon 45 is located at the tandem Ig domains, aberrant inclusion of exon 45 in DM1 might lead to defective myofibril assembly and function.