Targeted RNA-Seq profiling of splicing pattern in the DMD gene: exons are mostly constitutively spliced in human skeletal muscle

We have analysed the splicing pattern of the human Duchenne Muscular Dystrophy (DMD) NB transcript in normal skeletal muscle. To achieve depth of coverage required for the analysis of this lowly expressed gene in muscle, we designed a targeted RNA-Seq procedure that combines amplification of the full-length 11.3 kb DMD cDNA sequence and 454 sequencing technology. A high and uniform coverage of the cDNA sequence was obtained that allowed to draw up a reliable inventory of the physiological alternative splicing events in the muscular DMD transcript. In contrast to previous assumptions, we evidenced that most of the 79 DMD exons are constitutively spliced in skeletal muscle. Only a limited number of 12 alternative splicing events were identified, all present at a very low level. These include previously known exon skipping events but also newly described pseudoexon inclusions and alternative 3′ splice sites, of which one is the first functional NAGNAG splice site reported in the DMD gene. This study provides the first RNA-Seq-based reference of DMD splicing pattern in skeletal muscle and reports on an experimental procedure well suited to detect condition-specific differences in this low abundance transcript that may prove useful for diagnostic, research or RNA-based therapeutic applications.

We have analysed the splicing pattern of the human Duchenne Muscular Dystrophy (DMD) NB transcript in normal skeletal muscle. To achieve depth of coverage required for the analysis of this lowly expressed gene in muscle, we designed a targeted RNA-Seq procedure that combines amplification of the fulllength 11.3 kb DMD cDNA sequence and 454 sequencing technology. A high and uniform coverage of the cDNA sequence was obtained that allowed to draw up a reliable inventory of the physiological alternative splicing events in the muscular DMD transcript. In contrast to previous assumptions, we evidenced that most of the 79 DMD exons are constitutively spliced in skeletal muscle. Only a limited number of 12 alternative splicing events were identified, all present at a very low level. These include previously known exon skipping events but also newly described pseudoexon inclusions and alternative 3′ splice sites, of which one is the first functional NAGNAG splice site reported in the DMD gene. This study provides the first RNA-Seq-based reference of DMD splicing pattern in skeletal muscle and reports on an experimental procedure well suited to detect condition-specific differences in this low abundance transcript that may prove useful for diagnostic, research or RNA-based therapeutic applications.
Alternative splicing (AS) is a key mechanism for generating tissue or developmental stage-specific proteomic diversity in eukaryotes 1,2 . Muscle was one of the first tissues in which AS was widely observed in particular in contractile protein genes 3 . Recent global analyses of splicing programs have confirmed that skeletal muscle is among the tissues showing the largest number of tissue-specific AS events (ASEs) 4 .
The DMD gene encodes dystrophin, a cytoskeletal protein of 427 kDa accounting for only approximately 0.002% of the total striated muscle protein content but playing an essential role in muscle fiber integrity and function. Loss-of-function mutations cause Duchenne muscular dystrophy (DMD), the most common and severe form of progressive muscular dystrophy in children 5 . The DMD transcript can be categorized among the low abundance transcripts relative to other muscle transcripts. In adult skeletal muscle tissues, its concentration is estimated to be 5-10 molecules per nucleus as compared to 25,000-50,000 copies per nucleus for the highly abundant muscle transcript encoding the myosin heavy chain 6 . The DMD gene is remarkable by its length (2.2 Mb, the longest human gene) and genomic structure. About 99% of the gene is made of introns, some of them being very long introns exceeding 200 kb, while the translated coding sequence, which is fragmented into 79 exons, is only 11.3 kb. Seven independent tissue-specific promoters encode three full-length isoforms (including the Dp427m muscle one) and four N-terminally truncated proteins 5 .
There is no clear picture of the global splicing profile of the muscle transcript. Available data are not readily comparable because techniques of different sensitivity were used in previous studies that generally focused only on specific regions of the gene [7][8][9][10] . Considering the currently developed therapeutic approaches for DMD based on antisense oligonucleotide-mediated splicing modulation 11 , as well as the importance of splicing defects as a cause or a modifier of disease severity 12,13 , there is a clear need to elucidate the full splicing pattern of the DMD transcript. Massively parallel RNA sequencing (RNA-Seq) has become a powerful technology to explore the complexity of mammalian transcriptomes 14 . Besides allowing for the comparison of gene expression changes in response to cellular differentiation, environmental factors or disease conditions, RNA-Seq can be used to accurately identify novel isoforms, assess relative transcript abundances and detect alternative exon and splice site usage in tissues or cells 15 . However, this approach has not yet provided information for all genes uniformly. Indeed a large fraction of sequence reads in RNA-Seq experiments are used up by highly expressed transcripts, thereby lowering the ability to detect other transcripts present at low levels 16 . This limitation is particularly detrimental to splicing analysis, which requires basically more input data than for gene expression analysis since a read must include the AS region to count towards splicing analysis 17 . Thus detection of alternative splicing within low abundance genes remains challenging. In this study, we designed a suitable strategy that relies on a DMD-targeted RNA-Seq protocol and the 454 technology to allow accurate and comprehensive inventory of physiological ASEs of the DMD transcript in human skeletal muscle.

Results
Design of an optimized RNA-Seq protocol for human DMD transcript analysis. Before designing our experimental workflow, a preliminary analysis of RNA sequencing data of human skeletal muscle tissue available in the Illumina's Body Map (IBM) transcriptome project 18 had shown us that the read depth along the DMD transcript was globally insufficient (50X-200X) despite the huge amount of reads produced, to allow reliable detection of ASEs present at low level. This observation prompted us to set up a DMD-targeted RNA-Seq approach (Fig. 1) that combines the amplification of the full DMD coding sequence (11.3 kb) with the Roche 454 sequencing technology. Typically 454 reads are > 300 bp long and are likely to span more than two exon junctions that helps to identify non-canonical splicing events. To assess the performance of our procedure, a first comparison has been made between the data obtained from the analysis of one skeletal muscle sample using our targeted approach (average read length: 387 bp, 15.8 Mb) and the data from the IBM project (human skeletal muscle RNA GSM759515, 2 × 50 bp paired end, 8.2 Gb). Visualization of the aligned reads to the DMD Dp427m isoform showed differences in DMD mRNA sequencing depth and coverage (Fig. 2). With the total mRNA sequencing technology, the mean depth per base was 142 (+ /− 82 SD) reads and a marked decrease of coverage at the 5′ end of the transcript was noticed (about 1 kb is covered by less than 50 reads) due to a bias toward the identification of sequences from the 3′ end of DMD mRNA as previously described 19 . By using our DMD-targeted approach, a relatively uniform coverage was obtained with a mean depth of 1,363 (+ /− 302 SD) reads per base that would theoretically allow ASEs as low as 1% to be reliably detected (> 10 reads).
The Spliced Transcripts Alignment to a Reference software (STAR), specially designed for aligning long reads 20 , was then used to identify DMD canonical splice junctions as well as potential new ones from junction-spanning reads. Compared to TopHat2, a popular mapper suitable for processing relatively short reads, STAR performed better both in the total mRNA-Seq and our DMD-targeted mRNA-Seq datasets (Table 1) for the positioning of the 78 canonical junctions and the discovery of alternative splice junctions as illustrated with pseudoexon 1a, an already known case of pseudoexon insertion in the DMD transcript 21 .
Splicing pattern of DMD transcript in skeletal muscle tissue. Four independent skeletal muscle tissue samples from normal young men were used to establish the splicing pattern of the muscular DMD transcript. We applied our established DMD-targeted RNA-Seq protocol, and the splice junction reads obtained both for the 78 canonical junctions and novel splice junctions in the 4 samples are detailed in the Supplementary Table S1. Among the non-canonical splice junctions detected, we chose to consider only those consistently detected with a minimum of 5 read counts in at least 2 out of the 4 muscle samples analysed for further analyses. Strikingly only 12 ASEs could be identified (Table 2), which can be divided into three different categories: exon skipping (n = 5), cryptic exon inclusion (n = 3) and 3′ alternative splice site (ss) usage (n = 4). Most of them (9/12) are low level ASEs (≤ 1%) and the three most represented events (del71, del78, PE1a) do not exceed 3%. It is noteworthy that the majority of them (66%, 8/12) preserve an open reading frame. While some events have already been reported (del9, del71, del78, PE1a), the other ones are here described for the first time (Supplementary Table S2). The very low level of detected ASEs precluded their experimental validation by RT-PCR in most cases, but we were able to conduct these analyses for three of them (del71, del78, and PE1a). We performed either standard or fluorescent RT-PCR using primers that amplify both the included and the skipped isoforms, and analysed PCR products by gel electrophoresis or fragment analysis on a capillary sequencer, respectively. The three ASEs were successfully validated (Fig. 3), with a greater concordance between RNA-Seq and fragment analysis data (Table 3).  In addition to exon skipping events, the RNA-Seq analysis has revealed the usage of alternative splice sites located either at proximity of the natural ones or deep in the introns ( Table 2). The reliability of the bioinformatics analysis for these new sites was systematically verified by inspection of the raw sequencing reads. In particular for pseudoexons, we verified that the whole sequence of the pseudoexon could be retrieved from a single read. Erroneously annotated alternative splice sites were found at the exons 3/4 and exons 14/15 junctions due to the presence of homopolymers and/or sequence homologies that led to misalignment (Supplementary Table S1). These alternative junctions were thus excluded from our analysis. With the exception of the PE1a, the 6 other reported events (2 pseudoexons, and 4 alternative 3′ ss) are newly described in this study ( Table 2). The strenght of these alternative splice sites was evaluated by maximum entropy scores (MaxEntScan, MES) and the Human Splicing Finder (HSF) program (Fig. 3).
All three pseudoexons displayed high 5′ and 3′ splice site scores, thus supporting their use to some extent in the DMD transcripts. Regarding the detected 3′ alternative ss, all are exonic and their calculated scores were lower than the scores of the adjacent natural splices in all cases except for the exon 76 + 59 alternative 3′ ss, which exhibits scores (HSF: 89.39 and MES: 9.59) significantly higher than the natural 3′ ss of exon 76 (HSF: 69.15 and MES: 3.81) (Fig. 3). When used, a new transcript is produced carrying an in-frame deletion of 20 amino acids (Pro3600 to Gln3619) in the C-terminal domain of dystrophin that is not predicted to alter a protein-protein interaction domain. The RNA-Seq analysis also disclosed splicing at a previously undescribed NAGNAG tandem site in DMD exon 54 that deletes one amino acid (p.Glu2625). This novel junction was reliably detected in all four analysed skeletal muscle samples (11-28 junction reads).
Lastly, we have drawn up a list of all detected exon skipping events but that failed to reach the minimum criteria set (i.e. a minimum of 5 reads in 2 out of the 4 samples analysed) (Supplementary Table S1). Interestingly we noticed that half of them (13/25, 52%) are located in the contiguous block of the 20 in-frame exons, encoding portions of the central rod domain of dystrophin, and the most represented ones (del25, del28, del28 + 29, del38) match with exons frequently involved in nonsense-induced exon skipping events that contribute to modify disease severity in patients. This observation further supports the idea that these exons have peculiar features regarding splicing, which may make them much more prone to exon skipping when mutated in patients 22 .

Discussion
Over half of the multi-exonic human genes are believed to have splice variants 23 . Due to its large number of exons, the DMD gene was a priori a good candidate for the occurrence of multiple alternative splicing events. A literature survey supports this common assumption with many reports describing alternative splicing events in the DMD gene since its discovery in 1986 (Supplementary Table S2). However these data derive from studies that are very heterogeneous in terms of the techniques used (northern blot, PCR, nested PCR), specificity of the transcript isoform analysed (Dp427m and/or other isoforms), biological samples analysed (human versus animal tissues, skeletal muscle versus other tissues or lymphocytes, patients versus normal controls), and gene regions explored (most studies focused on specific regions), that makes qualitative and quantitative comparisons between these different datasets difficult. A comprehensive and unbiased inventory of the DMD splicing events in skeletal muscle was lacking.
The recently developed deep-sequencing technologies allows a far more precise quantification of transcript levels and their isoforms than other methods 14 . In this study, a targeted RNA-Seq procedure was chosen to increase the sequencing depth and therefore to allow reliable detection of alternative splice junctions in the DMD gene. The deep RNA sequencing strategy was reasonably expected to expand the repertoire of ASEs in the DMD gene. Yet surprisingly, the vast majority of the 79 DMD exons were found to be constitutively spliced in
skeletal muscle under physiological conditions. Hence, our results would indicate that the frequency of ASEs previously reported in the DMD muscular transcript may have been overestimated, possibly due to methodological considerations. Exon-skipping was the most common form of the ASEs detected (5/12, 41.7%). Notwithstanding the large number of exons in the DMD gene, skipping of only a single or a limited number of exons were characterized (Table 2 and Supplemental Table 1). Skipping of long stretches of exons were not found in normal human skeletal muscle contrary to what can be observed in pathological condition, where multiple exon skipping rearrangements have been reported to occur, notably in revertant fibers to restore dystrophin expression in patients 24 or in the mdx model 25 . Though few in number, the identity of the identified ASEs in this study provides strong support for the reliability of our RNA-Seq analysis. Indeed, several of them correspond to previously described ASEs occurring in a tissue-specific or developmental stage-specific manner in the full-length muscular transcript or in other isoforms 9,26,27 . In particular, exon 71 and exon 78 are alternatively spliced in Dp71, one of the shortest dystrophin isoform and the major DMD gene product in many non-muscle tissues. DMD transcripts alternatively spliced at the 3′ end encode functionally distinct protein isoforms that are likely to remodel protein-protein interaction networks 28 in particular with components of the dystrophin-glycoprotein complex (DGC). The in-frame skipping of exon 71 results in loss of the syntrophin-binding site from the protein 29 , while the absence of exon 78 causes a frameshift that replaces the 13 C-terminal dystrophin amino acids residues with 31 new ones defining a protein with a novel hydrophobic carboxy terminus 30 . Exon 78 is spliced out in the embryonic muscle dystrophin isoform. This developmentally regulated alternative splicing is highly conserved throughout vertebrates, arguing for a critical functional role of this longer C-terminal protein domain during the embryonic stage of development. By contrast, dystrophin exon 78 is required for muscle structure maintenance in adult skeletal muscle. Its abnormal exclusion in patients suffering from myotonic dystrophy (DM) due to depletion of the MBNL1 splicing factor likely contributes to the progressive dystrophic process in DM type1 patients 31 . In accordance with these data, exon 78 was found to be included in about 98% of the muscular transcripts in our RNA-Seq experiments as was exon 71. The traces of exon skipping detected in muscle may reflect the complexity of the mechanisms controlling the inclusion of these two exons in a tissue-specific or developmental stage-specific manner 9,32 . It is noteworthy that DMD exon 71 and exon 78 are very short exons (39 and 32 bp, respectively) that share some characteristics with the recently described microexons in neurons, a set of highly conserved short exons, which are strongly regulated by RNA-binding proteins (RBPs) and functionally modulate tissue-dependent protein-protein networks 33,34 .
The RNA-Seq analysis has revealed the usage of alternative splice sites located either in the close vicinity of the natural sites or deep in the introns (Table 2), which are all newly described except pseudoexon 1a. The pseudoexon 1a originating from intron 1 was initially identified in lymphocytes 21 , where it is included in more than 50% of the illegitimate DMD transcripts. The inclusion of this out-of-frame extra-exon may represent a post-transcriptional control in cells that normally do not express the dystrophin protein (like lymphocytes), while a low inclusion rate is observed in skeletal muscle transcripts (3%). Due to the size of the DMD introns, multistep splicing events (recursive or nested splicing) are frequently used to splice out a single intron 35 . Interestingly, the 5′ and 3′ genomic coordinates of the identified pseudoexons in our study (1a, 21X, 51X) are similar to 6 out of the 145 biocomputational predicted positions associated with either 5′ or 3′ recursive splicing of multi-step intron removal recently identified by capture-pre-mRNA-seq of intermediately spliced dystrophin transcript 35 . Whether these 6 positions represent true motifs required for 5′ and 3′ recursive splicing remains to be experimentally demonstrated, but our findings show that they can be used in combination for the inclusion of pseudoexons in the mature transcripts.
As previously reported in other human genes 36 , we observed that alternative acceptors were the second most common (4/12, 33%) type of AS detected in DMD after exon skipping. All are exonic alternative 3′ splice sites that induce partial exon deletion when used, and the major one involves a NAGNAG tandem site in DMD exon 54. NAGNAG motif occurs in 30% of human genes and appears to be functional in at least 5% of human genes 37,38 . Their use results in the production of the two distinct isoforms distinguishable by three nucleotides (NAG). These subtle changes may nonetheless be of functional relevance by changing local hydrophobicity and charge, varying the distances between relevant sites in proteins or changing recognition sequences for post-translational modifications. This is the first functional NAGNAG motif identified in the DMD gene. Splicing at this site deletes the polar amino acid glutamine at position 2625 of dystrophin repeat 21 which forms part of the region involved in specific protein/lipid interactions that favours homogeneous dystrophin distribution along the membrane 39 . Inspection of the 77 remaining 3′ ss did not reveal any other NAGNAG tandem acceptor motifs in the DMD gene.
In conclusion, this study provides the first RNA-Seq-based reference of DMD splicing pattern in normal muscle. The strategy used allows the analysis of the whole 11.3 kb-cDNA sequence all at once that may prove useful for mutational analysis or monitoring the impact of splicing interventions on transcript structure in patients with Duchenne muscular dystrophy. Undesirable alternative splicing events may impact the outcome of exon skipping or trans-splicing approaches. Unlike other isoforms 26,27,40 , expression of the DMD gene is tightly regulated in skeletal muscle to preserve the production of a full-length transcript and a 427 kDa dystrophin protein, even if some protein domains are considered to be functionally dispensable 41 . The DMD gene represents a paradigm for extreme splicing conditions that illustrates perfectly how accurate the process of splice site selection shoud be to include the 79 exons into the mature transcript concomitantly with the repression of near perfect pseudoexons or cryptic 3′ ss activation. A large piece of work remains to be done to elucidate the splicing regulation in this huge gene, which will contribute to reveal new insights into basic splicing mechanisms. RNA extraction and cDNA synthesis. Total RNA was extracted from biopsies using the SV total RNA extraction kit (Promega) and RNA integrity was checked using the Agilent RNA 6000 Nano kit with Agilent 2100 bioanalyzer (Agilent Technologies). 700 ng of total RNA were used to synthesize template cDNA by reverse transcription with the Superscript ® II (Thermo Scientific) and oligodT primers for RNA-Seq experiments.
Long Range (LR)-PCR. One fifth of the RT product was used as template to amplify the full-length DMD cDNA as a single long fragment (11.3 kb) by using the GoTaq ® Long PCR Master Mix (Promega) and primers located in exon 1 of the muscle isoform (Dp427m) (5′ -CTTTCCCCCTACAGGACTCAG-3′ ) and in the 3′ UTR (5′ -CCAAATCATCTGCCATGTGG-3′ ). Cycle parameters were programmed as 94 °C for 2 min, followed by the first 10 cycles of 93 °C for 15 s, 58 °C for 30 s, and 68 °C for 11 min and a 20 s auto-extension of elongation time for cycles 11-35. After verification of amplicon size by agarose gel electrophoresis, LR-PCR reactions were purified using Qiaquick PCR purification kit (Qiagen) and quantified using the NanoDrop 2000 spectrophotometer (Thermo Scientific).

Construction of 454 libraries and sequencing.
For each library, around 1 μ g of purified PCR product were fragmented by nebulization (2.1 bars of nitrogen for 1 min) and subjected to library preparation with the Rapid Library Preparation Kit (Roche) according to manufacturer's instructions. The different samples were labeled via ligation of Multiplex Identifiers (MID) oligonucleotide adaptors to allow multiplexing. Three MID-containing libraries were quantified using the Qubit ® Fluorometer and pooled in equimolar amounts into a single sample prior to emulsion PCR amplification and sequencing in parallel on the Roche GS Junior 454 sequencer giving a mean total number of 69,060 reads (387 bp mean length) per library. The data obtained were then sorted according to their tag sequences.
Bioinformatics. Raw reads were edited and filtered prior to analysis. A dedicated analysis pipeline was developed using the Galaxy framework (http://galaxyproject.org) 42 . First, relevant adapter sequences were removed with Cutadapt (v.1.3, default parameters), and quality-based trimming at the 3′ ends of reads was performed using the Qtrim tool (v.1.1, parameters: mean quality = 25, window size = 20, minimum read length = 40 nt). Cleaned reads were then mapped to the Human X chromosome reference sequence (hg19, UCSC) with STAR (v. 2.3, annotations from UCSC, parameters: -sjdbOverhang 29 -outFilterMismatchNoverLmax 0.05 -outSJfilterReads Unique, outFilterMultimapNmax 1) 20 . A script developed in-house was developed to annotate identified splice junctions, which can be obtained upon request. The output data is processed to obtain DMD mRNA coverage, as well as a list of identified splice junctions and their counts. Only new junctions covered by a minimum of 5 reads in at least two out of the four biological replicates were considered for further analysis. The Percent-Spliced-In (PSI) was calculated for each splicing event using intron-centric metrics 43 , with the SJPIPE pipeline, from the ipsa package (parameters: margin = 0, deltaSS = 0, mincount = 0, https://github.com/pervouchine/ipsa). For the comparison of the DMD-targeted versus publicly available total mRNA sequencing data from the Illumina Human Body Map project 2.0 (skeletal muscle tissue, 2 × 50 bp, GEO sample GSM759515), raw reads were mapped to the Human X chromosome sequence with Bowtie2 (v2.default parameters) 44 and exon junctions detection and count was performed by TopHat2 45 (v2.0.2, Gene Model Annotations option: chrX_GTF downloaded from UCSC). Two different splice site prediction algorithms were used for computational scoring of 5′ and 3′ splice sites: the Human Splicing Finder tool (http://www.umd.be/HSF3/), which uses the weight matrix model 46  Experimental validation of Alternative Splicing Events. Independent RT experiments were performed by using random primers for experimental validation. 1.5 μ L of cDNA was used as template for PCR amplification in a 25 μ L total volume with the Taq DNA Polymerase (New Englands Biolabs) and primers hybridizing to upstream and downstream exons (sequences available upon request). For conventional RT-PCR, the 30 cycles-amplified products were separated on 1.8% agarose gels and spliced products were quantified with the Quantity One (v.4.6.9) software (Bio-Rad). For semi-quantitative PCR, fluorescein-labeled forward primers were used and 26 cycles were done. Capillary electrophoresis analysis was performed using 1 μ l of PCR product added to 18 μ l of formamide and 0.5 μ l of ROX 400 HD fluorescent size standards (Applied Biosystems). Amplified products were separated on an ABI 3130 XL DNA analyzer and the peaks areas were measured by the GeneMapper v4.0 software. Ratios of splicing isoforms were determined as the peak area of one specific isoform divided by the total peak areas for the two or three detected isoforms. Data represent the mean + /− SD of at least three independent assays.