It is generally believed that splicing removes introns as single units from precursor messenger RNA transcripts. However, some long Drosophila melanogaster introns contain a cryptic site, known as a recursive splice site (RS-site), that enables a multi-step process of intron removal termed recursive splicing1, 2. The extent to which recursive splicing occurs in other species and its mechanistic basis have not been examined. Here we identify highly conserved RS-sites in genes expressed in the mammalian brain that encode proteins functioning in neuronal development. Moreover, the RS-sites are found in some of the longest introns across vertebrates. We find that vertebrate recursive splicing requires initial definition of an ‘RS-exon’ that follows the RS-site. The RS-exon is then excluded from the dominant mRNA isoform owing to competition with a reconstituted 5′ splice site formed at the RS-site after the first splicing step. Conversely, the RS-exon is included when preceded by cryptic promoters or exons that fail to reconstitute an efficient 5′ splice site. Most RS-exons contain a premature stop codon such that their inclusion can decrease mRNA stability. Thus, by establishing a binary splicing switch, RS-sites demarcate different mRNA isoforms emerging from long genes by coupling cryptic elements with inclusion of RS-exons.
At a glance
- Subdivision of large introns in Drosophila by recursive splicing at nonexonic elements. Genetics 170, 661–674 (2005) , , , &
- Generation of alternative Ultrabithorax isoforms and stepwise removal of a large intron by resplicing at exon–exon junctions. Mol. Cell 2, 787–796 (1998) , &
- An apparent pseudo-exon acts both as an alternative exon that leads to nonsense-mediated decay and as a zero-length exon. Mol. Cell. Biol 26, 2237–2246 (2006) &
- The peculiarities of large intron splicing in animals. PLoS ONE 4, e7853 (2009) , &
- Gene regulation and priming by topoisomerase IIα in embryonic stem cells. Nature Commun. 4, 2478 (2013) et al.
- Total RNA sequencing reveals nascent transcription and widespread co-transcriptional splicing in the human brain. Nature Struct. Mol. Biol. 18, 1435–1440 (2011) et al.
- Widespread binding of FUS along nascent RNA regulates alternative splicing in the brain. Sci. Rep. 2, 603 (2012) et al.
- Context-dependent splicing regulation: exon definition, co-occurring motif pairs and tissue specificity. RNA Biol. 8, 384–388 (2011) &
- Exon definition may facilitate splice site selection in RNAs with multiple exons. Mol. Cell. Biol. 10, 84–94 (1990) , &
- Alternative splicing resulting in nonsense-mediated mRNA decay: what is the meaning of nonsense? Trends Biochem. Sci. 33, 385–393 (2008) &
- Intrasplicing coordinates alternative first exons with alternative splicing in the protein 4.1R gene. EMBO J. 27, 122–131 (2008) , , &
- Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. J. Comp. Biol. 11, 377–394 (2004) &
- Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto-karyotype. Nature 431, 946–957 (2004) et al.
- The effect of intron length on exon creation ratios during the evolution of mammalian genomes. RNA 14, 2261–2273 (2008) , , &
- Noisy splicing drives mRNA isoform diversity in human cells. PLoS Genet. 6, e1001236 (2010) , , &
- Divergent roles of ALS-linked proteins FUS/TLS and TDP-43 intersect in processing long pre-mRNAs. Nature Neurosci. 15, 1488–1497 (2012) et al.
- Long pre-mRNA depletion and RNA missplicing contribute to neuronal vulnerability from loss of TDP-43. Nature Neurosci. 14, 459–468 (2011) et al.
- Topoisomerases facilitate transcription of long genes linked to autism. Nature 501, 58–62 (2013) et al.
- Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010) &
- Quality control parameters on a large dataset of regionally dissected human control brains for whole genome expression studies. J. Neurochem. 119, 275–28 (2011) et al.
- STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013) et al.
- GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012) et al.
- iCLIP reveals the function of hnRNP particles in splicing at individual nucleotide resolution. Nature Struct. Mol. Biol. 17, 909–915 (2010) et al.
- Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009) , , &
- Rates of in situ transcription and splicing in large human genes. Nature Struct. Mol. Biol. 16, 1128–1133 (2009) &
- TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013) et al.
- GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMC Bioinformatics 10, 48 (2009) , , , &
- WebLogo: a sequence logo generator. Genome Res. 14, 1188–1190 (2004) , , &
- Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnol. 28, 511–515 (2010) et al.
- Core promoter factor TAF9B regulates neuronal gene expression. eLife 3, e02559 (2014) , , &
- Hydroxymethylation at gene regulatory regions directs stem/early progenitor cell commitment during erythropoiesis. Cell Rep. 6, 231–244 (2014) et al.
Extended data figures and tables
Extended Data Figures
- Extended Data Figure 1: Long gene expression is enriched in the brain. (437 KB)
a, GO term analysis of genes >150 kb relative to all human genes. All GO terms are associated with enrichment scores >2. b, The log2-fold gene expression ratios following DESeq19 analysis of all human protein-coding genes between the brain and all other tissues. Data are represented as Loess smoothing curves after the genes by their maximum length in kilobases. Hashed vertical line indicates 150 kb gene length. RNA-seq data was obtained from the GTEX consortium. c, Individual scatterplots used to create Fig. 1b and representing DESeq19 analysis of individual genes within indicated tissues compared to the brain. Red dots indicate genes that contain RS-sites, blue dots indicate dystrophin, and black dots indicate titin (two long genes most highly expressed in muscle tissues). Grey dots are all remaining genes. d, DESeq19 analysis of individual gene expression after vs before differentiation of C2C12 mouse myoblasts (GSM521256) into myogenic lineage (GSM521259)29, after vs before differentiation of mouse embryonic stem cells (GSM1346027) into motor neurons (GSM1346035)30, or after vs before differentiation of haematopoietic stem cells (GSM992931) into erythroid lineage (GSM992934)31. Loess smoothing curves are shown after sorting the genes by their maximum length in kilobases. Hashed vertical line indicates 150 kb gene length.
- Extended Data Figure 2: Linear regression analysis and novel junction sequence considerations used to identify mammalian recursive splice sites. (364 KB)
a, Examples of RNA-seq read density patterns for three genes together with their calculated gradients across the (1) first intron >50 kb, and (2) the average across all other >50-kb long introns within the same gene. Gradients represent the change in summated read count every 5 kb since RNA-seq reads are grouped in 5-kb windows and linear regression performed on resulting histograms. b, Density plot indicating the ratio of gradients of all other >50 kb introns within the same gene: the gradient of the first intron >50 kb. Blue hashed line represents ratio of 1. This would indicate that gradients for long introns within the same gene are comparable and transcription is proceeding at a largely constant rate. c, Schematic of the bioinformatics pipeline used to identify novel junctions. d, Ranking of human 5′ splice site pentamer usage genome-wide. e, Nucleotide usage frequency at human 3′ splice sites genome-wide, and branch-point positioning relative to 3′ splice site genome-wide.
- Extended Data Figure 3: Inferred splicing patterns identify recursive splice sites within mammalian >150 kb intron genes. (540 KB)
a–g, RNA-seq (red) read density patterns and normalized FUS iCLIP (green) cross-link density patterns for the OPCML (a), ROBO2 (b), HS6ST3 (c), ANK3 (d), CADM2 (e), NCAM1 (f) and PDE4D (g) genes within human brains. RNA-seq reads and normalized FUS iCLIP cross-links are grouped in 5-kb windows. RefSeq introns >150 kb were searched for novel junctions and linear regression performed on all Ensembl introns >50 kb in which novel junctions were located. Gene isoforms displayed are those including introns within which significant junctions were identified. Red novel junctions represent significant improvements in goodness-of-fit in both RNA-seq and FUS regression analysis (P < 0.01 in both data sets, F-test). Blue novel junctions contact RS-exons. Grey novel junctions were not deemed significant following regression analysis. Zoomed area represents sequence at deep intronic loci surrounding novel junction. Phylo-P conservation track indicates sequence conservation across 46 levels of mammalian evolution.
- Extended Data Figure 4: Inferred recursive splicing patterns in the OPCML gene across four separate brains. (505 KB)
a, RNA-seq read density patterns for the OPCML gene across 12 different regions of four separate brains. Gene isoform displayed is that which included the long first intron within which a significant novel junction was identified. RNA-seq reads are grouped in 5-kb windows. Dotted arrows indicate location of experimentally derived RS-site.
- Extended Data Figure 5: RT–PCR confirmation of RS-sites in human and zebrafish samples, and prediction of mouse RS-exons. (218 KB)
a, Schematic of primer design used for RT–PCR validation of novel junctions. b–g, RT–PCR analysis of CADM2 (b), HS6ST3 (c), ROBO2 (d), PDE4D_1_1 (e), PDE4D_1_2 (f) and PDE4D_2_2 (g) genes around RS-sites using indicated primers. For PDE4D sites, first number after gene name indicates RS-site studied, second number indicates the upstream exon used. See Extended Data Fig. 3g for junctions detected. h, RT–PCR analysis of cadm2a RS-site junction in adult male and female zebrafish embryos, together with an alignment of zebrafish (ZF) cadm2a RS-site to human (HS) CADM2 RS-site. i, Map of consensus splice site location and in-frame termination codons following RS-sites in indicated mouse genes. Strong consensus splice sites are GTAAG, GTGAG, GTAGG and GTATG. Weak consensus splice sites are GTAAA, GTAAT, GTGGG, GTAAC, GTCAG and GTACG.
- Extended Data Figure 6: Conservation of inferred recursive splicing patterns in the mouse brain. (434 KB)
a–h, Normalized Fus iCLIP read density patterns for the Opcml (a), Robo2 (b), Hs6st3 (c), Ank3 (d), Cadm1 (e), Ncam1 (f), Cadm2 (g) and Pde4d (h) genes within the mouse brain. Normalized FUS iCLIP cross-link sites are grouped in 5-kb windows, and the displayed linear regression lines were computed on resulting histograms. Zoomed area at deep intronic loci represents RS-site sequences conserved from humans to mouse.
- Extended Data Figure 7: Promoter-dependent inclusion of RS-exons in CADM2 and NTM genes. (336 KB)
a, Number of cassette and constitutive exons starting with motif GURAG. b–d, RT–PCR of CADM2 gene in the frontal cortex using primers indicated in b or Fig. 4a. RT–PCR was carried out on one (b) or four (c, d) human brains. In c, the inclusion of the second RS-exon occurs together with the minor promoter. Two bands are present for both PCR reactions due to the presence of an alternatively spliced exon following the RS-exon. This can result in two distinct long or short isoforms. In d, the inclusion of the second RS-exon occurs when the first RS-exon is included. Schematics in c and d represent examined splicing products together with expected length of products. e, RNA-seq read density patterns for the NTM gene and expected human isoforms. RNA-seq reads are grouped in 5-kb windows and linear regression performed on resulting histograms. A cryptic minor promoter/exon detected by RNA-seq is indicated by vertical red line. The annotated RS-exon is indicated by the vertical blue line. Zoomed area represents RS-site sequence at start of the annotated RS-exon. Primers to assess the major and minor promoter products associated with the RS-exon are indicated by coloured arrows. f, RT–PCR of NTM gene around RS-exon using indicated primers. g, RT–PCR analysis of NTM products in which the upstream exon is either derived from the major upstream promoter or the cryptic upstream promoter/exon. RT–PCR was performed in the frontal cortex of three human brains using primer sets indicated by coloured arrows in e. Schematics represent possible splicing products together with expected length of products. Top panel assesses RS-exon inclusion, bottom panel assesses RS-site junction detection.
- Extended Data Figure 8: Recursive splicing regulates the alternative splicing of RS-exons. (190 KB)
a, Qiaxcel analysis and quantification of the splicing intermediates of indicated CADM2 splicing reporter products following transfection in SH-SY5Y cells. Primers used are indicated by red arrows in schematic, together with expected products and their sizes. b, RT–PCR analysis of the zebrafish cadm2a mRNA after in vivo injection of AON-2. Sequencing reveals RS-exon inclusion results in subsequent splicing to additional downstream cryptic elements before the second exon, explaining why RS-exon included product size is larger than expected. c, qRT–PCR analysis of exon–exon junctions surrounding the RS-site containing introns following AON-A1 mediated inhibition of RS-site use of the human CADM1 and ANK3 genes (n = 3, 1 experiment) or the zebrafish cadm2a gene (n = 7, 3 separate experiments). d, Splice site scores of reconstituted 5′ splice sites following first step of recursive splicing versus the 5′ splice sites of corresponding recursive exons.
- Extended Data Figure 9: Cryptic elements are frequent in long first introns. (205 KB)
a, UCSC annotated isoforms of the OPCML gene together with spliced expressed sequence tags (ESTs) detected across the OPCML locus. Recursive exon is marked in blue, and the preceding exons produced by minor promoter or cryptic splicing of the long first intron are marked in red. b, Lengths of the 9 introns containing the high-confidence RS-sites compared to other introns across vertebrates. Results are an extension of Fig. 4g. c, Boxplot showing the detected number of unannotated alternative start exons that junction to the dominant second exon of brain expressed genes. Only novel junctions that do not match UCSC/GENCODE transcripts are considered for analysis. Genes are separated into bins based on the first intron length of the canonical isoform. Boxplot presents median, first and third quartile boundaries for each bin. Additional red diamonds indicate mean values for each bin. *P < 10−10 (Mann–Whitney U test). Only tests between the 100 kb+ bin to other bins are shown. Right panel shows cartoon of the implications of boxplot results.
- Supplementary Information (285 KB)
This file contains a Supplementary Note, Supplementary References and full legends for Supplementary Tables 1-4.
- Supplementary Table 1 (22.5 MB)
This table contains novel junction detection and linear regression analysis – see Supplementary Information file for full legend.
- Supplementary Table 2 (13 KB)
This table contains functions and disease associations of high confidence RS-site containing genes.
- Supplementary Table 3 (3.3 MB)
This table contains RS-site splice site competition scores and cryptic splice site usage across the transcriptome – see Supplementary Information file for full legend.
- Supplementary Table 4 (15 KB)
This table contains reporter constructs and primer sequences used in this study – see Supplementary Information file for full legend.