Recursive splicing in long vertebrate genes

Journal name:
Nature
Volume:
521,
Pages:
371–375
Date published:
DOI:
doi:10.1038/nature14466
Received
Accepted
Published online

It is generally believed that splicing removes introns as single units from precursor messenger RNA transcripts. However, some long Drosophila melanogaster introns contain a cryptic site, known as a recursive splice site (RS-site), that enables a multi-step process of intron removal termed recursive splicing1, 2. The extent to which recursive splicing occurs in other species and its mechanistic basis have not been examined. Here we identify highly conserved RS-sites in genes expressed in the mammalian brain that encode proteins functioning in neuronal development. Moreover, the RS-sites are found in some of the longest introns across vertebrates. We find that vertebrate recursive splicing requires initial definition of an ‘RS-exon’ that follows the RS-site. The RS-exon is then excluded from the dominant mRNA isoform owing to competition with a reconstituted 5′ splice site formed at the RS-site after the first splicing step. Conversely, the RS-exon is included when preceded by cryptic promoters or exons that fail to reconstitute an efficient 5′ splice site. Most RS-exons contain a premature stop codon such that their inclusion can decrease mRNA stability. Thus, by establishing a binary splicing switch, RS-sites demarcate different mRNA isoforms emerging from long genes by coupling cryptic elements with inclusion of RS-exons.

At a glance

Figures

  1. Detection of recursive splice sites within long genes expressed in the human brain.
    Figure 1: Detection of recursive splice sites within long genes expressed in the human brain.

    a, Schematic of the D. melanogaster recursive splicing mechanism. b, The log2-fold gene expression ratios following differential expression sequencing (DESeq)19 analysis of all human protein-coding genes between the brain and all other tissues. Data are represented as Loess smoothing curves after defining genes by their maximum length in kilobases. Dashed vertical line indicates 150 kb. RNA-seq data were obtained from the Illumina Human Body Map 2.0 total RNA-seq library (GEO accession GSE30611). Skel. m., skeletal muscle; WBC, white blood cell. c, Schematic of the theoretical RNA abundance across long introns demonstrating linear regression analysis performed on introns before/after novel junction consideration. d, All novel junctions identified within CADM1 by RNA-seq data are shown on top of experimentally derived RNA-seq (red) and FUS iCLIP (green) read densities, both grouped in 5-kb windows. The displayed linear regression line was determined after the intron was split at the red novel junction. This split significantly improved the regression in both RNA-seq and FUS iCLIP (P < 0.01 in both, F-test). Blue novel junction contacts the RS-exon. Phylo-P sequence conservation scores are shown around the CADM1 RS-site across 46 mammalian species. e, Ratio of after:before gradients at long gene novel junctions in RNA-seq (x axis) and FUS iCLIP (y axis) data sets. Black and red dots represent junctions that significantly improve the regression gradient and goodness-of-fit, whereas grey dots show no improvement. Black dots are junctions contacting the sequence of 3′ splice sites (SS), whereas red dots contact the sequence of RS-sites. Dashed lines mark top and bottom quartile ratios for each data set. f, WebLogo of RS-sites identified by red junctions from e.

  2. Recursive splicing requires initial definition of RS-exons.
    Figure 2: Recursive splicing requires initial definition of RS-exons.

    a, RT–PCR validation of recursive splicing in ANK3 and CADM1 genes. The length of expected products in nucleotides is marked below the gels. No products are expected in the lanes marked by the asterisk. b, Consensus splice site location and in-frame termination codons at RS-exons in indicated human genes. c, d, Phylo-P conservation scores aligned at RS-sites (c) and 5′ splice sites (d) of RS-exons. Conservation at the two nearest cryptic 5′ splice sites following RS-exons (nearest 5′ splice site) and the canonical 5′ and 3′ splice sites in the same genes are also shown. e, Schematic of the exon definition model and AON-A1 design strategy. f, g, Quantitative RT–PCR (qRT–PCR) analysis of RS-site junctions in human CADM1 and ANK3 genes (n = 4 for non-specific AON (NS), n = 5 for AON-A1, two separate experiments) (f) or zebrafish cadm2a gene after treatment with AON-A1 (n = 7, 3 separate experiments) (g). h, qRT–PCR analysis of intronic RNA upstream of RS-sites in CADM1 and ANK3 genes after treatment with AON-A1. Location of primer pair is indicated by red arrow in schematic, and expected changes in intronic abundance indicated by grey triangles (n = 4 for NS, n = 5 for AON-A1, 2 separate experiments). i, qRT–PCR analysis of zebrafish cadm2a mRNA using two separate primer sets targeting constitutive exons after in vivo injection of AON-A1 (n = 7, 3 separate experiments). j, qRT–PCR analysis of human CADM1 and ANK3 mRNAs after 48 h treatment with AON-A1. mRNA for both genes was assessed in nuclear fractions (n = 4 for NS, n = 5 for AON-A1, 2 separate experiments). *P < 0.05 (two-tailed student t-test). Data are mean ± s.d. Unless indicated otherwise, primers are indicated by coloured arrows within schematics. Replicate data are shown in the source data.

  3. The reconstituted 5[prime] splice site is required for RS-exon skipping.
    Figure 3: The reconstituted 5′ splice site is required for RS-exon skipping.

    a, Schematic of splice site competition model, and the design strategy for the CADM2 splicing reporter P1 variants and AON-A2 experiments. b, c, Qiaxcel analysis of indicated CADM2 splicing reporter products after transfection in SH-SY5Y cells (n = 3–5, 2 separate experiments) (b), or human CADM1 and ANK3 genes after 48-h treatment with AON-A2 (n = 4 for CADM1, n = 5 for ANK3, 2 separate experiments) (c). d, Quantification of CADM1 and ANK3 RS-exon inclusion after treatment with AON-A2 then dimethylsulphoxide (DMSO) or cycloheximide (CHX) in SH-SY5Y cells (n = 4, 2 separate experiments). *P < 0.05 (two-tailed Student’s t-test). Data are mean ± s.d. Primers used are indicated by red arrows in schematics. Replicate data are shown in the source data.

  4. Splice site competition allows a binary splicing switch for RS-exons.
    Figure 4: Splice site competition allows a binary splicing switch for RS-exons.

    a, RNA-seq read density patterns in the CADM2 gene shown in 5-kb windows, with linear regression performed after the first intron is split at the two RS-sites indicated with blue vertical lines. Isoforms expressed from the dominant and minor promoters in human frontal cortex tissue are shown, and primer locations used for b indicated by coloured arrows. Grey forward primer is located in the first exon of dominant isoform, blue forward primer is located in the first RS-exon, red forward primer is located in the first exon of alternative isoform (P2). Zoomed area represents the sequence at the start of the second RS-exon. b, c, RT–PCR analysis of RS-exon inclusion in indicated CADM2 isoforms (b) or indicated NTM isoforms (c) (n = 4 and n = 3 respectively; Extended Data Fig. 7). Values are mean ± s.d. d, Schematic of CADM2 splicing reporter variants P1 and P1-m3, based on the dominant CADM2 isoform (white), and P2 and P2-m1, based on the minor CADM2 isoform (red). Splice site scores for reconstituted and RS-exon 5′ splice sites are indicated. e, f, Qiaxcel analysis of indicated CADM2 splicing reporter products after transfection in SH-SY5Y cells (n = 3 or n = 4, 2 separate experiments). The expected size of PCR products is shown next to each electropherogram. g, Lengths of the 9 introns containing high-confidence RS-sites compared to other vertebrate introns. h, Histogram of human gene lengths plotted alongside the percentage of genes with RS-site-containing novel junctions. i, Schematic representation of the mechanism of recursive splicing and the binary splicing switch as described in main text. For relevant panels, replicate data are shown in the source data.

  5. Long gene expression is enriched in the brain.
    Extended Data Fig. 1: Long gene expression is enriched in the brain.

    a, GO term analysis of genes >150 kb relative to all human genes. All GO terms are associated with enrichment scores >2. b, The log2-fold gene expression ratios following DESeq19 analysis of all human protein-coding genes between the brain and all other tissues. Data are represented as Loess smoothing curves after the genes by their maximum length in kilobases. Hashed vertical line indicates 150 kb gene length. RNA-seq data was obtained from the GTEX consortium. c, Individual scatterplots used to create Fig. 1b and representing DESeq19 analysis of individual genes within indicated tissues compared to the brain. Red dots indicate genes that contain RS-sites, blue dots indicate dystrophin, and black dots indicate titin (two long genes most highly expressed in muscle tissues). Grey dots are all remaining genes. d, DESeq19 analysis of individual gene expression after vs before differentiation of C2C12 mouse myoblasts (GSM521256) into myogenic lineage (GSM521259)29, after vs before differentiation of mouse embryonic stem cells (GSM1346027) into motor neurons (GSM1346035)30, or after vs before differentiation of haematopoietic stem cells (GSM992931) into erythroid lineage (GSM992934)31. Loess smoothing curves are shown after sorting the genes by their maximum length in kilobases. Hashed vertical line indicates 150 kb gene length.

  6. Linear regression analysis and novel junction sequence considerations used to identify mammalian recursive splice sites.
    Extended Data Fig. 2: Linear regression analysis and novel junction sequence considerations used to identify mammalian recursive splice sites.

    a, Examples of RNA-seq read density patterns for three genes together with their calculated gradients across the (1) first intron >50 kb, and (2) the average across all other >50-kb long introns within the same gene. Gradients represent the change in summated read count every 5 kb since RNA-seq reads are grouped in 5-kb windows and linear regression performed on resulting histograms. b, Density plot indicating the ratio of gradients of all other >50 kb introns within the same gene: the gradient of the first intron >50 kb. Blue hashed line represents ratio of 1. This would indicate that gradients for long introns within the same gene are comparable and transcription is proceeding at a largely constant rate. c, Schematic of the bioinformatics pipeline used to identify novel junctions. d, Ranking of human 5′ splice site pentamer usage genome-wide. e, Nucleotide usage frequency at human 3′ splice sites genome-wide, and branch-point positioning relative to 3′ splice site genome-wide.

  7. Inferred splicing patterns identify recursive splice sites within mammalian >150 kb intron genes.
    Extended Data Fig. 3: Inferred splicing patterns identify recursive splice sites within mammalian >150 kb intron genes.

    ag, RNA-seq (red) read density patterns and normalized FUS iCLIP (green) cross-link density patterns for the OPCML (a), ROBO2 (b), HS6ST3 (c), ANK3 (d), CADM2 (e), NCAM1 (f) and PDE4D (g) genes within human brains. RNA-seq reads and normalized FUS iCLIP cross-links are grouped in 5-kb windows. RefSeq introns >150 kb were searched for novel junctions and linear regression performed on all Ensembl introns >50 kb in which novel junctions were located. Gene isoforms displayed are those including introns within which significant junctions were identified. Red novel junctions represent significant improvements in goodness-of-fit in both RNA-seq and FUS regression analysis (P < 0.01 in both data sets, F-test). Blue novel junctions contact RS-exons. Grey novel junctions were not deemed significant following regression analysis. Zoomed area represents sequence at deep intronic loci surrounding novel junction. Phylo-P conservation track indicates sequence conservation across 46 levels of mammalian evolution.

  8. Inferred recursive splicing patterns in the OPCML gene across four separate brains.
    Extended Data Fig. 4: Inferred recursive splicing patterns in the OPCML gene across four separate brains.

    a, RNA-seq read density patterns for the OPCML gene across 12 different regions of four separate brains. Gene isoform displayed is that which included the long first intron within which a significant novel junction was identified. RNA-seq reads are grouped in 5-kb windows. Dotted arrows indicate location of experimentally derived RS-site.

  9. RT-PCR confirmation of RS-sites in human and zebrafish samples, and prediction of mouse RS-exons.
    Extended Data Fig. 5: RT–PCR confirmation of RS-sites in human and zebrafish samples, and prediction of mouse RS-exons.

    a, Schematic of primer design used for RT–PCR validation of novel junctions. bg, RT–PCR analysis of CADM2 (b), HS6ST3 (c), ROBO2 (d), PDE4D_1_1 (e), PDE4D_1_2 (f) and PDE4D_2_2 (g) genes around RS-sites using indicated primers. For PDE4D sites, first number after gene name indicates RS-site studied, second number indicates the upstream exon used. See Extended Data Fig. 3g for junctions detected. h, RT–PCR analysis of cadm2a RS-site junction in adult male and female zebrafish embryos, together with an alignment of zebrafish (ZF) cadm2a RS-site to human (HS) CADM2 RS-site. i, Map of consensus splice site location and in-frame termination codons following RS-sites in indicated mouse genes. Strong consensus splice sites are GTAAG, GTGAG, GTAGG and GTATG. Weak consensus splice sites are GTAAA, GTAAT, GTGGG, GTAAC, GTCAG and GTACG.

  10. Conservation of inferred recursive splicing patterns in the mouse brain.
    Extended Data Fig. 6: Conservation of inferred recursive splicing patterns in the mouse brain.

    ah, Normalized Fus iCLIP read density patterns for the Opcml (a), Robo2 (b), Hs6st3 (c), Ank3 (d), Cadm1 (e), Ncam1 (f), Cadm2 (g) and Pde4d (h) genes within the mouse brain. Normalized FUS iCLIP cross-link sites are grouped in 5-kb windows, and the displayed linear regression lines were computed on resulting histograms. Zoomed area at deep intronic loci represents RS-site sequences conserved from humans to mouse.

  11. Promoter-dependent inclusion of RS-exons in CADM2 and NTM genes.
    Extended Data Fig. 7: Promoter-dependent inclusion of RS-exons in CADM2 and NTM genes.

    a, Number of cassette and constitutive exons starting with motif GURAG. bd, RT–PCR of CADM2 gene in the frontal cortex using primers indicated in b or Fig. 4a. RT–PCR was carried out on one (b) or four (c, d) human brains. In c, the inclusion of the second RS-exon occurs together with the minor promoter. Two bands are present for both PCR reactions due to the presence of an alternatively spliced exon following the RS-exon. This can result in two distinct long or short isoforms. In d, the inclusion of the second RS-exon occurs when the first RS-exon is included. Schematics in c and d represent examined splicing products together with expected length of products. e, RNA-seq read density patterns for the NTM gene and expected human isoforms. RNA-seq reads are grouped in 5-kb windows and linear regression performed on resulting histograms. A cryptic minor promoter/exon detected by RNA-seq is indicated by vertical red line. The annotated RS-exon is indicated by the vertical blue line. Zoomed area represents RS-site sequence at start of the annotated RS-exon. Primers to assess the major and minor promoter products associated with the RS-exon are indicated by coloured arrows. f, RT–PCR of NTM gene around RS-exon using indicated primers. g, RT–PCR analysis of NTM products in which the upstream exon is either derived from the major upstream promoter or the cryptic upstream promoter/exon. RT–PCR was performed in the frontal cortex of three human brains using primer sets indicated by coloured arrows in e. Schematics represent possible splicing products together with expected length of products. Top panel assesses RS-exon inclusion, bottom panel assesses RS-site junction detection.

  12. Recursive splicing regulates the alternative splicing of RS-exons.
    Extended Data Fig. 8: Recursive splicing regulates the alternative splicing of RS-exons.

    a, Qiaxcel analysis and quantification of the splicing intermediates of indicated CADM2 splicing reporter products following transfection in SH-SY5Y cells. Primers used are indicated by red arrows in schematic, together with expected products and their sizes. b, RT–PCR analysis of the zebrafish cadm2a mRNA after in vivo injection of AON-2. Sequencing reveals RS-exon inclusion results in subsequent splicing to additional downstream cryptic elements before the second exon, explaining why RS-exon included product size is larger than expected. c, qRT–PCR analysis of exon–exon junctions surrounding the RS-site containing introns following AON-A1 mediated inhibition of RS-site use of the human CADM1 and ANK3 genes (n = 3, 1 experiment) or the zebrafish cadm2a gene (n = 7, 3 separate experiments). d, Splice site scores of reconstituted 5′ splice sites following first step of recursive splicing versus the 5′ splice sites of corresponding recursive exons.

  13. Cryptic elements are frequent in long first introns.
    Extended Data Fig. 9: Cryptic elements are frequent in long first introns.

    a, UCSC annotated isoforms of the OPCML gene together with spliced expressed sequence tags (ESTs) detected across the OPCML locus. Recursive exon is marked in blue, and the preceding exons produced by minor promoter or cryptic splicing of the long first intron are marked in red. b, Lengths of the 9 introns containing the high-confidence RS-sites compared to other introns across vertebrates. Results are an extension of Fig. 4g. c, Boxplot showing the detected number of unannotated alternative start exons that junction to the dominant second exon of brain expressed genes. Only novel junctions that do not match UCSC/GENCODE transcripts are considered for analysis. Genes are separated into bins based on the first intron length of the canonical isoform. Boxplot presents median, first and third quartile boundaries for each bin. Additional red diamonds indicate mean values for each bin. *P < 10−10 (Mann–Whitney U test). Only tests between the 100 kb+ bin to other bins are shown. Right panel shows cartoon of the implications of boxplot results.

Accession codes

Primary accessions

ArrayExpress

References

  1. Burnette, J. M., Miyamoto-Sato, E., Schaub, M. A., Conklin, J. & Lopez, A. J. Subdivision of large introns in Drosophila by recursive splicing at nonexonic elements. Genetics 170, 661674 (2005)
  2. Hatton, A. R., Subramaniam, V. & Lopez, A. J. Generation of alternative Ultrabithorax isoforms and stepwise removal of a large intron by resplicing at exon–exon junctions. Mol. Cell 2, 787796 (1998)
  3. Grellscheid, S. N. & Smith, C. W. An apparent pseudo-exon acts both as an alternative exon that leads to nonsense-mediated decay and as a zero-length exon. Mol. Cell. Biol 26, 22372246 (2006)
  4. Shepard, S., McCreary, M. & Fedorov, A. The peculiarities of large intron splicing in animals. PLoS ONE 4, e7853 (2009)
  5. Thakurela, S. et al. Gene regulation and priming by topoisomerase IIα in embryonic stem cells. Nature Commun. 4, 2478 (2013)
  6. Ameur, A. et al. Total RNA sequencing reveals nascent transcription and widespread co-transcriptional splicing in the human brain. Nature Struct. Mol. Biol. 18, 14351440 (2011)
  7. Rogelj, B. et al. Widespread binding of FUS along nascent RNA regulates alternative splicing in the brain. Sci. Rep. 2, 603 (2012)
  8. Ke, S. & Chasin, L. A. Context-dependent splicing regulation: exon definition, co-occurring motif pairs and tissue specificity. RNA Biol. 8, 384388 (2011)
  9. Robberson, B. L., Cote, G. J. & Berget, S. M. Exon definition may facilitate splice site selection in RNAs with multiple exons. Mol. Cell. Biol. 10, 8494 (1990)
  10. McGlincy, N. J. & Smith, C. W. Alternative splicing resulting in nonsense-mediated mRNA decay: what is the meaning of nonsense? Trends Biochem. Sci. 33, 385393 (2008)
  11. Parra, M. K., Tan, J. S., Mohandas, N. & Conboy, J. G. Intrasplicing coordinates alternative first exons with alternative splicing in the protein 4.1R gene. EMBO J. 27, 122131 (2008)
  12. Yeo, G. & Burge, C. B. Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. J. Comp. Biol. 11, 377394 (2004)
  13. Jaillon, O. et al. Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto-karyotype. Nature 431, 946957 (2004)
  14. Roy, M., Kim, N., Xing, Y. & Lee, C. The effect of intron length on exon creation ratios during the evolution of mammalian genomes. RNA 14, 22612273 (2008)
  15. Pickrell, J. K., Pai, A. A., Gilad, Y. & Pritchard, J. K. Noisy splicing drives mRNA isoform diversity in human cells. PLoS Genet. 6, e1001236 (2010)
  16. Lagier-Tourenne, C. et al. Divergent roles of ALS-linked proteins FUS/TLS and TDP-43 intersect in processing long pre-mRNAs. Nature Neurosci. 15, 14881497 (2012)
  17. Polymenidou, M. et al. Long pre-mRNA depletion and RNA missplicing contribute to neuronal vulnerability from loss of TDP-43. Nature Neurosci. 14, 459468 (2011)
  18. King, I. F. et al. Topoisomerases facilitate transcription of long genes linked to autism. Nature 501, 5862 (2013)
  19. Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010)
  20. Trabzuni, D. et al. Quality control parameters on a large dataset of regionally dissected human control brains for whole genome expression studies. J. Neurochem. 119, 27528 (2011)
  21. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 1521 (2013)
  22. Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 17601774 (2012)
  23. König, J. et al. iCLIP reveals the function of hnRNP particles in splicing at individual nucleotide resolution. Nature Struct. Mol. Biol. 17, 909915 (2010)
  24. Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009)
  25. Singh, J. & Padgett, R. A. Rates of in situ transcription and splicing in large human genes. Nature Struct. Mol. Biol. 16, 11281133 (2009)
  26. Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013)
  27. Eden, E., Navon, R., Steinfeld, I., Lipson, D. & Yakhini, Z. GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMC Bioinformatics 10, 48 (2009)
  28. Crooks, G. E., Hon, G., Chandonia, J. M. & Brenner, S. E. WebLogo: a sequence logo generator. Genome Res. 14, 11881190 (2004)
  29. Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnol. 28, 511515 (2010)
  30. Herrera, F. J., Yamaguchi, T., Roelink, H. & Tjian, R. Core promoter factor TAF9B regulates neuronal gene expression. eLife 3, e02559 (2014)
  31. Madzo, J. et al. Hydroxymethylation at gene regulatory regions directs stem/early progenitor cell commitment during erythropoiesis. Cell Rep. 6, 231244 (2014)

Download references

Author information

  1. These authors contributed equally to this work.

    • Christopher R. Sibley &
    • Warren Emmett

Affiliations

  1. Department of Molecular Neuroscience, UCL Institute of Neurology, Queen Square, London WC1N 3BG, UK

    • Christopher R. Sibley,
    • Lorea Blazquez,
    • Nejc Haberman,
    • Daniah Trabzuni,
    • Mina Ryten,
    • John Hardy &
    • Jernej Ule
  2. MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge CB2 0QH, UK

    • Christopher R. Sibley,
    • Michael Briese,
    • Miha Modic &
    • Jernej Ule
  3. University College London Genetics Institute, Gower Street, London WC1E 6BT, UK

    • Warren Emmett &
    • Vincent Plagnol
  4. Department of Cell and Developmental Biology, University College London, Gower Street, London WC1E 6BT, UK

    • Ana Faro &
    • Stephen W. Wilson
  5. Institute for Clinical Neurobiology, University of Würzburg, Versbacherstr. 5, 97078, Würzburg, Germany

    • Michael Briese
  6. Department of Genetics, King Faisal Specialist Hospital and Research Centre, Riyadh 11211, Saudi Arabia

    • Daniah Trabzuni
  7. Department of Medical & Molecular Genetics, King’s College London, Guy’s Hospital, London SE1 9RT, UK

    • Mina Ryten &
    • Michael E. Weale
  8. Institute of Stem Cell Research, German Research Center for Environmental Health, Helmholtz Center Munich, 85764 Neuherberg, Germany

    • Miha Modic
  9. Faculty of Computer and Information Science, University of Ljubljana, 1000 Ljubljana, Slovenia

    • Tomaž Curk

Contributions

C.R.S., M.B. and J.U. conceived and designed the project; C.R.S., L.B., A.F., M.B., M.M. and D.T. performed experiments; C.R.S., W.E., L.B., V.P., T.C. and J.U. analysed the data and interpreted results with contributions from M.R., M.E.W. and J.H.; C.R.S. and J.U. wrote the manuscript with contributions from W.E., V.P., L.B. and S.W.W.

Competing financial interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to:

The sequence data and scripts are publically available from the European Genome-phenome Archive under the accession number EGAS00001001170, ArrayExpress (E-MTAB-3534), http://icount.biolab.si/ and https://github.com/vplagnol/recursive_splicing.

Author details

Extended data figures and tables

Extended Data Figures

  1. Extended Data Figure 1: Long gene expression is enriched in the brain. (437 KB)

    a, GO term analysis of genes >150 kb relative to all human genes. All GO terms are associated with enrichment scores >2. b, The log2-fold gene expression ratios following DESeq19 analysis of all human protein-coding genes between the brain and all other tissues. Data are represented as Loess smoothing curves after the genes by their maximum length in kilobases. Hashed vertical line indicates 150 kb gene length. RNA-seq data was obtained from the GTEX consortium. c, Individual scatterplots used to create Fig. 1b and representing DESeq19 analysis of individual genes within indicated tissues compared to the brain. Red dots indicate genes that contain RS-sites, blue dots indicate dystrophin, and black dots indicate titin (two long genes most highly expressed in muscle tissues). Grey dots are all remaining genes. d, DESeq19 analysis of individual gene expression after vs before differentiation of C2C12 mouse myoblasts (GSM521256) into myogenic lineage (GSM521259)29, after vs before differentiation of mouse embryonic stem cells (GSM1346027) into motor neurons (GSM1346035)30, or after vs before differentiation of haematopoietic stem cells (GSM992931) into erythroid lineage (GSM992934)31. Loess smoothing curves are shown after sorting the genes by their maximum length in kilobases. Hashed vertical line indicates 150 kb gene length.

  2. Extended Data Figure 2: Linear regression analysis and novel junction sequence considerations used to identify mammalian recursive splice sites. (364 KB)

    a, Examples of RNA-seq read density patterns for three genes together with their calculated gradients across the (1) first intron >50 kb, and (2) the average across all other >50-kb long introns within the same gene. Gradients represent the change in summated read count every 5 kb since RNA-seq reads are grouped in 5-kb windows and linear regression performed on resulting histograms. b, Density plot indicating the ratio of gradients of all other >50 kb introns within the same gene: the gradient of the first intron >50 kb. Blue hashed line represents ratio of 1. This would indicate that gradients for long introns within the same gene are comparable and transcription is proceeding at a largely constant rate. c, Schematic of the bioinformatics pipeline used to identify novel junctions. d, Ranking of human 5′ splice site pentamer usage genome-wide. e, Nucleotide usage frequency at human 3′ splice sites genome-wide, and branch-point positioning relative to 3′ splice site genome-wide.

  3. Extended Data Figure 3: Inferred splicing patterns identify recursive splice sites within mammalian >150 kb intron genes. (540 KB)

    ag, RNA-seq (red) read density patterns and normalized FUS iCLIP (green) cross-link density patterns for the OPCML (a), ROBO2 (b), HS6ST3 (c), ANK3 (d), CADM2 (e), NCAM1 (f) and PDE4D (g) genes within human brains. RNA-seq reads and normalized FUS iCLIP cross-links are grouped in 5-kb windows. RefSeq introns >150 kb were searched for novel junctions and linear regression performed on all Ensembl introns >50 kb in which novel junctions were located. Gene isoforms displayed are those including introns within which significant junctions were identified. Red novel junctions represent significant improvements in goodness-of-fit in both RNA-seq and FUS regression analysis (P < 0.01 in both data sets, F-test). Blue novel junctions contact RS-exons. Grey novel junctions were not deemed significant following regression analysis. Zoomed area represents sequence at deep intronic loci surrounding novel junction. Phylo-P conservation track indicates sequence conservation across 46 levels of mammalian evolution.

  4. Extended Data Figure 4: Inferred recursive splicing patterns in the OPCML gene across four separate brains. (505 KB)

    a, RNA-seq read density patterns for the OPCML gene across 12 different regions of four separate brains. Gene isoform displayed is that which included the long first intron within which a significant novel junction was identified. RNA-seq reads are grouped in 5-kb windows. Dotted arrows indicate location of experimentally derived RS-site.

  5. Extended Data Figure 5: RT–PCR confirmation of RS-sites in human and zebrafish samples, and prediction of mouse RS-exons. (218 KB)

    a, Schematic of primer design used for RT–PCR validation of novel junctions. bg, RT–PCR analysis of CADM2 (b), HS6ST3 (c), ROBO2 (d), PDE4D_1_1 (e), PDE4D_1_2 (f) and PDE4D_2_2 (g) genes around RS-sites using indicated primers. For PDE4D sites, first number after gene name indicates RS-site studied, second number indicates the upstream exon used. See Extended Data Fig. 3g for junctions detected. h, RT–PCR analysis of cadm2a RS-site junction in adult male and female zebrafish embryos, together with an alignment of zebrafish (ZF) cadm2a RS-site to human (HS) CADM2 RS-site. i, Map of consensus splice site location and in-frame termination codons following RS-sites in indicated mouse genes. Strong consensus splice sites are GTAAG, GTGAG, GTAGG and GTATG. Weak consensus splice sites are GTAAA, GTAAT, GTGGG, GTAAC, GTCAG and GTACG.

  6. Extended Data Figure 6: Conservation of inferred recursive splicing patterns in the mouse brain. (434 KB)

    ah, Normalized Fus iCLIP read density patterns for the Opcml (a), Robo2 (b), Hs6st3 (c), Ank3 (d), Cadm1 (e), Ncam1 (f), Cadm2 (g) and Pde4d (h) genes within the mouse brain. Normalized FUS iCLIP cross-link sites are grouped in 5-kb windows, and the displayed linear regression lines were computed on resulting histograms. Zoomed area at deep intronic loci represents RS-site sequences conserved from humans to mouse.

  7. Extended Data Figure 7: Promoter-dependent inclusion of RS-exons in CADM2 and NTM genes. (336 KB)

    a, Number of cassette and constitutive exons starting with motif GURAG. bd, RT–PCR of CADM2 gene in the frontal cortex using primers indicated in b or Fig. 4a. RT–PCR was carried out on one (b) or four (c, d) human brains. In c, the inclusion of the second RS-exon occurs together with the minor promoter. Two bands are present for both PCR reactions due to the presence of an alternatively spliced exon following the RS-exon. This can result in two distinct long or short isoforms. In d, the inclusion of the second RS-exon occurs when the first RS-exon is included. Schematics in c and d represent examined splicing products together with expected length of products. e, RNA-seq read density patterns for the NTM gene and expected human isoforms. RNA-seq reads are grouped in 5-kb windows and linear regression performed on resulting histograms. A cryptic minor promoter/exon detected by RNA-seq is indicated by vertical red line. The annotated RS-exon is indicated by the vertical blue line. Zoomed area represents RS-site sequence at start of the annotated RS-exon. Primers to assess the major and minor promoter products associated with the RS-exon are indicated by coloured arrows. f, RT–PCR of NTM gene around RS-exon using indicated primers. g, RT–PCR analysis of NTM products in which the upstream exon is either derived from the major upstream promoter or the cryptic upstream promoter/exon. RT–PCR was performed in the frontal cortex of three human brains using primer sets indicated by coloured arrows in e. Schematics represent possible splicing products together with expected length of products. Top panel assesses RS-exon inclusion, bottom panel assesses RS-site junction detection.

  8. Extended Data Figure 8: Recursive splicing regulates the alternative splicing of RS-exons. (190 KB)

    a, Qiaxcel analysis and quantification of the splicing intermediates of indicated CADM2 splicing reporter products following transfection in SH-SY5Y cells. Primers used are indicated by red arrows in schematic, together with expected products and their sizes. b, RT–PCR analysis of the zebrafish cadm2a mRNA after in vivo injection of AON-2. Sequencing reveals RS-exon inclusion results in subsequent splicing to additional downstream cryptic elements before the second exon, explaining why RS-exon included product size is larger than expected. c, qRT–PCR analysis of exon–exon junctions surrounding the RS-site containing introns following AON-A1 mediated inhibition of RS-site use of the human CADM1 and ANK3 genes (n = 3, 1 experiment) or the zebrafish cadm2a gene (n = 7, 3 separate experiments). d, Splice site scores of reconstituted 5′ splice sites following first step of recursive splicing versus the 5′ splice sites of corresponding recursive exons.

  9. Extended Data Figure 9: Cryptic elements are frequent in long first introns. (205 KB)

    a, UCSC annotated isoforms of the OPCML gene together with spliced expressed sequence tags (ESTs) detected across the OPCML locus. Recursive exon is marked in blue, and the preceding exons produced by minor promoter or cryptic splicing of the long first intron are marked in red. b, Lengths of the 9 introns containing the high-confidence RS-sites compared to other introns across vertebrates. Results are an extension of Fig. 4g. c, Boxplot showing the detected number of unannotated alternative start exons that junction to the dominant second exon of brain expressed genes. Only novel junctions that do not match UCSC/GENCODE transcripts are considered for analysis. Genes are separated into bins based on the first intron length of the canonical isoform. Boxplot presents median, first and third quartile boundaries for each bin. Additional red diamonds indicate mean values for each bin. *P < 10−10 (Mann–Whitney U test). Only tests between the 100 kb+ bin to other bins are shown. Right panel shows cartoon of the implications of boxplot results.

Supplementary information

PDF files

  1. Supplementary Information (285 KB)

    This file contains a Supplementary Note, Supplementary References and full legends for Supplementary Tables 1-4.

Excel files

  1. Supplementary Table 1 (22.5 MB)

    This table contains novel junction detection and linear regression analysis – see Supplementary Information file for full legend.

  2. Supplementary Table 2 (13 KB)

    This table contains functions and disease associations of high confidence RS-site containing genes.

  3. Supplementary Table 3 (3.3 MB)

    This table contains RS-site splice site competition scores and cryptic splice site usage across the transcriptome – see Supplementary Information file for full legend.

  4. Supplementary Table 4 (15 KB)

    This table contains reporter constructs and primer sequences used in this study – see Supplementary Information file for full legend.

Additional data