The discovery of introns four decades ago was one of the most unexpected findings in molecular biology1. Introns are sequences interrupting genes that must be removed as part of messenger RNA production. Genome sequencing projects have shown that most eukaryotic genes contain at least one intron, and frequently many2, 3. Comparison of these genomes reveals a history of long evolutionary periods during which few introns were gained, punctuated by episodes of rapid, extensive gain2, 3. However, although several detailed mechanisms for such episodic intron generation have been proposed4, 5, 6, 7, 8, none has been empirically supported on a genomic scale. Here we show how short, non-autonomous DNA transposons independently generated hundreds to thousands of introns in the prasinophyte Micromonas pusilla and the pelagophyte Aureococcus anophagefferens. Each transposon carries one splice site. The other splice site is co-opted from the gene sequence that is duplicated upon transposon insertion, allowing perfect splicing out of the RNA. The distributions of sequences that can be co-opted are biased with respect to codons, and phasing of transposon-generated introns is similarly biased. These transposons insert between pre-existing nucleosomes, so that multiple nearby insertions generate nucleosome-sized intervening segments. Thus, transposon insertion and sequence co-option may explain the intron phase biases2 and prevalence of nucleosome-sized exons9 observed in eukaryotes. Overall, the two independent examples of proliferating elements illustrate a general DNA transposon mechanism that can plausibly account for episodes of rapid, extensive intron gain during eukaryotic evolution2, 3.
At a glance
- Why genes in pieces? Nature 271, 501 (1978)
- Origin and evolution of spliceosomal introns. Biol. Direct 7, 11 (2012) , , &
- Origin of spliceosomal introns and alternative splicing. Cold Spring Harb. Perspect. Biol . 6, a016071 (2014) &
- Selfish DNA and the origin of introns. Nature 315, 283–284 (1985)
- The splicing of transposable elements and its role in intron evolution. Genetica 86, 295–303 (1992) &
- Duplication-degeneration as a mechanism of gene fission and the origin of new genes in Drosophila species. Nat. Genet. 36, 523–527 (2004) , &
- Extensive, recent intron gains in Daphnia populations. Science 326, 1260–1262 (2009) , , , &
- Identifying the mechanisms of intron gain: progress and trends. Biol. Direct 7, 29 (2012) &
- Chromatin organization marks exon-intron structure. Nat. Struct. Mol. Biol. 16, 990–995 (2009) , &
- Green evolution and dynamic adaptations revealed by genomes of the marine picoeukaryotes Micromonas. Science 324, 268–272 (2009) et al.
- The complex intron landscape and massive intron invasion in a picoeukaryote provides insights into intron evolution. Genome Biol. Evol. 5, 2393–2401 (2013) , &
- Dnmt1-independent CG methylation contributes to nucleosome positioning in diverse eukaryotes. Cell 156, 1286–1297 (2014) &
- Niche of harmful alga Aureococcus anophagefferens revealed through ecogenomics. Proc. Natl Acad. Sci. USA 108, 4352–4357 (2011) et al.
- Birth of new spliceosomal introns in fungi by multiplication of introner-like elements. Curr. Biol. 22, 1260–1265 (2012) , , &
- Intron invasions trace algal speciation and reveal nearly identical Arctic and Antarctic Micromonas populations. Mol. Biol. Evol . 32, 2219–2235 (2015) et al.
- Group II introns: mobile ribozymes that invade DNA. Cold Spring Harb. Perspect. Biol . 3, a003616 (2011) &
- DNA sequence at the integration sites of the insertion element IS1. Cell 13, 411–418 (1978) , &
- IS1 insertion generates duplication of a nine base pair sequence at its target site. Cell 13, 419–426 (1978)
- LTR-retrotransposons and MITEs: important players in the evolution of plant genomes. Curr. Opin. Genet. Dev. 5, 814–821 (1995) , &
- Evidence-based green algal genomics reveals marine diversity and ancestral characteristics of land plants. BMC Genomics 17, 267 (2016) et al.
- DNA transposon Hermes inserts into DNA in nucleosome-free regions in vivo. Proc. Natl Acad. Sci. USA 107, 21966–21972 (2010) , , , &
- Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat. Methods 10, 1213–1218 (2013) , , , &
- Estimating the timing of early eukaryotic diversification with multigene molecular clocks. Proc. Natl Acad. Sci. USA 108, 13624–13629 (2011) , , &
- DNA transposons and the evolution of eukaryotic genomes. Annu. Rev. Genet. 41, 331–368 (2007) &
- RNA-RNA interactions and pre-mRNA mislocalization as drivers of group II intron loss from nuclear genomes. Proc. Natl Acad. Sci. USA 111, 6612–6617 (2014) et al.
- TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol . 14, R36 (2013) et al.
- Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol . 12, R22 (2011) , , , &
- WebLogo: a sequence logo generator. Genome Res . 14, 1188–1190 (2004) , , &
- Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990) , , , &
- MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013) &
- Molecular Evolution and Phylogenetics (Oxford Univ. Press, 2000) &
- MEGA6: Molecular Evolutionary Genetics Analysis version 6.0. Mol. Biol. Evol. 30, 2725–2729 (2013) , , , &
- https://arxiv.org/abs/1303.3997 (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at
- https://arxiv.org/abs/1207.3907 (2012) & Haplotype-based variant detection from short-read sequencing. Preprint at
- T-lex2: genotyping, frequency estimation and re-annotation of transposable elements using single or pooled next-generation sequencing data. Nucleic Acids Res . 43, e22 (2015) , , &
- RetroSeq: transposable element discovery from next-generation sequencing data. Bioinformatics 29, 389–390 (2013) , &
- Discovery and characterization of Alu repeat sequences via precise local read assembly. Nucleic Acids Res . 43, 10292–10307 (2015) , , &
- The diagram, a method for comparing sequences. Its use with amino acid and nucleotide sequences. Eur. J. Biochem. 16, 1–11 (1970) &
Extended data figures and tables
Extended Data Figures
- Extended Data Figure 1: M. pusilla introner elements are in phase with nucleosome linker DNA, even without methylation. (181 KB)
Unmethylated regions (indicated by the line with arrowheads) are defined as containing no base positions with fractional methylation 0.5 or greater in a window starting from 50 bp upstream of the 5′ end of the introner element intron and continuing 234 bp downstream, which is 50 bp beyond the predominant M. pusilla introner element intron size of 184 bp (Fig. 2a). Mean values at each base position are shown for chromatin maps12 aligned to the subset (7%) of introner element introns residing in unmethylated regions (dark grey and dark blue for nucleosomes centres and DNA methylation, respectively), compared with alignment to all introner element introns (light grey and light blue; same data as in Fig. 1b for introner element introns). On the other hand, to assess whether introner elements could be in phase with methylated regions that are not also nucleosome linkers, we looked for introner elements that had both ends in methylated DNA regions12 but not in nucleosome linkers, which gave 35 potential candidates (1% of introner elements). Manual inspection revealed that 34 of the 35 candidates apparently nonetheless have ends in nucleosome linkers, simply being missed by the filtering criteria we used for calling linkers. This leaves one candidate, indicating little evidence that DNA methylated regions are found at introner element ends, which are not also nucleosome linkers. Thus, unmethylated nucleosome linkers could be the primary determinant of introner element insertion in at least some cases, whereas we find virtually no evidence that methylated regions could be the primary determinant of introner element insertion without also being nucleosome linkers.
- Extended Data Figure 2: A. anophagefferens introner elements insert into pre-existing nucleosome linkers. (280 KB)
a, Introner element (IE) introns are generally in phase with nucleosome positions, whereas other introns are not. DNA methylation12 was aligned to the 5′ ends of introner element introns (dark blue) or other introns (light blue). We did not generate nucleosome data previously for A. anophagefferens but DNA methylation is a reliable indicator of linker locations12. b, Introner elements are in phase with the starts of genes, indicating insertion between pre-existing nucleosomes. The 5′ ends of introner element introns and DNA methylation12 were aligned to gene starts. A kernel density estimate of introner element ends is displayed with peaks marked by vertical broken lines.
- Extended Data Figure 3: Target site duplications (TSDs) at introner element introns. (323 KB)
a, c, Intron sequences contain directly repeated sequences at their ends. Each A. anophagefferens (a) and M. pusilla (c) intron 5′ and 3′ end is directly aligned in each possible offset from −10 to 10 bp apart. Positions relative to the 5′ splice site from 10 bp upstream to 10 bp downstream are shown. Introner element (IE) introns are shown on the left and other regular non-introner element introns are in the centre, and the differences obtained by subtracting the identity percentages of other introns from those of introner element introns are on the right. Each panel is separated by a vertical black line and a diagonally stepped black line to delineate different regions: the upper left region represents alignment of upstream exon versus 3′ intron end sequence; the upper right represents 5′ intron end versus 3′ intron end; the lower right represents 5′ intron end versus downstream exon; and the lower left represents upstream exon versus downstream exon. The red arrowheads on the right indicate the offset with maximum average identity (0 in both cases). The red boxes in the right panels highlight the identified TSD length and position (see Supplementary Discussion). b, d, An example of an aligned 5′ (above) and 3′ (below) intron end of an introner element for the offset with maximum identity is shown in b for A. anophagefferens and d for M. pusilla. Exonic sequence is uppercase and boxed; intronic is lowercase. Vertical lines show identities that are part of at least an identical 2-mer with the red lines corresponding to the boxed regions in a and c.
- Extended Data Figure 4: Terminal inverted repeats (TIRs) in introner element introns. (395 KB)
a, c, Intron end sequences contain inverted repeats. Each A. anophagefferens (a) and M. pusilla (c) intron 5′ and reverse of the 3′ end is aligned in each possible offset from −30 to 30 bp apart. Positions relative to the 5′ splice site from 30 bp upstream to 30 bp downstream are shown. Introner element (IE) introns are shown on the left and other regular non-introner element introns are on the right. In each panel the upper left region represents upstream exon versus downstream exon sequence, the upper right represents 5′ intron end versus downstream exon, the lower right represents 5′ intron end versus 3′ intron end, and the lower left represents upstream exon versus 3′ intron end. The red arrowheads (right) indicate the offset with maximum average complementarity. b, d, An example of an aligned 5′ (top) and 3′ (bottom, reversed so that it is 3′ to 5′) end of an introner element intron for the offset with maximum complementarity is shown in b for A. anophagefferens (offset of +8) and d for M. pusilla (offset of −5). Exonic sequence is uppercase and boxed; intronic is lowercase. Vertical lines show complementarities that are part of at least an identical 2-mer.
- Extended Data Figure 5: Intron gain templated by nucleosomes and co-opted sequences. (128 KB)
Model for intron generation by introner elements acting as short non-autonomous DNA transposons that carry a splice site and insert between nucleosomes with co-option of the other splice site sequence.
- Extended Data Figure 6: Diploid genomic sequence variation in a more recent isolate of A. anophagefferens. (185 KB)
a, Calling of sequence variation from genomic sequencing reads without an assumption of ploidy reveals a peak at an alternate allele fraction of approximately 0.5. The most likely scenario is that this A. anophagefferens isolate has a diploid genome. It is not physically plausible for it to have higher ploidy because that amount of chromatin could not fit into its extremely compact nucleus12. b, An example reference introner element (IE) is present within one allele and absent from the alternate allele. The locus is displayed as in Fig. 3a. The reference introner element is located in an annotated protein-coding gene with a 200-bp RNA sequencing-validated intron in the reference isolate. The alternate allele is probably exonic without an intron (broken lines), so that it encodes the same amino acid sequence. The TSD within the reference allele is 8 bp, immediately flanking the introner element TIRs. c, An example introner element not found within the reference allele is present within the alternate allele. The locus is displayed as in Fig. 3a. The alternate introner element is within an annotated protein-coding gene with a predicted 200-bp intron (broken lines). If the predicted intron is indeed spliced out of the RNA, then the alternate allele encodes the same amino acid sequence. The TSD within the alternate allele is 8 bp, immediately flanking the introner element TIRs.
- Extended Data Figure 7: Splice site sequences. (159 KB)
Logos for the 10 bp upstream and downstream of 5′ and 3′ splice sites for introner element and other introns are shown for each organism. The rectangles show exonic positions. The core splice sites are GY (Y is C or T) and AG. Introner elements (IEs) combined with co-opted exonic sequence that is duplicated (Fig. 3) to generate particular sequences that extend beyond the core sites (bracketed). Specifically, this results in a predominance of AG|GY sequences (| denotes the position of splicing that ultimately occurs) at 5′ splice sites in M. pusilla introner element introns and 3′ splice sites in A. anophagefferens introner element introns. Similar respective sequences are observed in other introns in each organism: G|GT for M. pusilla 5′ splice sites and AG|G for A. anophagefferens 3′ splice sites. In non-introner element introns, these sequences have been under selection for long periods of time to promote RNA splicing, revealing the sequences extending beyond core sites that probably contribute to optimal splicing in each organism. The similarity of introner element intron splice sites to other intron splice sites thus suggests that introner elements in each organism generate new introns that are spliced reasonably well.
- Extended Data Figure 8: Most introner elements are located in genes expressing low to average RNA levels. (76 KB)
Distributions of detectable RNA levels of all transcripts (black) and only those containing at least one introner element (IE-containing, green) are shown as measured by RNA sequencing. Box plots indicate the median, first and third quartiles with whiskers extending up to data 1.5 times the interquartile range away from the box. For M. pusilla, introner element-containing gene expression does not differ significantly from that of all genes, P = 0.59. For A. anophagefferens, introner element-containing gene expression is slightly lower than that of all genes, P = 0.041.
- Supplementary Information (92 KB)
This file contains a Supplementary Discussion regarding TSD and TIR identification and splice site orientation in introner elements.