MicroRNA retrocopies generated via L1-mediated retrotransposition in placental mammals help to reveal how their parental genes were transcribed

In mammalian genomes, most retrocopies emerged via the L1 retrotransposition machinery. The hallmarks of an L1-mediated retrocopy, i.e., the intronlessness, the presence of a 3′ poly-A tail, and the TSDs at both ends, were frequently used to identify retrotransposition events. However, most previous studies only focused on protein-coding genes as their possible parental sources and thus only a few retrocopies derived from non-coding genes were reported. Remarkably, none of them was from microRNAs. Here in this study, we found several retrocopies generated from the mir-302–367 cluster gene (MIR302CHG), and identified a novel alternatively spliced exon encoding mir-302a. The other recognized microRNA retrotransposition events are primate-specific with mir-373 and mir-498 as their parental genes. The 3′ poly-A tracts of these two retrocopy groups were directly attached to the end of the microRNA precursor homologous regions, which suggests that their parental transcripts might alternatively terminate at the end of mir-373 and mir-498. All the three parental microRNAs are highly expressed in specific tissues with elevated retrotransposon activity, such as the embryonic stem cells and the placenta. This might be the reason that our first microRNA retrocopy findings were derived from these three microRNA genes.


Supplementary figures
. Genome assemblies used for 45 mammalian species in this study. Table S2. The genomic information for each identified retrocopy and the independent transposable element insertion events.   S1. The typical hallmarks of L1-mediated retrotransposition recognized in (A) MIR302CHG retrocopies, (B) mir-373 retrocopies and (C) mir-498 retrocopies. This figure is composed of screenshots cropped from the three supplementary MSA files. The black rectangles indicate the identified TSDs for each retrotransposition event except for T2S5, which was accompanied by a possible L1-mediated deletion without the generation of TSDs. The character "D" in MSA represents removed insertions, simple repeats, or omitted fragments without specifying their actual sequence lengths. The character "B" in MSA separates two discontinuous regions resulted from genomic rearrangements. Boundaries of the inserted retrocopies are denoted by the characters "SS."  Fig. S2. The sequence comparison of exon E302 and its flanking splice sites between the parental MIR302CHG genes (PG) and the derived retrocopies among various mammalian species. The exon-exon junctions in both two types of retrocopies are in line with the splice sites in their parental genes, supporting the existence of the newly identified exon E302. The black and red triangles and bold lines indicate the junctions of previously annotated and newly proposed alternative splice sites, respectively. The two alternative splice sites SS1 and SS2 located at the 3' end of exon E1 described in Fig. 1 are also specified. The nucleotides with a green background represent the canonical GT-AG intronic splice sites.

Fig. S3
. The transcripts containing exon E302 verified by NGS paired-end reads that spans the junction between E1 and E302 or the junction between E302 and E3. Six paired-end reads from four runs (representing four samples) of three independent studies were selected and displayed as the examples. All these reads are perfectly aligned to the parental E1, E302 and E3 regions. Fig. S4. The sequence comparison of exon E1 and its flanking regions between the parental MIR302CHG genes (PG) and four derived retrocopies among various mammalian species. The transcription start site (TSS, blue bold lines with arrows) and the splice site SS1 (the black inverted triangle and black bold lines) were validated and annotated in the current human MIR302CHG gene. In contrast, two alternative transcription start sites were inferred from the retrocopies T1S4 and T2S6 (yellow bold lines with arrows) in this study. Similarly, the alternative splice site SS2 (the red inverted triangle and the red bold line) was also inferred from most our identified retrocopies. Except for the displayed four retrocopies, all the other omitted retrocopies were derived from the splice site SS2 and the general annotated TSS. The nucleotides with a green background indicate the canonical intronic splice sites. We can notice that SS1 and SS2 may potentially co-exist (both with a green background) in many mammalian species at present. In contrast, in humans, treeshrews and Glires, their SS2 splice sites are currently mutated and disabled. Site SS1 is supposed to be obligatorily used for them. Fig. S5. Possible core promoter elements found in the exon E1 and upstream regions of T1S1, T1S2 and T1S3 retrocopies. Species were colored accordingly to their superorders: Euarchontoglires (blue), Laurasiatheria (green), Afrotheria (red), and Xenarthra (orange). The TATA-box and mammalian initiator patterns were referred from Smale and Kadonaga 1 , and highlighted on the alignment with orange and purple backgrounds, respectively. Fig. S6. The predicted secondary structures of the mir-302a precursor homologous regions of (A) T1S1, (B) T1S2, (C) T1S3 and the bat-specific T1S4 retrocopies, (D) the entire mir-373 retrocopy and (E) the entire mir-498 retrocopy by RNAfold Web Server. In some cases, the inspected sequences were probably disrupted by insertions of transposable elements (TEs) or accumulated simple repeats. If their hairpin structures could be regenerated after removing these repetitive elements, the re-evaluation results would be represented instead and denoted by an asterisk (*). The "(rc)" tag indicates that the prediction and evaluation results were generated from the reverse complement sequence.

Aye-aye
Secondary structure prediction was not performed due to a long deletion of almost the entire region

Supplementary MSA files
Three multiple sequence alignment (MSA) files in FASTA format for MIR302CHG retrocopies, mir-373 retrocopies and mir-498 retrocopies are provided. The sequences of inserted retrocopies were aligned with the corresponding regions in their parental genes. The flanking regions for each retrocopy were also displayed and aligned with the homologous regions in species without the certain retrotransposition event.
The identified TSDs for each retrotransposition event are shown in an independent row. The character "D" in MSA represents removed insertions, simple repeats, or omitted fragments without specifying their actual sequence lengths. The character "B" in MSA separates two discontinuous regions resulted from genomic rearrangements within a single scaffold (e.g., T1S1 in hedgehogs) or between different scaffolds (e.g., T1S3 in cats). The type of retrocopies (T1S1~T2S8, 373-S1~S2 and 498-S1~S4), and sequences obtained from the parental genes (PG), DNA transposon insertions (DI), independent SINE insertions (ISI), and species without any insertions (WO) were specified. Boundaries of the inserted retrocopies, the other independent transposable elements inserted at or near our investigated retrotransposition sites, and the parental mir-373 and mir-498 microRNA precursors, or the transcription start sites (TSS) of the parental MIR302CHG genes are denoted by the characters "SS." The single character "S" indicates the splice junctions. Note that the aardvark E3 is located within an unassembled region ("N") in the scaffold, which was inferred according to the conserved downstream sequences (the region outside the MSA).