LINE-1-like retrotransposons contribute to RNA-based gene duplication in dicots

RNA-based duplicated genes or functional retrocopies (retrogenes) are known to drive phenotypic evolution. Retrogenes emerge via retroposition, which is mainly mediated by long interspersed nuclear element 1 (LINE-1 or L1) retrotransposons in mammals. By contrast, long terminal repeat (LTR) retrotransposons appear to be the major player in plants, although an L1-like mechanism has also been hypothesized to be involved in retroposition. We tested this hypothesis by searching for young retrocopies, as these still retain the sequence features associated with the underlying retroposition mechanism. Specifically, we identified polymorphic retrocopies (retroCNVs) by analyzing public Arabidopsis (Arabidopsis thaliana) resequencing data. Furthermore, we searched for recently originated retrocopies encoded by the reference genome of Arabidopsis and Manihot esculenta. Across these two datasets, we found cases with L1-like hallmarks, namely, the expected target site sequence, a polyA tail and target site duplications. Such data suggest that an L1-like mechanism could operate in plants, especially dicots.


Identification and assembly of retroCNVs
We identified retroCNVs in Arabidopsis (Arabidopsis thaliana) by integrating several published retroCNV identification strategies (Fig. 2) 1,2 . Specifically, we downloaded the fastq-format Illumina sequencing data of 18 Arabidopsis accessions, with sequencing coverage ranging from 27-fold to 36-fold and read length ranging from 36 to 51 bp 3 .
We split all paired-end reads into single reads to retain cases of retroposition in which one read is mapped to the parental locus and its partner is mapped to the insertion site. We also retrieved the reference genome and gene annotations from TAIR10 4,5 .
We mapped reads against the exon-exon junction library (200 bp) using NovoAlign (version 2.08, www.novocraft.com) with the parameters "-o Softclip -r All 10 -s 1", which allows up to the top 10 alignment hits for one read and enables automatic trimming at two ends. For reads with more than one hit, we kept reads with differences in their alignment scores of greater than 5 and with identities greater than 5% between the top 1 and 2 alignment positions. In this way, we excluded all multimapping reads, i.e., reads that mapped to more than one location equally well. Then, we pulled out reads that uniquely mapped to the exon-exon junctions after removing reads generated by PCR duplication using Picard (picard.sourceforge.net). We called an event retroposition or intron loss if there were at least three reads from one accession spanning the exon-exon junctions with an overhang (≥10 bp). The introns of parental genes are present during retroposition but are absent during intron loss.
Thus, in the case of retroposition, we expected to detect reads that mapped to both exon-intron and intron-exon junctions, suggesting the existence of parental introns.
On this basis, we further selected cases in which at least three reads spanned the exonintron or intron-exon junctions with an overhang (≥10 bp). After applying these two filtering steps, we identified candidate retroCNVs in Arabidopsis accessions.
In contrast to previous efforts 1,2 , we next performed targeted de novo assembly. We collected reads that were uniquely mapped to exon-exon junctions and the 500-bp flanking regions of the parental genes. We assembled these reads using MIRA 6 , which is able to recognize differences between parental and retro copies, including SNPs and intron deletions. Afterwards, we retrieved exon-exon junction-spanning sequences with 20 bp on each side and used BLAT 7 to align these 40-bp sequences against contigs assembled by MIRA. Given the output, we further extracted candidate retroCNVs that covered the junctions with both identity and coverage higher than 95%. Then, we mapped all retroCNVs against the genome using BLAT on the Arabidopsis Genome Browser (epigenomics.mcdb.ucla.edu) and retained only those cases with the hallmark of intron loss compared with the presumed parental gene. Finally, for these retroCNVs, we implemented the PRICE package 8 to extend the flanking regions by searching and merging reads aligned to the contigs generated by MIRA. We were able to assemble the flanking regions of four retroCNVs. For each retroCNV, we took the longest contig (median length, 528 bp) in one accession as the template for downstream mechanistic analyses. All the reads mapping to retroCNVs and their insert sites, especially those spanning exon-exon junctions, and the breaking points between retroCNVs and flanking regions are shown in Fig. S2, S3, S5 and S7 (Panels D-F).

RetroCNV genotyping in Arabidopsis accessions
Next, we employed a conservative approach to determine the presence/absence of the four retroCNVs across the 18 accessions (Fig. S12). Specifically, we mapped the reads of the different accessions onto the reference genome as well as the template retroCNVs using NovoAlign (www.novocraft.com) with the aforementioned parameters "-r All 10 -s 1". We extracted the reads that were uniquely mapped to retroCNVs and used BLAT 7 to align these reads against the retroCNVs and the corresponding parental genes. We retained reads that mapped to the retroCNVs with at least 95% coverage and that had higher identity than when mapped to the parental genes. If at least two reads mapped better to the retroCNVs, we again assembled these reads into longer contigs using MIRA 6 . We then searched the contigs against the -5 -retroCNVs and the reference genome using BLAT 7 . If the retroCNV produced a better alignment, we classified it as presence in this accession.
In parallel, given that the 3' breakpoint may be disturbed by polyA sequences, we extended the 5' breakpoint of the retroCNV of interest by 50 bp and used the spanning 100-bp sequence to search reads in the 18 accessions. We then assembled the mapped reads using MIRA 6 . If we generated a contig that spanned the 5' breakpoint, we classified this as the presence of the insertion site.
Overall, we conservatively assigned the presence of retroCNVs across accessions by requiring both the presence of the retroposed region and the corresponding 5' breakpoint sequences. In other words, the population frequency in Table S1 represents a lower-bound estimation.

LTR/LINE retrotransposon inference
To infer whether the flanking regions of retroCNVs and recently evolved retrocopies encoded by the reference genomes consist of LTR/LINE retrotransposons, we used RepeatMasker (www.repeatmasker.org) to search a customized repeat library.
Considering the issue of incomplete annotation, this library not only included known plant retrotransposons listed in Repbase 9,10 and TIGR 11 but also covered retrotransposons predicted in the reference genomes of Arabidopsis and the cassava, Manihot esculenta (M. esculenta) (version 4.1) 12 by two de novo strategies based on MGESCan-LTR 13 and MGEScan-nonLTR 14 . We clustered all repeat elements via CD-HIT 15 , with an identity cutoff of 80%, considering that in MGESCan-LTR, the identity cutoff of the upstream and downstream long terminal repeat was set at 80% 13 . Based on our comprehensive and non-redundant repeat library, we scanned for repeat elements using RepeatMasker with the parameter "-s" to increase the sensitivity.

Identification of newly evolved retrocopies in dicot reference genomes
As in the case of retroCNVs, we again searched for the hallmark of intron loss and identified retrocopies encoded by Arabidopsis and the M. esculenta (version 4.1). For both species, we first extracted exon-exon junction sequences by extending 20 bp beyond the junction. Then, we aligned these sequences against the reference genome via BLAT. We extracted the possible segments of candidate retrocopies by only retaining alignments with a single block, suggesting an intron loss event. Moreover, because we were interested in recently evolved retrocopies, we required that the alignment showed high identity (≥95%) and high coverage (≥95%). Given these short alignments indicating intron loss, we further extended the boundaries of each candidate retrocopy and its corresponding parental gene by 1,000 bp and aligned them using BLAT. We reiterated this step until we could no longer extend the alignment.
After this step, we inferred the breakpoint of the retrocopy and the insertion site.
For all candidate retrocopies, we performed the following two filters. First, there were cases with one parental gene and multiple paralogous retrocopies that may represent secondary DNA-level duplications of the first retrocopy. Thus, we only kept the retrocopy with the highest similarity to the parental gene (i.e., the one with the highest BLAT score). Second, we manually checked the candidate retrocopies by mapping them onto the reference genome using BLAT on the Arabidopsis Genome Browser (epigenomics.mcdb.ucla.edu) and only retained the entries that showed spliced alignment against the corresponding parental genes. We genotyped 10 recently evolved retrocopies in the 18 Arabidopsis accessions by searching these retrocopies in the assembled genomes of the 18 accessions 3 via BLAT 7 .  Tables   Table S1. Genotyping of retroCNVs across 18 accessions.
The name of each retroCNV is represented by "RC_" (short for retroCNV) followed by its parental gene accession. "Y" denotes presence, whereas "N" denotes absence.

RetroCNV
Bur  Table S3. Newly evolved retrocopies encoded by the Arabidopsis reference genome.
The convention largely follows Table 1 in the main text. "Reported" denotes whether the retrocopy is identified previously [17][18][19] . Only the retrocopy derived from the parental gene AT5G37150.1 is associated with LTR retrotransposon at two sides. Another three cases associated with a polyA tail but not LTR retrotransposons were possibly driven by an L1-like mechanism. The remaining six cases are associated with either a previously undescribed mechanism or with an LTR or L1-like mechanism in which the sequence signatures have already degenerated over evolutionary time.
Thus, we called these cases "uncertain" in the last column. After aligning the sequences between the retrocopy and its corresponding parental gene on the basis of reading frame of the later, we calculated non-synonymous substitution rate (Ka) and synonymous substitution rate (Ks) via the codeml program in the PAML package 20 . We then performed the likelihood ratio test on whether Ka/Ks is significantly smaller than 0.5 21 .
Such a conservative criteria ensured that the retrocopy must be under constraint even the parental locus is neutrally evolving with a Ka/Ks of 1.
-10 -  The name of each retrocopy is represented by "R_" (short for recently evolved retrocopy) and its parental gene. "Y" denotes presence, whereas "N" denotes absence.
The conventions of Panels A, B and C follow Fig. S3. Panel D follows Fig. 1. The short reads in C are from the accession "No-0".