Sense-oriented AluYRa1 elements provide a lineage-specific transcription environment for polyadenylation

Transposable elements cause alternative splicing (AS) in different ways, contributing to transcript diversification. Alternative polyadenylation (APA), one of the AS events, is related to the generation of mRNA isoforms in 70% of human genes. In this study, we tried to investigate AluYRa1s located at the terminal region of cynomolgus monkey genes, utilizing both computational analysis and molecular experimentation. We found that ten genes had AluYRa1 at their 3′ end, and nine of these AluYRa1s were sense-oriented. Furthermore, in seven genes, AluYRa1s were expected to have a similar consensus sequence for polyadenylation cleavage. Additional computational analysis using the annotation files from the UCSC database showed that AluYRa1 was more involved in polyadenylation than in open reading frame exon splicing. To examine the extent of AluYRa1 involvement in polyadenylation, RNA-seq data from 30 normal cynomolgus monkeys were analyzed using TAPAS, a recently devised software that detects all the promising polyadenylation sites including APA sites. We observed that approximately 74% of possible polyadenylation sites in the analyzed genes were provided by sense-oriented AluYRa1. In conclusion, AluYRa1 is an Old-World monkey-specific TE, and its sense-oriented insertion at the 3′UTR region tends to provide a favorable environment for polyadenylation, diversifying gene transcripts.

www.nature.com/scientificreports/ enzymatic degradation or stored in a translationally dormant state 17 . In recent studies, the location where polyadenylation occurs, at 3′UTR or upstream of the 3′ of most exons, is proven to be crucial because it correlates with protein expression and localization 14 .
The Alu element, one of the primate-specific SINEs, was evolutionarily derived from 7SL RNA ~ 65 million years ago 18,19 . It is the most evolutionarily successful element, with more than 1 million copies, equivalent to 11% of the human genome 20,21 . This element is about 300 bases long, formed from two similar monomers, called "leftarm" and "right-arm" 18,21 . Antisense Alu in the genic region tends to provide potential splicing donor (GT) and acceptor (AG) sites, creating a new exon for the transcript 10,22 . This Alu-involved AS event accounts for ~ 5% of all internal alternative exons in the human genome 23 . Moreover, recent studies showed that Alu insertion within or near the genes in the sense orientation generates new cleavage and polyadenylation sites 24 and establishes a dynamic polyadenylation signal (PAS) in the gene 25 . AluYRa1, one of the 14 different Old-World monkey (OWM)-specific AluY subfamilies, emerged after the hominoid-OWM divergence 26 . In our brief in silico analysis, we found several genes that contain sense-oriented AluYRa1 at their 3′UTR end in the cynomolgus macaque genome. We focused on macaque monkeys because they are considered a crucial animal model for biomedical research owing to their behavioral, physiological, and genetic similarities to humans 27,28 . They also have ~ 50% of TEs on the genome, similar to that in the human genome 29 .
In the present study, we aimed to analyze all AluYRa1 elements in the cynomolgus macaque. Molecular experiments were performed to validate the AluYRa1 insertion in cynomolgus macaques and its transcript expression, and computational analyses were carried out to investigate its characteristics. The results of this study might provide essential clues for further biomedical studies in humans.

Results
Computational analysis of AluYRa1. Several AluYRa1s, located at the 3′UTR-end of the transcripts, were identified in our brief in silico analysis on the cynomolgus monkey. Before the investigation, we computationally analyzed how many AluYRa1s are located at the 3′UTR-end. First, we counted the number of genes that have a 3′UTR end located at AluYRa1 via computational calculations. We found that the transcripts of seven genes (TK2, PEX26, GTPBP4, IRF9, BLOC1S6, UBE2B, PAICS) had 3′UTR end AluYRa1 sequences. When we expanded the search to the SINE family, 72 elements were located at the 3′UTR-end of each gene; this is higher than the LINE and LTR family that had 36 and 16, respectively. Among 72 SINEs, there were 56 Alus, 13 MIRs, 2 FLAMs, and 1 FRAM. Among 56 Alus, 11 belonged to the AluJ family, 18 to the AluY family, and 27 to the AluS family. To be specific, eight were AluSz, seven were AluSx1, and seven were AluYRa1. AluYRa1 was one of the most widely available Alu elements associated with transcript termination, as observed from our results. When we further considered the accession number XM, which refers to the predicted transcript, three more genes (CMBL, SLC16A14, PDK4) were identified to have the AluYRa1 at their 3′UTR-end. To sum up, 10 AluYRa1s (seven in registered genes and three in predicted genes) were located at the terminal region of the gene transcript of cynomolgus macaque (Fig. 1).
Comparative studies on gene structure following AluYRa1 insertion. Ten gene structures in three genomes, namely human, rhesus macaque, and cynomolgus macaque genomes, were comparatively analyzed ( Fig. 2 and Supplementary Fig. S1). As AluYRa1 is Old World monkey-specific 26 , it does not exist in the human genome. We focused mostly on the structure of the 3′UTR to examine the AluYRa1 insertion at the 3′UTR-end and determined the difference in the 3′UTR length between human and macaque lineage monkeys. In 8 (TK2, PDK4, GTPBP4, PEX26, CMBL, BLOC1S6, UBE2B, PAICS) out of 10 genes, the 3′UTR length of the cynomolgus macaque was shorter than that of humans ( Fig. 2 and Supplementary Fig. S1). For instance, the 3′UTR length of PDK4, which had an AluYRa1 sequence, was 1412 bp in the cynomolgus macaque but 2283 bp in humans (Fig. 2). In the case of TK2, which had an AluYRa1, the 3′UTR length was 4064 bp in humans but 1897 bp in the cynomolgus macaque. Additionally, in four rhesus macaque genes (PDK4, GTPBP4, CMBL, IRF9) that were registered in the database, the transcript terminated at or near the AluYRa1 insertion site ( Fig. 2 and Supplementary  Fig. S1). Meanwhile, in the case of two genes (SLC16A14, IRF9), the 3′UTR length was similar among the three species, regardless of the AluYRa1 insertion.
Evolutionary analysis of AluYRa1 integration and polymorphism. The integration time of target AluYRa1s was traced using genomic DNA samples from nine primates: hominoids (human, chimpanzee, gorilla), OWMs (rhesus monkey, cynomolgus monkey, African green monkey), New World monkeys (NWMs; marmoset, squirrel monkey) and prosimians (ring-tailed monkey). Genomic PCR and the subsequent gene cloning step were performed to confirm and analyze AluYRa1 sequences. The alignment of these sequences revealed that the approximate integration time was different depending on the genes. For six genes (GTBPB4, PEX26, CMBL, SLC16A14, PAICS, UBE2B), AluYRa1 was located at the terminal region of each gene in two macaques, the cynomolgus monkey, and the rhesus monkey. For two other genes (IRF9, BLOC1S6), however, the same Alu was located at the terminal region of the genes in two macaques and the African green monkey. This evolutionary analysis indicates that the AluYRa1 in the former six genes were integrated into the 3′UTR-end of the macaque genome after it diverged from the African green monkey approximately 10 million years ago 30 . The one in the latter two genes was inserted into the 3′UTR-end between 10 and 25 million years ago after OWM diverged from hominoids (Fig. 3).
In the genomic PCR analysis, polymorphic AluYRa1 was detected in two genes, TK2 and PDK4 (Fig. 2). For TK2, the first round of genomic PCR did not show any insertion of AluYRa1 in the cynomolgus macaque, although the genomic sequence registered in the database had it. Therefore, we expanded the number of samples to 10  www.nature.com/scientificreports/ Fig. S2). We found that, of 10 samples, four had AluYRa1-inserted allele, one had AluYRa1-vacant allele, and five had both inserted and vacant alleles, which meant that the target AluYRa1 in TK2 was polymorphic. Similarly, for PDK4, we could not detect AluYRa1-inserted allele in the first round of genomic PCR, and hence we expanded the number of samples to 10. It turned out that two samples had a vacant allele, three had an inserted allele, and five had both alleles. These also represented the polymorphism of target AluYRa1 in PDK4. The results indicate that AluYRa1 integrated into TK2 after it was diverged from the rhesus macaque and integrated into PDK4 before that event (Fig. 3).
Expression pattern of sense-oriented 3′UTR-end AluYRa1. Reverse transcriptase PCR (RT-PCR) amplification was conducted to check if target AluYRa1 was expressed as part of the transcript, and also observe its expression patterns. cDNA from 11 tissues (cerebrum, cerebellum, heart, liver, lung, large intestine, pancreas, kidney, uterus, testis, stomach), extracted from cynomolgus monkey, were used for RT-PCR. The results showed that the target AluYRa1 was ubiquitously expressed in various tissues of all analyzed genes (Fig. 4). The patterns of expression level specifically displayed that the targets of three genes (PEX26, CMBL, BLOC1S6) were strongly expressed in all eleven tissues, and the targets of all analyzed ten genes were expressed in cerebrum and heart.

Structural analysis of sense-oriented AluYRa1s at the 3′UTR-end. To clarify the pattern by which
AluYRa1 was integrated into the genome, we created the Python script that we named it a locater, which indicated where all repeated elements were located in the reference gene sequence. We ran the locator and obtained the annotated file as an output. We first sorted the information on all Alus in the output file before specific AluYRa1 analysis. It revealed that Alus were located at 3′ UTR-end more frequently than other regions (5′UTR start, ORF start, inside exon, exon splicing, and ORF end) in all the three species (Fig. 5a). Besides, senseoriented Alus were almost three times more frequently observed than antisense-oriented ones in the gene termination region of cynomolgus monkeys and humans (Fig. 5a). On the contrary, antisense-oriented Alus were found in the exon splicing junction of human reference genes six times more than the sense-oriented ones, and similar patterns were observed in cynomolgus and rhesus monkeys (Fig. 5a). When we focused on specific Alu subfamilies, AluJ, AluS, and AluY, we could see similar patterns, and the AluS family was proportionally the most abundant at the 3′UTR-end, especially in humans ( Supplementary Fig. S3). Next, we did the same sorting for AluYRa1, and we realized that AluYRa1s were located at the 3′UTR-end more than the exon splicing junction, as observed in the patterns of Alus (Fig. 5b). The ratio of sense-and antisense-oriented AluYRa1 was also similar to that of Alus (Fig. 5b). Figure 1. Identification of genes that had an AluYRa1-derived 3′UTR-end. www.nature.com/scientificreports/ We subsequently analyzed the detailed structure of AluYRa1-inserted sections in the genes using the information from the output file described above. Therefore, 14 genes were identified to have AluYRa1 at their 3′ and 5′ UTR regions (Fig. 6). Two genes (ZNF575, KCNIP3) had AluYRa1 in the middle UTR, which was not relevant to AS events during transcription. In the case of PRELID3B, antisense-oriented AluYRa1 was intriguingly involved in the generation of both ends of exon number 1, which was associated with the start of transcription. In JOSD1, the AluYRa1 sequence was near the 3′UTR termination region, but it did not overlap. Finally, the remaining 10 genes had their AluYRa1 sequences overlapped with the 3′UTR-end (Fig. 6). Among these 10 genes, the 3′UTRends of UBE2B, BLOC1S6, and PAICS were located on the right-arm, A-rich region, and the left-arm of AluYRa1, respectively. In the case of seven genes (TK2, GTPBP4, PEX26, CMBL, SLC16A14, IRF9, PDK4), sense-oriented AluYRa1 was located at the end of the gene transcript sequence, and precisely that location was at or near the poly-A tail of AluYRa1 (Fig. 6). They had a similar cleavage sequence (CA) based on the analyzed sequence 31,32 (Supplementary Tables S1 and S2) 33 .
AluYRa1 is the element that gives rise to possible APA sites. AluYRa1 is located at the terminal region of 10 gene transcripts of cynomolgus monkey, as mentioned above, and it tended to be overlapped with the sequences that were associated with the polyadenylation cleavage site. Therefore, we were curious to know how closely AluYRa1 was related to the terminal sequence of the transcript. To do this, we used TAPAS software which is the tool for detecting such alternative (or all) polyadenylation sites within a gene from RNAseq data 34 . We downloaded relevant reference annotation file (refFlat.txt) and 30 individuals of RNA sequencing data as input files to run TAPAS. We finally achieved reliable output files that showed information on the polyadenylation site of each gene. Of the seven genes that had the sense-oriented AluYRa1 in which its poly-A region overlapped with the 3′UTR-end, 4 (PEX26, TK2, IRF9, GTPBP4) were included in the reference annotation file (refFlat.txt), which means they were qualified for TAPAS analysis of this study. The results data revealed that approximately 74% of polyadenylation site including APA sites were located on the sense-oriented AluYRa1 sequences (Fig. 7). To be specific, 40% of them were located on the poly-A tail and another 34% were on the A rich region of the sense-oriented AluYRa1 (Fig. 7).   25 . Since now we have more reliable reference gene transcript sequences over various species, which are registered at specific databases, we mainly focused on macaque monkeys. The computational analysis of the cynomolgus monkey genes along with the experimental validation in this study revealed that Alu elements were the majority of TEs that were located mostly at the terminal region of the reference gene transcript compared to the other TEs. AluYRa1 was the second largest element among these Alus. Therefore, we would insist that Alu elements were the most influential TEs on polyadenylation than any other TEs in the cynomolgus monkey and AluYRa1 was also one of the most important elements. In addition, we found that more than 80% of both 3′UTR-end located Alus and AluYRa1 had the same orientation as the genes they were inserted into. This result indicated the possible correlation between sense-oriented Alus and more powerful PAS or sensitive cleavage sites for the mRNA sequence termination mechanism. The phenomenon of AluYRa1 insertion providing different polyadenylation cleavage sites depending on the species could have an impact on the species-specific 3′UTR length. In 2012, Clen et al. proposed that 3′UTR extension is correlated with organismal complexity in animal evolution because a longer 3′UTR contains more putative targets for microRNAs, which is related to morphological complexity 35,36 . In the current study, we identified eight genes in the cynomolgus monkey that underwent 3′UTR shortening, compared to humans, via AluYRa1 insertion. Therefore, we propose that AluYRa1 might play evolutionary roles between humans and cynomolgus macaques in terms of different polyadenylation patterns.

HU_AluYRa1
Sense-oriented AluYRa1 is more associated with polyadenylation than exon splicing. In 2015, Tajnik et al. reported that intergenic Alu exonization contributes to both AS and polyadenylation in the upstream genes 37 . The tendency of Alu insertion in the antisense orientation, providing potential 5′SS or 3′SS, has been studied more than the alternative polyadenylation of the Alu sequence 38 . This is because the study of the exon splicing mechanism related to the diseases is essential, and the research protocol for exon splicing is simple. However, polyadenylation study associated with the Alu element is an increasing concern among genome biologists. According to our results, Alus were located at the 3′UTR-end more than the exon splicing junction, implying that Alu insertion was more influential in the polyadenylation than exon splicing. We also verified that Alus prompting polyadenylation tended to be more sense-oriented, whereas those involved in exon splicing were antisense-oriented.
With these results, we could predict that Alus were possible causative elements for polyadenylation. To make our results more reliable, we conducted RNA transcriptome sequencing on 30 cynomolgus monkeys and checked the expression patterns of their 3′UTR-end using the TAPAS software. The result from the run showed that 74%  www.nature.com/scientificreports/ of predicted polyadenylation sites and APA sites were located at the inserted AluYRa1 sequence, showing that AluYRa1 elements were correlated with polyadenylation phenomenon including APA. Hence, we proposed that AluYRa1 insertion at the 3′UTR generated various APA sites contributing to the diversification of transcripts. The computational analysis in this study also represented that the same phenomenon was likely to occur with many other Alu elements because our results showed that numerous Alus were located at the 3′UTR-end. 3′UTRs have been reported to adjust mRNA stability through the regulation of gene expression via RNA binding protein, determining the protein levels 36,39 . Moreover, the localization element located on the 3′UTRs mediates mRNA localization for translation 40 . In the same manner, various transcripts with different 3′UTR lengths generated by AluYRa1 insertion could alter gene expression and mRNA localization in a tissue-specific or species-specific way. Therefore, it is possible to conclude that these changes affect the phenotypic function of the organisms or lead to diseases such as cancer 41 .

Conclusion
From the above pieces of evidence, AluYRa1, an Old-World monkey-specific TE, tends to provide appropriate conditions for polyadenylation when it is sense-oriented, contributing to the diversification of gene transcripts. Further computational analyses also indicate that Alus tends to follow the same phenomenon as AluYRa1. The results of this study, such as Alu-mediated polyadenylation pattern and cynomolgus macaque-specific AluYRa1 polymorphism in the several genes, might be a valuable source for future non-human primate research. Further, an in-depth analysis of TE distribution patterns between non-human primates and humans would lead to therapeutic advances in biomedical research and evolutionary understanding of primate radiation.  Table S3). The GAPDH gene, a standard control, was analyzed using specific primer pairs (S: 5′-GAA ATC CCA TCA CCA TCT TCC AGG-3′, AS: 5′-GAG CCC CAG CCT TCT CCA TG-3).
Molecular cloning and sequencing procedures. PCR products were separated on 1.2-1.5% agarose gels and purified using the Gel SV Extraction kit (GeneAll, Seoul, Korea). The purified DNA was ligated into a T&A Cloning Vector (RBC Bioscience), then transformed into ECOS 101 (Yeastern Biotech, New Taipei City, Taipei) competent cells (strain: DH5α), which were then grown on agar plates containing 100 µg/ml of ampicillin. The cloned vectors were isolated using the Hybrid-Q Plasmid Rapidprep kit (GeneAll, Seoul, Korea). PCR product sequencing was performed by a commercial sequencing company (Macrogen Inc, Seoul, Korea).

3′UTR-end-located TE analysis.
We performed basic Python coding to build a locater that showed the specific location of repeated sequences in certain organisms. The run of this program yields detailed information about the location pattern, such as the exon number, 5′UTR or 3′UTR region, and the direction ( Supplementary  Fig. S4). To run the locator, we downloaded two types of input files from the UCSC database. The first input file (ftp://hgdow nload .soe.ucsc.edu/golde nPath /macFa s5/datab ase/refGe ne.txt) was the reference gene annotation file that contained all the position information on transcript-start and -end, exon-start and -end, the number of exons, and their direction. The second input file (ftp://hgdow nload .soe.ucsc.edu/golde nPath /macFa s5/datab ase/ rmsk.txt) was the repeat masking information file that contained the genomic position of the repeated sequence, its name, and its direction. We ran this analysis with the two input files of the cynomolgus macaque, rhesus macaque, and humans. We sorted the output file of these 3 species to obtain all Alus or AluYRa1s that were www.nature.com/scientificreports/ located at the end of the registered gene transcript. This sorting was performed using Awk languages in Linux operating system. TAPAS analysis. TAPAS, a recently devised software, enabled us to detect which is the tool for detecting such alternative (or all) polyadenylation sites within a gene from RNAseq data 34 . To run TAPAS, two types of file are required; one reference annotation file (ftp://hgdow nload .soe.ucsc.edu/golde nPath /macFa s5/datab ase/ refFl at.txt), which was downloaded from the UCSC database, and another one called the read coverage file that was generated from the BAM file using samtools (the program that utilizes SAM file) command upon the Linux operating system. The BAM file (.bam), which contained information about the read sequence, was generated from the RNA sequencing data (unpublished) of 30 cynomolgus macaques' blood samples ( Supplementary  Fig. S5) 42 . These healthy macaques were from the NPRC of the KRIBB. The reference annotation file and read coverage file were used as input files for TAPAS analyses. The output file consisted of six columns: gene name, chromosome name, the strand of the gene, detected APA sites, the abundance of those APA sites, and read count 34 . We sorted and tried to extract the information from the output file for the 10 genes that had 3′UTR-end AluYRa1s, but we could extract information on only seven genes because of the absence of information in the reference gene file for the remaining 3 genes (CMBL, SLC16A14, PDK4). Another group of three genes (BLOC1S6, UBE2B, PAICS) was also excluded from the TAPAS analysis because of their different direction or inappropriate gene structure. Therefore, we extracted APA information for four genes (TK2, GTPBP4, PEX26, IRF9) from the TAPAS output file and conducted the statistical analysis.  Figure 7. TAPAS analysis. The TAPAS software run on 30 cynomolgus macaques resulted in a certain output file, and we statistically analyzed this output file. Dots represent APA sites. Pink bar indicates the accumulation of the dots (APA sites) above, and the red bar represents the same, but the total red bars represent the total APA sites that are located on the AluYRa1 sequence. The total ratio is 74%. APA; alternative polyadenylation.