Detection of non-reference porcine endogenous retrovirus loci in the Vietnamese native pig genome

The Vietnamese native pig (VnP)—a porcine breed with a small body—has proven suitable as a biomedical animal model. Here, we demonstrate that, compared to other breeds, VnPs have fewer copies of porcine endogenous retroviruses (PERVs), which pose a risk for xenotransplantation of pig organs to humans. More specifically, we sought to characterize non-reference PERVs (nrPERVs) that were previously unidentified in the reference genome. To this end, we used whole-genome sequencing data to identify nrPERV loci with long terminal repeat (LTR) sequences in VnPs. RetroSeq was used to estimate nrPERV loci based on the most current porcine reference genome (Sscrofa11.1). LTRs were detected using de novo sequencing read assembly near the loci containing the target site duplication sequences in the inferred regions. A total of 21 non-reference LTR loci were identified and separated into two subtypes based on phylogenetic analysis. Moreover, PERVs within the detected LTR loci were identified, the presence of which was confirmed using conventional PCR and Sanger sequencing. These novel loci represent previously unknown PERVs as they have not been identified in the porcine reference genome. Thus, our RetroSeq method accurately detects novel PERV loci, and can be applied for development of a useful biomedical model.

Northern Vietnam is a center of pig domestication 1 . Vietnamese native pigs (VnPs) have acquired unique biological characteristics through a long history of breeding and fixation 2 . We recently identified 32 populations of indigenous VnP breeds that widely differ in appearance 3 . Using single-nucleotide polymorphism array and microsatellite marker data 4,5 , the genetic characteristics of the VnP populations were found to be closely correlated with the geographic distribution of their habitats. However, certain VnP populations had been hybridized with exotic breeds, such as Landrace, imported for industrialized pig farming. Meanwhile, a recent study revealed that VnP genomes have lower porcine endogenous retrovirus (PERV) copy numbers than those of Western pig breeds 6 .
Endogenous retroviruses (ERVs) are viral elements integrated into the host genome. An exogenous retrovirus infection integrates the viral RNA genome as a provirus into the host genome. When this virus infects germline cells, the provirus is transmitted to the offspring as an ERV 7 . The ERV incorporated into the pig genome is known as a PERV and contains the functional genes gag, pol, and env with two long terminal repeats (LTRs) at the 5′ and 3′ ends of each locus. Typically, in the ERV, the functional genes gag, pol, and env encode proteins involved in viral particle formation, reverse transcriptase, as well as the glycoprotein of the viral envelope, which is associated with adhesion and invasion of host cells, respectively 8 .
Recombination occurs between the 5′ and 3′ LTRs to form solo-LTRs 9 . LTRs contain internal promoters and regulatory sequences, such as transcription factor binding sites, that alter the expression of adjacent host genes 9 . In fact, gene regulation by ERVs and solo-LTRs can alter the human phenotype 10,11 . Thus, determining the loci of PERVs, solo-LTRs corresponding to PERVs, and their neighboring functional genes is necessary to predict possible influences of PERVs on the host genome. In this way, the domestication and distinctive characteristics of VnPs may be better understood. www.nature.com/scientificreports/ Moreover, PERVs are unfavorable genomic elements that can pose a significant risk for xenotransplantation of porcine tissues to human recipients. However, the copy number of PERV from VnPs, especially in the northern region of Vietnam was lower than that in other regions 6 . We collected pig samples from the Ban population in Yen Bai province (BanYB) to evaluate the PERV copy number. Depending on the PERV copies, this evaluation may help increase the gene modification success rate for producing PERV-free organisms. Moreover, due to their small body size, the organs of VnPs exhibit physiological similarities to those of humans 3 . As such, the VnP is expected to be representing a suitable biomedical model for producing xenotransplants for humans.
PERVs and solo-LTRs are dispersed in mutually similar sequences throughout the genome. It is, therefore, difficult to establish their precise genomic locations. Recently, whole-genome sequencing (WGS) data have enabled the examination of near-complete genomes for numerous species. To date, 20 PERVγ1 loci have been identified in swine using WGS. However, prior studies on PERV loci have been restricted to an earlier version (Sscrofa10.2) of the Duroc breed reference genome 12 . Therefore, non-reference PERVs (nrPERVs) may exist, however, do not appear in the reference genome. It is also possible that insertion loci (LTRs and PERV copy numbers) may vary among individuals and populations. In general, however, the reference genome was compiled without the repeat sequences (including ERVs) as integration of these elements has proven challenging. To address this issue, we have researched prior studies to identify a suitable method for detecting non-reference ERVs, as there is currently no established method for identifying nrPERVs in pigs.
The primary aim of the current study is to identify the nrPERV loci in VnPs to facilitate the establishment of a candidate biomedical model applicable for use in xenotransplantation. More specifically, we carried out quantitative real-time PCR (qRT-PCR) to analyze the PERV copy number as a simple measurement method. Collectively, we present the estimated numbers and loci of the LTRs and nrPERVs in the VnP genome.

Results
Sequencing data quality. We defined the three VnPs as VnP1, VnP2, and VnP3. qRT-PCR data indicated that the PERV pol gene copy numbers for VnP1, VnP2, and VnP3 were 7.3, 8.2, and 8.9, respectively. The PERV copy number identified with the RetroSeq method was then evaluated based on the qRT-PCR results. Data obtained by Illumina HiSeq X for these individuals are shown in Suppl. Table S2. The 150 bp paired-end reads exceeded, by more than 50-fold, the coverage of the entire genome for all three pigs. Trimming removed only 0.042% of the sequence reads and those remaining covered over forty-six-fold of the whole genome. The mapping results are shown in Suppl. Table S3. For each pig, > 94.2% of the paired-end reads mapped on the reference pig genome, whereas 1.63-1.69% did not. Moreover, 0.44-0.46% of the reads were singletons and were mapped on only one side. Non-proper pairs, such as discordant and split reads (Fig. 1), comprised 3.57-3.72% of the total genome. Sequence reads classified as non-proper pairs and singletons were used in the subsequent RetroSeq step. No quality control-failed reads were permitted to pass through the Burrows-Wheeler Alignment.
In silico identification of the non-reference LTR breakpoint. The analytical procedure is schematically represented in Fig. 2. In the RetroSeq "discover" step, we identified singleton and non-proper pairs among the read pairs supporting PERV-LTR (Fig. 2). We detected 8,884, 8,475, and 8,253 reads supporting LTRs in the genomic sequencing data of VnP1, VnP2, and VnP3, respectively. We then identified the LTR insertion loci (breakpoint) from the output of the RetroSeq "call" step. The candidate breakpoint was selected when the filter level was set to seven or eight, which is the range used in RetroSeq. A total of 220, 197, and 205 candidate LTR insertion loci were identified for VnP1, VnP2, and VnP3, respectively (Suppl . Table S4). We then used the merged Binary Alignment Map (BAM) data and Integrative Genomic Viewer (IGV) to detect 4-5 bp of target site duplication (TSD)-containing positions. TSDs were identified in loci on 23 autosomes and three X chromosomes. IGV mapping around the breakpoint showed that reads mapping on either the 5´ or 3´ end were broken at the TSD border (Fig. 2). Next, contigs were generated using a set of singleton and non-proper pair reads that mapped within 150 bp of the TSD and obtained them where one end matched the reference genome while the other did not (non-reference sequence). We then investigated whether these non-reference sequences matched the LTRs associated with PERV. Local de novo assembly generated the sequences containing the LTRs derived from all TSD-containing positions. The TSD sequences and the loci where the LTRs were detected in silico are shown in Table 1. The TSD sequences lacked any specific pattern. The lengths of the nrPERV-LTRs were in the range of 598-710 bp including their TSD sequences. The LTR sequences are shown in Suppl. Table S5. The contigs generated on the 5′ and 3′ ends of the TSD boundary were combined and the region between the TSDs was defined as the nrPERV-LTR sequence. Of the 26 LTRs, 21 were identified with TSDs at both the 5′ and 3′ ends. However, five LTRs were identified with only one TSD at either the 5′ or 3′ end. Therefore, we further analyzed the phylogenetics and LTR characteristics using 21 LTRs. The LTR chr13_57502585 harbored a mutation in the region overlapping combined contigs. Six other LTRs were identified within or around the functional genes (Table 1). Finally, PCR amplification and cycle sequencing analysis was performed on 26 LTR loci, from which nine, seven, and five nrPERV sequences were detected for VnP1, VnP2, and VnP3, respectively (Table 1).

Non-reference PERV-LTR characteristics and phylogenetic analysis. Among the nrPERV-LTR
sequences detected, there were several mutations, insertions, and deletions. In the maximum likelihood tree, the nrPERV-LTRs were classified into cluster LTR-A and LTR-B (Fig. 3). Of the 21 LTRs, 10 were classified as LTR-A and 11 as LTR-B. The LTR has U3, R, and U5 regions. We detected 18-bp and 21-bp repeats in the U3 region of LTR-B; however, the number of these repeats varied among LTR-B. In contrast, LTR-A lacked these repeats but had sub-repeat sequences resembling those of LTR-B (Fig. 4). www.nature.com/scientificreports/

Discussion
The sequencing data obtained here had a high depth even after trimming based on strict criteria. Over 94% of all read pairs mapped onto the reference pig genome. Hence, the assembled sequence data was deemed to be high in quality. Moreover, we attempted to detect nrPERVs with LTR (nrPERV-LTR) sequences using reads that were non-proper pairs and singleton sequences that did not map correctly to the reference genome. Hence, we detected 26 novel nrPERV-LTR loci in the porcine genome. Long PCR between the 5′-LTR and 3′-LTR confirmed the presence of PERVs and the existence of heretofore unreported nrPERV loci. The RetroSeq-based method used in this study primarily targets non-reference LTR loci, which, in theory, were excluded from the reference pig genome. Previous studies identified PERV loci in the reference sequence of the pig genome using RetroTector software 13 . However, considering that these results did not overlap with those of the present study, the 26 nrPERV-LTR loci found herein are considered novel. Furthermore, RetroSeq analysis using WGS data and long PCR disclosed nine, seven, and five nrPERVs for VnP1, VnP2, and VnP3, respectively. Meanwhile, analysis of the whole-genome assembly (Sscrofa10.2) revealed 9 and 11 PERV-A and PERV-B loci, respectively 12 . However, we could not generate the corresponding data for these VnPs in the present study as the method used in this study is not applicable for detecting the PERV loci previously identified in the reference pig genome. Moreover, copy number values of 7.3, 8.2, and 8.9 were estimated for the PERVs in VnP1, VnP2, and VnP3, respectively, using qRT-PCR. As qRT-PCR could detect both the reference and non-reference LTRs, we could not to distinguish them. Thus, the qRT-PCR results may show lower values than the actual estimates. Overall, qRT-PCR is suitable for broad comparisons of PERV copy numbers among breeds; however, it cannot precisely discriminate them owing to bias effects resulting from PCR inhibitors and variable amplification efficiencies. In future, use of droplet digital PCR, which has been recognized as the most suitable method for absolute quantification of gene copy numbers, will help determine the PERV copy numbers in VnPs. This will enable comparison of results between the present and previous studies 14 .
The method used in the present study identified nrPERV loci and PERV types with greater accuracy than qRT-PCR as the former used long PCR validation. However, our methodology was restricted to non-reference genomes. It may be possible to improve nrPERV-LTR detection sensitivity and accuracy by increasing the amount www.nature.com/scientificreports/ of data or by adding long-read sequencing data. However, undetected PERVs at the "non-LTR" loci (Table 1) might exist. Certain loci might have been overlooked if long PCR amplification was inefficient when LTRs with repetitive sequences were present. In fact, our long PCR did not identify any LTR or PERV bands, even though LTR sequences were detected in silico for chr18_4030456. Although we designed primers from the candidate loci flanking sequences based on the Duroc reference genome, the VnPs used might have contained a mutation in the flanking sequence. Detection of all intact PERVs with LTRs required de novo assembly without the reference genome. Seven of the nine nrPERV loci detected here were primarily PERV-B-the oldest phylogenetic PERV-while only one locus was found for PERV-A and PERV-C. These are all known competent subtypes of PERV which have high homology in the genes encoding gag and pol, however, differ in the genes encoding env proteins 15,16 . All members of the Suidae, including warthogs and red river hogs, harbor PERV-B. However, PERV-A and PERV-C are absent in warthogs while PERV-C is missing in red river hogs 17 . A study applying qRT-PCR reported that crossbreeding with Western species may increase PERV copy numbers 6 . Here, the copy numbers of PERV-A and PERV-C were lower than those of PERV-B. The latter commonly occurs in the wild boar and may have been conserved during the evolution of domesticated pigs. In contrast, the copy numbers of PERV-A and PERV-C are low in VnPs as this breed has occasionally been hybridized with Western species. Several studies have been conducted using different approaches including selection, short-interfering RNA, antibodies, and genome-editing technology (CRISPR/Cas9) to avoid PERV transmission during xenotransplantation 14,[18][19][20][21] . The results of these previous studies suggest the possibility of producing PERV-free pigs, which can be used as breeding organisms to establish a novel biomedical model for xenotransplantation. For this purpose, additional genetic modification should be introduced into the VnP, because of their suitable size for humans and the presence of fewer PERV copies.
A phylogenetic tree was constructed using the LTR sequences detected in the current study. The sequences were divided into LTR-A and LTR-B (Fig. 3). LTR-A and LTR-B differ in terms of how many 18-bp and 21-bp repeats were present in the U3 region, and the presence or absence of alternating repeats (Fig. 4). The sequence characteristics determined here were consistent with those of previous reports [22][23][24] . The inserted LTRs act as host gene enhancers or promoters. In humans, the growth factor pleiotrophin has mitogenic, growth-promoting, and angiogenic properties and is expressed by the ERV-derived LTR promoter. The LTR promoter enables trophoblast-specific placental gene expression 25,26 . Regarding the porcine LTR, promoter activity increases when there are 39-bp repeats in the U3 region 22 . Some of the nrPERV-LTRs detected here were inserted along within functional genes. For example, the LTR detected in chr8_137488280 was LTR-B3 and was inserted into CFAP299 with many repeats in the U3 region (Table 1; Fig. 4). Although CFAP299 reportedly regulates murine spermatogenesis 27 , its precise function in pigs is unknown. Nevertheless, it is primarily expressed in the ovaries 28 and, therefore, likely plays a role in reproduction. The LTRs detected in chr4_78524842 and chrX_75151968 www.nature.com/scientificreports/ were classified as LTR-B1. Each LTR was inserted into SNTG1 and PCDH11X, respectively (Table 1). Though the functions of these genes in pigs remain unknown, SNTG1 is associated with idiopathic scoliosis in humans 29 . Moreover, PCDHX11 is related to the development of primary ovarian insufficiency in human females 30 and might be involved in sexual maturation in cattle 31 . The relationships between certain LTRs and specific biological effects require further investigation. However, our observations suggest that the LTRs inserted within these genes might affect VnP traits. LTRs have been implicated in evolutionary research. Studies have been conducted to estimate the time at which endogenous retroviruses were first inserted by comparing mutations between 5′ and 3′ LTRs 32 . Here, we detected a mutation between the 5′ and 3′ ends of the chr13_57502585 LTR. However, no mutations were detected in any of the other LTRs. Moreover, if chromosomal rearrangements occurred due to homologous recombination between distant proviruses, the flanking TSDs should differ. These points were mentioned in a previous study on ERV-mediated genome rearrangements in primates 33 . However, the TSDs detected in the present study were similar on both the 5′ and 3′ sides. Hence, the LTRs detected may have been recently inserted.
RetroSeq was used in the present study to identify novel PERV loci with LTRs not identified in the reference genome. The findings of this study contribute to the continued evaluation of whether pigs represent an ideal biomedical xenotransplantation model, however, PERV copy number is not the only factor to consider and further investigation is necessary in this regard.  www.nature.com/scientificreports/   Detection of non-reference LTRs. The types of read pairs that mapped to the reference genome were defined to extract sequencing reads that were useful for this research. Most paired end reads derived from WGS, map to the reference genome. However, discordant read pairs may also occur and may have unexpected span sizes/inconsistent orientations. The designation "non-proper pairs" refers to a 5′ or 3′ end that maps to a contig sequence of the reference genome, while the other end fully or partially maps to an unexpected locus. The designation "singleton" refers to one end of a read pair that does not map to the reference genome, whereas "unmapped read pairs" refers to both ends of a read pair that do not map to the reference genome ( Fig. 1). Discordant read pairs may provide insights regarding LTR-related loci as the anchor 37 . RetroSeq software 37 was used to detect non-reference transposon elements (TEs) using mismatched reads. The process flow is shown in Fig. 2.
The LTR sequences were based on PERV-LTR sequences acquired from the National Center for Biotechnology Information (NCBI, Bethesda, MD, USA) under accession Nos. AF435966, AF546883-AF546887, AJ279057, AJ298073-AJ298075, AY312534-AY312550, EF133960, EU789636, and HQ540595. The reference genome was Sscrofa11.1. which contains only chromosomes 1-18 and X. In the RetroSeq "call" step, the TE insertion loci (breakpoints) were inferred using reads detected during the "discover" phase, as previously reported 38 . The "call" step read option was set to ≥ 10 to reduce false positives. The maximum read depth option per call was set to 10,000 to increase BAM coverage. All other RetroSeq options were used at their default values. At least seven filter level breakpoints were employed. Calls within 500-bp of a detected breakpoint were considered identical and were excluded. The IGV 39 was used to detect loci containing TSDs 4-5 bp long. The loci were presumed to be TSD if they mapped on reads detected during the "discover" phase either from the 5′ or 3′ side, overlapping by 4-5 bp (Fig. 2). The 5′ and 3′ reads mapping within 150 bp of the TSD were extracted with SAMtools 40 (Fig. 2). The read sets were used to generate contig sequences by local de novo assembly with CAP3 software 41 . The presence of the LTR sequence was confirmed from the contig sequence created by CAP3.

PCR amplification and nucleotide sequencing determination of non-reference PERVs.
For the loci wherein non-reference LTRs were identified, PCR was performed to confirm the presence of PERVs. The final PCR mixture consisted of 0.4 U KOD FX neo Taq (TaKaRa Bio Inc., Kusatsu, Shiga, Japan), 10 μL of 2 × KOD FX neo buffer (TaKaRa Bio Inc.), 1.6 μL of deoxynucleotide triphosphate (dNTP; 2.5 mL of each type; TaKaRa Bio Inc.), 1.2 μL of 10 µM forward and reverse primers per site, and 10 ng DNA in a total volume of 20 μL. The primers used for PCR are listed in Suppl. Table S1. The PCR program comprised a denaturation step at 95 °C for 2 min, followed by 40 cycles of 95 °C for 10 s and 68 °C for 10 min. After PCR, the 8000-10,000bp DNA band was purified and subjected to a second PCR using the primers for the PERV pol region 6 . The amplicon of the second PCR cycle was sequenced with an ABI3130 BigDye Terminator v. 3.1 Cycle Sequencing Kit (Applied Biosystems, Foster City, CA, USA) and analyzed with a ABI3130 Genetic Analyzer (Applied Biosystems) to detect the presence of PERV sequences. The types of PERVs were determined using the Basic Local Alignment Search Tool 42 according to the partial sequences of the PERV pol region.
LTR structure and phylogenetic tree analysis. The detected non-reference LTRs were aligned with Multiple Sequence Comparison by the Log-Expectation program 43,44 . The phylogenetic tree was inferred by the maximum likelihood method and the kimura three-parameter model included in Molecular Evolutionary Genetics Analysis software (MEGA X) to classify the non-reference LTRs 45-47 . qRT-PCR for copy number estimation. The numbers of PERV pol gene copies in the three pigs were estimated by qRT-PCR according to a previously reported method. β-actin (ACTB) was used as an endogenous control (reference) gene. The primers and probes were the same as those used in a previous study 6 . The PCR amplicons of these primer sets were cloned into the pCR-TOPO2.1 vector. The standard curves for absolute quantification were plotted using serial dilutions of linearized DNA from ACTB and pol gene plasmid clones. The qRT-PCR mixtures used for DNA amplification were prepared by adding 7.5 µL of 2 × TaqMan Gene Expres-