In Silico identification of a common mobile element insertion in exon 4 of RP1

Mobile element insertions (MEIs) typically exceed the read lengths of short-read sequencing technologies and are therefore frequently missed. Recently, a founder Alu insertion in exon 4 of RP1 has been detected in Japanese patients with macular dystrophy by PCR and gel electrophoresis. We aimed to develop a grep search program for the detection of the Alu insertion in exon 4 of RP1 using unprocessed short reads. Among 494 unrelated Korean patients with inherited eye diseases, 273 patients with specific retinal phenotypes who were previously genotyped by targeted panel or whole exome sequencing were selected. Five probands had a single heterozygous truncating RP1 variant, and one of their unaffected parents also carry this variant. To find a hidden genetic variant, whole genome sequencing was performed in two patients, and it revealed AluY c.4052_4053ins328/p.(Tyr1352Alafs*9) insertion in RP1 exon 4. This AluY insertion was additionally identified in other 3 families, which was confirmed by PCR and gel electrophoresis. We developed simplified grep search program to detect this AluY insertion in RP1 exon 4. The simple grep search revealed a median variant allele frequency of 0.282 (interquartile range, 0.232–0.383), with no false-positive results using 120 control samples. The MEI in RP1 exon 4 was a common founder mutation in Korean, occurring in 1.8% of our cohort. The RP1-Alu grep program efficiently detected the AluY insertion, without the preprocessing of raw data or complex installation processes.

www.nature.com/scientificreports/ As transposable elements in the human genome account for approximately 45% of the total DNA content, it is difficult to determine whether certain MEIs are pathogenic. Recently, a founder Alu (a short interspersed nuclear element) insertion in exon 4 of RP1 has been reported in Japanese patients with macular dystrophy 10,14,15 , as determined by optimized polymerase chain reaction (PCR)-based amplification with gel electrophoresis or Sanger sequencing. The genetic relatedness of Korean and Japanese populations suggests that founder RP1-Alu insertions may also be found in Koreans with macular dystrophy or cone-rod dystrophy (CRD). However, PCR and gel electrophoresis are time-and labor-intensive methods. Therefore, we developed a simple approach for detecting Alu insertions in RP1 exon 4 from raw NGS data based on the known sequence of the mutant junction and applied the method to a Korean cohort with IRD with previously generated targeted panel NGS or WES data.

Results
At the time of the analysis, 233 patients with IRDs were sequenced and analyzed by WES (clinicaltrials.gov: NCT03613948) (Fig. 1), including 168 patients with Leber congenital amaurosis (LCA), CRD, Stargardt disease, macular dystrophy, and RP. We identified four unsolved cases with macular dystrophy or CRD carrying a single heterozygous truncating mutation in RP1 (NM_006269.1) based on a WES analysis ( Fig. 2 and Figure S1). However, autosomal dominant inheritance was unlikely because an unaffected parent also had this variant and the minor allele frequency (MAF) was high. Additionally, the variants were located in a region with autosomal recessive inheritance 16 . Trio-based WGS of family D and proband-only WGS of one proband (family B) revealed no RP1 copy number variants or structural variants. The RP1 genomic region showed a more complex event leading to incorrect variant calling. A WGS analysis of one patient (family B) revealed the c.4052_4053insGGC CGG GCG CGG TGG CTC ACG CCT GTA ATC CCA GCA CTT TGG G: p.(Tyr1352Alafs*9) variant along with c.5797C > T:p.(R1933*), with a low variant allele frequency (VAF) (19.6%, allele depth = 41:10, REF:ALT) and VQSRTrancheINDEL 99.95-100.00. Genome Analysis Toolkit uses a machine learning model to differentiate true variants from false positives. A VQSRTrancheINDEL of ≥ 99.00 corresponds to tranches with more false positives. Therefore, VQSRTrancheINDEL 99.95-100.00 indicates a high probability of false positives. However, the soft clipped part of reads (i.e., the longest 122 bp) and opposite side of the poly-A tail in the WGS analysis of B.II-1 revealed an insertion of an AluY retrotransposon at chr8:55,540,494 (hg19) (Fig. 3A,B). WES or WGS analyses of the four probands (A-D) and the mother of unaffected patient D yielded similar results. Accordingly, we re-examined targeted NGS data for 105 patients with LCA, CRD, Stargardt disease, macular dystrophy, or RP (Fig. 1). We identified one additional patient (family E) with macular dystrophy previously thought to be unsolved because he harbored only one heterozygous nonsense c.5797C > T variant in RP1 (Fig. 2E). Likewise, abnormal reads with low VAFs were suspected between c.4052 and c.4053 in RP1, as determined using Integrative Genomics Viewer, but no variants were called at the position.
To better detect Alu insertions at this location in RP1 in patients with macular dystrophy or CRD, we designed a simple grep search program including the reference sequence (13 bp) and AluY sequence (13 bp) at the junction. We identified an Alu insertion in RP1 exon 4 in unsolved patients with the disease-causing variant p.(Arg1933*) in families B, C, and E, p.(Ile1528Valfs*10) in family A, and p.(Cys1399Leufs*5) in family D. Interestingly, macular dystrophy without peripheral retinal dystrophy was observed with c.5797C > T:p.(Arg1933*) and early-onset www.nature.com/scientificreports/ CRD was observed with c.4582_4585del:p.(Ile1528Valfs*10) (family A.II-2) and c.4196del:p.(Cys1399Leufs*5) (family D.II-2). The latter two patients had childhood-onset nystagmus and were legally blind at the age of 20 years (Table 1).
Alu insertions between c.4052 and c.4053 in RP1 were suspected for the five patients described above. We confirmed the Alu insertion in exon 4 of RP1 by PCR and gel electrophoresis using samples from four patients and their available parents ( Figure S2); patient D was excluded owing to the lack of residual sample. An approximately 300-bp insertion was identified in the probands (family A-C and E); in the probands of families A and B, the insertion originated from the father and mother, respectively. The mother of the proband in family E had no RP1 insertion; thus, the insertion likely originated from the father. Sanger sequencing of RP1 of the parents in families A and B and WGS of the parents in family D revealed that the insertion and another truncating variant in RP1 in families A, B, and D were located in trans. Sanger sequencing revealed that RP1 c.5797C > T:p.  www.nature.com/scientificreports/ (Arg1933*) in the proband of family E originated from the mother, indirectly confirming that the variant and Alu insertion in the proband of family E were in trans. The Alu sequence was determined by Sanger sequencing of a purified ~ 672 bp band in gel electrophoresis. Except for the poly(A) tail, 5 of 282 bases in Alu differed from the previously reported Alu Y reference ( Figure S3) 17 . Interestingly, some bases preceding the Alu insertion were detected behind the poly(A) of Alu Y in duplicate. These two direct repeats were likely introduced during the Alu insertion. The Alu sequence reported in the Japanese population has not been reported and thus it was not possible to confirm that the same element was present. However, the high prevalence in cases in both Korea and Japan and the identical position strongly suggest that the event was a common founder effect. The predictive pathogenicity and population frequency of the variants are summarized in Table S1.
Validation of the grep search. Using the bash grep command, we found a median VAF of 0.282 (interquartile range, IQR, 0.231-0.383) in nine sets of sequencing data from six patients (five probands and the mother of proband D) for the heterozygous RP1-Alu insertion (Table S2)  Comparison with other mobile element detection tools. We used the MELT, Mobster, and SCRAMble tools to compare the efficacy and runtime for MEI detection in RP1 exon 4 12,13,18 . The RP1-Alu was not called in two targeted NGS samples using the MELT algorithm and in one WES sample (Patient C.II-2) using the Mobster and SCRAMble algorithms (Table S3). Computation time is a limiting factor when running MEIs detection tools using large datasets. For targeted panel data, the median runtimes were 101.  (Fig. 4). The runtimes did not account for the pre-processing, filtering, and annotation of MEIs.

Discussion
RP1 is located on chromosome 8 and comprises 4 exons (3 coding) and 2156 amino acids. Most of reported disease-causing variants are clustered in the largest and terminal exon 4, and RP1 disease-causing variants show autosomal dominant or recessive inheritance patterns depending on the type and position of variants 16 . We have found 5 unsolved cases with a single disease-causing variant in the RP1 region with autosomal recessive inheritance based on NGS data. The c.4052_4053ins328 Alu element insertion in RP1 seems to be the second variant in East Asian population.  7,11,19,20 . The AluYb8 insertion in MAK is a founder mutation in the Jewish population 8 , and a BBS1 SVA F retrotransposon insertion is a frequent cause of Bardet-Biedl syndrome in Europeans 11 . Furthermore, recent studies have identified MEIs as causative mutations in 0.04-0.15% of cases 18,21 . The MAK-Alu grep program is an efficient tool for the detection of founder MEIs in the Jewish population 22,23 . Studies aimed at detecting pathogenic MEIs in Asian populations are relatively limited, despite the potential for population-specific founder MEIs. Recently, a founder Alu insertion in exon 4 of RP1 has been reported with autosomal recessive inheritance in Japanese patients with macular dystrophy 10,14,15,24 . Therefore, the founder MEI found in the Japanese population should be also investigated in the Korean population.
MEIs can often be missed by NGS methods due to PCR amplification and targeted capture in both targeted panel and WES data. PCR and gel electrophoresis have been used to identify the Alu in exon 4 of RP1 in cases with a heterozygous, disease-causing variant in RP1 by targeted panel sequencing 15 . However, this approach is time-consuming, expensive, and laborious. Therefore, we created a grep search program to detect the Alu in exon 4 of RP1 with previously generated raw NGS data, without requiring further experiments. By incorporating the simplified grep program in our clinical diagnostic pipeline, we detected MEI in RP1, which can provide a definitive molecular diagnosis that is typically missed by short-read sequencing. In our cohort with compatible phenotypes (n = 273), MEI was detected in 1.8% of patients, consistent with the frequencies reported in previous studies 24  However, the Alu insertion in trans with more proximal frameshift mutations (c.4196del or c.4582_4585del) causes childhood-onset nystagmus and severe macular dystrophy with rod involvement, consistent with earlyonset CRD. It occurs during childhood, with the first symptoms recognized in the first decade 25 . When compared with that in LCA, visual function in early-onset CRD is slightly better, but progressive loss of retinal function leads to blindness in the second to third decade of life. We found that the RP1-Alu variant along with other frameshift mutations can cause childhood-onset retinal dystrophy with nystagmus, mimicking LCA or Stargardt www.nature.com/scientificreports/ disease. As RP1 mutations cause CRD, RP, or macular dystrophy in either autosomal recessive or dominant states depending on the mutation location and type 26 , careful evaluations of the family history and the locations of variants in RP1 are important, particularly when a single heterozygous disease-causing RP1 variant is found and the family history does not indicate autosomal dominant inheritance. MEIs can be detected using the MELT or Mobster algorithm based on discordant read pairs and clipped reads in combination with consensus sequences of known mobile elements 12,13 . Additionally, SCRAMble shows relatively high sensitivity for the detection of MEIs occurring within a targeted capture region 18 . These tools show reduced sensitivity for target enrichment sequencing relative to PCR-free genome sequencing because discordant read pairs can exist outside of target regions. Indeed, RP1-Alu was not detected in two targeted NGS samples using the MELT algorithm and in one WES sample using both Mobster and SCRAMble. Furthermore, our grep has various practical advantages over other algorithms, including the reduced computational time, no need for complex installation processes or preprocessing steps.
Despite these advantages, it should be emphasized that our RP1-Alu grep program is only useful to detect the founder MEI in exon 4 of RP1. Although no common variants have been reported within 13 bp upstream of the Alu insertion in gnomAD v2.1.1, a rare variant was found in gnomAD v3.1 (hg38: 8-54,627,925-A-G: MAF = 1/152,184) 10 bp upstream of the Alu insertion site. To allow one mismatch within junction of the Alu insertion, R agrep program will yield positive results in such cases. We were also unable to confirm the validity of the method in patients with a homozygous RP1-Alu insertion or in other populations.
In conclusion, our results showed that the RP1-Alu insertion is common in Koreans with IRD, occurring in 1.8% of patients with IRD. RP1-Alu grep detected this common MEIs with no false-positive results. These findings provide a basis for further studies of the founder RP1-Alu insertion in pre-existing NGS data in East Asian patients with unsolved IRD. We also determined the full sequence of the inserted Alu Y. In unsolved early-onset CRD or macular dystrophy, RP1-Alu should be investigated using short-read sequencing data in East Asians.

Methods
Patient cohort and Alu detection process. The study protocol adhered to the tenets of the Declaration of Helsinki and was approved by the Institutional Review Boards of Yonsei University College of Medicine, Gangnam Severance Hospital (3-2020-0330). All probands were unrelated. Patients with clinical information were recruited and clinically examined at Severance Hospital, Yonsei University College of Medicine. Informed consent was obtained from all subjects or, for subjects under 18 years of age, from a parent or legal guardian; informed consent included consent for the publication of identifying information/images. Blood samples were collected for DNA extraction; 494 unrelated patients with inherited eye diseases, including FRMD7-related infantile nystagmus, congenital cataract, Stickler syndrome, familial exudative vitreoretinopathy, inherited optic atrophy, PR, LCA, CRD, and macular dystrophy, were identified. In total, 261 patients were evaluated by targeted panel NGS and 233 patients were evaluated by WES using xGen Exome Research Panel v1 (Integrated DNA Technologies, Coralville, IA, USA) and Twist Human Comprehensive Exome (Twist Bioscience, San Francisco, CA, USA). Proband-only WGS or trio WGS was additionally performed for 16 unresolved cases after targeted panel NGS or WES. Sequencing and bioinformatic analyses were performed as described previously and are summarized briefly in the Supplement methods 27,28 . Probands with LCA, Stargardt disease, CRD, macular dystrophy, and RP were screened. We evaluated unsolved patients with only one disease-causing variant in RP1 for selected probands and implemented a newly developed grep search program with FASTQ files to detect the Alu insertion in exon 4 of RP1. We additionally tested the program using control samples. Suspected Alu insertions in RP1 were confirmed by PCR and electrophoresis.
Grep program to detect RP1-Alu. The Linux grep command was used to search FASTQ files for the 5′ junction between the reference sequence of exon 4 and the beginning of the Alu insertion in RP1. Most FASTQ files without the insertion returned a count of "0," though in rare cases a false-positive read count of 1 or 2 was detected in wild-type samples depending on the coverage depth and sequencing method. The variant allele frequency (VAF) was calculated as mutant read count/(wildtype read count + mutant read count). The program returns "No AluY insertion: VAF < 0.1, " "AluY insertion suspected: 0.1 ≤ VAF < 0.3, " or "AluY insertion detected: VAF ≥ 0.3. " The grep search program is described in detail in the Supplementary methods.

Data availability
Data supporting the findings of this manuscript are available from the corresponding author upon reasonable request.