Detection of single nucleotide and copy number variants in the Fabry disease-associated GLA gene using nanopore sequencing

More than 900 variants have been described in the GLA gene. Some intronic variants and copy number variants in GLA can cause Fabry disease but will not be detected by classical Sanger sequence. We aimed to design and validate a method for sequencing the GLA gene using long-read Oxford Nanopore sequencing technology. Twelve Fabry patients were blindly analyzed, both by conventional Sanger sequence and by long-read sequencing of a 13 kb PCR amplicon. We used minimap2 to align the long-read data and Nanopolish and Sniffles to call variants. All the variants detected by Sanger (including a deep intronic variant) were also detected by long-read sequencing. One patient had a deletion that was not detected by Sanger sequencing but was detected by the new technology. Our long-read sequencing-based method was able to detect missense variants and an exonic deletion, with the added advantage of intronic analysis. It can be used as an efficient and cost-effective tool for screening and diagnosing Fabry disease.

. Next generation sequencing enabled diagnosis since the male patient was mosaic for an SNV (58% of the reads detected the mutated variant although 100% variant frequency would be expected for an X-linked allele). Although in this case the mosaic was also detected by Sanger sequencing, in other cases of mosaicism where the ratio between the variant and wild-type molecules are more extreme, they might be missed by Sanger sequencing.
Nanopore-based sequencing has recently emerged as a powerful technology for nucleic acid sequencing in all fields of biology 13,14 . This technology was successfully commercialized by Oxford Nanopore Technologies (ONT) due to its ability to sequence ultra-long-reads, perform real-time basecalling and analysis, provide base modification detection, all with short sample preparation time and low instrument costs 15 . These advantages allowed the successful complete assembly of human chromosomes 16 , tumor structural and epigenetic variations associated with cancer 17 , detection of imprinted DNA methylation 18 , identification of novel transcript isoforms 19 and fast detection of viral and bacterial pathogens [20][21][22][23] , including the SARS-CoV-2 virus 24,25 .
Amplification-based targeted sequencing can be performed using both short and long-read sequencing as well as Sanger sequencing. However, as ONT sequencing is able to sequence full amplicons in a single read, it has a clear advantage in detection of structural and intronic variants, variant phasing and accurate mapping in cases of highly homologues genes (e.g. pseudogene, duplicated gene). For the lysosomal storage disorder Gaucher disease, ONT sequencing was shown to accurately analyze the GBA gene, previously considered challenging due to a nearby pseudogene 26 . This analysis also identified intronic variants that are missed by the classical Sanger sequencing-based GLA genotyping. Similar approaches were used to identify variations in FCGR3A gene 27 and TP53 28 .
The aim of our study was to design a Long-read amplicon sequencing assay and to validate it by determining its accuracy for detection of variants in the GLA gene.

Results
Clinical description of the patients. The clinical signs and symptoms are presented in Table 1. All patients were previously diagnosed by measuring the alpha galactosidase A enzyme activity and Sanger sequencing. The samples tested were blinded except for gender, meaning that aside from male sex, the genetic variant and any relationship among the samples was not shared with our lab. Nanopore testing of the patients. GLA amplicon sequencing. In order to detect genomic variants in the GLA locus, we designed a PCR amplicon that produces a 13 kb product including the entire gene, and 800 bp and 2000 bp up-and down-stream sequences, respectively. Sequencing the pooled PCR products of the 12 samples on one MinION flow cell yielded a median of 88,800 reads and 617 million bases per sample (Table S1). Read length distributions of the 12 samples show a peak around 13 kb, demonstrating that most reads are of full amplicons (Fig. S1). Mapping the reads to the human reference genome showed a median of 44,000 × coverage per sample around the GLA region.
SNV and indel detection. The nanopolish tool was then used to call single nucleotide variants (SNV) and short indels. The resulting variants were filtered based on quality score to eliminate false positives calls. The filtered variants were first searched in the ClinVar database for any known classification. The variants that did not match any entry in known databases were manually classified for their predicted effect on the protein using several prediction tools (see "Material and methods"). A summary of the variants with likely pathogenic effect www.nature.com/scientificreports/ on the protein translation is shown in Table 2. In 10 of the 12 samples analyzed we identified a pathogenic or likely pathogenic exonic SNV (6 samples) or short indel (4 samples). For another sample, we detected an intronic SNV (c.547 + 404T>G) at a possible branch site that is likely to affect splicing. According to SpliceAI (https:// splic eailo okup. broad insti tute. org/) this variant has a score of 0.55 as a donor gain, above the 0.5 recommended score cut-off. This variant was detected previously by Sanger sequencing of multiple GLA intronic amplicons. With nanopore sequencing, the same variant was detected by a single PCR assay.

Structural variants.
Although structural variants (SVs), such as insertions and deletions longer than 50 bp, are rare compared to SNVs, many of them result in a pathogenic effect on the encoded protein.
Only 54 SVs (all pathogenic) in GLA are currently described, compared to > 900 pathogenic SNVs and short indels. Surprisingly, in one sample in our study cohort, we detected a 2914 bp deletion between introns 1 and 2, which completely removes exon 2 (Fig. S2). The deletion removes amino acids 65 to 123, located within the alpha-galactosidase domain, and will also change the reading frame from position 65 in the new protein onwards, introducing a new stop codon at position 70 of the new predicted protein. This change to the protein is likely pathogenic. The read length distribution of the amplicon sequencing shows a peak around 10 kb, the expected amplicon length of the deletion variant (Fig. S1, sample 11). As in the case of the intronic SNV, the precise breakpoints of this deletion could only be detected using whole gene amplicon sequencing as both of its boundaries are deep intronic.
Variant phasing. One of the advantages of using ONT in this workflow is the fact that > 50% of the reads include the full length gene, allowing haplotype phase determination of variants that originate from a single molecule. As the analysis was performed blindly to the personal background data of the patients, we could detect haplotypes that are shared between two samples (samples 3 and 5, and samples 7 and 8). We suspected that the samples from each of these two sets belonged to patients with blood relationships. At the end of the analysis, when the blinded file was decoded these findings were validated by the fact that the two sets of samples were actually siblings. Moreover, all 12 SNVs/CNVs were confirmed by prior Sanger sequence clinical testing.
Downsampling for designing higher multiplexed sequencing. In this study we sequenced in multiplex 12 samples on one MinION flowcell, yielding an average coverage of 45,000 ×, much higher than needed for high quality variant calling using ONT reads. In order to evaluate how many samples could be multiplexed in future analyses, we randomly downsampled the reads output to several points. Then, we repeated the analysis workflow on the downsampled reads and evaluated the detection rate of the variants found in the full dataset. Downsampling up to 500 reads per sample allowed detection of all variants from all 12 samples (Fig. 1). While for several samples using only 30 reads were sufficient for 100% true positive detection, for other samples using 250 reads or lower achieved only partial detection, and the rate of several false positive variants passing the quality filter increased. For future GLA genotyping using this pipeline we estimate that 1000 reads per sample Deep intronic and large copy number variant detection using long amplicon sequencing. Two patients in the cohort were of special note: one with the deep intronic variant c.547 + 404T>G and the second one with the deletion of exon 2.
The patient with the variant c.547 + 404T>G suffered acroparesthesia, typical Fabry pain crises and abdominal cramping from childhood. Then, a suspicion of FD was raised, and his enzyme activity was found 8% of normal. His candidate pathogenic variant remained unidentified for 10 years until Sanger sequencing was performed on multiple amplicons spanning the entire non-coding sequence of GLA. The patient with the deletion of exon 2, and his brother and mother were all clinically diagnosed, with Fabry disease. The patient had zero enzyme activity, but the variant was not found for several years by conventional Sanger sequencing. By extracting mRNA and sequencing the cDNA the deletion was found, as mentioned above, with a delay of several years. In contrast, both variants were identified with ease using the ONT long amplicon sequencing method.
All but two variants in Table 2 were classified as pathogenic by a combination of variant effect prediction tools (CADD, REVEL, SIFT, MutationTaser, Polyphen2) and 9 of them were already classified by ClinVar 29 or VarSome 30 as pathogenic or likely-pathogenic according to ACMG guidelines 31 . One patient carried the variant R118C that has conflicting pathogenicity interpretation 32,33 . Indeed, he did not present any of the signs and symptoms that are shown in Table 1. The second variant T194I was classified as likely pathogenic. This patient presented all the signs and symptoms described in Table 1 except for stroke.

Discussion
The purpose of our study is to validate the technique of Oxford Nanopore long amplicon sequencing of the GLA gene.
Fabry disease is a rare lysosomal genetic disorder and delay in diagnosis between the appearances of the first symptoms until the disease is recognized can span many years 34 . While measuring alpha galactosidase enzyme activity is accurate in males, in women, there is an overlap between Fabry diseased and healthy females regarding levels of enzyme activity. Furthermore, definitive diagnosis is recommended and it implies finding the genetic variant causing Fabry. A genetic diagnosis is not only important for the patient but also for screening of the entire family. This can be performed only when the disease causing variant of the proband patient is detected.
The GLA protein has a 48,767 Da mass and the encoding gene encompasses 7 exons. Over 900 pathogenic variants have been described including, SNVs, small deletions, insertions, CNV and IVS variants 5 .
While in most cases Sanger sequencing is straightforward, some cases can be challenging and remain without a molecular diagnosis for many years. Two examples are described in this paper, in which standard Sanger sequencing of the GLA coding sequence was unable to detect a deep intronic variant in one patient and a complete exon deletion in another.
For these two patients, although a clinical diagnosis was suspected, the final diagnosis based on the genetic variant was delayed by 10 years. While screening in the high risk males from both families could be performed by GLA enzyme activity measurements, lack of a molecular genetic diagnosis precluded genetic screening of the high risk females in each family. www.nature.com/scientificreports/ Given that FD is a multi-system disorder, it is often that the coding sequence of GLA is sequenced in order to confirm FD diagnosis. Screening is also necessary for the entire GLA gene sequence because most pathogenic variants in GLA are "private" and therefore not expected to resurface in other patients with similar ancestry. Hence, cost-effective genetic screening methods are necessary in order to conduct FD testing at high scale.
In the present study, 12 samples were long-read sequenced at high depth (45,000 ×). Using downsampling experiments, we demonstrate that such high coverage is unnecessary to enable accurate genotyping of the full gene sequence. Indeed, we propose that 1000 × sequencing depth will be sufficient to genotype any haploid male or diploid female sample across the entire coding and non-coding GLA sequence. Using long amplicon sequencing, this strategy would enable 96 samples to be sequenced at once on a single ONT Flongle flow cell. At this plexity, we expect the cost per sample of ONT to be much lower than that of Sanger sequencing ($10 ONT vs. $70 Sanger cost per sample). This cost savings would also be in addition to the aforementioned enhanced molecular coverage of CNVs and deep intronic variants.
In conclusion, we describe an accurate and cost-effective method for complete gene sequencing of the FDassociated GLA gene. Our new molecular assay is well adapted for high throughput Fabry pathogenic variant screening in the clinic. In addition, we expect our assay to reduce time to FD diagnosis in the long run, due to its enhanced sequencing coverage of the disease-associated coding and non-coding sequences.

Material and methods
Patients and samples. This study was conducted in accordance with the principles of the Helsinki Declaration. The research received the approval by the local IRBs of Shaare Zedek Medical Center and Zurich University, Switzerland, All patients signed a written informed consent. All authors have read and approved the manuscript.
Twelve consecutive adult male patients (age range 31-60 years) with genetically confirmed FD, who were regularly followed at the specialized FD center of the University Hospital Zurich Switzerland, were recruited into the study during their annual examinations. Patients data included medical history, cardiac, renal, and neurological evaluations. The presence of stroke or TIA (transient ischemic attack) was evaluated during annual examinations by asking the patient and/or using the medical records. Standard transthoracic 2D-echocardiography was routinely performed in all patients. Cardiomyopathy was defined as the presence of Fabry-typical electrocardiogram (ECG)-changes and/or signs of diastolic dysfunction and/or left ventricular hypertrophy on echocardiography or heart MRI. Kidney involvement was defined as either having protein/creatinine coefficient > 0.015 g/mmol and/ or estimated glomerular filtration rate (eGFR) according to Chronic Kidney Disease Epidemiology Collaboration (CKD-EPI) formula of < 90 mL/min/1.73 m 2 . Cerebral involvement was defined as the presence of stroke or TIA. For the present study, all clinical and laboratory results were obtained from the patients' medical records.
All variants have been classified as coding for the classic or late-onset phenotype based on genotype and residual α-Gal A activity in males and are published in the International Fabry Disease Genotype/Phenotype Database (www. dbFGP. org 35 ) and in previous studies 9, 36 . The phenotypic assignments of the variants are supported by the clinical manifestations in males, the age of symptoms onset and by in vitro expression assays as reported previously 37,38 . DNA extraction and amplification. DNA was extracted from 200 µL blood samples using FlexiGene DNA kit (Qiagen) according to manufacturer's instructions. The GLA genomic region was amplified by long PCR using LA Taq polymerase (TaKaRa) with the following primers: GLA-Fwd: TTT CTG TTG GTG CTG ATA TTG CTT GGG AGG GAA TAA GCT AGA GCC ATC A; GLA-Rev: ACT TGC CTG TCG CTC TAT CTT CCT TTG TCA AGC ACG CAT TTG CCT AGA T. The 5′ ends of each primer included adapter sequences for subsequent priming with PCR Barcoding Kit (ONT; SQK-PBK004) barcoded sequencing primers. Two rounds of long PCR were performed. First, 1.25 units of TaKaRa LA Taq® DNA Polymerase (catalog number: RR002T) were used to amplify a 13 kb amplicon (capturing the entire GLA genomic DNA sequence including promoter region) in a 25ul reaction with 100 ng input DNA and 200 nM each of the aforementioned primers in 1X LA PCR Buffer ll (Mg2 + plus) and 400uM dNTPs. Thermocycling was as follows: 94 °C for 1 min followed by 20 cycles of 98 °C for 10 s and 68 °C for 13 min, then 72 °C final extension for 10 min. PCR products were purified with 0.45X Ampure XP beads (Beckman Coulter), eluted in low TE (10 mM Tris-HCl (pH 8.0), 0.1 mM EDTA), then subjected to another round of PCR using 200 nM of barcoded LWB primer pairs from the ONT PCR Barcoding Kit. The second PCR was at 50ul total volume and also included 2.5 units of TaKaRa LA Taq® DNA Polymerase, 1X LA PCR Buffer ll (Mg2 + plus), and 400uM dNTPs. Thermocycling was as follows: 94 °C for 1 min followed by 30 cycles of 98 °C for 10 s, and 55 °C for 30 s, and 68 °C for 13 min; then 72 °C final extension for 10 min. The second PCR products were purified with 0.45X Ampure XP beads (Beckman Coulter) and eluted in Tris-NaCl (10 mM Tris-HCl (pH 8.0), 5 mM NaCl).
Library preparation and long-read sequencing. DNA concentration of final purified PCR products was determined by Qubit BR (Thermo Fischer) measurement and 12 samples were pooled at equimolar concentrations. Subsequently, 50 femtomol of pooled PCR products were loaded onto a MinION flow cell (FLO-MIN106D) according to the manufacturer's protocol (PCR Barcoding Kit SQK-PBK004,,Oxford Nanopore Technologies).and sequencing was performed using a MinION device and MinKnow software (MinION Release 19.12.5) for 48 h.
Ethics approval. Details of ethics approval:The research received IRB approval in both Medical centers from Shaare Zedek Medical Center, Jerusalem Israel and Zurich University, Switzerland; A patient consent statement. Informed consent was obtained from all patients and is available upon request.

Data availability
The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.