Introduction

Heterozygous variants in the glucocerebrosidase (GBA1) gene, which encodes the enzyme β-glucocerebrosidase (GCase), are increasingly recognized as the most common genetic risk factor for the development of Parkinson’s disease (PD). Homozygous variants in GBA1 are causative for the most frequent autosomal-recessive lysosomal storage disorder, Gaucher disease (GD)1. GD is characterized by a deficiency of the enzyme GCase which is required to hydrolyze the β-glucosyl linkage of glucosylceramide lipide in lysosomes to form glucose and ceramide2.

Accurate variant calling in the GBA1 gene is challenging due to the presence of the highly homogeneous untranslated pseudogene called GBAP1, which is located 16 kilobases downstream3, and shares 96% sequence homology within the coding region4. In addition, recombination and structural chromosomal variation within and around the GBA1 locus further complicate the analysis5. Complex alleles, which include several single nucleotide variants, are derived from recombination between the functional GBA1 gene and the GBAP1 pseudogene6. RecNciI is the most common recombinant allele, including the amino acid changes p.L483P and p.A495P, and the synonymous variant p.V499V6.

Our study aimed to accurately assess all rare coding variants in the GBA1 gene in all participants of the Luxembourg Parkinson’s study7, a case and control cohort including patients with PD and atypical parkinsonism. To assess the accuracy of the targeted GBA1 DNA sequencing method using the Pacific Biosciences (PacBio)8 technology, which targets only the GBA1 gene without sequencing the GBAP1 pseudogene, we compared this method with genotyping using the NeuroChip array9 and short-read whole genome sequencing (WGS) data using Sanger sequencing as the gold standard for validation. We identified several types of pathogenic GBA1 variants (severe, mild, and risk) and further characterized genotype–phenotype associations to better understand the influence of each variant type and their effect on disease severity.

Results

Demographic and clinical characteristics

A total of 760 patients (660 PD patients (nPD) and 100 patients with other forms of parkinsonism (npark)) and 808 healthy controls (nHC) from the Luxembourg Parkinson’s study (Fig. 1) were genotyped using NeuroChip and screened for GBA1 variants using targeted PacBio DNA sequencing method, while a subset of 72 patients was screened with WGS. Among the patients, 66.4% (n = 499) were male with a mean age at disease onset (AAO) of 63 ± 11.5 years (Supplementary Table 1). The control group consisted of 52.7% (n = 426) males with a mean age at assessment (AAA) of 59.3 ± 12.2 years. Due to their above 30-fold coverage provided by the long-read DNA sequencing, all samples were selected after successfully passing the MultiQC step (Supplementary Table 9). To ensure ethnic homogeneity and exclude other genetic factors that may bias the assessment of the genetic contribution of GBA1 to PD in the Luxembourgish population, we excluded carriers of mutations in other PD-causing genes (point mutations: n = 10, nPD = 8,nHC = 2; CNV: nPD = 4) in PD-associated genes (no CNVs in GBA1 were detected), first-degree family members (n = 64, nPD = 8, npark = 2, nHC = 54), younger HC (<60 AAA) with first-degree relatives having PD (nHC = 74), and individuals of non-European descent (n = 6) from the cohort. The final cohort consisted of 735 patients (nPD=637, npark = 98) and 675 HC with a mean AAO among the patients of 63.2 ± 11.3 years, whereas the mean AAA for HC was 61 ± 11.5 years. Based on Neurochip and WGS data, none of the GBA1 carriers carried pathogenic variants in other PD-associated genes as defined by MDSGene10.

Fig. 1: Description of the study dataset and methodology.
figure 1

HC Healthy controls, PD Parkinson’s Disease and Parkinson’s Disease with Dementia, PSP Progressive Supranuclear Palsy, DLB Dementia with Lewy Body, MSA Multiple System Atrophy, FTDP Fronto-temporal dementia with parkinsonism, GBA1 glucocerebrosidase gene, VUS Variants of unknown significance, PD+GBA1 PD patients with GBA1 pathogenic variant, PD-GBA1 PD patients without GBA1 pathogenic variant, CNV copy number variants, AAA age at assessment.

Targeted PacBio DNA sequencing showed the highest specificity for detecting rare coding variants in GBA1

To measure the reliability of calling rare GBA1 coding variants, we performed two types of comparison. Rare variants were here defined as variants with minor allele frequency (MAF) < 1% in the European population. We compared the results from the PacBio, WGS, and NeuroChip data for a subset of samples (n = 72). We then compared the PacBio and NeuroChip data as they both covered the majority of samples (n = 1568). We considered true positives to be the GBA1 variants validated by Sanger sequencing. False-positive variants were those identified by the analysis method but not confirmed by Sanger sequencing. False-negative variants were not called by the analysis method but were later validated with Sanger sequencing (Supplementary Table 2). First, we evaluated 72 samples screened by all three methods (Fig. 2). Using the GBA1-targeted PacBio DNA sequencing method and WGS in combination with the Gauchian11 tool implemented in Dragen v4 (GBA caller option), we detected six individuals carrying GBA1 variants (p.E365K (n = 3), p.T408M (n = 1), p.N409S (n = 1), RecNciI (n = 1)). The RecNil combines the three variants p.L483P, p.A495P, and p.V499V in one haplotype allele. All variants detected were confirmed by Sanger sequencing (true positive rate (TPR) of 100%). We did not identify any false positive variant calls. However, using the Dragen v.4 pipeline without the GBA1 caller, relying only on the GATK best practices pipeline, the WGS method failed to detect the RecNciI recombinant allele in one individual (TPR of 83.3% (5/6)). Using Neurochip, we detected three potential GBA1 variant carriers (p.T408M (n = 1), p.N431S (n = 1), p.A215D (n = 1), but only one variant (p.T408M) was subsequently confirmed by Sanger sequencing (TPR of 16.6% (1/6), resulting in a false discovery rate (FDR) of 66.6% (2/3).

Fig. 2: Comparison of variant calls from PacBio, WGS and NeuroChip genotyping data using 72 matched samples for the GBA1 gene and validated by Sanger sequencing.
figure 2

a *RecNcil (p.L483P; p.A495P; p.V499V); Sanger sequencing results: TP, true positive; FP, false positive. Sample count gives the total number of samples carrying the variant found by each method. b Comparative study of GBA1 variants detection by the GBA1-targeted PacBio DNA sequencing method and NeuroChip array methods in the Luxembourg Parkinson’s study. Due to overrepresented variants with the NeuroChip array, we applied for the detected variants a study-wide threshold of 1% in our cohort.

Next, we compared the results from 1568 samples screened with both, the GBA1-targeted PacBio DNA sequencing method and the NeuroChip array (Fig. 3). Using the GBA1-targeted PacBio DNA method, we detected 135 GBA1 variants carriers, of which 100% were validated by Sanger sequencing. Using the NeuroChip array, we detected 47 potential GBA1 variant carriers, among which only 36 were validated by Sanger sequencing (TPR of 26.7% (36/133), resulting in an FDR of 23.4% (11/47).

Fig. 3: Comparative study of GBA1 variants detection by the GBA1-targeted PacBio DNA sequencing and NeuroChip array methods in the Luxembourg Parkinson’s study.
figure 3

Due to overrepresented variants with the NeuroChip array, we applied for the detected variants a study-wide threshold of 1% in our cohort.

Classification of GBA1 variants

Of the 1568 individuals sequenced using the GBA1-targeted PacBio DNA sequencing method, we identified 135 carriers of at least one GBA1 variant (Supplementary Tables 3, 4). Based on the classification of Höglinger et al.12, 25 were carriers of severe variants, 10 of mild variants, 72 of risk variants and 22 of VUS. The most common GBA1 variants in PD patients were the risk variants p.E365K (n = 23; 3.5%) and p.T408M (n = 17; 2.6%).

GBA1 variants were mostly heterozygous missenses, one patient carried a heterozygous stop-gain variant p.R398*(rs121908309), two PD patients carried a homozygous missense variant p.E365K/p.E365K(rs2230288). We identified two HC carrying a pathogenic LRRK2 variant and being positive for GBA1 variant (p.E365KGBA-p.R1441CLRRK2; p.K13RGBA-p.G2019SLRRK2). We also detected nine different synonymous variants in exonic regions (Supplementary Table 4). The variant p.T408T(rs138498426) is a splice site variant (located within 2 bp of the exon boundary) and is classified as VUS12. The remaining synonymous variants were not further analyzed. Additionally, we identified 69 variants in intronic and UTRs regions of GBA1 (Supplementary Table 5) with unclear pathogenic relevance, 35 of which were rare with MAF < 1% in gnomAD for the Non-Finnish European population10.

We classified the following combinations of multiple variants per individual as severe based on the classification of the respective associated pathogenic variants (Table 1): p.N409S-p.L483P, p.K13R-p.L483P, p.F252I-p.T408M, p.Y61H-p.T408M.

Table 1 Distribution of GBA1 variants in the Luxembourg Parkinson’s study.

GBA1 variant frequency

To calculate the GBA1 frequency in our study, we considered the individuals remaining after the exclusion step (735 patients and 675 HC). We detected 12.1% (n = 77) GBA1 variant carriers among 637 PD patients and 5% (34/675) in HC individuals. We found a frequency of 10.5% (67/637) of pathogenic variants in PD patients (severe, mild, risk) and 4.3% (29/675) in controls (Table 2). Four patients with parkinsonism had GBA1 variants. Carriers of severe GBA1 variants (n = 21; 3.2%; OR = 11.4; 95% CI = [2.6, 49]; p = 0.0010) have a high risk of developing PD as defined by the indicated OR.

Table 2 Frequency of GBA1 variants in the Luxembourg Parkinson’s study.

Genotype–phenotype associations in GBA1-PD patients

We characterized the clinical phenotype of severe (n = 21), mild (n = 7) and risk (n = 39) GBA1 carriers and non-carriers (n = 554) only in unrelated PD patients excluding carriers with only one synonymous or VUS variant in individuals remaining after the filtering step. The AAO was similar between GBA1 carriers (61.6 ± 11.5) and non-carriers (62.6 ± 11.6). Severe PDGBA1 variant carriers showed a trend towards younger AAO compared to mild and risk (severe: 58.6 ± 13.1 vs mild: 65.4 ± 17 vs risk: 62.5 ± 9.3 years; p = 0.29) (Table 3), with a significant risk to develop early onset PD (OR = 4.02; p = 0.0098). In contrast to non-carriers, we also observed that carriers of pathogenic variants have a strong family history of PD (OR = 0.74; p = 0.0401).

Table 3 Demographic data for the PD patients in the Luxembourg Parkinson’s study separated by GBA1 variant status.

We compared clinical features between PD patients carrying pathogenic GBA1 variants and PD patients without GBA1 variants (Supplementary Table 6). We found that in carriers the sense of smell was strongly impaired (uncorrected p = 0.0210) and a higher rate of hallucinations (uncorrected p = 0.0415). Next, we compared patients carrying variants from each category (severe, mild or risk) separately with PD patients without GBA1 variants (Table 4). Carriers of severe GBA1 variants showed more severe non-motor symptoms when compared to non-GBA1 carriers, such as MDS-UPDRS Part I (uncorrected p = 0.0074) and hallucinations (uncorrected p = 0.0099), and also an impaired sense of smell as assessed by Sniffin’ Stick test (uncorrected p = 0.0405). To show the deleterious impact of the severe variants, we compared carriers of severe variants with patients carrying either mild or risk GBA1 variants (Table 5). We observed that severe variants carriers have more severe gait disorder (uncorrected p = 0.0188) and depression (uncorrected p = 0.0074) and worse MDS-UPDRS Part I (uncorrected p = 0.0019) and PDQ-39 (uncorrected p = 0.0422). For all clinical features, there were no significant associations after the correction for multiple comparisons using FDR adjustment.

Table 4 Clinical characteristics of PD classified by GBA1 variant status.
Table 5 The deleterious impact of severe GBA1-PD carriers in comparison with mild and risk and their clinical characteristics.

VUS and the glucosylceramidase structure

We detected nine already reported VUS (p.K13R, p.Y61H, p.R78C, p.L213P, p.E427K, p.A495P, p.H529R, p.R534C, p.T408T) and three new VUS (p.A97G, p.A215 and p.R434C).

According to our strategy developed for the VUS classification of GBA1 variants, where we assign the pathogenicity based on the REVEL, the CADD and the dbscSNV scores, as well as the presence or absence of the variants in the patients. We propose to subclassify the VUS p.Y61H, p.L213P, p.A215D, and p.R434C as probably pathogenic severe variants (Supplementary Table 7). The variant p.L213P changes the leucine into proline, which is known to be the “helix breaker” amino acid that induces a bend into the protein structure13 (Supplementary Fig. 1). The p.L213P and p.A215D variants are in the catalytic site of the enzyme in the triose-phosphate isomerase (TIM) barrel structure. The p.Y61H variant (Fig. 4a) is located next to the residue position of the known severe variant p.C62W, and the patient carrying this variant had an AAO of 38 years, indicating an early-onset, probably severe form of PD. This patient has a family history of PD and reported that the paternal uncle and aunt were diagnosed with PD at the ages of 60 and 70, respectively. The p.R434C variant is close to a known severe (p.V433L) and mild (p.W432R, p.N435T) PD variants in the 3D structure. We compared the clinical scores obtained from carriers of known severe variants with the four carriers of probable severe VUS (p.Y61H, p.L213P, p.A215D, and p.R434C) (Supplementary Table 7). The z-score was used to determine the number of SD deviations from the mean for each clinical score. We observed that the PD patient carrying the p.L213P variant had a z-score that was significantly different for MDS-UPDRS II (z-score = 3.05) and MDS-UPDRS III (z-score = 2.94) confirming its classification as a severe variant.

Fig. 4: Sub-classification of VUS found in the Luxembourg Parkinson’s study.
figure 4

a GBA1 missense and stop gain variants mapped onto the three-dimensional structure of GCase. Domain I is shown in dark yellow, domain II in blue, and domain III in pink. Domain I begins at residue 40 after the signal peptide sequence. Variants classified as severe are colored red, mild are colored orange, risk in yellow and VUS are colored purple. The 3D structure of GCase (PDB code 1ogs) was generated using PYMOL (http://www.pymol.org). b Proposed sub-classification of identified VUSs with their score in a known database. GBA1 glucocerebrosidase gene, GD Gaucher’s disease, PD Parkinson’s disease, AAO age at onset, AAA age at assessment in visit1. HGMD The Human Gene Mutation Database, REVEL Rare Exome Variant Ensemble Learner, CADD Combined Annotation Dependent Depletion, gnomAD The Genome Aggregation Database. DM Disease causing mutation, D Deleterious, T Tolerate. Variants classified as severe are colored red, mild are colored orange, risk in yellow and VUS are colored purple.

We propose to subclassify the variants p.H529R and p.R534C as probably mild variants, as they are both found only in PD patients. The variants p.K13R, p.R78C, p.E427K, and p.A495P are subclassified as probable risk variants. The variant p.K13R is located in the signal peptide region. The variant p.R78C was annotated as “PD susceptibility” in HGMD with deleterious impact in CADD. The variant p.E427K was annotated as associated to “parkinsonism” in ClinVar and “reduced activity” in HGMD. We propose to classify the variant p.A97G as probably benign because it is localized in a coil-bend structure and is not close to any known pathogenic variants.

The synonymous variant p.T408T was found in two cases and one healthy control individual. Two established splice-site prediction scores (dbscSNV: ada_score 0.9797 and rf_score 0.85) agreed in their prediction that the variant is likely to affect splicing. HGMD classified the variant as disease mutation (DM) (Supplementary Table 4). Therefore, we propose to classify the variant as a risk variant.

Overall, we propose to classify four VUS variants as probably severe pathogenic variants (p.Y61H, p.L213P, p.A215D, and p.R434C), two as probably mild pathogenic variants (p.H529R and p.R534C), five as probably pathogenic risk variants (p.K13R, p.R78C, p.E427K, p.A495P, and p.T408T) and one as probably benign variants (p.A97G) (Fig. 4b).

Discussion

Our study demonstrated in a large cohort the utility of targeted PacBio DNA sequencing for GBA1 as a highly sensitive and specific method to identify known and novel GBA1 variants and to overcome the problems posed by the presence of the GBAP1 pseudogene by avoiding its amplification. The effectiveness of the targeted PacBio DNA sequencing method in investigating relevant genes with homologous pseudogenes has also been demonstrated in several other studies13,14,15,16. The PacBio method together with the WGS method combined with the new Gauchian tool showed a very high accuracy of 100% true positive validated variants. The comparative study that we performed with the different screening technologies to detect GBA1 variants will help researchers to get a more accurate and comprehensive overview of GBA1 variants. This implies a more critical evaluation of the results obtained by NeuroChip, which revealed a high proportion of false positive and negative results and those obtained by WGS, which will depend on the detection tool used for the complex GBA1 region. Our study still has the limitation that we cannot fully exclude missing variants (false negatives) that could not be detected by all three methods used in our study. Long-read DNA sequencing excels in the detection of structural variants. However, the method employed in this study relies on a single amplicon, limiting its efficiency in detecting structural variants due to the generation and purification of amplicons of specific sizes only. However, we would like to highlight the fact that the PacBio-based method can be a cost-effective (30€/sample for PacBio) alternative for the high-fidelity calling of GBA1 variants. GBA1 variants have been identified as the most common genetic risk factor for the development of PD. GBA1 variants have typically been observed in 4%–12% of PD patients in different populations worldwide, with the highest prevalence of 20% described in Ashkenazi Jewish PD patients17,18. Large differences of prevalence were observed depending on the ethnicity of the cohort, the variants studied, and the sequencing method used. Previous studies looking only at coding regions reported frequencies of 14.3% in Italians19 (n = 874), 11.7% in southern Spanish20 (n = 532), 9.2% in New Zealanders of European descent5 (n = 229) and 8.3% in Irish21 (n = 314) (Supplementary Table 8). Our study describes the landscape of GBA1 carriers in the Luxembourgish population showing a high prevalence (12.1%) of GBA1 variants that could be the major genetic risk factor of PD in Luxembourg. Moreover, we observed a significantly higher proportion of pathogenic (severe, mild and risk) GBA1 variants in PD patients compared to HC (10.4% vs 4.3%; OR = 2.6; CI = [1.6,4.1], p = 0.0001). Compared to previous studies, our study highlights that using the new PacBio sequencing method, the Luxembourg Parkinson’s study showed a comparable frequency of PDGBA1 carriers reported so far in similarly sized Italian19 and Spanish20 cohorts (Supplementary Table 8). When comparing previous reports of GBA1 variants in different populations, we want to highlight the fact that only cohorts that used full Sanger sequencing were able to detect the RecNciI recombinant allele so far. This once more emphasizes the accuracy of the PacBio sequencing methods for detecting rare and complex GBA1 variants. Additionally, we confirmed that severe variants showed a higher OR than risk variants, which supports the concept of graded risk for different GBA1 variants in PDGBA1 carriers20.

The most prevalent GBA1 variant in the Luxembourg Parkinson’s study was p.E365K, and the frequency of this variant was similar to what was described in the Irish21, Spanish20, and New Zealand5 populations. It is interesting to note that homozygous carriers of the p.E365K variant do not develop GD22. This variant is associated with PD, and multiple studies have found enrichments varying from 1.60 to 3.3423,24,25. Furthermore, carriers of the risk variants p.E365K and p.T408M could be associated with atypical parkinsonism, as these variants were the only ones also present in patients with DLB and PSP in our cohort. Whether this is simply related to the higher frequency of these risk variants in the general population or does have a specific impact on the phenotype needs to be determined in larger studies focusing on GBA1 variants in atypical parkinsonism26.

We present a concept for classifying VUS in the GBA1 gene according to the localization in relation to known variants in sequence and 3D structure, which may help to provide access to future targeted therapies for these patients. Here additional in vitro and ex vivo studies are needed to functionally validate the impact of these VUS on GCase function in neurons derived from stem cells or in enzyme-activity assays in cerebrospinal fluid of affected carriers of these VUS.

Additionally, we observed that the average AAO in PD was about four years younger in severe GBA1 carriers compared to non-GBA1 carriers. This was also observed in previous studies, which showed that PDGBA1 patients generally have an earlier AAO compared to non-carriers with a median onset in the early fifties27,28.

Recent studies have shown that PDGBA1 carriers have a higher prevalence of cognitive impairment19,29,30 and non-motor symptoms including neuropsychiatric disturbances19,20, autonomic dysfunction29, and sleep disturbances such as RBD31. Although not significant after p value adjustment, we found a similar trend and noticed that motor symptoms such as gait disorder, non-motor symptoms such as depression and hallucinations, were associated with a more aggressive clinical phenotype in severe GBA1 carriers, supporting the effect of differential GBA1 variant severity20,32.

In conclusion, this study showed the utility of targeted PacBio DNA sequencing to identify known and novel GBA1 variants with high accuracy. These findings offer important access to variant-specific counseling. Furthermore, our study describes the full landscape of GBA1-related PD in the current Luxembourgish population showing the high prevalence of GBA1 variants as the major genetic risk in PD.

Methods

Clinical cohort

At the time of analysis, the Luxembourg Parkinson’s study comprised 1568 participants (760 patients of parkinsonism and 808 healthy controls (HC) in the frame of the National Centre for Excellence in Research on Parkinson’s disease program (NCER-PD).

All patients complied with the diagnostic criteria of typical PD or atypical parkinsonism as assessed by neurological examination following the United Kingdom Parkinson’s Disease Society Brain Bank (UKPDSBB) diagnostic criteria33: 660 fulfilled the criteria for PD, 60 for progressive supranuclear palsy (PSP) including corticobasal syndrome as a subtype of PSP (PSP-CBS), 25 for Dementia with Lewy Body (DLB), 14 for Multiple System Atrophy, and one for Fronto-temporal dementia with parkinsonism. All patients and HC underwent a comprehensive clinical assessment of motor and non-motor symptoms, neuropsychological profile and medical history along with comorbidities. The clinical symptoms assessed, and scales applied are defined in the Supplemental Information34. All individuals provided written informed consent. The patients were reassessed at regular follow-up visits every year and the HC every 4 years. We considered early-onset PD patients those with AAO equal to or younger than 45 years35. The genotype-phenotype analysis was based on the assessment of the first visit. The final diagnosis was taken according to the last visit. The study has been approved by the National Research Ethics Committee (CNER Ref: 201407/13 and 202304/03).

NeuroChip array

Genotyping was carried out on the InfiniumR NeuroChip Consortium Array9 (v.1.0 and v1.1; Illumina, San Diego, CA USA). For rare variants analysis, standard quality control (QC) procedures were conducted, using PLINK v1.936, to remove variants if they had a low genotyping rate (<95%) and Hardy-Weinberg equilibrium p value < 1 × 10−6. As an additional quality filter, we applied a study-wide allele frequency threshold of <1% in our cohort for rare variants. For further statistical analysis, we excluded individuals of non-European ancestry using PLINK 1.9 multidimensional scaling and merged our data with the 1000 genomes dataset37. We selected only samples of European ancestry excluding those with > ±3 SD based on the first and the second principal components.

GBA1-targeted PacBio DNA long-read amplicon sequencing

The targeted GBA1 gene screening was performed by single-molecule real-time (SMRT) long read sequencing8 using Sequel II instrument (PacBio). The targeted GBA1 gene coordinates were chr1:155,232,501-155,241,415 (USCS GRCh38/hg38). Long-distance PCR was performed using GBA1-specific primer sequences (Forward: 5′-GCTCCTAAAGTTGTCACCCATACATG-3′ and Reverse: 5′-CCAACCTTTCTTCCTTCTTCTCAA-3′)38 and the 2x KAPA HiFi Hot Start ReadyMix (Roche), which avoid GBAP1 pseudogene amplification. For sample multiplexing, dual asymmetric barcoding was used based on a different 16-bp long index sequence upstream of each of the reverse and forward primers to allow the generation of uniquely barcoded amplicons in one-step PCR amplification. QC was performed prior to pooling. Pools of amplicons were purified with AMPure PacBio beads. A total of 1700 ng of purified amplicon pool was used as input for the SMRTbell library using the SMRTbell Express Template Prep Kit 2.0 (PacBio). Binding of the polymerase and diffusion loading on SMRTCell 8 M was prepared according to SMRTLink instructions with CCS reads as sequencing mode (version SMRT Link: 9.0.0.92188). We generated high-quality consensus reads using the PacBio Sequel II sequencer on Circular Consensus Sequencing mode using the pbccs (v6.0.0) tool. The method replicates both strands of the target DNA39. We demultiplexed and mapped reads from each sample to the human reference genome GRCh38 using minimap240 from the pbmm2 package (v1.4.0) (https://github.com/PacificBiosciences/pbmm2). We used the MultiQC41 tool and selected samples with more than 30-fold coverage. For variant calling, we used the DeepVariant42 (1.0) with models optimized for CCS reads. Finally, we selected variants with quality above 30 (QUAL > 30).

Whole genome sequencing

The TruSeq Nano DNA Library Prep Kit (Illumina, San Diego, CA, USA) and MGIEasy FS DNA Prep kit (BGI, China) were used according to the manufacturer’s instructions to construct the WGS library. Paired-end sequencing was performed with the Illumina NovaSeq 600043 and on the MGI G400 sequencers. A QC of the raw data was performed using FastQC (version 0.11.9: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). To call the variants, we used the Bio-IT Illumina Dynamic Read Analysis for GENomics (DRAGEN) DNA pipeline44 v445 with standard parameters and with or without the ‘GBA caller’ option, which uses the Gauchian tool. To select the high-quality variants, we annotated and selected variants using VariantAnnotator and SelectVariants modules of the Genome Analysis Toolkit (GATK 4)46 pipeline and applied the following additional filtering steps: VariantFiltration module for SNVs (QD < 2, FS > 60, MQ < 40, MQRankSum < −12, ReadPosRankSum < −8, DP < 10.0, QUAL < 30, VQSLOD < 0, ABHet > 0.75 or <0.25, SOR > 3 and LOD < 0), and insertions-deletions (QD < 2, FS > 200, QUAL < 30, ReadPosRankSum < −20, DP < 10 and GQ_MEAN < 20).

Variant annotation and validation

Variant annotation was done with ANNOVAR47, using the Genome Aggregation Database (gnomAD r2.1)48, the Human Gene Mutation Database (HGMD)49 and ClinVar50, and the Combined Annotation Dependent Depletion (CADD)51 and Rare Exome Variant Ensemble Learner (REVEL)52 to score the pathogenicity of missense variants53. For variants in splice sites, we used the ada_score and rf_score from dbscSNV (version 1.1)54. Ada_score ≥ 0.6 or rf_score ≥ 0.6 indicate that the variant is likely to affect splicing.

Rare variants were selected according to MAF < 1% in gnomAD for the Non-Finnish European (NFE) population in the ‘non-neuro’ gnomAD subset. Then, exonic and splicing variants (±2 bp from the exon boundary) were selected for autosomal dominant (LRRK2, SNCA, VPS35, GBA1) and autosomal recessive (PRKN, PINK1, PARK7, ATP13A2) PD genes. Rare variants within these genes were then confirmed by Sanger sequencing55.

CNVs in PD genes

To detect the presence of copy number variants (CNVs) in selected six PD genes (PARK7, ATP13A2, PINK1, SNCA, GBA1, and PRKN), we used the PennCNV tool (v1.0.5)56 using the Neurochip array data applying the same filtering steps as previously described for CNV calls in PD11. The multiplex-ligation dependent probe amplification method, which exclusively targets the selected genes, was used to validate the CNVs. Six patients with each one CNVs in one of the six PD genes were found and no CNV in GBA1 was found. To detect CNVs within the GBA1 gene through the analysis of PacBio data, we employed the pbsv tool (version 2.9.0) (https://github.com/PacificBiosciences/pbbioconda), which is specifically designed for long-read data analysis from PacBio. This tool successfully identifies 59.46% of structural variants with precision57,58.

GBA1 variant nomenclature

All variants in GBA1 were annotated based on GRCh37 and were numbered according to the current variant nomenclature guidelines (http://varnomen.hgvs.org), based on the primary translation product (NM_001005742), which includes the 39-residue signal peptide.

GBA1 variant classification

GBA1 variants classification was done according to the PD literature based on the work of Höglinger and colleagues in 202212. Exonic or splice-site variants that are not mentioned in the paper were subclassified as “severe” GBA1 variants if they were annotated as pathogenic in ClinVar, otherwise they were subclassified as variants with unknown significance (VUS)51.

Statistical analysis

To assess the frequency of different GBA1 variant types and to analyze the genotype–phenotype associations in the Luxembourg Parkinson’s Study, we considered only unrelated individuals and kept only one proband per family. For cases, we kept the patient with the earliest AAO. To account for age-dependent penetrance, we excluded HC with first-degree relatives (parents, sibs, and offspring) with PD and an AAA of less than 60 years. This reduced the age difference between cases and HC. We also excluded carriers of rare variants or CNVs in PD-associated genes (except GBA1) and individuals of non-European ancestry. Thus, 1410 unrelated individuals (735 patients and 675 HC) were selected for the statistical analysis.

We used regression models to assess the effect of PDGBA1 carrier status on the clinical variables. In these models, the dependent variable was the clinical outcome, while the predictor was GBA1 carrier status. We excluded individuals carrying only VUS or synonymous variants. To this aim, we performed three types of association tests: (1) all PDGBA1 pathogenic variant carriers (severe, mild and risk) vs. PDGBA1-non-carriers, (2) for each sub-group of PDGBA1 pathogenic variant carriers vs. PDGBA1-non-carriers, (3) severe PDGBA1 pathogenic variant carriers vs combined mild and risk PDGBA1 pathogenic variant carriers. The effect of each factor was expressed as the Beta (β) regression coefficient. The odds ratio (OR) along with a 95% confidence interval (CI) was used to assess whether a given exposure was a risk factor for a given outcome. Regression models were adjusted for AAA, sex, and disease duration. FDR-adjusted p value < 0.05 was considered as statistically significant.

Structure-based evaluation of VUS

To evaluate VUS variants, we implemented a method to assign the pathogenicity based on the REVEL53 and CADD51 scores for missense variants and the dbscSNV scores (ada_score and rf_score) for splice variants according to the dbNFSP54 definition, as well as whether the patients carried the variants. We reclassified a VUS (1) as “severe” if the variant was present only in patients and with deleterious effect in all scores or present only in patients with early-onset PD, (2) as “mild” if the variant was present only in patients and with tolerated effect in all scores, (3) as “risk” if present in patients and HCs or with tolerated and deleterious effect in either score or annotated as “PD susceptibility” in HGMD, and (4) as “benign” if present only in HC.

We mapped the known pathogenic missense variants and newly identified VUS in our cohort together with all reported population variants from gnomAD onto the GBA1 protein sequence and the 3D structure. We used the X-ray structure of GCase at 2.0 Å resolution (PDB structure accession code 1ogs; https://www.rcsb.org/) (Supplementary Fig. 2). Analysis of the 3D structure was carried out using PyMOL (http://www.pymol.org). VUS were evaluated as a risk variant if they were 2 bp positions away in sequence or had a C-alpha distance of less than 5 Å in 3D from another known pathogenic variant similar to the approach used by Johannesen et al.59.