Curated incidence of lysosomal storage diseases from the Taiwan Biobank

Lysosomal storage diseases (LSDs) are a group of metabolic disorders resulting from a deficiency in one of the lysosomal hydrolases. Most LSDs are inherited in an autosomal or X-linked recessive manner. As LSDs are rare, their true incidence in Taiwan remains unknown. In this study, we used high-coverage whole-genome sequencing data from 1,495 Taiwanese individuals obtained from the Taiwan Biobank. We found 3826 variants in 71 genes responsible for autosomal recessive LSDs. We first excluded benign variants by allele frequency and other criteria. As a result, 270 variants were considered disease-causing. We curated these variants using published guidelines from the American College of Medical Genetics and Genomics (ACMG). Our results revealed a combined incidence rate of 13 per 100,000 (conservative estimation by pathologic and likely pathogenic variants; 95% CI 6.92-22.23) to 94 per 100,000 (extended estimation by the inclusion of variants of unknown significance; 95% CI 75.96–115.03) among 71 autosomal recessive disease-associated genes. The conservative estimations were similar to those in published clinical data. No disease-causing mutations were found for 18 other diseases; thus, these diseases are likely extremely rare in Taiwan. The study results are important for designing screening and treatment methods for LSDs in Taiwan and demonstrate the importance of mutation curation to avoid overestimating disease incidences from genomic data.


Introduction
Lysosomal storage diseases (LSDs) are a group of genetic disorders resulting from a de ciency in one of the lysosomal hydrolases. 1 Most LSDs are inherited in an autosomal (most common) or X-linked recessive (mucopolysaccharidosis type II and Fabry disease) manner. 1 Reported epidemiological data for LSDs vary across countries.In one review article, the combined birth prevalence of LSDs was reported to range from 7.5 per 100,000 live births in British Columbia to 23.5 per 100,000 in the United Arab Emirates (UAE). 2 The overall birth prevalence of 29 different LSDs studied in the Portuguese population was calculated as 25 per 100,000 live births. 3The incidence of MPS in the US has been reported as 0.98 per 100,000 live births, with a prevalence of 2.67 per 1 million. 4The distribution and demographic characteristics of subtypes of LSDs also vary across countries, and one report from Eastern China indicated MPS represented 50.5% of all LSDs. 5 The true incidence of LSDs is unknown in Taiwan due to their rarity.
Understanding the incidence of rare diseases is critical for designing screening and treatment methods.
The clinical diagnosis of rare diseases is often delayed, and patients miss the opportunity to receive treatment.Clinical variants can be missed entirely in the diagnostic process.Therefore, screening methods, such as newborn and high-risk screening, are often considered for rare diseases.Newborn screening allows early diagnosis and treatment, even in the presymptomatic period, 6 as shown in the screening of Pompe disease and spinal muscular atrophy in Taiwan. 7,8However, the design of screening approaches can be challenging when the estimates of disease incidences are incorrect.New treatments for rare diseases have recently emerged.These treatments are often expensive, and insurance companies or public health agencies must estimate the treatment cost before making it available, which is impossible without accurate incidence data. 9xt-generation sequencing (NGS) has generated a large amount of genomic data from patients and normal populations.These data provide an excellent opportunity to directly estimate disease incidences. 10For rare diseases, the number of patients with genomic data may be insu cient to accurately estimate disease incidence.Fortunately, regarding recessive diseases, the incidence of carriers is much than that of affected individuals.Thus, disease incidence can be calculated from the carrier rate.However, the disease incidence is often overestimated in studies employing population genomic data due to the inclusion of benign or unknown variants as disease-causing. 11In this study, we explored the incidence of LSDs using carrier data obtained from the Taiwan Biobank (TWB).We curated all possible disease-causing variants and made a reasonable estimation of the incidence of LSDs in Taiwan.

Results
A total of 51,963 variants in 74 genes related to LSDs were identi ed in data from 1,495 Taiwanese individuals obtained from the TWB, with an average of 34 variants per gene (range 0 to 147).We included the 71 genes encoding for the autosomal recessive LSD and included variants with AF < 0.05 located in or near the exon region; 1,003 variants remained for the subsequent analyses.A total of 270 variants in 53 genes were reported in ClinVar or the HGMD as "pathogenic" or unreported but with a high predicted severity score (severity score over 7 in the 13 prediction tools available from ANNOVAR).No variants were found in the following 18 genes: NPC1, NPC2, ATG5, ATG7, BLOC1S3, CLCN5, OCRL, CLN4, CLN7, CTSK, DTNBP1, Fig. 4, GM2A, MCOLN1, SUMF1, mTORC1, SLC38A9, and SLC9A5.We then curated the pathogenicity of these 270 variants according to the 2015 ACMG criteria (Supplementary Table 1). 12wenty-seven variants were classi ed as pathogenic, 48 as likely pathogenic, 131 as VUS, and 64 as benign or likely benign (Fig. 1, Fig. 2).Overall, 15 of the 53 genes demonstrated VUS but no known pathogenic or likely pathogenic variants.Another 3 demonstrated only benign or likely benign variants.
We calculated the estimated conservative disease incidence by including only pathogenic and likely pathogenic variants (P + LP) and the estimated extended incidence by incorporating VUS (P + LP + VUS).
We compared the current data with previously published prevalence data (Fig. 3) using retrospective clinical observation and newborn population screening data.Regarding MPSs, our conservative incidence estimation was 2.21 per 100,000 (95% CI: 0.056-12.313),which showed no statistical signi cance (p = 0.549) in comparison with the combined birth prevalence except for MPS II, which demonstrated an incidence of 0.97 per 100,000 live births. 13The extended incidences of total MPS and MPS type IIIA, IIIC and IVA were signi cantly higher than those shown in the incidence data reported by Lin et al. (Fig. 3). 13owever, compared with the incidence observed in newborn screening for MPS type I, IIIB and IVA, 24 our conservatively calculated incidence was signi cantly lower (Fig. 3).
The incidence of Pompe (glycogen storage disease II) was also assessed using newborn screening data. 14The incidence (prevalence at birth) was reported to be 55 in 994,975 from 2005 to 2018 (5.53 per 100,000). 15Our conservative estimation for Pompe disease was 4.23, thus, did not signi cantly differ from the reported incidence (Fig. 3).

Discussion
Biobank data are valuable for estimating the burden of rare genetic diseases Here, using genetic data obtained from the TWB, we estimated that the combined incidence of 71 autosomal recessive LSDs is between 13 per 100,000 (pathologic and likely pathogenic variants) and 94 per 100,000 (pathologic and likely pathogenic variants and variants of unknown signi cance).This incidence range is considerably higher than the reported prevalence among clinical cases but similar to that obtained through newborn screening.LSDs are very rare, and diagnoses are often delayed or missed; therefore, an accurate estimation of the incidence of these diseases in Taiwan is challenging, if not impossible.Therefore, estimation methods from genome-wide sequencing databases, as conducted in this study, or unbiased population screening are alternative methods for understanding the true incidence of these diseases.Thus, these approaches assist in the development of policies that address the burden of rare diseases.

Clinical and prediction incidence comparison: underdiagnosis, cases not registered, variant de nitions
The conservative estimation data in the current study were more similar to the incidence rates observed in the clinic than the extended estimation data.For example, regarding MPS I, the conservative incidence estimate (0.03; 95% CI: 0.001-0.17) is similar to the published incidence in Taiwan (0.11; 95% CI: 0.003-0.61), 13con rming that this is an extremely rare disease.However, the extended and newborn screening incidences demonstrated a wider estimation range, implying that a milder or late-onset phenotype may exist that is not easily recognized by clinicians.Further understanding of the pathogenicity of VUS, either with functional or long-term follow-up data obtained through newborn screening, may further elucidate the true incidence of MPS I.
Although we analyzed limited genomic data in this study, the general incidence obtained is similar to that obtained in a previously published large-scale biobank study of the same population. 16The carrier rate of Krabbe disease (GALC gene) in the previous study was estimated to be 1.67%, similar to the current study's estimate (0.2-2.18%).Regarding mucolipidosis type II/III (GNPTAB), the previous estimate was 0.44%, and the current estimate is 0.3-1%, and the difference between the two estimates is not signi cant.
The comparison of Pompe disease and GAA carrier incidence between the previous study and ours is more indirect, as the previous group calculated only the allele frequency (0.38%) of GAA causing infantileonset Pompe disease among 103,106 individuals Taiwan; 16 however, we included late-onset and infantile-onset Pompe disease, yielding a conservative allele frequency of 0.65%.Overall, our data support validation using a small dataset instead of a large dataset such as the biobank.The use of TWB 2.0 may decrease the ability to detect rare variants in rare diseases.However, the current study demonstrated no differences when using larger SNP chip datasets versus comprehensive WGS data from a small population.Further validation will be required when more WGS data become available.
In this study, the allele frequency in Taiwanese individuals was too low to calculate the variant incidence for 18, and an additional 18 genes without pathogenic variants were recorded.For example, in NPC1 and NPC2, which cause Niemann-Pick disease type C, no NPC1 variants were identi ed in the WGS data from the 1,495 individuals in the TWB, and only one NPC2 variant was identi ed.The NPC2 variant was excluded because the severity score was 4 over 13.The published prevalence of Niemann-Pick disease type C is 0.25 per 100,000 in the United Arab Emirates and 2.2 per 100,000 in Portugal. 2, which converts to a carrier rate of at least 1 in 400.This range indicates that the variants should have been present among the 1,495 individuals studied here.Our current data demonstrate an even lower incidence of Niemann-Pick disease type C in Taiwan, although clinical cases have been reported. 7The existence of selection bias, which is the prevalence of diseases only in speci c populations, requires further study.Selection bias is less likely in our study because of Taiwan's relatively homogenous Chinese-Han population. 12We are not aware of any clustering of such LSD in speci c populations in this country.

Different biobank prediction comparison
Many biobanks, such as the Global Biobank Meta-analysis Initiative (GBMI), UK Biobank, Estonian Biobank and China Biobank, have been established worldwide as a result of improvements in NGS techniques.Many researchers have tried used data from different biobanks to predict the risk or prevalence of different diseases.Most select likely pathogenic variants 16,17 and use the Hardy-Weinberg equation to calculate the disease incidence, as in this study.In addition, we estimated conservative and extended disease incidences due to the uncertainty of VUS curation and to better estimate the disease incidence range.Nevertheless, because biobanks are generated using different types of omics data, such as genotype arrays and WGS, additional caution should be taken when applying the resulting datasets to estimate disease prevalence.Furthermore, although population-based biobanks may not represent the general population regarding sociodemographic or health-related characteristics 18 and may not be a suitable resource for determining disease prevalence and incidence rates.UK BioBank has released 50,000 exomes 19 and will add an additional 200,000 exomes to become the largest open-access resource of WES data linked to health records.A better understanding of rare disease incidence is expected in the future following analyses of a larger WES dataset.

Limitations
Our study did not assess X-linked LSDs because the equation for X-linked disorders requires different interpretation methods, especially for those diseases with late-onset phenotypes.For example, the newborn screening for Fabry disease by enzyme assay revealed an incidence rate of 1 in 1,250 among Taiwanese males, 20 most of whom had the GLA IVS4 + 919G > A variant.The incidence rate of the GLA IVS4 + 919G > A variant is estimated to be 1 in 600 among newborns 21 ; however, it is unknown if those individuals participated without bias in the small WGS dataset used in this this study; therefore, we did not include X-linked LSD.
Another limitation of our study is associated with the short-read WGS method.For example, the GBA gene recombines with its pseudogene; thus, it is challenging to determine where the variants are located accurately using WGS.Therefore, we could only roughly estimate the incidence of Gaucher disease.However, in the newborn screening data, the incidence range was similar to that estimated from the dataset. 22Thus, we consider that our results provide useful information for estimating the burden of autosomal recessive LSD, although further clari cation, such as improving the methods or data from biochemical screening, may be warranted for speci c conditions.
Finally, although we used the WGS database, mutations in deep introns not regarded as critical for splicing may have been missed, and copy number changes were not reported.However, we could not demonstrate a signi cant difference in incidence when comparing our data to available NGS data regarding biochemical and protein levels, implying the minimal impact of using such genomic data for estimation.When calculating incidence, the possibility of in-cis variants was not considered; thus, the incidence may have been overestimated.The increasing acceptance of preconception carrier screening could also in uence the clinical incidence since prenatal diagnosis and abortion if the fetus is found to be affected are permitted in Taiwanese culture.Such incidence drift has been observed in thalassemia and spinal muscular atrophy carrier screening, which are performed widely in Taiwan. 23 conclusion, the current study generated useful incidence data regarding LSDs in Taiwan.Our curated, conservative estimation of incidence could guide public health measures in calculating disease or drug burdens.Our extended estimation could also facilitate newborn and high-risk screening.Incidence estimation from genomic data will improve further as the clinical signi cance of variants becomes better understood.

Methods
The Taiwan Biobank (TWB) The TWB is a government-supported database that facilitates biomedical genetic research in the Taiwanese population (https://www.twbiobank.org.tw).The TWB was initiated as an ongoing prospective study in 2012 with a target sample size of 200,000 individuals aged 20-70 with no prior cancer diagnosis.As of November 30 th, 2022, 184,577 volunteers have participated in biobanking, and 45,439 have completed the rst follow-up round.Whole-genome genotyping was performed for all participants; additional data were obtained from a subset of individuals, including whole-genome sequencing (WGS), DNA methylation, and human leukocyte antigen typing information.The TWB regularly releases deidenti ed data to the scienti c community.In a previous genetic pro le study employing TWB data, 21.2% of the population was mutation carriers of autosomal recessive diseases. 16In the current study, we used high-coverage WGS data obtained from 1,495 Taiwanese individuals in the TWB to calculate the incidence of LSDs.

LSD gene selection
We selected the genes from the lysosomal disorder and mucopolysaccharidosis panel in Blueprint genetics (https://blueprintgenetics.com/).Mutations in a total of 74 genes are known to cause LSDs.X chromosome variants were not included, including those in the IDS, LAMP2, and GLA genes.
The pathogenicity of the variants was determined according to American College of Medical Genetics and Genomics (ACMG) guidelines. 12Risk alleles were de ned as pathogenic (P), likely pathogenic (LP), or variants of unknown signi cance (VUS).Gene-speci c risk allele frequency (q) was de ned as the sum of the frequency of all variants in the indicated gene.Linkage between variants within a gene was not assessed.Therefore, the probability of having a risk allele for a disease in the haploid genome of a population was q and that of not having a risk allele was Q = 1 − q.The carrier rate was then calculated as 2 × Q × q based on Hardy-Weinberg equilibrium.The disease incidence was calculated as q 2 .The calculated LSD disease incidences were then compared with real-world epidemiological data.

Statistics
The statistical analyses were performed using MedCalc® Statistical Software version 20.2 (MedCalc Software Ltd, Ostend, Belgium; https://www.medcalc.org;2022).Comparisons of two rates were used to calculate the 95% con dence interval (95% CI) and p value between the estimated and the reported incidence.A p value < 0.05 was considered to indicate signi cance.

FigureFigure 3 FiguresFigure 1 Filter
FigureFigure3is available in the Supplementary Files section.