Introduction

Lysosomal storage diseases (LSDs) are a group of genetic disorders resulting from a deficiency in one of the lysosomal hydrolases1. Most LSDs are inherited in an autosomal (most common) or X-linked recessive (mucopolysaccharidosis type II and Fabry disease) manner1. Reported epidemiological data for LSDs vary across countries. In one review article, the combined birth prevalence of LSDs was reported to range from 7.5 per 100,000 live births in British Columbia to 23.5 per 100,000 in the United Arab Emirates (UAE)2. The overall birth prevalence of 29 different LSDs studied in the Portuguese population was calculated as 25 per 100,000 live births3. The incidence of mucopolysaccharide (MPS) in the United States has been reported as 0.98 per 100,000 live births, with a prevalence of 2.67 per 1 million4. The distribution and demographic characteristics of subtypes of LSDs also vary across countries, and one report from Eastern China indicated MPS represented 50.5% of all LSDs5. The true incidence of LSDs is unknown in Taiwan due to their rarity.

Understanding the incidence of rare diseases is critical for designing screening and treatment methods. The clinical diagnosis of rare diseases is often delayed, and patients miss the opportunity to receive treatment. Clinical variants can be missed entirely in the diagnostic process. Therefore, screening methods, such as newborn and high-risk screening, are often considered for rare diseases. Newborn screening allows early diagnosis and treatment, even in the presymptomatic period6, as shown in the screening of Pompe disease (glycogen storage disease II) and spinal muscular atrophy in Taiwan7,8. However, the design of screening approaches can be challenging when the estimates of disease incidences are incorrect. New treatments for rare diseases have recently emerged. These treatments are often expensive, and insurance companies or public health agencies must estimate the treatment cost before making it available, which is impossible without accurate incidence data9.

Next-generation sequencing (NGS) has generated a large amount of genomic data from patients and normal populations. These data provide an excellent opportunity to directly estimate disease incidences10. For rare diseases, the number of patients with genomic data may be insufficient to accurately estimate disease incidence. Fortunately, regarding recessive diseases, the incidence of carriers is much than that of affected individuals. Thus, disease incidence can be calculated from the carrier rate. However, the disease incidence is often overestimated in studies employing population genomic data due to the inclusion of benign or unknown variants as disease-causing11. In this study, we explored the incidence of LSDs using carrier data obtained from the Taiwan Biobank (TWB). We curated all possible disease-causing variants and made a reasonable estimation of the incidence of LSDs in Taiwan.

Results

The overall curative incidence of autosomal recessive LSDs

A total of 51,963 variants in 74 genes related to LSDs were identified in data from 1495 Taiwanese individuals obtained from the TWB, with an average of 34 variants per gene (range 0–147). We included the 71 genes encoding for the autosomal recessive LSD and included variants with allele frequency <0.05 located in or near the exon region; 1003 variants remained for the subsequent analyses. A total of 270 variants in 53 genes were reported in ClinVar or the HGMD as “pathogenic” or unreported but with a high predicted severity score (severity score over 7 in the 13 prediction tools available from ANNOVAR). No variants were found in the following 18 genes: NPC1, NPC2, ATG5, ATG7, BLOC1S3, CLCN5, OCRL, CLN4, CLN7, CTSK, DTNBP1, FIG4, GM2A, MCOLN1, SUMF1, mTORC1, SLC38A9, and SLC9A5. We then curated the pathogenicity of these 270 variants according to the 2015 ACMG criteria (Supplementary Table 1)12. Twenty-seven variants were classified as pathogenic, 48 as likely pathogenic, 131 as VUS, and 64 as benign or likely benign (Figs. 1 and 2). Overall, 15 of the 53 genes demonstrated VUS but no known pathogenic or likely pathogenic variants. Another three demonstrated only benign or likely benign variants. A total of 21 unreported variants were identified in this cohort (Table 1). We calculated the estimated conservative disease incidence by including only pathogenic and likely pathogenic variants (P + LP) and the estimated extended incidence by incorporating VUS (P + LP + VUS). The calculated incidence for each LSD is listed in Fig. 3. The total incidence of autosomal recessive LSDs was 13 per 100,000 (95% CI 6.92–22.23) by conservative estimation and 94 per 100,000 (95% CI 75.96–115.03) by extended estimation.

Fig. 1: flowchart of this study.
figure 1

Total 51,963 variants in 74 genes related to LSDs were identified from the TWB. We included the 71 genes encoding for the autosomal recessive LSD and included variants with allele frequency <0.05 located in or near the exon region. We included the 71 genes encoding for the autosomal recessive LSD and included variants with allele frequency <0.05 located in or near the exon region; 1003 variants remained for the subsequent analyses. A total of 270 variants in 53 genes were reported in ClinVar or the HGMD as “pathogenic” or or unreported but with a high predicted severity score. We then curated the pathogenicity of these 270 variants according to the 2015 ACMG criteria. LSD lysosomal storage disease, TWB Taiwan Biobank, SNV single nucleotide variant, HGMD Human Gene Mutation Database, ACMG American College of Medical Genetics.

Fig. 2: Interpretation of 270 LSD variants.
figure 2

A Types of mutations in these 270 variants. B Summary of curated ACMG interpretation. The X-axis shows the pathogenicity interpretation by ClinVar, and the color shows the pathogenicity interpretation by ACMG (blue: pathogenic, pink: likely pathogenic, green: uncertain, yellow: likely benign, and purple: benign).

Table 1 The unreported variants identified in Taiwanese individuals and the related disorders.
Fig. 3: Comparison of the conservative and extended incidence with the known prevalence data.
figure 3figure 3figure 3figure 3

X-axis: incidence as per 100,000. *p value < 0.05.

The incidence of MPSs

We compared the current data with previously published prevalence data (Fig. 3) using retrospective clinical observation and newborn population screening data. Regarding MPSs, our conservative incidence estimation was 2.21 per 100,000 (95% CI: 0.056–12.313), which showed no statistical significance (p = 0.549) in comparison with the combined birth prevalence except for MPS II, which demonstrated an incidence of 0.97 per 100,000 live births13. The extended incidences of total MPS and MPS type IIIA, IIIC, and IVA were significantly higher than those shown in the incidence data reported by Lin et al. (Fig. 3)13. However, compared with the incidence observed in newborn screening for MPS type I, IIIB, and IVA13, our conservatively calculated incidence was significantly lower (Fig. 3).

The incidence of Pompe disease

The incidence of Pompe was also assessed using newborn screening data14. The incidence (prevalence at birth) was reported to be 55 in 994,975 from 2005 to 2018 (5.53 per 100,000)15. Our conservative estimation for Pompe disease was 4.23, thus, did not significantly differ from the reported incidence (Fig. 3).

Discussion

Here, using genetic data obtained from the TWB, we estimated that the combined incidence of 71 autosomal recessive LSDs is between 13 per 100,000 (pathologic and likely pathogenic variants) and 94 per 100,000 (pathologic and likely pathogenic variants and variants of unknown significance). This incidence range is considerably higher than the reported prevalence among clinical cases but similar to that obtained through newborn screening. LSDs are very rare, and diagnoses are often delayed or missed; therefore, an accurate estimation of the incidence of these diseases in Taiwan is challenging, if not impossible. Therefore, estimation methods from genome-wide sequencing databases, as conducted in this study, or unbiased population screening are alternative methods for understanding the true incidence of these diseases. Thus, these approaches assist in the development of policies that address the burden of rare diseases.

The conservative estimation data in the current study were more similar to the incidence rates observed in the clinic than the extended estimation data. For example, regarding MPS I, the conservative incidence estimate (0.03; 95% CI: 0.001–0.17) is similar to the published incidence in Taiwan (0.11; 95% CI: 0.003–0.61)13, confirming that this is an extremely rare disease. However, the extended and newborn screening incidences demonstrated a wider estimation range, implying that a milder or late-onset phenotype may exist that is not easily recognized by clinicians. Further understanding of the pathogenicity of VUS, either with functional or long-term follow-up data obtained through newborn screening, may further elucidate the true incidence of MPS I.

Although we analyzed limited genomic data in this study, the general incidence obtained is similar to that obtained in a previously published large-scale biobank study of the same population16. The carrier rate of Krabbe disease (GALC gene) in the previous study was estimated to be 1.67%, similar to the current study’s estimate (0.2-2.18%). Regarding mucolipidosis type II/III (GNPTAB), the previous estimate was 0.44%, and the current estimate is 0.3–1%, and the difference between the two estimates is not significant. The comparison of Pompe disease and GAA carrier incidence between the previous study and ours is more indirect, as the previous group calculated only the allele frequency (0.38%) of GAA causing infantile-onset Pompe disease among 103,106 individuals Taiwan16; however, we included late-onset and infantile-onset Pompe disease, yielding a conservative allele frequency of 0.65%. Overall, our data support validation using a small dataset instead of a large dataset such as the biobank. Since TWB 2.0 only contained 179 known disease-relevant regions16, the use of TWB 2.0 may decrease the ability to detect rare variants in rare diseases. However, the current study demonstrated no differences when using larger SNP chip datasets versus comprehensive whole-genome sequencing (WGS) data from a small population. It would be due to the fact that only exonic and nearby intronic variants were analyzed. Further validation will be required when more WGS data become available.

In this study, the allele frequency in Taiwanese individuals was too low to calculate the variant incidence for 18 among the 71 genes encoding for the autosomal recessive LSD, and an additional 18 genes among the rest of 53 genes without pathogenic variants were recorded. For example, in NPC1 and NPC2, which cause Niemann-Pick disease type C, no NPC1 variants were identified in the WGS data from the 1495 individuals in the TWB, and only one NPC2 variant was identified. The NPC2 variant was excluded because the severity score was 4 over 13. The published prevalence of Niemann-Pick disease type C is 0.25 per 100,000 in the United Arab Emirates and 2.2 per 100,000 in Portugal2., which converts to a carrier rate of at least 1 in 400. This range indicates that the variants should have been present among the 1,495 individuals studied here. Our current data demonstrate an even lower incidence of Niemann-Pick disease type C in Taiwan, although clinical cases have been reported7. The existence of selection bias, which is the prevalence of diseases only in specific populations, requires further study. Selection bias is less likely in our study because of Taiwan’s relatively homogenous Chinese-Han population12. We are not aware of any clustering of such LSD in specific populations in this country.

Many biobanks, such as the Global Biobank Meta-analysis Initiative (GBMI), UK Biobank, Estonian Biobank and China Biobank, have been established worldwide as a result of improvements in NGS techniques. Many researchers have tried used data from different biobanks to predict the risk or prevalence of different diseases. Most select likely pathogenic variants16,17 and use the Hardy–Weinberg equation to calculate the disease incidence, as in this study. Currently, most studies rely on biobank data to determine disease incidence and identify genetic and non-genetic factors contributing to various common chronic diseases. However, to date, no additional omics studies have been conducted. In the future, it would be highly valuable to organize further omics studies to delve deeper into the underlying mechanisms and molecular aspects of these diseases. Such studies could provide a more comprehensive understanding of the diseases’ complexities and potentially lead to more targeted and effective interventions. In addition, we estimated conservative and extended disease incidences due to the uncertainty of VUS curation and to better estimate the disease incidence range. Nevertheless, because biobanks are generated using different types of omics data, such as genotype arrays and WGS, additional caution should be taken when applying the resulting datasets to estimate disease prevalence. Furthermore, those biobanks, although population-based, may not represent the general population regarding sociodemographic or health-related characteristics18 and may not be a suitable resource for determining disease prevalence and incidence rates. UK BioBank has released 50,000 exomes19 and will add an additional 200,000 exomes to become the largest open-access resource of WES data linked to health records. A better understanding of rare disease incidence is expected in the future following analyses of a larger WES dataset.

Our study did not assess X-linked LSDs because the equation for X-linked disorders requires different interpretation methods, especially for those diseases with late-onset phenotypes. For example, the newborn screening for Fabry disease by enzyme assay revealed an incidence rate of 1 in 1250 among Taiwanese males20, most of whom had the GLA IVS4 + 919 G > A variant. The incidence rate of the GLA IVS4 + 919 G > A variant is estimated to be 1 in 600 among newborns21; however, it is unknown if those individuals participated without bias in the small WGS dataset used in this study; therefore, we did not include X-linked LSD.

Another limitation of our study is associated with the short-read WGS method. For example, the GBA gene recombines with its pseudogene; thus, it is challenging to determine where the variants are located accurately using WGS. Therefore, we could only roughly estimate the incidence of Gaucher disease. However, in the newborn screening data, the incidence range was similar to that estimated from the dataset22. Thus, we consider that our results provide useful information for estimating the burden of autosomal recessive LSD, although further clarification, such as improving the methods or data from biochemical screening, may be warranted for specific conditions.

Finally, although we used the WGS database, mutations in deep introns not regarded as critical for splicing may have been missed, and copy number changes were not reported. However, we could not demonstrate a significant difference in incidence when comparing our data to available NGS data regarding biochemical and protein levels, implying the minimal impact of using such genomic data for estimation. When calculating incidence, the possibility of in-cis variants was not considered; thus, the incidence may have been overestimated. The increasing acceptance of preconception carrier screening could also influence the clinical incidence since prenatal diagnosis and abortion if the fetus is found to be affected are permitted in Taiwanese culture. Such incidence drift has been observed in thalassemia and spinal muscular atrophy carrier screening, which are performed widely in Taiwan23.

In conclusion, the current study generated useful incidence data regarding LSDs in Taiwan. Our curated, conservative estimation of incidence could guide public health measures in calculating disease or drug burdens. Our extended estimation could also facilitate newborn and high-risk screening. Incidence estimation from genomic data will improve further as the clinical significance of variants becomes better understood.

Methods

The Taiwan Biobank

The TWB is a government-supported database that facilitates biomedical genetic research in the Taiwanese population (https://www.twbiobank.org.tw). The TWB was initiated as an ongoing prospective study in 2012 with a target sample size of 200,000 individuals aged 20–70 with no prior cancer diagnosis. At recruitment, participants provided a written informed consent and had their baseline data collected. As of 30 November 2022, 184,577 volunteers have participated in biobanking, and 45,439 have completed the first follow-up round.

Whole-genome sequencing

In the current study, we used WGS data obtained from 1495 Taiwanese individuals in the TWB. The WGS data were generated using Illumina platforms, and the experiments and analyses were conducted by Genomics BioSci & Tech Co., Ltd. DNA was extracted from blood samples. Sequenced was done by the Illumina Hi-Seq 2500 (2 × 150 bp paired-end) with output of 90GB and an average coverage depth of 30x. Raw reads were mapped to hg38 genome reference by BWA-MEM2 and variants were called by the Genome Analysis Toolkit (GATK) haplotypecaller. Subsequently, we employed the WGS data to calculate the incidence of LSDs.

LSD gene selection

We selected the genes from the lysosomal disorder and mucopolysaccharidosis panel in Blueprint genetics (https://blueprintgenetics.com/). Mutations in a total of 74 genes are known to cause LSDs. X chromosome variants were not included, including those in the IDS, LAMP2, and GLA genes.

Curation of variants and estimation of incidence

We first included single nucleotide variants (SNVs) in the exon and exon/intron border and small indel variants; the allele frequency of all variants in the TWB was ≤0.05. We then included variants according to the following criteria: (1) Reported in ClinVar as pathogenic or in the Human Gene Mutation Database (HGMD) as disease-causing mutations (DM) or possible/probable disease-causing mutations (DM?); or (2) unreported in ClinVar or the HGMD with a severity score exceeding 7 in the 13 prediction tools [Sorting Intolerant From Tolerant (SIFT), PolyPhen-2 (Polymorphism Phenotyping v2) HDIV, PolyPhen-2 HVAR, LRT (Likelihood Ratio Test), Mutation Taster, Mutation Assessor, FATHMM (Functional Analysis through Hidden Markov Models), FATHMM-MKL, Provean (Protein Variation Effect Analyzer), CADD (Combined Annotation–Dependent Depletion), MetaSVM, MetaLR, Mendelian Clinically Applicable Pathogenicity (M-CAP)].

The pathogenicity of the variants was determined according to American College of Medical Genetics and Genomics (ACMG) guidelines12. Risk alleles were defined as pathogenic (P), likely pathogenic (LP), or variants of unknown significance (VUS). Gene-specific risk allele frequency (q) was defined as the sum of the frequency of all variants in the indicated gene. Linkage between variants within a gene was not assessed. Therefore, the probability of having a risk allele for a disease in the haploid genome of a population was q and that of not having a risk allele was Q = 1 − q. The carrier rate was then calculated as 2 × Q × q based on Hardy–Weinberg equilibrium. The disease incidence was calculated as q2. The calculated LSD disease incidences were then compared with real-world epidemiological data.

Statistics

The statistical analyses were performed using MedCalc® Statistical Software version 20.2 (MedCalc Software Ltd, Ostend, Belgium; https://www.medcalc.org; 2022). Comparisons of two rates were used to calculate the 95% confidence interval (95% CI) and p value between the estimated and the reported incidence. A p value < 0.05 was considered to indicate significance.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.