INTRODUCTION

Peroxisome biogenesis disorders (PBDs) represent a spectrum of conditions associated with faulty peroxisome assembly and function in this organelle.1 There are two subtypes of PBDs, which are distinguished by measurements of plasma very long chain fatty acid (VLCFA) levels and erythrocyte membrane plasmalogens. The first, rhizomelic chondroplasia punctate (RCDP), results from variants in the PEX7 gene. The second, Zellweger spectrum disorder (ZSD), is autosomal recessive and results from variants in 13 genes: PEX1, PEX2, PEX3, PEX5, PEX6, PEX10, PEX11β, PEX12, PEX13, PEX14, PEX16, PEX19, and PEX26.1,2 ZSD has different severities that used to be described as separate disorders: Zellweger syndrome (ZS, severe), neonatal adrenoleukodystrophy (NALD, intermediate), and infantile Refsum disease (IRD, mild).1 Symptoms are present at birth in severe cases or can manifest later in childhood in less severe cases. Because peroxisomes are tied to many processes in the body, ZSD has a wide range of symptoms. They include craniofacial abnormalities, hypotonia, seizures, blindness, deafness, enlarged liver, renal cysts, and myelin degradation.1,2,3

A related condition, X-linked adrenoleukodystrophy (X-ALD), was recently added to the recommended uniform newborn screening panel.4 Because of their similar biochemical features, screening for X-ALD might also identify some ZSD cases through elevated VLCFAs, the markers used for X-ALD screening. Increased detection of ZSD will probably lead to demand for more comprehensive information about the condition, including carrier rate for recurrence risk assessment. A better understanding of the carrier frequency will be helpful to extended family members’ genetic counseling following positive newborn screening results.

Approximately 1 in 50,000 births is thought to be affected by PBDs.1,2,5,6,7 This estimate is based on a combination of observations, and often it is unclear which level of cases (ZS, ZSD, or PBD) is included in the incidence. In 1975, Danks et al. reported on eight cases of ZS in the state of Victoria, Australia, that occurred over the course of 13 years.8 They used these cases and the number of births in Victoria (882,765) over the same period to estimate the incidence of ZS at 1 in 100,000. In a 1987 review of ZS, Hans Zellweger stated that he believed the incidence to be higher than that reported by Danks et al. because in the Netherlands the disease “is more likely to affect one in 25,000 to 50,000 newborns,”7 but no data were reported to support this claim. Lazarow and Moser noticed that the Kennedy Krieger Institute accounted for 0.79 in 100,000 (1 in 126,582) incident ZS cases in the United States between 1985 and 1995.9 In certain subpopulations, the incidence is much higher, such as the French-Canadian population in the Saguenay-Lac-St-Jean region, where the estimated frequency of ZS is 1 in 12,191.10 Subsequent literature regarding ZSD has cited a middle figure of 1 in 50,000.1,5,6 The range of estimates reported in the literature indicates a need for a systematic approach to obtain a more accurate carrier frequency of ZSD in a large population that relies on a genetic definition of ZSD.

To date, there are few reports on genetic disease carrier frequency assessment using a combination of large population-based data and variant analysis tools to assess pathogenicity.11,12 In this study, we aimed to demonstrate the process of estimating the carrier frequency and associated incidence rate of ZSD using the Exome Aggregation Consortium (ExAC) database and several bioinformatics tools that assess pathogenicity of genetic variants. ExAC is a compilation of high-quality exome data from approximately 60,000 individuals, which have been filtered to contain unrelated adults without a history of severe childhood disease.13 Bioinformatics tools used include Sorting Intolerant from Tolerant (SIFT), Polymorphism Phenotyping v2 (PolyPhen-2), Combined Annotation-Dependent Depletion (CADD), and Deleterious Annotation of genetic variants using Neural Networks (DANN), all of which evaluate missense variants based on evolutionary conservation, protein structure, or a combination of both.14,15,16,17 Also used was the conservation-based Phylogenetic Analysis with Space/Time Models tool (PhastCons), which evaluates nucleotides based on conservation.18 Of the tools listed here, SIFT and PolyPhen-2 are among the most used for clinical assessment of missense variants.19 We also assessed these variants according to the American College of Medical Genetics and Genomics (ACMG) criteria to see how they would be evaluated in the current clinical practice setting.19 To our knowledge, this study is the first to report estimates of ZSD carrier frequency and incidence rates based on a large consortium database and a genetic definition of the disease.

MATERIALS AND METHODS

Databases and genetic variants

As described below, our study used three lists of variants from different databases in the process of (1) selecting the variant analysis tools that we would use to assess variant deleteriousness, (2) establishing a threshold for each of those tools, and (3) evaluating the carrier frequency of ZSD (Fig. 1).

Fig. 1
figure 1

Analysis outline. Workflow to estimate ZSD carrier frequency and incidence.

To evaluate which variant analysis tools are most informative in assessing variant deleteriousness, we used genome-wide missense variants (i.e., not limited to variants in the 13 PEX genes) reported as either pathogenic (15,406 variants) or benign (3932 variants) in ClinVar.20 This is a repository of human variants with a reported clinical significance such as benign or pathogenic.

To establish a threshold for deleteriousness for variant analysis tools, we used known ZSD-causing missense variants from OMIM21 and dbPEX (PEX Gene Database).22 OMIM provides information on publications that report pathogenic variants, and dbPEX is a PBD-specific database of variants. Thresholds can vary depending on the disease, and setting a disease-specific deleteriousness threshold was the method of choice in a previous study.11,16 The following inclusion criteria were applied to ZSD variants: first, they were present in the ExAC population; second, they had a deleteriousness score for each variant analysis tool that we used (scores were not available for some variants); and third, they met the baseline deleteriousness thresholds of each variant analysis tool. From the over 200 variants in dbPEX and OMIM, 34 were present in the ExAC database. Of those, 15 were nonsynonymous variants and therefore had scores for each variant analysis tool. Of these, 6 did not meet the standard deleteriousness cutoffs provided by the variant analysis tools. This resulted in 9 variants used to establish deleteriousness thresholds (Table 1).

Table 1 Variants known to cause Zellweger spectrum disorder found in OMIM or dbPEX databases

To assess the carrier frequency of ZSD, we extracted allele frequencies from the ExAC database in the following 13 genes: PEX1, PEX2, PEX3, PEX5, PEX6, PEX10, PEX11β, PEX12, PEX13, PEX14, PEX16, PEX19, and PEX26. We then analyzed whether these variants were potentially pathogenic using the deleteriousness scores from the selected tools and our criteria for pathogenicity, as described below. A variant was excluded if at least one individual in the ExAC population was homozygous for it. There were six homozygous variants. In addition to the 9 variants used to establish a deleteriousness threshold, a total of 2104 variants from ExAC were assessed for the carrier frequency calculation.

Variant analysis tool selection and bioinformatics variant pathogenicity assessment

ANNOVAR23 was used to annotate all variants, including ClinVar variants used for the variant analysis tool selection, known ZSD-causing variants from OMIM and dbPEX used to establish the threshold for deleteriousness, and the ExAC variants included in the carrier frequency calculation. ANNOVAR also extracted allele frequencies from the ExAC database. Annotations used the GRCh37 human reference sequence and RefSeq gene definitions. Variant deleteriousness and frequency in the ExAC population were analyzed with SAS (v. 9.4) and R (v. 3.2.1). We use the term “deleteriousness” when referring to one bioinformatics tool’s assessment of a variant and “pathogenicity” when referring to a composite score of all five bioinformatics tools.

To select the most informative variant analysis tools, we imported all ClinVar variants identified as pathogenic (15,406) and benign (3932) into ANNOVAR, which provided scores for 16 different variant analysis tools. Then we conducted further evaluation as described below.

First, we qualitatively examined the spread, shape, and overlap of the distributions of pathogenic and benign ClinVar variant scores from each tool. We preferred tools that had narrow distributions of scores, especially for pathogenic variants, and tools where the overlap between the benign and pathogenic variant distributions was minimized. Second, we selected tools that represent various approaches in determining the deleteriousness of variants, as stated in the ACMG guidelines for variant assessment.19 These categories include evolutionary conservation of amino acids, protein structure and function, and nucleotide conservation. Third, we conducted a literature search to select tools that are widely cited in peer-reviewed journals. We chose SIFT, PolyPhen-2, and PhastCons because they each represented at least one of the three mentioned approaches to determining deleteriousness, they were widely cited, and the comparison of benign and pathogenic variant distributions had the qualities we described above (Fig. 2). We chose CADD and DANN because they represented a fourth approach in determining deleteriousness, which is based on a comparison of variants that survived natural selection to simulated variants. The distributions of pathogenic and benign variants for the two scores met our outlined criteria. CADD was widely referenced in the literature while DANN is a relatively new tool.

Fig. 2
figure 2

Violin plots comparing scores for 15,406 pathogenic and 3932 benign variants from the ClinVar database for the five variant analysis tools used to assess deleteriousness. In each plot, benign variants are on the left and pathogenic variants are on the right. The deleteriousness scores are along the y-axis. Higher values indicated a higher probability that the variant is damaging in all scores except for SIFT, where a low score is associated with deleteriousness. The y-axis for CADD is a logarithmically transformed score, and the rest are linear probabilities. The x-axis represents the probability density of variants along the range of scores. The CADD plot appears different because its y-axis is on a logarithmic instead of linear scale.

The SIFT algorithm predicts whether an amino acid substitution is deleterious using evolutionary conservation of amino acids.14 SIFT generates a probability for every amino acid in the protein based on how often that amino acid is observed in alignments with homologous sequences. The lower the probability of an amino acid substitution, the higher the likelihood is for that substitution to be deleterious. PolyPhen-2 uses a combination of 11 tools based on amino acid sequence and protein structure to predict if an amino acid substitution is deleterious.15 PolyPhen-2 (HumVar-trained model) generates a score that estimates the probability of a variant being damaging. High scores indicate variants that are more likely to be damaging. CADD compares derived alleles with simulated de novo variants and ranks each one relative to the rest based on how likely it is that the allele is derived or simulated.16 It is based on the principle that there are fewer derived than simulated deleterious variants because of natural selection. We used a Phred-like scaled version of this C-score, which is equivalent to −10log10 (rank/total number of substitutions). A variant with a scaled C-score between 20 and 29 means that the variant is in the 1st percentile of the “most deleterious substitutions that you can do to the human genome.”24 A score between 30 and 39 means that it is in the 0.1th percentile of the most deleterious substitutions. DANN is similar to CADD except that it uses a deep neural network instead of a linear kernel support vector machine to compare derived and simulated alleles.17 The higher the score, the more likely a variant is to be damaging. PhastCons (20-way mammalian score) is based on phylogenetic hidden Markov models and generates a conservation score for each variant using a cross-species alignment.18 A high score indicates that the variant has a higher probability of being in an evolutionarily conserved element and that changing it would be deleterious.

After selecting these five tools, we established a threshold of deleteriousness for each one using the nine missense PEX gene variants that were reported in OMIM and dbPEX as disease-causing. We calculated the mean scores of this set of variants for SIFT (mean = 0.007, SD = 0.017), PolyPhen-2 (mean = 0.999, SD = 0.002), CADD (mean = 33.167, SD = 2.677), DANN (mean = 0.999, SD = 0.000), and PhastCons (mean = 0.988, SD = 0.014) and used those as the cutoff between deleterious and nondeleterious for each tool when evaluating variants in the ExAC database. If a missense variant in ExAC had a score equal to or above (below for SIFT) the mean, it was considered deleterious.

To evaluate which missense variants in the 13 PEX genes from the ExAC database were potentially pathogenic, we categorized them by the number of variant analysis tools that classified them as deleterious. An allele was classified as pathogenic under three levels of stringency: pathogenic if deemed deleterious by at least three of the five tools (3/5), at least four of the five tools (4/5), and all of the tools (5/5). Each tool carried the same weight in the composite scores.

Insertions and deletions (indels) causing frameshift, stop loss, stop gain, and splice-site variants were all considered as potentially pathogenic. The splice-site variants that ANNOVAR annotates as “splicing” are by default in the +1, +2, −1, and −2 positions (personal communication with Dr. Wang of ANNOVAR). These are well-conserved positions, and changes to these nucleotides are recognized to affect protein splicing. We included these variants in the “other variants” category. The frequencies of the known ZSD-causing missense variants are also included in this category.

Variant assessment with ACMG criteria

Variants were categorized with clinical interpretation software Cartagenia (Allisa Interpret) to evaluate each variant using a series of databases, allele frequency information, and functional predictions. Pathogenicity was categorized according to the standards and guidelines set forth by ACMG.19 Evidence for and against pathogenicity were weighted as strong (previously described function, loss of function), moderate (loss of initiation, premature stop codon, disruption of stop codon, whole-gene deletion, frameshifting indel, and disruption of splicing), or supporting (nonsynonymous substitution, in-frame indel, support from multiple functional prediction algorithms). Each variant was interpreted based on the cumulative evidence supporting its categorization as pathogenic or likely pathogenic.

Carrier frequency and disease incidence estimation

To date, variants in 13 PEX genes have been linked to ZSD. We calculated the ExAC population carrier frequency for every PEX gene in each of the three pathogenicity categories described above. We summed the allele frequencies from the ExAC population within those categories along with the frequencies of the frameshift indels, stop loss, stop gain, splice-site, and known ZSD-causing variants. We then estimated the incidence rates for each gene based on those frequencies and the Hardy–Weinberg equilibrium principle: 1 = p2 + 2pq + q2. The p represents the frequency of the major (nondisease) allele, which we assume to be approximately 1. The q represents the minor allele frequency, q2 the frequency of affected individuals (including compound heterozygotes), and 2pq the carrier frequency. Then, we summed the estimated carrier frequencies and incidence rates across all genes. As an example, the carrier frequency (2pq) for PEX1 in the 3/5 pathogenicity category is 0.00635682. To solve for q, we divide that number by 2 (assuming p is 1) and get 0.00317841. We then square that to get a gene-level incidence (q2) of 1.01 × 10−5. We also estimated the carrier frequency and incidence for the African, non-Finnish European, Finnish, admixed American, South Asian, and East Asian genetic ancestry groups within the consortium. The same process for carrier frequency and incidence estimates was followed for the variants categorized according to ACMG criteria as pathogenic or likely pathogenic.

RESULTS

Bioinformatics assessment of carrier frequency and incidence

To estimate ZSD carrier frequency, we assessed 1953 missense variants and 151 additional variants, which include frameshift indels, stop gain or loss variants, and splice-site variants. There were 9 known ZSD-causing variants available, for a total of 160 variants in the “other” category that were counted as pathogenic. There were 231, 82, and 24 missense variants in the 3/5, 4/5, and 5/5 pathogenicity categories. The 3/5 category is inclusive of the 4/5 and 5/5 categories, and the 4/5 category is inclusive of the 5/5 category. In the subpopulation assessments, the same variants were assessed in each population except for one variant in the “other” category, which was not available for the Finnish population. Overall, the top three genes that had the most pathogenic variants across all composite scores were PEX1, PEX6, and PEX12, in descending order. According to dbPEX, the top three genes with the highest number of unique variants associated with ZSD are PEX6, PEX1, and PEX12, in descending order. However, the number of recorded cases with PEX1 variants is high relative to cases with variants in other genes.

The estimated incidence of ZSD in the entire ExAC population using the 3/5, 4/5, and 5/5 thresholds is 1 in 83,841, 1 in 121,749, and 1 in 139,557 births, respectively (Table 2). In the ExAC subpopulations, using the lowest stringency level of 3/5 (at least three of five variant analysis tools classified variant as deleterious) the incidence ranges from 1 in 31,165 births in the East Asian population to 1 in 263,531 births in the admixed American population (Table 2). At the highest stringency level where all five variant analysis tools classified a variant as deleterious, the incidence ranged from 1 in 76,630 births in the non-Finnish European population to 1 in 2,702,703 births in the Finnish population. No individuals in this latter subpopulation had missense variants that were in the 5/5 category, which could be due to the low probability of observing these variants in the small population of only 3307 people, the smallest of the ExAC populations. The incidence estimated in the entire ExAC population is mainly reflective of the non-Finnish European subpopulation because 33,370 people out of the total 60,706 are in this population (Table 2).

Table 2 Carrier frequency and estimated incidence of Zellweger spectrum disorder (ZSD)

ACMG assessment of carrier frequency and incidence

The same variants assessed using the bioinformatics criteria discussed above were also assessed using ACMG criteria.19 Of the variants extracted from ExAC, 11 were classified as pathogenic, and 33 were classified as likely pathogenic for a total of 44 variants that factored into the carrier frequency estimate. There are four missense variants in the likely pathogenic category and none in the pathogenic category. The remaining 40 variants are frameshift indels, stop gain or loss variants, and splice-site variants.

The estimated incidence of ZSD in the entire ExAC population including variants classified as pathogenic and likely pathogenic according to ACMG criteria is 1 in 3,275,751 births (Table 3). The estimate decreases to 1 in 10,413,631 births if variants classified as pathogenic are the only ones included. For ExAC subpopulations, the total incidence ranged from 1 in 1,230,228 births in the non-Finnish European group to 1 in 94,886,541 births in the South Asian group (Table 3).

Table 3 Carrier frequency and estimated incidence of Zellweger spectrum disorder estimated with variants that pass ACMG criteria to classify sequence variants

DISCUSSION

Our study estimates the ZSD carrier frequency and incidence rates using a large consortium database. Recent advancements in bioinformatics tools for variant assessment, and efforts to create large databases of human genomic information, have generated possibilities to estimate carrier frequencies based on large population data. One challenge with these bioinformatics tools is to develop a procedure for their use that represents biological or pathogenic processes. We outline an approach for selecting informative variant analysis tools that uses the entire ClinVar repository of “benign” and “pathogenic” nonsynonymous variants. These variants are then leveraged to select tools that discern between reported benign and pathogenic variants. Combining this with our other described criteria, we have more confidence in the reliability of the tools we use for variant evaluation than if we had selected tools based on convenience or familiarity. Instead of using default deleteriousness thresholds, we calibrated each tool with ZSD-causing variants. Then we evaluated missense variants based on a combination of the five tools, setting three thresholds to determine whether variants were pathogenic. In addition to other variants assumed to be pathogenic, we calculated the carrier frequency and estimated the associated incidence.

Our bioinformatically estimated incidence of ZSD in the whole ExAC population of 1 in 83,841 births is similar to recent estimates from newborn screening in New York of approximately 1 in 90,000 births (calculated from 12 ZSD cases in 1.08 million births, personal communication with Dr. Joseph Orsini, August 2018), and lower than the figure of 1 in 50,000 births that is often cited.1,2,6 Our analysis was limited to PEX variants present in the ExAC database, which did not include all variants, such as large indels that are known to cause ZSD. Therefore, the frequency of those could not be included in the estimate. In addition, if ANNOVAR did not provide SIFT, PolyPhen-2, CADD, DANN, or PhastCons scores for a missense variant in ExAC, it could not be analyzed. There were 102 missense variants that did not have these scores. The combination of these limitations means that some pathogenic variants may not have been included in the estimates, which means the incidence in the ExAC population could be higher than what we estimated. Another factor that could lead to an underestimate of the incidence is that we may have excluded hypomorphic alleles that result in the disease only when paired with a more deleterious allele. For example, we excluded at least one: PEX6-R601Q, a homozygous variant in ExAC. When this allele is paired with a null allele it results in ZSD.25 Conversely, we may have overestimated the carrier frequency because in the absence of clinical information about the variants, we may have falsely classified some variants as pathogenic.

Under the current clinical setting, using ACMG guidelines, the whole ExAC population incidence would be estimated at 1 in 3,275,751 births, which is much lower than the bioinformatics assessment or the observed estimate. The design of ACMG criteria aims to greatly limit the possibility of falsely assessing a nonpathogenic variant as pathogenic. The criteria focus on combining clinical, bioinformatics, variant type, and other lines of evidence about a variant to make a definitive judgment about its pathogenicity. If a variant’s pathogenicity has not been assessed in multiple ways, it cannot be classified as pathogenic even if it is potentially pathogenic. The low incidence estimate using ACMG criteria likely reflects this lack of information. ACMG criteria are necessary for clinical diagnosis but may not be suitable for estimating the disease incidence.

Our bioinformatically estimated incidence supports that ZSD is a rare disease. Besides the constraints associated with data availability, the incidence is dependent on the deleteriousness cutoffs for each score and the pathogenicity thresholds we set. We took a conservative approach to estimating the carrier frequency and set deleteriousness thresholds using the mean scores of known ZSD-causing variants. Our thresholds were more stringent than the default thresholds for each variant analysis tool. For example, the default threshold for SIFT is 0.05, and our threshold was more stringent at 0.007. SIFT has a 20% false positive rate at the 0.05 level.26 Another limitation is that variants we used to set the deleteriousness threshold did not represent all PEX genes. We also had the added stringency of a compound score for pathogenicity. The five tools we selected in our compound score provided results that were similar to what obtained from a large newborn screening cohort. Additional replicable studies in other disorders are needed to evaluate if the utility of the same five tools can be generalized. Currently, there are no agreed-upon guidelines for computationally evaluating variant pathogenicity.

Despite the described limitations, one interesting finding from the ExAC subpopulations suggests that the incidence of ZSD varies by population composition. This range in ZSD incidence highlights that it would be important to investigate whether different subgroups are more heavily impacted by ZSD.

Our study, collectively with other recent research, provides a starting point for calibrating bioinformatics approaches of disease carrier frequency estimation.11,12 An opportunity for refinement of this method comes with the recent expansion of the recommended newborn screening panel to include X-ALD, which can also detect ZSD. Our estimates appeared close to currently available newborn screening data for ZSD in New York, and these estimates will be further verified as new newborn screening data emerge. However, newborn screening is ongoing, and at any given point, the obtained incidence rate may not be as similar as we estimated with our method. In addition, future analyses could work with the larger gnomAD data set, which is in early beta mode but includes 126,216 exomes and 15,136 genomes.27 Bioinformatics approaches to carrier frequency estimations are an important resource when other methods for assessing variant pathogenicity are limited and when population-based gene variant testing is implausible.