Estimating the occurrence of primary ubiquinone deficiency by analysis of large-scale sequencing data

Primary ubiquinone (UQ) deficiency is an important subset of mitochondrial disease that is caused by mutations in UQ biosynthesis genes. To guide therapeutic efforts we sought to estimate the number of individuals who are born with pathogenic variants likely to cause this disorder. We used the NCBI ClinVar database and literature reviews to identify pathogenic genetic variants that have been shown to cause primary UQ deficiency, and used the gnomAD database of full genome or exome sequences to estimate the frequency of both homozygous and compound heterozygotes within seven genetically-defined populations. We used known population sizes to estimate the number of afflicted individuals in these populations and in the mixed population of the USA. We then performed the same analysis on predicted pathogenic loss-of-function and missense variants that we identified in gnomAD. When including only known pathogenic variants, our analysis predicts 1,665 affected individuals worldwide and 192 in the USA. Adding predicted pathogenic variants, our estimate grows to 123,789 worldwide and 1,462 in the USA. This analysis predicts that there are many undiagnosed cases of primary UQ deficiency, and that a large proportion of these will be in developing regions of the world.

acting as a hetero-tetramer) assembles an isoprenoid tail from precursors produced by the mevalonate pathway. COQ2 joins this isoprenoid tail to a tyrosine-derived benzoquinone ring precursor, and COQ3, COQ5, COQ6 and COQ7 are responsible for various methylation and hydroxylation reactions affecting the benzoquinone ring. COQ8 appears to play a regulatory role by modulating phosphorylation of COQ3, COQ5 and COQ7. COQ8 has two human homologues, COQ8A (also known as ADCK3 or CABC1) and COQ8B (ADCK4), both of which can independently result in UQ deficiency 12,13 . The roles of COQ4 and COQ9 are not well defined, although COQ4 appears to play a role in the assembly of COQ2 -COQ7 into a complex and COQ9 is required for COQ7 function. ARH1 (human homologue FDX1L) and YAH1 (FDXR) transfer electrons to COQ6, while also participating in other pathways. There are two modification steps of the UQ benzoquinone ring that have yet to be assigned an enzyme.
To date, pathogenic variants in nine of these proteins (PDSS1, PDSS2, COQ2, COQ4, COQ6, COQ7, COQ8A, COQ8B and COQ9) have been shown to cause UQ deficiency in human patients 7,9 . We sought to leverage the recent availability of exome or genome sequences of very large numbers of individuals in order to estimate the frequency of known pathogenic variants in these genes. We used the NCBI ClinVar database 14 and conducted a literature search to identify variants in the known UQ biosynthesis genes that result in illness and UQ deficiency. The gnomAD exome and genome database 15 , with sequences for almost 138,632 individuals divided into seven genetically-distinct populations, was used to estimate the frequencies of these variants. Using these frequencies, we estimated the birth prevalence of individuals homozygous or compound heterozygous for known or predicted pathogenic genetic variants for primary UQ deficiency (assuming Hardy-Weinberg equilibria) on a population-by-population basis and used known population sizes and distributions to estimate the actual numbers of afflicted individuals due to each variant world-wide, as well as in a population with the particular size and mix of the USA. Importantly, the calculation of the number of afflicted individuals on a per-variant, per-population, basis eliminates a potential confounding factor when working with large numbers of variants present at very low frequencies -namely, that many individual variants may be too rare to result in any homozygous or compound heterozygous individuals, and the traditional method of summing these frequencies could yield frequencies high enough to artificially suggest that individuals are affected.
It is likely that many pathogenic variants simply have not been clinically documented at this relatively early stage in our awareness of primary UQ deficiency. To account for this, we also estimated the number of individuals who would be homozygous or compound heterozygous for variants observed in gnomAD but that have not yet been observed in the clinic, focusing on predicted loss-of-function (LoF) or pathogenic missense mutations.
There are many challenges to making estimates of this nature. For example, it is not possible to conclusively determine the pathogenicity of missense variants based on sequence information alone. We attempt to address this by conservatively included only those variants independently predicted to be pathogenic by two separate bioinformatic algorithms (see Methods). There is also extreme variability in severity of primary UQ deficiency, ranging from neonatal lethality (with mouse studies suggesting that embryonic lethality is a possible outcome for null alleles for some genes [16][17][18][19] ) to mild disease that becomes apparent only in later decades of life. This makes accurate predictions of disease prevalence based on allelic frequencies extrapolated from public databases of genomic variants challenging, which is why our results are best interpreted as birth prevalence of individuals homozygous or compound heterozygous for variants likely to cause disease. Actual disease prevalence would be expected to diverge from these estimates. We discuss these issues in greater detail below.
We found that the carrier frequencies for most previously identified pathogenic variants were low (averaging 1/6,420 for the populations in which they were present), and given known population sizes we estimated they would result in a total of 1,016 individuals worldwide due to homozygosity and an additional 649 due to compound heterozygosity, with a total of 192 in the USA. The addition of all predicted loss-of-function and pathogenic missense variants results in a predicted total of 123,789 individuals worldwide and 1,462 in the USA.

Methods
Identification of known pathogenic variants. We identified pathogenic variants of UQ biosynthesis genes (PDSS1, PDSS2, COQ2 -COQ7, COQ8A/ADCK3, COQ8B/ADCK4 and COQ9) using the NCBI ClinVar database and via PubMed literature searches. ClinVar is a public archive (https://www.ncbi.nlm.nih.gov/clinvar/) describing human genetic variants and their relationship to human health 14 . Variants are extracted from the peer-reviewed literature or directly reported by CLIA certified or ISO 1589 accredited clinical testing laboratories. Variant pathogenicity is reported by the submitter according to the ordinal scale recommended by the American College of Medical Genetics and Genomics and the Association for Molecular Pathology ("pathogenic", "likely pathogenic", "uncertain significance", "likely benign" or "benign") 20 . Note that ClinVar results cannot be used to directly estimate birth prevalence and the database does not include fields for incidence frequency. Each ClinVar entry describes a unique variant, and may be derived from multiple submissions.
We queried ClinVar (search conducted on 2017-03) for each gene (e.g., 'COQ2[gene]') and identified pathogenic variants using the following inclusion criteria: (i) At least one submission describes the variant as "pathogenic" or "likely pathogenic". (ii) No submitter assigns a significance as "benign" or "likely benign". (iii) Variant only affects one gene (i.e., no multi-gene deletions or duplications).
Complete records for all variants meeting our inclusion criteria were manually reviewed, including confirming that the record matches the description in any cited studies. To ensure as complete as possible a record of known variants, we also conducted a systematic literature search via pubmed (search conducted on 2017-01), where we reviewed all clinical studies in the search results for each gene name.

Identification of predicted pathogenic variants.
To identify predicted pathogenic variants in the gno-mAD database, we first excluded variants that did not pass quality-control filters and those in non-canonical transcripts (as defined by gnomAD, the canonical transcript is the longest consensus coding sequence translation with no stop codons). To identify LoF variants, we extracted those annotated as "stop gained", "frameshift", "splice donor" or "splice acceptor" and excluded variants which gnomAD had flagged as low-confidence LoF. To identify the missense variants that were most likely to be pathogenic, we extracted only those variants for which gnomAD reported an assessment of "probably damaging" and "deleterious" by PolyPhen2 and SIFT respectively. To reduce the risk of obtaining false positives, we excluded variants with high minor allele (MAF) frequencies. Although a MAF cut-off of 0.5% has been suggested 22 , we chose a more conservative approach, instead using the highest observed MAF in the list of "known" pathogenic variants as a threshold: thus variants with a global MAF greater than 0.019% or a MAF for any population greater than 0.31% were excluded. Data Availability. The datasets analyzed during the current study are available at www.ncbi.nlm.nih.gov/ clinvar/ and http://gnomad.broadinstitute.org/.

Results
Through ClinVar, we identified 552 reported genetic variants affecting UQ biosynthesis genes (for complete listing, see File S1). Of these, 143 were deletions or duplications affecting multiple genes (all 17 variants reported for COQ3 and COQ5 fell into this category), and were not considered further because the pathogenicity of these variants could potentially be related to the activity of multiple genes. Of the remainder, 315 were excluded because submitters did not assess them as pathogenic (in only one case, ClinVar variation 3645, COQ8A p.Phe331=, were there both pathogenic and benign interpretations -in this case, a MAF of 1.57% supports the benign interpretation). Thirteen of those remaining were subsequently excluded because close inspection of the records and cited works revealed a number of problems, including single-copy variants not consistent with the typically recessive nature of UQ deficiencies (4 records), duplicate records (4), risk factors mis-categorized as causative pathogenic variants (2), an incomplete ClinVar entry (1), a multi-variant haplotype not testable in gnomAD (1), reliance on a secondary, unreferenced, source (1), and one variant present in the untranscribed N-terminal region of COQ2 (see Methods).
Of the remaining 80 records, the majority (49) had been extracted from the peer-reviewed literature, 22 were from the genetic testing company GeneDx (MD, USA), with the remainder from 6 other testing labs (see Table S2 for detailed information). GeneDx and five other testing labs provided detailed assertion criteria for the determination of variant pathogenicity, all adhering to established standards.
To account for the possibility that not all known pathogenic variants are included in ClinVar, we carried out an independent review of the literature, identifying 18 additional pathogenic variants: 2 affecting COQ2, 3 COQ4, 1 COQ6, 9 COQ8A, and 2 COQ8B (see Table S2 for literature references).
In total, we identified 97 pathogenic variants. Of these, 57 resulted in a single residue substitution, 21 in frameshifts, 10 premature stop codons, 7 variants altering splice-site donor or acceptor regions in ways predicted to be pathogenic, and 3 single-residue indels (see Table S2 for a complete listing of all identified known pathogenic variants). COQ8A was most frequently affected, with 40 variants.
To better understand the birth prevalence of these variants we queried the gnomAD exome and genome database. We found 441 carriers, with 49 of 97 pathogenic variants present (Table 1). No variants were present in homozygous form, all missense variants were predicted to be damaging by PolyPhen2, SIFT, or both, and all premature stop, frameshift or splice site-disrupting variants were predicted to be high-confidence loss-of-function.     3 Predicted number of homozygotes in population size equivalent to that of the given ethnicity globally. 4 Predicted number of homozygotes in a population size equivalent to that the given ethnicity in the USA.
All of these findings are fully consistent with the reported pathogenic nature of these variants. Global allele frequencies ranged from 4.1 × 10 −6 to 1.7 × 10 −4 , yielding a combined frequency of 1.76 × 10 −3 , implying that 1/321,368 individuals will be homozygous for pathogenic variants at birth. Through casual observation it was apparent that several of the known pathogenic variants were not distributed evenly within the different populations. For example, the COQ8A p.Met555Ile variant was observed in 39 European or Finnish individuals, but in no other population, and the COQ8B p.Glu483* variant was observed in 10 individuals from South Asia but only in 1 European, despite the almost 4-fold greater number of European alleles genotyped. Indeed, the six variants with the greatest numbers of carriers had frequencies that were distributed unevenly between populations (Pearson's chi-squared 24.3 to 537.5, p < 0.001) ( Figure S1 -statistically significant differences were rarer among the variants with lower allele counts, potentially due to the decreased statistical power inherent in a lower sample size). Because of this unevenness, subsequent analysis was conducted on a population-by-population basis.
Each pathogenic variant was observed in an average of 1.9 populations (not counting 'Other'), with an average allele frequency of 1.56 × 10 -4 ( Table 2). Combined estimates of Hardy-Weinberg homozygosity for all variants for each of the 7 populations averaged 1/5,492,983, ranging from 1/12,021,014 (Latin Americans) to 1/60,113 (Ashkenazi Jews). Predicted homozygous frequency for individual variants averaged 1/5.4 M, with the variant found at the greatest frequency being COQ4 p.Arg240Cys (with a 1/162 carrier frequency among Ashkenazi Jews which would result in the birth of homozygotes at a frequency of 1/104,733). With an estimated worldwide population of 10 M, this would imply 95 afflicted Ashkenazi Jews susceptible to UQ deficiency due to homozygosity for this one variant alone. Considering all variants across all populations, we can predict 1,016 homozygous-at-birth individuals globally, or 122 in the USA ( Table 2).
The presence of multiple variants within the same populations is consistent with the numerous reports of compound heterozygosity in patients with primary UQ deficiency 7 . When estimating birth prevalence of compound heterozygosity, pathogenic variants in COQ8A again exhibited a greater prevalence relative to other genes. In fact, the birth prevalence of compound heterozygotes for COQ8A among Ashkenazi Jews (1/725,578), Finns (1/1.4 M) or non-Finnish Europeans (1/1.6 M) alone were all individually greater than the combined prevalence of all other genes (1/17.1 M) ( Table 3, Table S3). We can estimate that 649 individuals worldwide are born as compound heterozygous for pathogenic genetic variants causing UQ deficiency, with 70 in the USA-like population.

full variant-by-variant breakdown in
Premature stop codons, frameshifts or the disruption of canonical splice sites (LoF) or critical protein residues (via missense mutations) are all expected to result in significant impairments to protein function. Although we can expect an unknown proportion of these predicted pathogenic variants to result in embryonic lethality, those that do allow survival to birth are likely to result in clinically significant illness. We therefore determined the birth prevalence of all predicted pathogenic variants in UQ biosynthesis genes, as described in Methods. Across all UQ biosynthesis genes there were a total of 782 predicted pathogenic variants (including all known pathogenic variants), and 618 possible compound heterozygote combinations (summarized in Table 4, complete variant list in Table S4 and Table S5). The two genes with the highest frequency of predicted pathogenic variants (combining homozygotes and compound heterozygotes) were COQ8A and COQ8B, with cross-population average incidences of 1/193,621 and 1/198,391, resulting in a predicted 27,321 and 44,727 afflicted individuals worldwide, respectively, and 391 and 398 afflicted individuals in the USA. The gene with the lowest frequency was COQ3 (1/57 M), with only 146 predicted affected individuals worldwide, and none predicted in the USA. The population with the greatest total frequency of pathogenic variants was that of East Asia (1/20,170), with a predicted 79,423 afflicted individuals worldwide. The variant with the greatest prevalence in any population was COQ4 p.Arg-240Cys in the Ashkenazi Jewish population, with a MAF of 0.0001719 (Table S4).
Considering the occurrence of both homozygotes and compound heterozygotes averaged across all populations, our results predict a global birth prevalence of 1/52,092. However, not all the populations considered are of equal size, and the predicted number of afflicted individuals worldwide was 41,555 due to homozygosity and 85,581 due to compound heterozygosity, for a total of 123,789 (1/48,495). In the USA, our analysis predicts 1,462 afflicted individuals (1/211,917).

Discussion
Overall, our results predict a worldwide total of 123,789 individuals suffering from primary UQ deficiency, and 1,462 in a population with a composition similar to the USA. Of these, 1,665 and 192 respectively are due to variants that are known to be pathogenic, with the remainder due to predicted LoF and pathogenic missense variants (summarized in Fig. 1A and B). However, the extent to which known pathogenic variants contributed to the total varied between populations. The addition of predicted LoF variants has less impact for Western populations (Ashkenazi Jews, Finnish and non-Finish Europeans: blue in Fig. 1), with inclusion of predicted pathogenic variants resulting in an average 3.5-fold increase in the number of afflicted individuals, relative to known pathogenic variants only (Fig. 1c). In contrast, in populations from non-Western, developing regions (South and East Asians, Latin Americans and Africans: red in Fig. 1), the addition of predicted LoF variants resulted in an average 122-fold increase in the number of afflicted individuals (Fig. 1c). The increased likelihood of pathogenic variants to have been identified in Western populations is consistent with the reality of their relatively higher clinical coverage compared to non-Western populations, where the expense of clinical sequencing has limited the genetic characterization of patients suffering from mitochondrial disease. Our results imply that primary UQ deficiency is substantially under-diagnosed in Latin American, African and Asian populations.
There are several factors that could induce error in our predictions. For example, LoF variants may be so harmful that a homozygous individual is not viable in the first place. That this is possible is supported by the embryonic lethality of the complete genetic ablation of PDSS2, COQ2, COQ3, COQ6 and COQ7 in mice [16][17][18][19] , with COQ4 exhibiting pre-weaning lethality 17 . In contrast, COQ8A 23 and COQ9 24 -null mice have been reported as viable. Indeed, among patients with pathogenic variants in the UQ biosynthesis genes likely to be necessary for life (PDSS1 -COQ7), very few are homozygous or compound heterozygous for severe variants expected to result in significant LoF. Among the severe variants (nonsense, frameshift, splice site affecting) for these genes described in the literature we reviewed, only COQ2 p.Asn401Ilefs*15 was present in homozygous form, resulting in multi-organ failure and death in an infant patient 25 , and there was only one patient compound heterozygous for LoF variants (COQ6 p.Trp447* and p.Gln461fs*478 26 ). In all three variants the region affected was close to the C-terminus (closer than any other known pathogenic variant for these genes), implying that these patients may have retained some partially functional protein, and that other severe variants may have resulted in complete LoF and embryonic lethality.
It is therefore likely that some of the LoF variants that contribute to our final totals may not actually contribute to disease rates due to embryonic or pre-natal lethality. Variant severity is not easy to predict -for example, COQ9 R239X mice that express a partial protein have a much more severe phenotype than COQ9 Q95X mice with no measurable protein expression, presumably due to the destabilization of a multiprotein UQ biosynthesis complex by the truncated protein 27 . However, homozygous or compound heterozygous severe variants in PDSS1 through COQ7 account for only 6,142 out of 123,789 predicted individuals worldwide, and 219 out 1,462 in the US. This suggests that our predictions are not greatly inflated by the inclusion of embryonically lethal allelic combinations.
Our predictions may also suffer from the opposite problem -missense variants identified as damaging by SIFT or PolyPhen2 may, in fact, not have deleterious physiological effects. We attempted to address this by requiring our "predicted pathogenic" variants to be rated as highly likely to be deleterious by both PolyPhen2 and SIFT, but such prediction algorithms are clearly not infallible. For example, COQ4 p.Arg145Gly was rated as "tolerated" by SIFT, yet was reported in homozygous form in a neonate who died 4 h after birth, and it also failed to rescue Δcoq4 yeast 28 . It is therefore reasonable to expect a certain proportion of predicted-pathogenic missense variants to result in asymptomatic individuals. Interestingly, missense variants seem to be responsible for a lesser proportion of COQ8A and COQ8B-deficient individuals, with patients homozygous or compound heterozygous for LoF variants being relatively common [29][30][31][32] . Given that COQ8A alone can rescue COQ8-null yeast 33 , and COQ8A patients with truncating nonsense mutations shown to result in nonsense-mediated decay remained viable in their mid-20's 32 , it is likely that these genes may be relatively insensitive to some borderline-pathogenic missense variants. This has the potential to greatly impact our predictions, with homozygous or compound heterozygous  COQ8A and COQ8B are also noteworthy in that most of the known patients have relatively well-defined, gene-specific, pathologies. Specifically, symptoms of ataxia (often associated with cerebellar atrophy or other neurological abnormalities) are found with 26 of the 29 known pathogenic variants of COQ8A, and all 13 of the published COQ8B pathogenic variants exhibited nephrotic syndrome (citations provided in Table S2). It would therefore be tempting to claim that our predicted patients would exhibit similar clinical conditions, with, for example, all predicted COQ8B patients suffering from nephrotic syndrome 34 . However, it is likely (and our results support) that only a subset of primary UQ deficiency patients have been identified at this point, and they may be non-representative of the actual patient population. Of note, many of the known pathogenic variants were identified in studies where clinicians screened cohorts of patients with specific subsets of well-defined symptoms. For example, our knowledge of COQ8B variants largely comes from two studies in which large numbers of patients with nephrotic syndrome were subjected to sequencing of either whole exomes or multi-gene panels designed for nephrotic syndrome 29,35 . A similar issue can be raised for the ataxic nature of COQ8A variants. For example, two studies described how, after identifying pathogenic COQ8A variants in ataxic patients, they proceeded to sequence COQ8A in other ataxic patients, identifying additional novel pathogenic variants 30,32 . Additional pathogenic variants were found in later studies in which COQ8A, alone or in combination with other UQ biosynthesis  genes, was specifically sequenced in ataxic patients 31,36 . We hypothesize that future COQ8A or COQ8B patients identified via less targeted methodologies may present with more diverse clinical phenotypes, as is characteristic of other UQ biosynthesis genes such as COQ2 or COQ4. There are also several factors that could increase the number of afflicted individuals beyond our estimates. For example, we conservatively assumed that primary UQ deficiency is always recessive; however, haploinsufficiency of COQ4 has been shown to cause clinically significant primary UQ deficiency 37 . Also, violations of Hardy-Weinberg equilibrium (e.g., consanguinity or populations with a large degree of endogamy) could increase the likelihood of an individual being born with two pathogenic variants. It is also noteworthy that 6 of the 29 missense variants known to be pathogenic would not have met our criteria for inclusion as "predicted" pathogenic variants, since they were not assigned the highest level of confidence for pathogenicity by both SIFT and PolyPhen2 (Table 1). This supports the conservative nature of our selection criteria.
Furthermore, there are several reasons why truly pathogenic variants may not appear on our list of known variants. Some variants may have been identified in clinics without being formally described in the literature. For example, COQ2 p.Met128Val and p.Arg387* have been cited as pathogenic in the secondary literature 38 , but without a formal research citation they would not have met our inclusion criteria as known pathogenic variants. Furthermore, although the latter variant was included as a predicted pathogenic variant, the former was assessed as 'benign' and 'tolerated' by Polyphen and SIFT respectively, excluding it from our list of predicted pathogenic variants. In addition, many predicted pathogenic variants were more common in non-western populations, meaning that they are less likely to have been identified in the existing clinical reports, which have focussed on western populations. Additionally, our list of known pathogenic variants may not have included variants detected as part of recent large-scale studies [39][40][41] , and the fact that some UQ biosynthesis genes were found to be associated with disease earlier than others (e.g., COQ2 was first found in 2006 42 , vs. COQ8B in 2013 29 and COQ7 in 2015 43 ) could have delayed the introduction of some genes into widely used genetic screening panels 44 , meaning that more patients were screened for some genes compared to others. Finally, after the literature review phase of our analysis was concluded, novel pathogenic variants have continued to be described in the clinical literature (e.g., COQ4 45 , COQ6 46 , COQ7 47 , ADCK4 48,49 ), indicating that many remain to be reported.
Several aspects of our results point towards their general reliability. For example, there have been no reports of pathogenic variants in COQ3 or COQ5, which is consistent with our prediction of few individuals with primary UQ deficiency due to pathogenic variants in these genes (less than 2,000 individuals worldwide, and only 4 in the USA). Conversely, more patients with defects in COQ8A and COQ8B have been described than for any other UQ biosynthesis gene 8 , which corresponds to our finding that pathogenic variants in these genes make the greatest contribution to the number of individuals worldwide predicted to suffer from primary UQ deficiency, together accounting for more than half of the predicted 127,136 patients worldwide.
In conclusion, we have made the first estimates of the worldwide and within-population birth prevalence of individuals who are homozygous or compound heterozygous for pathogenic variants causing primary UQ deficiency by combining a decades-worth of clinical genetics with the recently available large-scale full exome/ genome sequencing. Our calculations suggest a minimum of 1,665 afflicted individuals worldwide or 192 in the USA (using only variants clinically shown to be pathogenic), up to a maximum of 123,789 worldwide or 1,462 in the USA (with all variants predicted to be pathogenic). Notably, the gap between predictions made using "known" vs. "predicted" pathogenic variants appears smallest in populations expected to have the greatest access to the modern methodologies of clinical genetics. This implies that healthcare providers have already made substantial headway in identifying individuals suffering from this disorder. However, it remains likely that the bulk of patients worldwide suffering from primary UQ deficiency have yet to be recognized.