Introduction

Diseases that can be attributed to mutations in single genes are often referred to as Mendelian, and their phenotypic spectrum essentially spans all structural and functional aspects of the human body.1,2 Despite the potential of these diseases to reveal the medical relevance of individual human genes, their rarity makes them less familial to clinical practice and more difficult to study epidemiologically compared to their common multifactorial counterparts.

The rarity of Mendelian diseases is the consequence of a strong negative selection that operates on causative alleles because many result in phenotypes that are detrimental to the reproductive success of the affected individuals.3 In the case of dominant disorders, the negative selection operates conspicuously; i.e., the defective allele is eliminated by the failure of the patient to reproduce. Dominant lethal alleles are thereby replenished through their introduction anew in the genetic pool, i.e., de novo. Recessive alleles, however, tend to persist in the genetic pool even when they result in lethal phenotypes because carriers typically have normal reproductive health.4

Although individual Mendelian diseases tend to be rare, some can achieve a relatively high frequency.5 This may be due to an unusual “heterozygotes advantage” that endows health benefits to carriers, leading to reproductive success, as in the classic case of carriers of HBB mutations and resistance of malaria.6 Founder effect is another major factor that can inflate the frequency of recessive diseases because it increases the proportion of carriers; classic examples include Finnish and Ashkenazi Jewish diseases.7,8 Consanguinity is a special case of founder effect whereby the carrier frequency is not increased per se but the very limited genetic pool within any given family increases the probability of a mate to be a carrier of the same recessive allele.9 It should be noted that, although individually rare, Mendelian diseases collectively account for significant morbidity and mortality, especially in the pediatric population.10

Countrywide estimates of burden of diseases—an essential metric that informs public health policies—are routinely obtained for a wide variety of disease conditions.11 Although Mendelian diseases are not rare in aggregate, estimating their disease burden at a national level is cumbersome and faces many challenges. This is particularly concerning when one considers that most Mendelian diseases are incurable and many rank highly in clinical severity, thus implying prevention as the best health-care policy to combat them.12 In the absence of reasonable estimates of the overall burden that these disorders pose to a country’s state of health, well-informed prevention strategies will be difficult to design. A major obstacle in estimating disease burden for Mendelian diseases is diagnosis, not only because they are unfamiliar to many physicians but also because their presentation can vary greatly such that they may not be diagnosed clinically, even by trained medical geneticists.13 Although this can be mitigated by adopting genomic molecular approaches that are agnostic to clinical judgment,14 the logistic challenge of a national registry that logs all molecularly diagnosed cases can be formidable.

In this study, we show that estimating the disease burden for recessive Mendelian diseases can be informed by clinical genomics, a growing trend in medical genetics whereby an individual patient is tested for all or many genes at once.15 In a large data set of exomes and gene panels performed on 7,101 patients from all regions of Saudi Arabia, we were able to calculate the carrier frequency for confirmed pathogenic pediatric-onset recessive conditions. The overall disease burden inferred from this set of data reveals surprising patterns that we hope will inform policymakers locally and encourage others to explore this promising application of clinical genomics.

Materials and Methods

As part of our ongoing effort to characterize autosomal recessive diseases in Saudi Arabia, we have genomically analyzed 7,101 patients: 1,549 by whole-exome sequencing and the rest by targeted gene panel sequencing. The assignment of a panel to a patient was based on the phenotype and diagnosis, as explained before in our report of the “Mendeliome” study.15 It is important to mention that in this study we used more samples than in the Mendeliome study.

The sequencing work was carried out using the setup of the Saudi Genome Project based on Ion Torrent sequencing technology. The content and design of the multigene panels were described previously,15 as was the commercial Inherited Disease Panel of LifeTech, together comprising 3,085 unique known disease genes across all panels. Whole-exome sequencing was performed using primer-based sequencing, targeting all the genes in the genome (approximately 25,000 genes and open-reading frames). The sequencing and data analysis procedures are the same as described previously.15

Our analysis of this large number of patients with suspected Mendelian diseases revealed a large number of likely causal variants, but we chose a subset of 618 autosomal recessive variants from 357 genes that can be assigned pathogenic status with the highest confidence for subsequent analysis (Supplementary Tables S1 and S2 online). Supplementary Table S3 online includes gene-per-gene sequencing quality information in terms of depth and coverage. The overall average coverage of all genes is 98%, and the overall average depth is 701×.

Obviously, the homozygous or compound heterozygous state of these alleles is highly biased in our cohort and cannot be used to directly estimate disease frequency. However, we reasoned that their presence in the heterozygous state is unrelated to the disease state of the tested individual. For example, the probability of being a carrier for a hemoglobinopathy mutation is not influenced by being homozygous for a Bardet-Biedl disease mutation. Therefore, we screened our cohort for these 618 variants but counted their occurrence only in the heterozygous state (excluding compound heterozygosity). The minimal number of samples screened for carrier status per variant is 1,549 (1,549 exomes + 0 multigene panels), the maximum is 4,975 (1,549 exomes + 3,426 multigene panels), and the average was 2,322 samples tested per variant. The number of samples screened for each variant is shown in Supplementary Tables S1 and S2 online.

The carrier frequency for variant X CF(X) is computed according to the following equation:

Results

Founder mutations account for a minority of recessive mutations in Saudi Arabia

We have previously suggested that the remarkable genetic and allelic heterogeneity we observe in Saudi Arabia is due, at least in part, to the effect of consanguinity, which theoretically can render de novo recessive mutations homozygous in a span of three generations only. In other words, consanguinity can inflate the contribution of young mutations to the mutational spectrum of autosomal recessive diseases.16 Consistent with this, we noticed that 58% (359/618) of the disease-causing mutations encountered did not appear to have a founder effect beyond the family in which they were identified, as inferred by their absence in the heterozygous state in any of more than 7,000 samples tested; i.e., they were “private” mutations ( Figure 1 and Supplementary Table S2 online).

Figure 1
figure 1

Relative distribution of founder and private variants.

Wide range of carrier frequencies of founder mutations

We identified 259 autosomal recessive mutations for which at least one carrier was encountered in our cohort of more than 7,000 unrelated patients (excluding parents and other relatives), which enabled us to calculate the corresponding carrier frequency (Supplementary Table S1 online). Supplementary Table S3 online shows the combined carrier frequency for each screened gene. The highest carrier frequency for a single mutation was 0.0218 for the known founder mutation CYP1B1:NM_000104.3:c.1103G>A:p.Arg368His, which causes congenital glaucoma, a highly endemic disease in Saudi Arabia, albeit with reduced penetrance.17 The combined carrier frequency for the three founder mutations in CYP1B1 was 0.047. The lowest carrier frequency was 0.0002 for ADA:NM_000022.2:c.385G>A:p.Val129Met, which causes severe combined immunodeficiency due to lack of adenosine deaminase. However, the combined carrier frequency for all immunodeficiency founder mutations was high, approaching 0.01.

To calculate the overall carrier frequency for all recessive mutations without resorting to computationally challenging combinatorial probability statistics, we queried the exomes of 1,549 unrelated patients with various Mendelian disorders and ignored the homozygous calls to avoid bias (see Materials and Methods). The probability of being a carrier for at least one of the founder mutations was approximately 0.214 (331/1549) (0.022 for at least two and 0.002 for at least three mutations).

High burden of recessive diseases in Arabia

Although Saudi Arabia is known to be enriched for recessive diseases due to high rates of consanguinity, it has not been possible to quantify this.18 Using the Hardy–Weinberg equation, we calculated the disease frequency (q2) based on the carrier frequency (2pq). According to this method, we found that the most common group of diseases comprises those that can present with psychomotor delay/intellectual disability (3 per 1,000), followed by retinal dystrophies (1.7 per 1,000) and congenital glaucoma (1 per 4,500). The estimated burden of the disease categories is shown in Table 1 and Figure 2 . However, the Hardy–Weinberg equation assumes independent mating, a condition clearly violated by the >50% rate of consanguinity in Saudi Arabia.18 In the setting of consanguinity, the probability of the mutant allele q to coexist with another mutant allele is not qXq (q2) but qXF, where F is the inbreeding coefficient. For example, although the carrier frequency of sickle cell disease (0.008) predicts a disease frequency of 1 per 62,500, the high inbreeding coefficient in Saudi Arabia (estimated at 0.0241 (ref. 19)) puts the disease frequency at 1 per 10,300 ( Figure 3 ). Because the latter is much closer to the estimated disease prevalence based on a previously published community-based study,20 we opted to present the disease burden based on q2 and qF ( Table 1 ). To provide a regularly updated estimate of disease burden based on newly identified mutations, a freely available database was established that can be accessed at http://shgp.kacst.edu.sa/dbm/.

Table 1 Estimated burden of disease categories based on derived and observed combined carrier frequency
Figure 2
figure 2

Top 10 diseases with respect to their burden based on q2. For abbreviation key, see the Table 1 footnote

Figure 3
figure 3

Top 10 diseases with respect to their burden based on qF. For abbreviation key, see the Table 1 footnote.

Discussion

Genomic sequencing (whole-genome, whole-exome, and gene panels) is a new trend in diagnostics that was driven by the need to develop diagnostic tools that can overcome the limited sensitivity and specificity of clinically based diagnosis of genetic diseases.21 We have previously published the results of applying this approach to more than 2,300 patients with various suspected Mendelian disorders.15 Through that study and others, we were able to compile a large list of point mutations, the majority of which are recessive, consistent with the predicted effect of the high consanguinity rate in Saudi Arabia. Although these samples come from all regions in Saudi Arabia, they could not be used to estimate the disease frequency directly because of the obvious bias in sampling. However, we reasoned that each of these unrelated patients can contribute to the estimate of carrier frequency of mutations other than their causative mutations. Using this approach, we calculated the first countrywide estimate of disease burden of recessive diseases in the Middle East without resorting to independent genomic sequencing of healthy controls.

The overall disease-burden estimate we present in this work is a minimal estimate for several reasons. First, our estimate does not take into account the regional variation in carrier frequency. One good example is sickle cell disease, for which previous studies have shown a remarkably wide range of carrier frequency across Saudi provinces, from 0 in the north to 0.25 in the eastern province, where the disease is endemic.20,22 Second, we relied on founder mutations to calculate carrier frequency, which account for only a minority of pathogenic mutations (most of the mutations we identified are private). Novel disease-causing mutations that have yet to be identified are obviously not included in our estimate. Finally, mutations that are not readily picked up by next-generation sequencing were not included (e.g., the common founder mutation GORAB NM_152281.2:c.306dupA:p.P103Tfs*20 (ref. 23) was consistently missed by the next-generation sequencing platform used in this study).

This limitation also applies to large deletions/duplications and deep intronic mutations. Perhaps a good demonstration of the conservative nature of our estimate can be gleaned from local experience at the national newborn screening program at KFSHRC, where 1 out of 1,043 newborns is confirmed positive for 1 of the 17 tested conditions (phenylketonuria, maple syrup urine disease, argininosuccinase deficiency, citrullinemia, HMG-CoA lyase deficiency, isovaleric acidemia, methylmalonic acidemia, propionic acidemia, β-ketothiolase deficiency, methylcrotonyl-CoA carboxylase deficiency, glutaric acidemia type I, medium-chain acyl-CoA dehydrogenase deficiency 13, very-long-chain acyl-CoA dehydrogenase deficiency, congenital hypothyroidism, congenital adrenal hyperplasia, galactosemia, and biotinidase deficiency) (Alfadhel M and Al-Odaib A, personal communication) compared with the 1 out of 3,951 predicted by our conservative q2 estimate, or even compared with the less conservative 1 out of 2,608 qF estimate.

Even with this minimal estimate, the figure obtained for disease burden is alarmingly high. With the probability of being a carrier for any of the founder pathogenic mutations at 0.214, we can estimate that the probability of a first-cousin union conceiving a child homozygous for the corresponding mutation is ~0.7% (6.7 per 1,000), which in itself is an underestimate given the large degree of inbreeding; i.e., F is higher than the expected 0.0625 for first-cousin unions. This raises several issues that are of public health concern. For instance, Saudi Arabia has implemented two public health initiatives for primary and secondary prevention of genetic diseases. The first is a premarital screening program that focuses on hemoglobinopathies; the other is a newborn screening program for 17 inborn errors of metabolism.24,25 Although these diseases are relatively common, the disease-burden estimate we calculate shows that other disease categories are much more common, such as autosomal recessive intellectual disability. With the increasing interest in early diagnosis of pediatric-onset diseases as a justification for inclusion in screening programs even when effective therapy is lacking,26 a neonatal screening program based on these established pathogenic mutations is an appealing choice. Such genotyping-based screening will not only be less expensive but also circumvent many other controversies surrounding neonatal genomic sequencing, such as variants of unknown significance and incidental findings.27 Similarly, the premarital screening program, its controversial mandatory statute notwithstanding, that is currently based on hemoglobin electrophoresis can benefit greatly from switching to a genotyping-based platform. Again, the established pathogenic nature of the mutations we report will greatly facilitate genetic counseling surrounding this test. For comparison, the probability among Ashkenazi Jews to be a carrier for one of the “Ashkenazi Jewish mutations” is 0.3, and such a high carrier frequency was the impetus for many preventive screening programs.28,29

Our work highlights an interesting benefit of performing large-scale clinical genomics in which each tested patient serves as a source of genetic variation in the general population and carrier frequency. We hope the data we present from the largest Middle Eastern country can inform the design of next-generation prevention strategies for genetic diseases locally and beyond.

Disclosure

The authors declare no conflict of interest.