Introduction

Characterising the patterns of genetic variation within and among human populations is crucial to understand human evolutionary history and the genetic basis of disorders1. Many global genome-wide genotyping and whole-genome sequencing studies (such as the Human Genome Diversity Project1,2, the 1000 Genomes Project (1KGP)3,4 and the UK10K project5) have been undertaken to catalogue genetic variation. Coding exonic regions, though estimated to encompass only approximately 1–2% of the genome, harbour the most functional variation and contain almost 85% of the known disease-causing pathogenic variants6,7; therefore, several global whole-exome sequencing studies have also been undertaken8,9,10. Such large-scale global projects have revealed that human populations harbour a large amount of rare variations which exhibit little homology between diverged populations3,9,10,11,12,13,14,15,16,17, Mendelian and rare genetic disorders are often associated with rare coding variants. Likewise, common markers associated with complex disorders too can vary in frequency across populations18. Considering that population-specific differences in allele frequencies are of clinical importance, it is fundamental to catalogue them in diverse ethnic populations19.

The Arabian Peninsula holds a strategic place in the early human migration routes out of Africa20,21,22. The Peninsula was instrumental in shaping the genetic map of current global populations because the first Eurasian populations were established here23. The ancestry of indigenous Arabs can largely be traced back to ancient lineages of the Arabian Peninsula23,24. The Arab population is heterogeneous but well-structured3,24,25,26. For example, the Kuwaiti population comprises three genetic subgroups, namely KWP (largely of West Asian ancestry representing Persians with European admixture), KWS (city-dwelling Saudi Arabian tribe ancestry) and KWB (tent-dwelling nomadic Bedouins characterised by the presence of 17% African ancestry)24. Further, the Qatari population also comprises similar subgroups with the third group displaying a much higher African ancestry25. The Greater Middle Eastern Variome study26 detected several ancient founder populations and continental & sub-regional admixture in the extended region of Greater Middle East (comprising the Gulf region, North Africa and Central Asia); the study further stated that the ancestral Arab population from Arabian Peninsula could be observed in nearly all of the GME regions possibly as a result of the Arab conquests in the seventh century.

Consanguinity in the Arab region has made the population vulnerable to a plague of recessive genetic disorders. An increased burden of runs of homozygosity has been observed in populations from Kuwait24 and the extended region of Greater Middle East26. An overwhelming proportion (63%) of the disorders documented in the Catalogue for Transmission Genetics in Arabs (CTGA)27 follows a recessive mode of inheritance28. Studying consanguineous populations lead to identifying causal mutations for Mendelian disorders29,30 and rare familial (monogenic) forms of common complex disorders31. These studies also paved the way to evaluate the role of consanguinity and environmental factors in complex lifestyle disorders, such as obesity and type 2 diabetes, cases of which are rapidly increasing in the Arabian Peninsula32,33. Thus, studying consanguineous populations is important to human medical genetics research26,34,35.

Despite consanguinity, diversity and admixture in its populations, the region is poorly represented in global genomic surveys. Even larger databases, such as the Exome Aggregation Consortium (ExAC)8 and the Genome Aggregation Database (gnomAD)8, are deficient in representing Middle Eastern populations. Although the Greater Middle East (GME) Variome project26 provides whole exome data of 1,111 individuals from six GME regions, the region of Arabian Peninsula is represented by only 214 samples and the sub-region of Kuwait by only 45 samples.

In our previous studies, we sequenced and analysed thirteen exomes from the KWS group36 and representative whole genomes from each of the three subgroups of the Kuwaiti population36,37,38. In this study, we extended the study by sequencing whole exomes of 291 native Kuwaiti Arab individuals representing the three population subgroups. We further analysed the data to infer the extent of exome variability in the Kuwaiti population and to delineate its impact on population substructures of Kuwait and medical genetics of the region.

Results

Exome variants discovered in the Kuwaiti population

The 291 exomes were sequenced to a median coverage of 45X, with an average of 80% of the target base pairs having at least 15X coverage. ‘Missingness’ rate (referring to the percent of samples where information was missing) of 1.8% was obtained leading to genotyping call rate of 98.2%. Totally, 173,849 (including 2,626 non-autosomal) variants were identified (Table 1 and Supplementary Table S1), 12.16% of which were novel. The call set included 170,508 single-nucleotide variants (SNVs) and 3,341 insertions and deletions (indels). 11.85% of the SNVs and 28% of the indels were novel. The observed aggregate transition/transversion (Ti:Tv) ratio of 3.22 was within the acceptable range for whole-exome sequencing variants39,40. A heterozygous to homozygous variant genotype ratio of 0.63 was obtained indicating that the population skews towards homozygosity with its inbreeding nature.

Table 1 Statistics of variants observed in Kuwaiti exomes.

Validation of SNP calls

The validity of the SNP calls was confirmed by utilizing an in-house genome-wide genotype data set on 269 (of the sequenced 291) samples derived using the Illumina HumanOmniExpress BeadChip (Illumina Inc, USA). In an average, 13,175 variants could be compared per sample. The concordance rate of the SNP calls between the exome sequencing data and genome-wide genotype data was >99.7% (see Supplementary Table S2). The observed concordance rate in our study is on par with those reported in literature: Kenna et al.40 reported a genotype concordance rate of 98.9% on comparing the accuracy of genotypes inferred using Illumina high throughput sequencing platforms with genotypes ascertained using Illumina BeadChips. The disagreements in the SNP calls were seen more often with heterozygous SNPs than with homozygous SNPs. As is the practice41, we choose not to remove the inconsistent calls.

Principal component analysis of variants in the merged set of exome variants from Kuwait and global populations

The scatter plot of the first two principal components of the merged data set of exome variants from Kuwait, 1KGP global populations, and Qatar is presented in Fig. 1. The plot affirmed the heterogeneity of the Kuwaiti Arab population as comprising three substructures24 and inferred the regional affinity.

Figure 1
figure 1

Scatter plot of the first two principal components of the merged data set of exome variants from the three Kuwaiti substructures and from regional (Qatar) and 1KGP global populations.

Classifications of observed SNVs

50.7% of the identified SNVs were ‘rare’, 12.5% were ‘low-frequency’ and 15.1% were ‘common’. Up to 21.7% of the SNVs were ‘personal’ (found in only one Kuwaiti exome and not seen in the data sets of 1KGP Phase 3 and GME). Alternate allele was the major allele in 4.2% of the identified SNVs; 0.22% of the SNVs were fixed for the alternate allele, having a non-reference frequency of 100%. Among the identified SNVs, 53.5% were missense, 41.61% were synonymous and approximately 1% were loss-of-function (LoF). 55,644 of the identified SNVs were ‘population-specific’, 60% of which were missense; 9,429 of these 55,644 population-specific variants were polymorphic (seen in ≥2 exomes from the study cohort and not seen in 1KGP), most of which were ‘rare’ (8408 out of 9429); of the remaining 46,215 variants, 37,044 were ‘personal’ and 9171 were seen in one exome from the study cohort and were also seen in GME data set. On average, 14,557 SNVs and 210 indels were seen in every Kuwaiti individual. The average number of ‘personal variants’ per individual was 129. Population-specific missense variants per individual were more than synonymous changes (184 versus 126). The average number of LoF variants per Kuwaiti individual was 73, of which 4.5 were specific to the Kuwaiti population.

Homozygous LOF variants and “inactivated genes”

We had observed 1645 putative LoF SNVs (Table 1) in Kuwaiti exomes from 291 healthy individuals of Arab ethnicity. 186 of these 1645 LoF SNVs were homozygous and they were harbored in 179 genes (See Supplementary Table S3). Of the 186 homozygous LoF SNVs, 27 were with MAF <1% and another 9 were with MAF (≥1% and <2%). Sulem et al.42, by way of performing whole-genome sequencing of 2,636 Icelanders and chip-imputing a further 101,584 Icelanders, had identified a set of rare (MAF <2.0%) homozygous LoF variants in 1,171 genes. In a similar manner, the Exome Aggregation Consortium (ExAC) data set of 60,706 sequenced individuals identified 2,068 genes that were inactivated8; The GME26 consortium, by way of analyzing 354 exomes of healthy individuals, identified 301 genes with rare homozygous LoF variants of which 50 genes overlapped the Icelandic gene list and 94 overlapped the ExAC gene list of inactivated genes. Upon comparing the homozygous LoF variants from the Kuwaiti exomes with the above-mentioned data sets of inactivated genes, we found 23 genes (PNPLA1, ULBP3, OR8K3, RAD52, APOBEC1, PDIA2, WDR87, SIGLEC1, COL9A2, OTOF, SULTIC3, COQ2, MROH2B, FAM81B, UNC93A, DNAH11, PXDNL, OR4D10, SLC22A24, RNASE9, C17orf77, CARD14 and SLC5A4) in common with Icelandic data set, 3 genes (EML1, WWTR1 and PPFIA1) in common with ExAC data set and six genes (COL9A2, SLC5A9, FAM81B, GGT6, EFCAB13 and SLC5A4) in common with GME data set. Upon considering only those LoFs with <2% MAF in Kuwaiti exomes, the number of genes in common with Icelandic data got reduced to 8 - PNPLA1 (MAF_KWT of the LoF: 1.03), ULBP3 (0.52%), OR8K3 (1.5%), RAD52 (1.3%), APOBEC1 (0.34%), PDIA2 (1.3%), WDR87 (0.34%) and SIGLEC1 (0.34%)); upon considering only the rare (MAF <1%) homozygous LoF variants in Kuwaiti exomes, only one gene (EML1 (0.69%)) was seen in common with ExAC data set; and none with GME data set. GME work reported more genes as common with the Icelandic/ExAC data sets as they also considered indels along with SNVs to derive the list of LOFs while we considered only the SNVs.

Comparison with Greater Middle East (GME) Variome data

Results of comparing the variants observed in our study with those reported in GME populations26 are presented in Table 2. Up to 64% of the SNVs identified in our study were seen common with GME – the remaining 36% of variants not seen in GME are expected to enlarge the variome of the GME region. GME provided supporting evidence to designate up to 25% of Kuwaiti population-specific singleton mutations (seen in only one exome from the study cohort) as genuine SNVs. Up to as high as 58% of the population-specific polymorphic variants observed in Kuwaiti exomes were also seen in GME variome.

Table 2 Comparing the Kuwaiti Arab whole-exome variants with Greater Middle East (GME) whole-exome Variome.

Extent of variability in Kuwaiti exomes

In each of the categories of ‘all’, ‘known’, ‘novel’ and ‘Kuwaiti population-specific’ variants, the observed number of variants increased linearly with increasing number of sequenced exomes and did not reach a plateau (Fig. 2). A similar trend was observed when the three subgroups were examined individually (Supplementary Fig. S1). However, when the population-specific variants were divided into ‘personal’ and ‘population-specific polymorphic’ variants, population-specific variants shared by more than one individual reached a plateau.

Figure 2
figure 2

Distribution of total number of single-nucleotide variants (SNVs) upon step-wise addition of exomes. The red line represents the number of all variants found as the number of sequenced exomes increased. The green line represents the number of known variants among all variants found. The orange line represents the number of novel variants among all variants found. The blue line represents the number of population-specific variants among all variants found. Population-specific ‘personal’ variants observed in only one Kuwaiti exome and not seen in either 1KGP or GME are represented by the dotted line; population-specific ‘polymorphic’ variants observed in more than one Kuwaiti exome are represented by the dashed line.

Variants significantly differentiating the three population subgroups of Kuwait

Results of pFst likelihood ratio tests for allele frequency differences between the three subgroups based on 142,626 autosomal variants are presented in Supplementary Fig. S2. Three variants significantly distinguished KWP from KWB: rs2289043_A > G (UNC5C) (pFst = 3.28 × 10−6), mostly prevalent in admixed Americans (75%) and Europeans (71%); rs3739310_T > G (KIAA1456) (pFst = 4.40 × 10−5), frequently found in East Asians (78%) and Europeans (77%); and rs764374986_G > A (AKAP12) (pFst = 5.21 × 10−5), a rare variant occurring mostly in Africans (0.01%) from gnomAD data set (the variant is absent in 1kGP data set). Three variants significantly distinguished KWS from KWB:rs1150360_A > G (FAM76B) (pFst = 3.89 × 10−5), frequent in Africans (93%); rs138408584_G > C (PHRF1) (pFst = 6.97 × 10−5), rare in Europeans (~1%); and rs1043730_G > T (TRAF3IP2) (pFst = 9.57 × 10−5), present at 99% frequency in Africans and East Asians. Two variants significantly distinguished KWP from KWS: rs35840170_C > T (FBN3) (pFst = 7.63 × 10−5), present at ~20% frequency in East Asians and admixed Americans; and rs7956133_G > T (FAM216A) (pFst = 9.51 × 10−5) present at a frequency of 15% in Africans.

SAFD variants with significant allele frequency differences between the Kuwaiti and 1KGP global populations; and analysis of their population-wide occurrence

Examination of the SNVs seen in common between Kuwaiti exomes and 1KGP phase 3 exome data for significant allele frequency differences led to identifying 6,186 SAFD variants. Functional characterization of these variants is presented in Supplementary Table S4. Of these 6,186 SAFD variants, 2,960 were missense, 2,913 were synonymous, 20 were stop-gain and 26 were LoF. Extent of LoF variants among the SAFD SNVs was only 0.4% while it was 1.7% among the ‘all’ SNVs. Population-wide occurrence of the identified SAFD variants was investigated to determine the pairing occurrence of Kuwaiti population subgroups in the context of maximum allele frequency (Supplementary Fig. S3 and Fig. 3). (a) Analysis of the 5,140 SAFD variants, derived using gnomAD populations: The number of variants showing maximum allele frequency in KWS, KWP and KWB subgroups were 2885, 543 and 1712, respectively. In KWS, 38% of the 2885 variants showed maximum allele frequency in Ashkenazi Jews and 19% in Africans. In KWP, 35% had maximum allele frequency in Ashkenazi Jews and 33% in South Asians. In KWB, 61% variants showed maximum allele frequency in Africans. (b) Analysis of 6186 SAFD variants, derived using 1KGP populations: Coupling observed with South Asians and Africans was confirmed. KWB paired with Africans in 21% of 1,056 variants; KWP and KWS paired with South Asians in 31% of 1355 and 17% of 3775 variants, respectively. Furthermore, coupling with Europeans, which was not seen in the analysis using gnomAD populations, was observed in 53% of KWS variants, 52% of KWP variants and 41% of KWB variants.

Figure 3
figure 3

The occurrence pattern of pairing Kuwaiti populations with (a) the gnomAD or (b) the 1KGP global populations as populations with maximum allele frequency. X-axis: percentage of pairing occurrence of Kuwaiti populations and gnomAD (A) or 1KGP (B) global populations as populations with maximum allele frequency, Y-axis: Kuwaiti populations (KWT-All Kuwaitis; KWB-Bedouins; KWP-Persians; KWS-Saudi Arabian tribe). gnomAD global populations: AFR, Africans/African Americans; AMR, admixed Americans; ASJ, Ashkenazi Jewish; EAS, East Asians; FIN, Finnish; NFE, Non-Finnish Europeans; OTH, Other population not assigned; SAS, South Asians. 1KGP global populations: AFR, African; AMR, Admixed American; EAS, East Asian; EUR, European; SAS South Asian. (c) Considers only the variants with minor allele frequency (MAF) of >5% (n = 3887) and pairing with 1KGP global populations.

Validation of the genetic relatedness implied by analysis for population-wide occurrence of SAFD variants

In order to further explore the observed coupling in maximum allele frequency between Kuwaitis and other populations (including the Ashkenazi Jews), Kuwaiti exome data was merged with the data sets from Ashkenazi Jews43, Qatar44 and 1KGP phase 3. Upon applying quality control steps and LD-pruning the combined data set of coding-region variants, a total of 896 variants from 3,336 individuals was obtained. Genetic differentiation of Kuwaiti subpopulation groups in terms of regional and continental populations was assessed by way of calculating mean pairwise FST (Supplementary Fig. S4, Supplementary Table S5). Lowest degree of differentiation was observed between Kuwaiti subpopulation groups and Qataris (KWB FST = 0.0005, KWP FST = 0.0027, KWS FST = 0.0023) followed with Ashkenazi Jews (KWB FST = 0.0103, KWP FST = 0.0071, KWS FST = 0.0104) and Europeans (KWB FST = 0.0143, KWP FST = 0.0093, KWS FST = 0.0155). Scatter plots resulting from principal component analysis (PCA) of the merged data set are presented in Figs 4 and 5. Consistent with the FST analysis, the Kuwaiti population were seen dispersed over the Qataris, Ashkenazi Jewish and Europeans (Fig. 4). A clear dispersal of these populations was seen in the three-dimensional PCA plot (Fig. 5 and the interactive three-dimensional plot available at http://dgr.dasmaninstitute.org/exome_pca/).

Figure 4
figure 4

Two-dimensional principal component analysis (PCA) plots showing the dispersal of Kuwaitis over the Qataris, Ashkenazi Jewish and Europeans.

Figure 5
figure 5

Three-dimensional principal component analysis (PCA) plots showing the dispersal of Kuwaitis over the Qataris, Ashkenazi Jewish and Europeans. The interactive three-dimensional plot is available at http://dgr.dasmaninstitute.org/exome_pca/).

‘Rare and deleterious’ variants and their clinical significance

The analysis pipeline that examined the Kuwaiti exomes for deleterious variants which are rare in both 1KGP and ExAC data sets yielded a list of 46 variants (41 unique disorders of which 20 were reported in CAGS database) – comprising 35 pathogenic (for rare disorders), 1 drug response, 10 risk factors (1 corresponding to a rare but multifactorial disorder and the remaining 9 to complex and common disorders) (Table 3). Of these 46 variants, 43 variants remained rare in Kuwaiti exomes; and three variants reached an MAF value characterizing low-frequency variants (rs1800553/ABCA4/Risk-Factor:2.41%, rs61742245/VKORC1/Drug-response:1.04%, rs11909217/LIPI/Risk-Factor:1.72%). Pathogenic variants: The 35 pathogenic variants mapped to 32 genes and to 32 unique single-gene disorders; 28 of the 35 variants follow autosomal recessive (AR) and the remaining follow autosomal dominant (AD) mode of inheritance. In 16 instances of these 35 pathogenic variants, the disorders were observed in Arab population (as annotated in CAGS database27). Drug response variant: the VKORC1 variant was associated with warfarin resistance in AD mode. Risk factor variants: The 10 risk factor variants mapped to 9 genes and to susceptibility to 8 unique disorders. The inheritance patterns were seen to be mostly autosomal dominant (in three instances, can be AR along with AD). In instances of 4 of the 10 risk factor variants, the disorders were observed in Arab population as per CAGS database.

Table 3 46 rare and deleterious variants (pathogenic and risk factors)@ seen in Kuwaiti Exomes.

Pathogenic variants and high MAF in Kuwaiti exomes

Five of the identified “rare & deleterious” variants that were annotated “pathogenic” for clinical significance in ClinVar were seen to possess risk allele frequencies of ≥1% in Kuwaiti exomes as opposed to <1% in 1KGP populations. ClinVar defines “pathogenic” variants as those that are interpreted for Mendelian disorders; or as those that have low penetrance. It is also possible that a variant in ClinVar can have an erroneous or conflicting classification. Cassa et al.45 examined 81,432 “pathogenic” variants from HGMD7 in a data set of whole-genome sequences of 1.092 individuals from 1KGP project and found that 4.62% of the tested variants to possess an MAF of ≥1% and 3.5% of the tested variants to possess an MAF of ≥5%; they concluded that many of these variants are probably erroneous findings or have lower penetrance than previously expected. It is also possible that such high frequency pathogenic variants are indeed of the type “increased susceptibility” and not of the type “causal”; it is also possible that the disorders with such high MAF “pathogenic” variants are not really “rare” but are either “common” or “more prevalent in the study population”; that it is also possible that the frequent variants have evidence to cause a disease when inherited in compound heterozygous state and have insufficient evidence to lead to a disease in homozygotes. The five “pathogenic” variants that were seen in Kuwaiti exomes with an MAF of ≥1% are as follows:

(a) Four variants retained as pathogenic for rare disorders: (i) rs79204362 (MAF_KWT:1.03% and MAF_1KGP: 0.42%) associated with Early onset of Glaucoma: ClinVar annotated this variant as Pathogenic based on evidence from literature studies and as of uncertain significance based on clinical testing. The disorder was supposed to be rare (1 in 10,000) in European-based populations and of higher frequency in Middle East 51–100 per 100,000 (i.e. 5 in 10,000); CTGA reported a high incidence rate of 1 in 2,500 in Saudi Arabian population. Thus, the MAF that is seen marginally higher at 1.03% was acceptable. (ii) rs61732874_C > A (MAF_KWT:1.55% and MAF_1KGP: 0.18%) associated with Familial Mediterranean fever (FMF): ClinVar annotated this as Pathogenic/likely-pathogenic based both on literature evidence and clinical testing. FMF is a rare disorder in European population; however, it is no longer a rare disorder in certain populations such as Japan (see Table 7). CAGS also listed the incidence as 51–100 per 100,000 in Arab population; CAGS further mentioned that estimates of the incidence of FMF in specific eastern Mediterranean populations ranged from 1 in 2000 to 1 in 100, depending on the population studied. Thus, the MAF that is seen at 1.55% was acceptable. (iii) rs61757294 (MAF_KWT:15.19% and MAF_1KGP:5.3%) associated with Corticosterone Methyloxidase Type II Deficiency, a rare genetic disorder: ClinVar annotated this variant as pathogenic based on evidence from literature publication and benign based on clinical testing records. The variant was found in patients of Iranian Jewish ancestry. This had DR mode of inheritance – both this variant and another one rs289316 need to be homozygous; thus, the observed high MAF was acceptable; in fact, the variant was a common variant in 1KGP as well. (iv) rs12021720 (MAF_KWT:13.47% and MAF_1KGP:10.9%) associated with Maple syrup urine disease, intermediate, type II (a rare genetic disorder): ClinVar annotated this variant as pathogenic based on literature evidence and as benign with clinical testing as source of annotation. Though the incidence rate world-wide is 1 in 185,000, CAGS reported an incidence rate of 2 in 10,000 in Bahrain - still the higher MAF is not justified. This mutation was seen in one of the three patients from the study46 and the patient was a compound heterozygote for a C to G transversion at nucleotide 309 in exon 4 [rs121965001] and a G to A transition at nucleotide 1165 in exon 9 [rs12021720], causing an Ile-to-Met substitution at amino acid 37 and a Gly-to-Ser substitution at amino acid 323, respectively. Thus, the high frequency of MAF at one of the two variants of the compound heterozygotes was acceptable for pathogenic variant; in fact, the variant was a common variant in 1KGP as well. (b) One variant retained as pathogenic for rare disorder but with a suggestion that they can be “likely benign”: rs61751507 (MAF_KWT:7.47% and MAF_1KGP:2.7%) associated with Carboxypeptidase N deficiency, which is possibly a complex disorder. ClinVar annotated this variant as pathogenic with evidence from literature publication and benign based on information clinical testing. The study47 found this pathogenic variant in just one patient and hence it may be considered as of insufficient evidence. Hence this variant can be considered as “Likely Benign”.

Missense variants rare within global populations but common within Kuwaiti population

170 SNVs were identified as rare in global populations but common in Kuwaiti exomes; 85 of these were missense variants (Supplementary Table S6). The 85 variants were of two categories: (a) A set of 20 variants harboured in genes annotated for disorders in OMIM: However, these 20 variants were not of any pathogenic value as ClinVar annotated these variants as either ‘benign’ or ‘conflicting interpretation’. Not surprisingly, the REVEL scores in these instances (except in 2 instances – GLDC variant at around 0.8; and the DPYD variant at 0.4) were seen low at ≤0.3. (b) A set of 65 variants harboured in genes NOT annotated for any disorder in OMIM: Association with phenotypes was seen with only one of these 85 variants; the TTC38 variant rs117135869 (REVEL = 0.621; MAF_KWT: 5.0%; MAF_1KGP: 0.58%) has been recently identified as a novel metabolic quantitative trait loci (mQTLs) in a cohort from Middle Eastern population48; this variant was seen in 29 of the Kuwaiti exomes in the heterozygous form.

Missense variants mapping to drug-binding domains and were of pharmacogenomic relevance

We identified 21 missense SNVs that mapped to a set of 130 drug-binding domains reported in literature49 and were annotated in PharmGKB50 (Table 4). These 21 variants had impact on the efficacy of drugs used largely for treating common disorders (such as heart failure, hypertension, Chemotherapy, neoplasms, diabetes, nephrosclerosis, rheumatoid arthritis, asthma, pulmonary diseases, schizophrenia, tobacco use disorder, heroin dependence, sickle-cell anemia, and HIV). Furthermore, literature survey revealed that 7 of these 21 pharmacogenomic variants were associated with complex disorders in Arab studies (Table 5).

Table 4 21 missense variants mapping to drug-binding domains and of pharmacogenomic relevance (efficacy, dosage toxicity).
Table 5 Subset of 7 of the identified pharmacogenomic variants (from Table 4) that were also reported in Arab studies as relating to complex disorders.

SAFD variants and their clinical relevance

For 230 of the 6,186 SAFD variants, ClinVar database provided annotation relating to clinical significance (Supplementary Fig. S5 and Supplementary Table S7). These 230 variants were from 186 unique genes, for 162 of which Inheritance patterns were known; 91 were AR and 63 were AD. 206 of these 230 variants were benign or likely benign. The disorders related to the genes harbouring the benign variants were often single-gene disorders and familial, hereditary and congenital. The 24 non-benign variants (Table 6) were from 21 unique genes associated with 20 unique disorders.

Table 6 List of 24 SAFD variants annotated for clinical significance in ClinVar and OMIM.

Seven of these 24 non-benign SAFD variants were seen annotated in ClinVar as “Pathogenic”; however, either the associated disorder was common/complex or more prevalent in the study population or the patient carrying the variant was annotated in OMIM as susceptible to the disorder (which is usually a common disorder). Going by the practice that “pathogenic” variants are related to Mendelian disorders, we considered the variants associated with common disorders as risk factors. (i) rs1800435_G > C (MAF_KWT:12.41%; MAF_1KGP:6.4%) associated with “Aminolevulinate dehydratase, alad*1/alad*2 polymorphism susceptibility to lead poisoning ALAD porphyria”. ClinVar annotated this as pathogenic based on literature evidence and likely benign based on clinical testing. It increases the risk for lead poisoning. ALAD porphyria is a very rare genetic metabolic disease; however, quoting from the CDC report on lead poisoning – “There are approximately half a million U.S. children ages 1–5 with blood lead levels above 5 micrograms per deciliter (µg/dL), the reference level at which CDC recommends public health actions be initiated”, lead poisoning is no longer a rare disorder. Susceptibility is the keyword, and we reannotated this variant as risk factor. (ii) rs5030737 (MAF_KWT:7.90%; MAF_1KGP:2.8%) associated with Mannose-binding lectin deficiency, which is a complex trait. ClinVar annotated the variant as pathogenic based on literature reference; since we associate “pathogenic” to Mendelian disorders, we reannotated this variant as risk factor. (iii) rs121918530 (MAF_KWT:1.03%; MAF_1KGP:0.04%)associated with coronary artery disease/myocardial infarction, which is a complex multifactorial disorder. ClinVar annotated this variant as pathogenic based on literature evidence and likely benign based on clinical testing. since we associate “pathogenic” to Mendelian disorders, we reannotated this variant as risk factor. (iv) rs5030739 (MAF_KWT:8.42%; MAF_1KGP:2.32%) associated with “Prostate cancer hereditary 2, susceptibility to” (Complex trait). ClinVar annotated this variant as pathogenic based on literature evidence and benign based on clinical testing. The cited literature suggested increased risk of prostate cancer; ‘susceptibility to’ was the keyword. Thus, this variant was considered as risk factor. It was also the case that this variant has to appear in compound heterozygosity with the next listed variant of rs4792311. We reannotated this variant as risk factor. (v) rs4792311 (MAF_KWT:35.52%; MAF_1KGP:21.5%) associated with “Prostate cancer hereditary 2, susceptibility to” (Complex trait). ClinVar annotated this variant as pathogenic based on literature evidence and benign based on clinical testing. The cited literature suggested increased risk of prostate cancer; ‘susceptibility to’ is the keyword. Thus, this variant was considered as risk factor. It was also the case that this variant had to appear in compound heterozygosity with the previous listed variant of rs5030739. We reannotated this variant as risk factor. (vi) rs1801483 (MAF_KWT:3.78%; MAF_1KGP:0.42%) associated with Diabetes mellitus type 2, non-insulin dependent (a multifactorial complex disorder). ClinVar annotated the variant as pathogenic (based on literature evidence). Considering that we associate “pathogenic” only with Mendelian disorder, we reannotated the variant as risk factor. (vii) rs34719006 (MAF_KWT:2.58%; MAF_1KGP:0.18%) associated with Cholestasis of pregnancy, which is a most common liver disease unique to pregnancy. (Complex trait). ClinVar annotated this variant as pathogenic based on evidence from literature study and annotated as with conflicting evidence between likely benign (clinical testing), uncertain significance (Clinical testing). Considering that we associate “pathogenic” only with Mendelian, we reannotated this variant as risk factor.

This set of 24 SAFD variants with clinical significance was distributed onto (a) A set of 2 pathogenic variants (rs61757294 and rs61751507) with AR mode inheritance; the MAF of these two variants in Kuwaiti exomes were uncharacteristic of pathogenic variants (see above for more details); (b) A set of 4 drug response variants one of which was AR); (c) A set of 14 risk variants and 2 protective variants for complex traits (3 were AR); and (d) A set of two variants associated with phenotype traits through GWAS studies. Five of the disorders associated with the SAFD variants were seen annotated in CAGS as observed in Arab countries (see Table 6).

Assessing the Loss-of-Function SAFD Variants for clinical significance

We had identified 26 LoF SAFD variants (Supplementary Table S4); as many as 15 of these were stop-gain, seven were start-loss and the remaining four were splice site mutations. None of these 26 SAFD LoF variants was seen annotated for disorder in OMIM; however, the GWAS Catalog51 listed one of these variants namely rs2228015-C from CCR7 gene as associated with the complex phenotype trait of lymphocyte counts (at genome-wide significant p-value of 6E-09).

CAGS disorders for which the OMIM-listed causal variants were seen in Kuwaiti exomes

We further examined the CAGS database for disorders observed in Kuwait at any incidence rate and for disorders seen in any of the Arab countries at incidence rates of ≥11 per 100,000. CAGS database provided the Phenotype MIM number using which we retrieved the OMIM-reported causal variants and checked for their occurrences in Kuwaiti exomes. For 25 disorders, the OMIM-reported variants were seen in Kuwaiti exomes (Table 7); eight of these 25 disorders had already been seen in the analysis for functional variants. Except in one instance (rs1800858), all the variants were missense. 13 of these variants were “pathogenic” and the remaining 12 were “risk factor” variants. 18 of these disorders were observed in Kuwait and the remaining in other Arab countries.

Table 7 Arab disorders (annotated in CAGS) for which the OMIM-listed risk variants were seen in Kuwaiti exomes.

Scrutinization of the identified variants against Arab mutations reported in Arab studies

Analyses performed so far in the study indicated that disorders relating to 20 instances of rare & deleterious variants, 16 of which were pathogenic variants for rare disorders and 4 were risk factor variants for complex disorders (see Table 3), 7 instances of pharmacogenomic variants that were associated with complex disorders in Arab studies (see Table 5), 5 instances of SAFD variants (see Table 6), and 17 additional instances from the analysis of CAGS disorders were seen in Arab population (see Table 7). During the analysis, we also found in Kuwaiti exomes two recessive mutations (namely rs1801133 & rs1801131 from MTHFR – see Table 8) associated with recessive early onset of susceptibility to Type 2 diabetes in Arab population. We set upon to identify which of these variants were also reported in Arab studies for the corresponding disorder. Upon performing literature survey and manual examination of the bibliography data presented in CAGS database, these variants could be classified onto the following categories (Table 8): (a) 16 Instances where the OMIM-listed variants identified in Kuwaiti exomes were also reported as Arab mutations in Arab studies. 9 of these were pathogenic variants for rare disorder; 1 was drug response; and 6 were risk factors for complex disorders. (b) 7 Instances where the identified pharmacogenomic variants in Kuwaiti exomes were also observed as associated with disorders in Arab studies. These were drug response variants to complex disorders. (c) 12 instances of disorders where the genetic basis at SNV level had not been reported in Arab studies. 7 of these were pathogenic variants for rare disorders and 5 were risk factors for complex disorders. (d) 10 Instances where the Arab studies reported variants different from the OMIM-listed variants observed in Kuwaiti exomes; however, the Arab reported variants were from the same gene. The Arab variants were generally seen in OMIM but not in our exomes. Eight of these variants were pathogenic for rare disorders and two were risk factors for complex disorders. (e) 7 Instances of variants (from 4 disorders) where the Arab studies reported variants from genes different from those of the OMIM-listed variants observed in Kuwaiti exomes. In general, the different gene and the different mutations were listed in OMIM for the disorder but were not seen in Kuwaiti exomes. All the 7 variants were risk factors for complex disorders.

Table 8 Evaluation of the identified variants for observation as Arab mutations in Arab studies.

Identified variants and the associated complex disorders

Examining our in-house genome-wide association study (GWAS) data (on 1351 native Kuwaiti Arab individuals genotyped on Illumina HumanOmniExpress BeadChip and 1900 native Kuwaiti Arab individuals genotyped on Illumina HumanCardio-Metabo BeadChip) for the presence of the variants identified through exome analysis revealed that 27 of the identified OMIM-listed causal variants present in Kuwaiti exomes were also seen in the GWAS data. Allele frequencies and carrier distributions as seen in exomes data set and GWAS data set are presented (Table 9). 13 of these 27 variants were associated with disorders observed in Arab studies. The 27 variants were pharmacogenomic (11), SAFD (9) and CAGS (5 + 2) variants for complex disorders. The allele frequencies among the data sets of Kuwait exome, Kuwait GWAS and GME were comparable with each other; and the carrier distributions were similar between the Kuwaiti exomes and GWAS data sets.

Table 9 Comparison of allele frequencies & carrier distribution of the reported variants from Kuwaiti exome data set with a larger data set of our in-house genome-wide genotype data.

Discussion

In this study, exomes from 291 healthy, unrelated native Kuwaiti Arabs were analysed to identify 170,508 SNVs and 3,341 indels. 12% of SNVs and 28% of indels were novel. One-third of the identified SNVs were population-specific, and 21.7% were ‘personal’ (observed in only one Kuwaiti exome and not seen in GME or 1KGP), consistent with the results of other studies on ethnic populations, including those from Qatar44, Spain17 and Denmark12. 53% of the identified SNVs were missense, and an average of 1.3% of the 14,557 SNVs that each person carried were predicted to affect protein function. Allele frequencies in 6,186 SAFD variants were significantly different from those observed in 1KGP populations.

Recent population genetic analyses have demonstrated that humans harbour an abundance of rare & deleterious variations, with >80% of all coding variants having a frequency of ≤1%10,14,52. In this study, a majority (51%) of the identified SNVs in Kuwaiti exomes were rare. Of the identified 55,644 population-specific SNVs, only 138 were ‘common’, and the rest were ‘rare’ or ‘low-frequency’. Up to 60% of the population-specific variants were missense changes, and 51% of LoF variants were population-specific (some of which were polymorphic). These observations support the notion that coding variants with allele frequency of <1% show increased population-specificity and are enriched for functional variants13. Human populations have experienced recent explosive growth, expanding by at least three orders of magnitude over the past 400 generations; such a rapid recent growth along with weak purifying selection has increased the load of rare variants, many of which are deleterious and relevant for understanding disease risks14,16.

On average, nearly 10.4% of Kuwaiti population-specific variants found in every Kuwaiti individual were homozygous; this extent of homozygosity, which is higher than that observed in other ethnic populations (such as the value of 7.05% in Spanish17), reflects the higher rate of consanguinity practised among the Kuwaiti Arab population. The GME study26 demonstrated an increased burden of runs of homozygosity in Greater Middle East populations; our previous works had shown that Kuwaiti population is heterogeneous (placed between populations that have large amount of ROH and the ones with low ROH) with the KWS subgroup as highly endogamous24. An average of 73 LoF variants (of which 4.5 were Kuwaiti-specific) were seen per individual. Observed disease-causing mutations failing to cause disease in at least a proportion of the individuals who carry them has been extensively discussed53. On an average, only 4.67% of Kuwaiti-specific LoF variants per individual were seen homozygous (as opposed to the expected 10.4%) and such a reduced homozygosity among LoF variants may explain the reduced penetrance.

Rare homozygous loss of function variants are supposed to exhibit strong signs of selective pressure. Of the genes harboring the identified 36 rare (MAF <2.0%) homozygous putative LoF variants observed in Kuwaiti exomes, only 8 were seen common with published list of inactivated genes from Icelanders42 and only 1 was common with the list from ExAC8. These findings suggest that the set of non-clinically relevant loss-of-function variants is far from being complete26 and consideration of ethnic populations with consanguinity as in the GME study and our study can augment the list of human knock-out events.

We previously catalogued36 exome variants from 15 native Kuwaiti individuals of KWS subgroup (city-dwelling Saudi Arabian tribe ancestry24) and postulated that further samples were needed to capture the full spectrum of exome variability. The present study indicated that our previous work captured only a portion (22%) of variability. The repertoire of ‘all’ SNVs and ‘population-specific’ variants increased with the number of samples sequenced and did not reach a plateau (Fig. 2). However, once population-specific variants were divided into personal and genuine polymorphic variants, the later reached a plateau. These data suggested that most of the Kuwaiti-specific polymorphisms within coding regions were restricted to approximately 10,000 positions (Fig. 2, dashed blue line).

Utility of whole-exome data in population structure analysis produces results congruent to those obtained using genome-wide genotype data54. In this study, principal component analysis of the merged data set of exome variants from Kuwait, 1KGP global populations, and Qatar confirmed the existence of three subgroups (Fig. 1) previously derived from genome-wide genotype data24 in Kuwaiti population. The KWB subgroup showed greater genetic affinity towards African populations, and the other two clearly demarcated subgroups (namely KWP and KWS) were between South Asian and European populations. Furthermore, the three substructures of the Qatari population25 lied akin to Kuwaitis. These results were supported by evidence from pFST likelihood ratio tests, which identified variants that differentiated the subgroups. The population-wide occurrence of Kuwaiti SAFD variants in the context of maximum allele frequency populations indicated pairing of Kuwaiti individuals mostly with Europeans and Ashkenazi Jewish populations from 1KGP phase 3 and gnomAD data sets. Such genetic relatedness among Middle Eastern, European and Ashkenazi Jewish populations was further confirmed through performing population genetic analyses (FST and PCA) by way of including genotype data from Ashkenazi Jews as well (Figs 4 and 5). Notably, in line with previous studies43, population genetic analysis presented in this study demonstrated the genetic relatedness among Middle Eastern, European and Ashkenazi Jewish populations.

The Kuwaiti exomes, presented in this study, included 46 clinically significant deleterious variants that are rare in global populations and in Kuwaiti exomes (except for three) (pathogenic: 35; drug response:1; risk factor: 10). 28 of the 36 pathogenic variants followed AR mode of inheritance and the 7 of the 10 risk factor variants followed AD mode of inheritance. Disorders associated with 20 of the 46 variants were seen in Arab populations. Three of the 46 variants reached an MAF value characterizing low-frequency variants in Kuwaiti exomes; two of these three were risk factor and one was drug response variant; and the allele frequencies were comparable with GME data set (see Table 3)-(rs61742245-VKORC1:1.04%,1.37%; rs1800553-ABCA4:2.41%,2.1%; rs11909217-LIPI:1.72%,1.31%). The three variants indicating high risk ratios in Kuwaiti exomes for disease pathogenesis and response to medication were: the VKORC1 variant was associated with warfarin resistance (AD) (heterozygous in four individuals and homozygous recessive in one individual), the ABCA4 variant was associated with susceptibility to age-related macular degeneration (AD) (heterozygous in 12 individuals and homozygous recessive in one individual), and the LIPI variant was associated with susceptibility to hypertriglyceridemia (AD) (heterozygous in 10 individuals). Mendelian and rare genetic disorders as well as monogenic forms of common complex diseases are often associated with rare coding variants. The rare coding variants can have remarkably different allelic frequencies in different ethnic populations compared with the 1KGP populations10,55. The data presented above reported rare variants associated with not only rare Mendelian disorders but also with complex disorders. This observation is in agreement with literature reports on many examples of rare and low-frequency variants associated with complex phenotype traits and common disorders (a review of some of the relevant studies are as listed in Table 1 in Schork et al.56). An interesting example of rare & deleterious “risk factor” variants associated with increasing risk for complex disorders was the GHRL variant (rs34911341-C/T; Arg51gln) which OMIM associated with susceptibility to the complex disorder of obesity (along with genes such as POMC, SDC3 and ADRB2); the variant was originally seen in 6.13% of 96 unrelated Swedish female subjects of morbid obesity (BMI 42.3 ± 3.4 Kg/m2)57. The variant had been seen in GME data set and in one individual from our study cohort; incidentally, the individual was morbidly obese female with a BMI of 44.3 kg/m2; Though our study cohort consisted 48 morbidly obese female individuals, only one of them carried this GHRL allele.

The study identified in Kuwaiti exomes a set of 21 missense SNVs (that were predominantly ‘common’ in both Kuwaiti exomes and 1KGP populations as well as in GME) mapping to drug-binding domains and were of pharmacogenomic relevance (relating to complex disorders, such as sickle-cell anaemia, hypertension, diabetes, asthma, cancer and chemical dependence). 7 of these 21 variants were also observed as Arab mutations associated with complex disorders in Arab populations (Table 5). Of the 21 pharmacogenomic SNVs, the CYP2C8*3 variants encoding two linked amino acid substitutions46 were particularly evident (Table 4) in Kuwaiti exomes; risk allele frequencies at these two variants were 12% in Kuwaiti exomes, 10.4% in GME and 4.6% in 1KGP; the risk alleles co-segregated in 33 individuals in our study cohort. CYP2C8 has emerged as a significant pharmacogene58,59 and is responsible for biotransformation of 5% of currently used drugs that undergo phase 1 hepatic metabolism60. The CYP2C8*3 variants regulate the dosage of the diabetes drugs rosiglitazone and repaglinide61,62. The minor alleles of the CYP2C8*3 variants were also associated with decreased metabolism of paclitaxel59. The ADRB2 variant (occurring with an MAF of 1.2% in Kuwaiti exomes, 1.6% in GME and 0.4% in 1KGP) regulates the efficacy of the asthma drug terbutaline and beta-blocking agents used to treat heart failure63. ADRB2 variant had also been correlated with the risk of type 2 diabetes, obesity and hypertension63; six individuals from our study cohort carried risk allele at this variant.

The identified 24 SAFD variants (all of which were missense variants – see Table 6) with clinical significance included (a) two pathogenic variants (with AR mode of inheritance) associated with the rare disorders of Corticosterone methyloxidase type 2 deficiency and Carboxypeptidase N deficiency); (b) four drug response’ variants associated with toxicity to the drugs of cisplatin or cyclophosphamide and with response to anti-coagulation drugs; (c) sixteen risk/protective’ variants associated with complex traits (ex. asthma, Parkinson’s disease, obesity, nephrolithiasis, melanoma 6, and alcohol dependency); and (d) two ‘Associated’ variants relating to traits of FPG levels and skin/hair/eye pigmentation. The gnomAD populations that showed highest MAFs at these 24 variants were Ashkenazi Jews (15 instances) and Europeans including Finnish (6 instances). The 1KGP populations that showed the highest MAFs were Europeans (14) and South Asians (7). As expected, a major number of these variants were ‘common’ (20 in Kuwaiti exomes and 14 in 1KGP data). The associated disorders are common in the region - Cholestasis of pregnancy (associated with the ATP8B1 variant), the most common AD disorder in pregnant women, has an incidence rate of 0.8%–1.46% in South Asian populations64; Hereditary prostate cancer (associating with the two ELAC2 variants) is one of the most prevalent cancers in Kuwait65; The corticosterone methyloxidase type 2 deficiency (associating with the CYP11B2 variant) is more common in people of Iranian Jewish ancestry66; and Coronary artery disease (associating with the MEF2A variant) has an incidence rate of approximately 6% in the Saudi Arabian population67.

In addition, two other SNVs from our analysis for functional variants were seen associated with quantitative traits in GWA studies – an SAFD LoF variant rs2228015/CCR7 associated with the complex hematological trait of lymphocyte count in European-ancestry people68 and a missense variant (rare in 1KGP but common in Kuwaiti exomes) rs117135869/TTC38 associated with a novel complex metabolic quantitative trait loci (mQTLs) in a cohort from Middle Eastern population48. Further search for presence in Kuwaiti exomes of OMIM-listed causal variants relating to CAGS disorders led to a list of additional 17 variants (see Table 7); 7 of these variants were “pathogenic” and the remaining 10 were “risk factor” variants. The analysis identified a total of 25 CAGS disorders for which the OMIM-listed causal variants were seen in Kuwaiti exomes; such a poor turnout of only 25 was probably due to the small size of the cohort.

Of the 112 variants of clinical significance discussed so far, as many as 44 were ‘common’ variants in Kuwaiti exomes. Very often these common variants were relating to complex disorders. In this study, we did not ourselves delineate the variants associated with complex disorders; we rather just examined whether and which of the functional variants identified in our study were annotated in OMIM, ClinVar, PharmGKB, and literature as associated with complex (or rare) disorders. A question arose as to whether the study cohort of 291 exomes had enough power. Of the 112 variants, 27 variants (comprising 1 rare, 3 low-frequency and 23 common variants) were also seen in our in-house GWAS data set of larger sample size; the set of 27 variants comprised 11 pharmacogenomic, 9 SAFD, 5 CAGS and the two MTHFR variants associated with susceptibility to T2DM (Table 9). The MAF at these variants were comparable among the Kuwaiti exomes, GWAS data, and the GME data set; the carrier distributions were also comparable with one another among the Kuwaiti exomes and GWAS data set.

Finally, disorders relating to 52 of the identified variants were observed in Arab population. Inheritance modes associated with these 52 variants were: 28 autosomal recessive, 15 autosomal dominant, and 9 ambiguous. 25 of these variants were relating to ‘rare’ and 27 were relating to complex disorders. This study (based on 291 exomes) provided data on 23 known Arab mutations for 23 disorders seen in Arab populations, data on 12 putative mutations for 12 disorders observed but not yet characterized for genetic basis in Arab population, and data on 17 additional putative mutations for disorders characterized for genetic basis in Arab populations. This data is useful for testing in future case-control studies.

Capturing the extent of genetic variation in Middle East region is poorly represented in global studies. However, the Greater Middle Eastern (GME) Variome Consortium26 has recently made a notable effort to address this concern by way of capturing genetic variations from exomes of 1,111 unrelated and supposedly healthy individuals from Northwest and Northeast Africa, Turkish peninsula, Syrian desert, Arabian Peninsula and Persia & Pakistan. The GME data set included 214 exomes from Arabian Peninsula (AP), of which 45 are from Kuwaiti population. Our study consisting of 291 Kuwaiti samples, sourced from the 3 Kuwaiti population subgroups, complements and augments the GME genetic variation data by way of presenting a higher number of exomes representing a single state of AP namely Kuwait. It is further the case that the GME study discovered and presented the variegated genetic architecture in GME populations; this is complemented by the population genetics results from our study from a relatively larger sample set of native Arabs living in a single state from the Peninsula. The GME study demonstrated the utility of the GME exome data set in discovering the genetic basis of Mendelian disorders in Greater Middle Eastern populations; our study provides data on Arab mutations for 23 disorders and points to 31 OMIM-listed variants relating to disorders seen in Arab populations for testing in future case-control studies.

A potential limitation of this study arises from the number of exomes sequenced. Though the number of population-specific variants seemed to saturate with 291 exomes, the total number of “all” identified variants did not saturate (Fig. 2); this indicates that we need to sequence furthermore samples to sufficiently represent the Arab population from Kuwait. It is further the case that variants associated with only a small set of disorders observed in the region were seen in the reported Kuwaiti exome data.

In conclusion, the presented assessment of 291 exomes of unrelated healthy individuals unveiled the prevalence of rare as well as common variants related to various Mendelian disorders and common complex diseases that are predominantly inherited as recessive. The inclusion of different genome data sets in our analyses highlighted similarities in allele frequencies among Arabs and Jews, and among nomadic Bedouins and Africans. Furthermore, our data corroborates the Kuwaiti population substructures previously determined by genome-wide genotype data; the results on population structures from Kuwait is generally in agreement with the variegated genetic architecture seen in Greater Middle Eastern populations26. The striking occurrence of pharmacogenomic variants relating to common complex disorders, underlines the importance and need for cataloguing genetic variants in similar Arab populations of the Middle East region. This study is a significant addition to regional data resources (such as GME26) and global resources (such as 1kGP3,4) on human exome variability; however, a wide range of similar studies in the region are warranted to support genomic discoveries in medical and population genetics at the regional and global levels26.

Methods

Ethics Statement

The protocols used in the study were approved by the International Scientific Advisory Board and the Ethical Review Committee at Dasman Diabetes Institute, Kuwait. Written informed consent was obtained from participants before collecting blood samples. Identities of the participants were protected from public exposure, and samples/data were processed anonymously. All methods were performed in accordance with the relevant guidelines and regulations.

Selection of subjects for whole-exome sequencing

To capture the extent of exome variation in the entire Kuwaiti population, 291 healthy, unrelated native Kuwaiti individuals from the study cohorts used in our earlier studies were selected24,36,69,70. At the time of recruitment, all participants in this study were healthy and deemed free of Mendelian or rare genetic disorders, cognition or physical disability, mental retardation or chronic disorders, such as cancer. Distribution of the selected participants in three subgroups of Kuwaiti population24 was as follows: 109 in KWS (Saudi Arabian tribe ancestry), 126 in KWP (Persian ancestry) and 34 in KWB (nomadic Bedouin ancestry).

Whole-exome sequencing

High-quality DNA samples were enriched for exomes using TruSeq Exome Enrichment kit and the Nextera Rapid Capture Exome kit (Illumina Inc. USA). The captured libraries were then clustered using TruSeq Paired Cluster Kit V3 (Illumina Inc. USA) and sequenced in HiSeq 2000 using Illumina’s Sequence by Synthesis technology as 100 paired-end reads.

Exome data analysis

The HugeSeq71 computational pipeline was used to automate the variant discovery process. Sequence reads were aligned to the reference human genome build hg19 using BWA72. Prior to variant calling, alignment files were processed using the Genome Analysis Toolkit (GATK)73. Post-alignment procedures included PCR duplicate removal, local realignment around known indels and base quality recalibration. Best practices for the GATK workflow were followed, and standard hard filtering parameters74,75 were used for variant discovery from the processed alignment files. Variant calling on each sample’s BAM file was performed using HaplotypeCaller followed by joint genotyping analysis of the resultant gVCFs to create raw SNV and indel VCFs. Variants called in the sequenced exomes were restricted to intervals covered by both TruSeq (163 samples) and Nextera (128 samples) Exome Enrichment kits. To improve the quality of the data set, the resulting variant call sets were filtered by setting sample variant thresholds at ≥10X depth, <180X depth and genotype quality of >20. Variants with allele balance of <30% were removed to filter out sites where the fraction of non-reference reads was too low. Hardy–Weinberg Equilibrium was assessed using an exact test, as defined by Wigginton et al.76, and excluded sites with p-values of <10−5. Lastly, all variants with a call rate of <90% were excluded. Thus, after the variant quality filtering steps, only the consensus of variants determined using both kits appeared in the final VCFs.

Classifying the variants

The Ensembl genome database build 75 was used as reference for gene annotation. SNP Variation Suite (SVS) v8.7.1 from Golden Helix Inc77 was used to derive functional classifications of the identified variants. The identified SNVs and indels were categorised as ‘known’ and ‘novel’ based on the content of the single-nucleotide polymorphism database of dbSNP14678. Variants already reported in dbSNP146 were annotated as ‘known’, and the others were annotated as ‘novel’. Variants observed in only a single exome from the study cohort and not seen in 1KGP or GME data sets were annotated as ‘personal’. Variants (excluding the ‘personal’) that were not observed in 1KGP phase 3 data were annotated as ‘population-specific’, and population-specific variants observed in more than one exome from the study cohort were annotated as ‘population-specific polymorphic’ variants. Variants leading to stop gain, stop loss, frameshift and damage in splice sites were annotated to cause LoF (loss of function). Variants were classified as ‘rare’ if MAF was <1% (personal variants were not considered as rare), as ‘low-frequency’ if MAF was 1–5% and as ‘common’ if MAF was ≥5%.

Principal component analysis of the merged set of exome variants from Kuwaiti and global populations

The 1KGP phase 3 exomes of 2,504 individuals from 26 populations, covering the four continents of Asia, Africa, America and Europe and 100 exomes of Qatari individuals44 were considered along with Kuwaiti exomes. An LD-pruned (LD threshold of 0.5) data set of 20,215 variants (having MAF of ≥5%) observed in all three Kuwaiti subgroups and the regional and global populations was created. Golden Helix SVS software v8.7.1 was used to perform principal component analysis with the merged data set.

pF ST likelihood ratio tests: Comparison of allele frequency distribution among the Kuwaiti population subgroups

Reference alleles and alternate alleles were binned to set the standard for ‘Kuwaiti exome’. In order to detect alleles driving differentiation among the three Kuwaiti subpopulation groups of KWP, KWS and KWB, pFST likelihood ratio tests79 for allele frequency differences in autosomal variants (filtered for missingness rate and deviation from Hardy–Weinberg equilibrium) were performed.

Identification of SNVs with significant differences (SAFD variants) in allele frequencies between Kuwaiti and global populations

Autosomal SNVs observed in both Kuwaiti exomes and 1KGP phase 3 exomes4 were identified, and SNVs for which minor alleles were not observed in Kuwaiti exomes were excluded. SAFD variants that exhibited significant allele frequency differences were identified by performing one-sided binomial exact tests (allele frequencies in 1KGP global populations were considered as ‘expected’), followed by Bonferroni correction. A p-value threshold of 0.05 was used to assess the significance of allele frequency differences. ClinVar80 data resource was used to assess the clinical significance of the identified SAFD variants. In the context of population structure analyses, populations from gnomAD data set8 were also used to compare allele frequency distributions. The comprehensive scrutinization of population-wide occurrence was performed by considering the paired incidence of populations with maximum allele frequency.

Principal Component Analysis of the merged set of Kuwaiti exomes, Ashkenazi Jews, 1KGP phase 3 and Qatar and F ST analysis

We combined Kuwaiti exomes with the data sets from Ashkenazi Jews43, 1KGP phase 34 and Qatar44. The combined data set of coding-region variants was cleaned and LD-pruned to obtain a total of about 896 variants and 3,336 individuals representing world populations. Principal component analysis (PCA) was performed using smartpca in the EIGENSOFT software package (v 6.1.4)81,82. Two-dimensional and three-dimensional scattered PCA plots were created using RStudio83 (v 1.1.423). Mean pairwise FST values and the matrix between populations were generated using PLINK84 (v 1.9). The FST heatmap was created using RStudio (v 1.1.423).

Examining OMIM and ClinVar annotations for inferring clinical significance of SNVs

OMIM and ClinVar should mention the Kuwaiti exome SNV, with literature evidence and citation reference, as an associated variant for a disorder; the dbSNP identifier of the SNV and the observed risk allele should be mentioned as such in the OMIM and ClinVar annotation80,85. The clinical significance for the variant should be mentioned consistently with the same term (such as ‘pathogenic’ or ‘risk factor’ or ‘protective’ or ‘drug response’) in all the records for the disorder; it should not be the case that few records list the significance as ‘pathogenic’ and few other records list as ‘benign’ or ‘conflicting interpretation’ for the disorder; ClinVar records listing “not specified” for the data item of ‘conditions’ were not considered. As is the practice86, in cases of ClinVar variants with conflicting annotation for clinical significance, evidence from a peer-reviewed publication and manually curation (OMIM) takes precedence over evidence from clinical testing submissions. ClinVar defines “Pathogenic” variants as those that are interpreted for Mendelian disorders; or as those that have low penetrance; “Drug response” variants as those that affect drug response, and not a disease; “Risk factor” variants as those that are interpreted not to cause a disorder but to increase the risk; “Association” variants as those that were identified in a GWAS study and further interpreted for their clinical significance; “Protective” variants that decrease the risk of a disorder, including infections; and “Susceptibility to” variants that increase the risk of a disorder. In those instances, wherein ClinVar annotated a variant as “pathogenic” but the associated disorder was “complex or common or more prevalent in the study population” or the patient carrying the variant was annotated in OMIM as susceptible to the disorder (which is often a common disorder), we reannotated the variant as “Risk factor by inference”.

Classifying disorders as ‘rare’ or ‘complex’

Various resources that were examined to ascertain whether the disorder is rare or common: Catalogue of Transmission Genetics in Arabs (available at http://cags.org.ae/ctga/), Genetic and Rare Disease (GARD) Information Centre (available at https://rarediseases.info.nih.gov/diseases/), Genetics Home Reference (available at https://ghr.nlm.nih.gov/), Medscape (available at https://geneaware.clinical.bcm.edu/GeneAware/AboutGeneAware/DiseaseSearch.aspx) and literature.

Examining the Kuwaiti exomes for rare, deleterious and pathogenic variants

‘Known’ missense and LoF SNVs having MAF of <1% in the 1KGP phase 3 data4 and ExAC database8 were catalogued. Of these, only the variants annotated as damaging by both SIFT87 and PolyPhen-288 tools were retained. The Kuwaiti exomes were examined for such variants. As an additional step, the resulting variants were filtered based on their Combined Annotation-Dependent Depletion score89 to prioritise functional, deleterious and pathogenic variants across many functional categories, effect sizes and genetic architectures. A scaled score of ≥20 was applied to retrieve only those variants that were predicted to be among the top 1% of deleterious variants in the human genome. The above set of variants were screened for clinical significance using the OMIM85 and ClinVar80 databases.

Variants found to be ‘rare’ within global populations but ‘common’ within Kuwaiti population

A data set of SNVs that are rare in 1KGP phase 3 populations but common within Kuwaiti exomes was created. For such missense variants, scores predicting their pathogenicity were calculating using the REVEL90 software.

Examining the pharmacogenomic relevance of Kuwaiti exome variants

Variants of pharmacogenomic relevance were delineated using the resources built upon the concept of druggable genome originally formulated by Hopkins and Groom49 and PharmGKB50. From the data set of variants derived for Kuwaiti exomes, missense SNVs (with MAF of >1%) that are not deleterious (i.e. SIFT and PolyPhen-2 scores were outside the deleteriousness range) were mapped to protein domains (using InterPro91) and checked for inclusion in the list of 130 domains reported by Hopkins and Groom. From the resulting set of variants mapping to drug-binding domains, only those for which pharmacogenomic annotation was available in PharmGKB database were retained.