Assessment of coding region variants in Kuwaiti population: implications for medical genetics and population genomics

Consanguineous populations of the Arabian Peninsula have been underrepresented in global efforts that catalogue human exome variability. We sequenced 291 whole exomes of unrelated, healthy native Arab individuals from Kuwait to a median coverage of 45X and characterised 170,508 single-nucleotide variants (SNVs), of which 21.7% were ‘personal’. Up to 12% of the SNVs were novel and 36% were population-specific. Half of the SNVs were rare and 54% were missense variants. The study complemented the Greater Middle East Variome by way of reporting many additional Arabian exome variants. The study corroborated Kuwaiti population genetic substructures previously derived using genome-wide genotype data and illustrated the genetic relatedness among Kuwaiti population subgroups, Middle Eastern, European and Ashkenazi Jewish populations. The study mapped 112 rare and frequent functional variants relating to pharmacogenomics and disorders (recessive and common) to the phenotypic characteristics of Arab population. Comparative allele frequency data and carrier distributions of known Arab mutations for 23 disorders seen among Arabs, of putative OMIM-listed causal mutations for 12 disorders observed among Arabs but not yet characterized for genetic basis in Arabs, and of 17 additional putative mutations for disorders characterized for genetic basis in Arab populations are presented for testing in future Arab studies.

. Statistics of variants observed in Kuwaiti exomes. Ti:Tv, transition/transversion ratio. @ Variants from our cohort that were not seen in 1KGP were termed as 'Kuwaiti population-specific' variants. # Personal SNVs are those that are observed only in a single exome from the study cohort and not seen in the data sets of 1KGP or GME. These are indeed "private mutations" and remain so until the mutations are observed in further exomes/genomes sequenced in future studies. & Loss-of-function (LoF) variants represent the sum total of stop gain, stop loss, frameshift and splicing variants. LoF variants are expected to correlate with complete loss of function of the affected transcripts, including stop codon-introducing (nonsense) or splice site-disrupting single-nucleotide variants (SNVs), insertion/deletion (indel) variants predicted to disrupt a transcript's reading frame or larger deletions removing either the first exon or >50% of the protein-coding sequence of the affected transcript. $ These were calculated using all the identified SNVs including the personal variants.
Comparison with Greater Middle East (GME) Variome data. Results of comparing the variants observed in our study with those reported in GME populations 26 are presented in Table 2. Up to 64% of the SNVs identified in our study were seen common with GME -the remaining 36% of variants not seen in GME are expected to enlarge the variome of the GME region. GME provided supporting evidence to designate up to 25% of Kuwaiti population-specific singleton mutations (seen in only one exome from the study cohort) as genuine SNVs. Up to as high as 58% of the population-specific polymorphic variants observed in Kuwaiti exomes were also seen in GME variome.

Extent of variability in Kuwaiti exomes.
In each of the categories of 'all' , 'known' , 'novel' and 'Kuwaiti population-specific' variants, the observed number of variants increased linearly with increasing number of sequenced exomes and did not reach a plateau (Fig. 2). A similar trend was observed when the three subgroups were examined individually ( Supplementary Fig. S1). However, when the population-specific variants were divided into 'personal' and 'population-specific polymorphic' variants, population-specific variants shared by more than one individual reached a plateau.

SAFD variants with significant allele frequency differences between the Kuwaiti and 1KGP
global populations; and analysis of their population-wide occurrence. Examination of the SNVs seen in common between Kuwaiti exomes and 1KGP phase 3 exome data for significant allele frequency differences led to identifying 6,186 SAFD variants. Functional characterization of these variants is presented in Supplementary Table S4. Of these 6,186 SAFD variants, 2,960 were missense, 2,913 were synonymous, 20 were stop-gain and 26 were LoF. Extent of LoF variants among the SAFD SNVs was only 0.4% while it was 1.

Validation of the genetic relatedness implied by analysis for population-wide occurrence of SAFD variants.
In order to further explore the observed coupling in maximum allele frequency between Kuwaitis and other populations (including the Ashkenazi Jews), Kuwaiti exome data was merged with the data sets from Ashkenazi Jews 43 , Qatar 44 and 1KGP phase 3. Upon applying quality control steps and LD-pruning the combined data set of coding-region variants, a total of 896 variants from 3,336 individuals was obtained. Genetic differentiation of Kuwaiti subpopulation groups in terms of regional and continental populations was assessed by way of calculating mean pairwise F ST ( Supplementary Fig. S4 27 ). Drug response variant: the VKORC1 variant was associated with warfarin resistance in AD mode. Risk factor variants: The 10 risk factor variants mapped to 9 genes and to susceptibility to 8 unique disorders. The inheritance patterns were seen to be mostly autosomal dominant (in three instances, can be AR along with AD). In instances of 4 of the 10 risk factor variants, the disorders were observed in Arab population as per CAGS database.
Pathogenic variants and high MAF in Kuwaiti exomes. Five of the identified "rare & deleterious" variants that were annotated "pathogenic" for clinical significance in ClinVar were seen to possess risk allele frequencies of ≥1% in Kuwaiti exomes as opposed to <1% in 1KGP populations. ClinVar defines "pathogenic" variants as those that are interpreted for Mendelian disorders; or as those that have low penetrance. It is also possible that a variant in ClinVar can have an erroneous or conflicting classification. Cassa et al. 45 examined 81,432 "pathogenic" variants from HGMD 7 in a data set of whole-genome sequences of 1.092 individuals from 1KGP project and found that 4.62% of the tested variants to possess an MAF of ≥1% and 3.5% of the tested variants to possess an MAF of ≥5%; they concluded that many of these variants are probably erroneous findings or have lower penetrance than previously expected. It is also possible that such high frequency pathogenic variants are indeed of the type "increased susceptibility" and not of the type "causal"; it is also possible that the disorders with such high MAF "pathogenic" variants are not really "rare" but are either "common" or "more prevalent in the study population"; that it is also possible that the frequent variants have evidence to cause a disease when inherited in compound heterozygous state and have insufficient evidence to lead to a disease in homozygotes. The five "pathogenic" variants that were seen in Kuwaiti exomes with an MAF of ≥1% are as follows: (a) Four variants retained as pathogenic for rare disorders: (i) rs79204362 (MAF_KWT:1.03% and MAF_1KGP: 0.42%) associated with Early onset of Glaucoma: ClinVar annotated this variant as Pathogenic based on evidence  from literature studies and as of uncertain significance based on clinical testing. The disorder was supposed to be rare (1 in 10,000) in European-based populations and of higher frequency in Middle East 51-100 per 100,000 (i.e. 5 in 10,000); CTGA reported a high incidence rate of 1 in 2,500 in Saudi Arabian population. Thus, the MAF that is seen marginally higher at 1.03% was acceptable. (ii) rs61732874_C > A (MAF_KWT:1.55% and MAF_1KGP: 0.18%) associated with Familial Mediterranean fever (FMF): ClinVar annotated this as Pathogenic/ likely-pathogenic based both on literature evidence and clinical testing. FMF is a rare disorder in European population; however, it is no longer a rare disorder in certain populations such as Japan (see Table 7). CAGS also listed the incidence as 51-100 per 100,000 in Arab population; CAGS further mentioned that estimates of the incidence of FMF in specific eastern Mediterranean populations ranged from 1 in 2000 to 1 in 100, depending on the population studied. Thus, the MAF that is seen at ClinVar annotated this variant as pathogenic based on literature evidence and as benign with clinical testing as source of annotation. Though the incidence rate world-wide is 1 in 185,000, CAGS reported an incidence rate of 2 in 10,000 in Bahrain -still the higher MAF is not justified. This mutation was seen in one of the three patients from the study 46 and the patient was a compound heterozygote for a C to G transversion at nucleotide 309 in exon 4 [rs121965001] and a G to A transition at nucleotide 1165 in exon 9 [rs12021720], causing an Ile-to-Met substitution at amino acid 37 and a Gly-to-Ser substitution at amino acid 323, respectively. Thus, the high frequency of MAF at one of the two variants of the compound heterozygotes was acceptable for pathogenic variant; in fact, the variant was a common variant in 1KGP as well. (b) One variant retained as pathogenic for rare disorder but with a suggestion that they can be "likely benign": rs61751507 (MAF_KWT:7.47% and MAF_1KGP:2.7%) associated with Carboxypeptidase N deficiency, which is possibly a complex disorder. ClinVar annotated this variant as pathogenic with evidence from literature publication and benign based on information clinical testing. The study 47 found this pathogenic variant in just one patient and hence it may be considered as of insufficient evidence. Hence this variant can be considered as "Likely Benign".

Missense variants rare within global populations but common within Kuwaiti population. 170
SNVs were identified as rare in global populations but common in Kuwaiti exomes; 85 of these were missense variants (Supplementary Table S6). The 85 variants were of two categories: (a) A set of 20 variants harboured in genes annotated for disorders in OMIM: However, these 20 variants were not of any pathogenic value as ClinVar annotated these variants as either 'benign' or 'conflicting interpretation' . Not surprisingly, the REVEL scores in these instances (except in 2 instances -GLDC variant at around 0.8; and the DPYD variant at 0.4) were seen low at ≤0.3. (b) A set of 65 variants harboured in genes NOT annotated for any disorder in OMIM: Association with phenotypes was seen with only one of these 85 variants; the TTC38 variant rs117135869 (REVEL = 0.621; MAF_KWT: 5.0%; MAF_1KGP: 0.58%) has been recently identified as a novel metabolic quantitative trait loci   Missense variants mapping to drug-binding domains and were of pharmacogenomic relevance. We identified 21 missense SNVs that mapped to a set of 130 drug-binding domains reported in literature 49 and were annotated in PharmGKB 50 (Table 4). These 21 variants had impact on the efficacy of drugs used largely for treating common disorders (such as heart failure, hypertension, Chemotherapy, neoplasms, diabetes, nephrosclerosis, rheumatoid arthritis, asthma, pulmonary diseases, schizophrenia, tobacco use disorder, heroin dependence, sickle-cell anemia, and HIV). Furthermore, literature survey revealed that 7 of these 21 pharmacogenomic variants were associated with complex disorders in Arab studies (Table 5).  Table S7). These 230 variants were from 186 unique genes, for 162 of which Inheritance patterns were known; 91 were AR and 63 were AD. 206 of these 230 variants were benign or likely benign. The disorders related to the genes harbouring the benign variants were often single-gene disorders and familial, hereditary and congenital. The 24 non-benign variants ( Table 6) were from 21 unique genes associated with 20 unique disorders. Seven of these 24 non-benign SAFD variants were seen annotated in ClinVar as "Pathogenic"; however, either the associated disorder was common/complex or more prevalent in the study population or the patient carrying the variant was annotated in OMIM as susceptible to the disorder (which is usually a common disorder). Going by the practice that "pathogenic" variants are related to Mendelian disorders, we considered the variants associated with common disorders as risk factors. (i) rs1800435_G > C (MAF_KWT:12.41%; MAF_1KGP:6.4%) associated with "Aminolevulinate dehydratase, alad*1/alad*2 polymorphism susceptibility to lead poisoning ALAD porphyria". ClinVar annotated this as pathogenic based on literature evidence and likely benign based on clinical testing. It increases the risk for lead poisoning. ALAD porphyria is a very rare genetic metabolic disease; however, quoting from the CDC report on lead poisoning -"There are approximately half a million U.S. children ages 1-5 with blood lead levels above 5 micrograms per deciliter (µg/dL), the reference level at which CDC recommends public health actions be initiated", lead poisoning is no longer a rare disorder. Susceptibility is the keyword, and we reannotated this variant as risk factor. (ii) rs5030737 (MAF_KWT:7.90%; MAF_1KGP:2.8%) associated with Mannose-binding lectin deficiency, which is a complex trait. ClinVar annotated the variant as pathogenic based on literature reference; since we associate "pathogenic" to Mendelian disorders, we reannotated this variant as risk factor. (iii) rs121918530 (MAF_KWT:1.03%; MAF_1KGP:0.04%)associated with coronary artery disease/myocardial infarction, which is a complex multifactorial disorder. ClinVar annotated this variant as pathogenic based on literature evidence and likely benign based on clinical testing. since we associate "pathogenic" to Mendelian disorders, we reannotated this variant as risk factor. (iv) rs5030739 (MAF_KWT:8.42%; MAF_1KGP:2.32%) associated with "Prostate cancer hereditary 2, susceptibility to" (Complex trait). ClinVar annotated this variant as pathogenic based on literature evidence and benign based on clinical testing. The cited literature suggested increased risk of prostate cancer; 'susceptibility to' was the keyword. Thus, this variant was considered as risk factor. It was also the case that this variant has to appear in compound heterozygosity with the next listed variant of rs4792311. We reannotated this variant as risk factor. (v) rs4792311 (MAF_KWT: 35  This set of 24 SAFD variants with clinical significance was distributed onto (a) A set of 2 pathogenic variants (rs61757294 and rs61751507) with AR mode inheritance; the MAF of these two variants in Kuwaiti exomes were uncharacteristic of pathogenic variants (see above for more details); (b) A set of 4 drug response variants one of which was AR); (c) A set of 14 risk variants and 2 protective variants for complex traits (3 were AR); and (d) A set of two variants associated with phenotype traits through GWAS studies. Five of the disorders associated with the SAFD variants were seen annotated in CAGS as observed in Arab countries (see Table 6). Table S4); as many as 15 of these were stop-gain, seven were start-loss and the remaining four were splice site mutations. None of these 26 SAFD LoF variants was seen annotated for disorder in OMIM; however, the GWAS Catalog 51 listed one of these variants namely rs2228015-C from CCR7 gene as associated with the complex phenotype trait of lymphocyte counts (at genome-wide significant p-value of 6E-09).

SAFD variants (Supplementary
CAGS disorders for which the OMIM-listed causal variants were seen in Kuwaiti exomes. We further examined the CAGS database for disorders observed in Kuwait at any incidence rate and for disorders seen in any of the Arab countries at incidence rates of ≥11 per 100,000. CAGS database provided the Phenotype MIM number using which we retrieved the OMIM-reported causal variants and checked for their occurrences in Kuwaiti exomes. For 25 disorders, the OMIM-reported variants were seen in Kuwaiti exomes ( Table 7); eight of these 25 disorders had already been seen in the analysis for functional variants. Except in one instance (rs1800858), all the variants were missense. 13 of these variants were "pathogenic" and the remaining 12 were "risk factor" variants. 18 of these disorders were observed in Kuwait and the remaining in other Arab countries.   Scrutinization of the identified variants against Arab mutations reported in Arab studies. Analyses performed so far in the study indicated that disorders relating to 20 instances of rare & deleterious variants, 16 of which were pathogenic variants for rare disorders and 4 were risk factor variants for complex disorders (see Table 3), 7 instances of pharmacogenomic variants that were associated with complex disorders in Arab studies (see Table 5), 5 instances of SAFD variants (see Table 6), and 17 additional instances from the analysis of CAGS disorders were seen in Arab population (see Table 7). During the analysis, we also found in Kuwaiti exomes two recessive mutations (namely rs1801133 & rs1801131 from MTHFR -see Table 8) associated with recessive early onset of susceptibility to Type 2 diabetes in Arab population. We set upon to identify which of these variants were also reported in Arab studies for the corresponding disorder. Upon performing literature survey and manual examination of the bibliography data presented in CAGS database, these variants could be classified onto the following categories (Table 8) Allele frequencies and carrier distributions as seen in exomes data set and GWAS data set are presented ( Table 9). 13 of these 27 variants were associated with disorders observed in Arab studies. The 27 variants were pharmacogenomic (11), SAFD (9) and CAGS (5 + 2) variants for complex disorders. The allele frequencies among the data sets of Kuwait exome, Kuwait GWAS and GME were comparable with each other; and the carrier distributions were similar between the Kuwaiti exomes and GWAS data sets.

Discussion
In this study, exomes from 291 healthy, unrelated native Kuwaiti Arabs were analysed to identify 170,508 SNVs and 3,341 indels. 12% of SNVs and 28% of indels were novel. One-third of the identified SNVs were population-specific, and 21.7% were 'personal' (observed in only one Kuwaiti exome and not seen in GME or 1KGP), consistent with the results of other studies on ethnic populations, including those from Qatar 44 , Spain 17 and Denmark 12 . 53% of the identified SNVs were missense, and an average of 1.3% of the 14,557 SNVs that each person carried were predicted to affect protein function. Allele frequencies in 6,186 SAFD variants were significantly different from those observed in 1KGP populations.
Recent population genetic analyses have demonstrated that humans harbour an abundance of rare & deleterious variations, with >80% of all coding variants having a frequency of ≤1% 10,14,52 . In this study, a majority (51%) of the identified SNVs in Kuwaiti exomes were rare. Of the identified 55,644 population-specific SNVs, only 138 were 'common' , and the rest were 'rare' or 'low-frequency' . Up to 60% of the population-specific variants were missense changes, and 51% of LoF variants were population-specific (some of which were polymorphic). These observations support the notion that coding variants with allele frequency of <1% show increased population-specificity and are enriched for functional variants 13 . Human populations have experienced recent explosive growth, expanding by at least three orders of magnitude over the past 400 generations; such a rapid recent growth along with weak purifying selection has increased the load of rare variants, many of which are deleterious and relevant for understanding disease risks 14,16 .
On average, nearly 10.4% of Kuwaiti population-specific variants found in every Kuwaiti individual were homozygous; this extent of homozygosity, which is higher than that observed in other ethnic populations (such as the value of 7.05% in Spanish 17 ), reflects the higher rate of consanguinity practised among the Kuwaiti Arab population. The GME study 26 demonstrated an increased burden of runs of homozygosity in Greater Middle East populations; our previous works had shown that Kuwaiti population is heterogeneous (placed between populations that have large amount of ROH and the ones with low ROH) with the KWS subgroup as highly endogamous 24 . An average of 73 LoF variants (of which 4.5 were Kuwaiti-specific) were seen per individual. Observed disease-causing mutations failing to cause disease in at least a proportion of the individuals who carry them has been extensively discussed 53 . On an average, only 4.67% of Kuwaiti-specific LoF variants per individual were seen homozygous (as opposed to the expected 10.4%) and such a reduced homozygosity among LoF variants may explain the reduced penetrance.
Rare homozygous loss of function variants are supposed to exhibit strong signs of selective pressure. Of the genes harboring the identified 36 rare (MAF <2.0%) homozygous putative LoF variants observed in Kuwaiti exomes, only 8 were seen common with published list of inactivated genes from Icelanders 42 and only 1 was common with the list from ExAC 8 . These findings suggest that the set of non-clinically relevant loss-of-function variants is far from being complete 26 and consideration of ethnic populations with consanguinity as in the GME study and our study can augment the list of human knock-out events. We previously catalogued 36 exome variants from 15 native Kuwaiti individuals of KWS subgroup (citydwelling Saudi Arabian tribe ancestry 24 ) and postulated that further samples were needed to capture the full spectrum of exome variability. The present study indicated that our previous work captured only a portion (22%)  Table 3  of variability. The repertoire of 'all' SNVs and 'population-specific' variants increased with the number of samples sequenced and did not reach a plateau (Fig. 2). However, once population-specific variants were divided into personal and genuine polymorphic variants, the later reached a plateau. These data suggested that most of the Kuwaiti-specific polymorphisms within coding regions were restricted to approximately 10,000 positions (Fig. 2, dashed blue line). Utility of whole-exome data in population structure analysis produces results congruent to those obtained using genome-wide genotype data 54 . In this study, principal component analysis of the merged data set of exome variants from Kuwait, 1KGP global populations, and Qatar confirmed the existence of three subgroups (Fig. 1) previously derived from genome-wide genotype data 24 in Kuwaiti population. The KWB subgroup showed greater genetic affinity towards African populations, and the other two clearly demarcated subgroups (namely KWP and KWS) were between South Asian and European populations. Furthermore, the three substructures of the Qatari population 25 lied akin to Kuwaitis. These results were supported by evidence from pF ST likelihood ratio tests, which identified variants that differentiated the subgroups. The population-wide occurrence of Kuwaiti SAFD variants in the context of maximum allele frequency populations indicated pairing of Kuwaiti individuals mostly with Europeans and Ashkenazi Jewish populations from 1KGP phase 3 and gnomAD data sets. Such genetic relatedness among Middle Eastern, European and Ashkenazi Jewish populations was further confirmed through performing population genetic analyses (F ST and PCA) by way of including genotype data from Ashkenazi Jews as well (Figs 4 and 5). Notably, in line with previous studies 43 , population genetic analysis presented in this study demonstrated the genetic relatedness among Middle Eastern, European and Ashkenazi Jewish populations.

C. 6 Pathogenic variants that already appeared in the list of RARE & DELETERIOUS variants (see
The Kuwaiti exomes, presented in this study, included 46 clinically significant deleterious variants that are rare in global populations and in Kuwaiti exomes (except for three) (pathogenic: 35; drug response:1; risk factor: 10). 28 of the 36 pathogenic variants followed AR mode of inheritance and the 7 of the 10 risk factor variants followed AD mode of inheritance. Disorders associated with 20 of the 46 variants were seen in Arab populations. Three of the 46 variants reached an MAF value characterizing low-frequency variants in Kuwaiti exomes; two of these three were risk factor and one was drug response variant; and the allele frequencies were comparable with GME data set (see Table 3)-(rs61742245-VKORC1:1.04%,1.37%; rs1800553-ABCA4:2.41%,2.1%; rs11909217-LIPI:1.72%,1.31%). The three variants indicating high risk ratios in Kuwaiti exomes for disease pathogenesis and response to medication were: the VKORC1 variant was associated with warfarin resistance (AD) (heterozygous in four individuals and homozygous recessive in one individual), the ABCA4 variant was associated with susceptibility to age-related macular degeneration (AD) (heterozygous in 12 individuals and homozygous recessive in one individual), and the LIPI variant was associated with susceptibility to hypertriglyceridemia (AD) (heterozygous in 10 individuals). Mendelian and rare genetic disorders as well as monogenic forms of common complex diseases are often associated with rare coding variants. The rare coding variants can have remarkably different allelic frequencies in different ethnic populations compared with the 1KGP populations 10,55 . The data presented above reported rare variants associated with not only rare Mendelian disorders but also with complex disorders. This observation is in agreement with literature reports on many examples of rare and low-frequency variants associated with complex phenotype traits and common disorders (a review of some of the relevant studies are as listed in Table 1 in Schork et al. 56 ). An interesting example of rare & deleterious "risk factor" variants associated with increasing risk for complex disorders was the GHRL variant (rs34911341-C/T; Arg51gln) which OMIM associated with susceptibility to the complex disorder of obesity (along with genes such as POMC, SDC3 and ADRB2); the variant was originally seen in 6.13% of 96 unrelated Swedish female subjects of morbid obesity (BMI 42.3 ± 3.4 Kg/m 2 ) 57 . The variant had been seen in GME data set and in one individual from our study cohort; incidentally, the individual was morbidly obese female with a BMI of 44.3 kg/m 2 ; Though our study cohort consisted 48 morbidly obese female individuals, only one of them carried this GHRL allele.
The study identified in Kuwaiti exomes a set of 21 missense SNVs (that were predominantly 'common' in both Kuwaiti exomes and 1KGP populations as well as in GME) mapping to drug-binding domains and were of pharmacogenomic relevance (relating to complex disorders, such as sickle-cell anaemia, hypertension, diabetes, asthma, cancer and chemical dependence). 7 of these 21 variants were also observed as Arab mutations associated with complex disorders in Arab populations (Table 5). Of the 21 pharmacogenomic SNVs, the CYP2C8*3 variants encoding two linked amino acid substitutions 46 were particularly evident (Table 4) in Kuwaiti exomes; risk allele frequencies at these two variants were 12% in Kuwaiti exomes, 10.4% in GME and 4.6% in 1KGP; the risk alleles co-segregated in 33 individuals in our study cohort. CYP2C8 has emerged as a significant pharmacogene 58,59 and is responsible for biotransformation of 5% of currently used drugs that undergo phase 1 hepatic metabolism 60 . The CYP2C8*3 variants regulate the dosage of the diabetes drugs rosiglitazone and repaglinide 61,62 . The minor alleles of the CYP2C8*3 variants were also associated with decreased metabolism of paclitaxel 59 . The ADRB2 variant (occurring with an MAF of 1.2% in Kuwaiti exomes, 1.6% in GME and 0.4% in 1KGP) regulates the efficacy of the asthma drug terbutaline and beta-blocking agents used to treat heart failure 63 . ADRB2 variant had also been correlated with the risk of type 2 diabetes, obesity and hypertension 63 ; six individuals from our study cohort carried risk allele at this variant.
The identified 24 SAFD variants (all of which were missense variants -see Table 6) with clinical significance included (a) two pathogenic variants (with AR mode of inheritance) associated with the rare disorders of Corticosterone methyloxidase type 2 deficiency and Carboxypeptidase N deficiency); (b) four drug response' variants associated with toxicity to the drugs of cisplatin or cyclophosphamide and with response to anti-coagulation drugs; (c) sixteen risk/protective' variants associated with complex traits (ex. asthma, Parkinson's disease, obesity, nephrolithiasis, melanoma 6, and alcohol dependency); and (d) two ' Associated' variants relating to traits of FPG levels and skin/hair/eye pigmentation. The gnomAD populations that showed highest MAFs at these 24 variants were Ashkenazi Jews (15 instances) and Europeans including Finnish (6 instances). The 1KGP populations that showed the highest MAFs were Europeans (14) and South Asians (7). As expected, a major number of these variants were 'common' (20 in Kuwaiti exomes and 14 in 1KGP data). The associated disorders are common in  67 .
In addition, two other SNVs from our analysis for functional variants were seen associated with quantitative traits in GWA studies -an SAFD LoF variant rs2228015/CCR7 associated with the complex hematological trait of lymphocyte count in European-ancestry people 68 and a missense variant (rare in 1KGP but common in Kuwaiti exomes) rs117135869/TTC38 associated with a novel complex metabolic quantitative trait loci (mQTLs) in a cohort from Middle Eastern population 48 . Further search for presence in Kuwaiti exomes of OMIM-listed causal variants relating to CAGS disorders led to a list of additional 17 variants (see Table 7); 7 of these variants were "pathogenic" and the remaining 10 were "risk factor" variants. The analysis identified a total of 25 CAGS disorders  for which the OMIM-listed causal variants were seen in Kuwaiti exomes; such a poor turnout of only 25 was probably due to the small size of the cohort. Of the 112 variants of clinical significance discussed so far, as many as 44 were 'common' variants in Kuwaiti exomes. Very often these common variants were relating to complex disorders. In this study, we did not ourselves delineate the variants associated with complex disorders; we rather just examined whether and which of the functional variants identified in our study were annotated in OMIM, ClinVar, PharmGKB, and literature as associated with complex (or rare) disorders. A question arose as to whether the study cohort of 291 exomes had enough power. Of the 112 variants, 27 variants (comprising 1 rare, 3 low-frequency and 23 common variants) were also seen in our in-house GWAS data set of larger sample size; the set of 27 variants comprised 11 pharmacogenomic, 9 SAFD, 5 CAGS and the two MTHFR variants associated with susceptibility to T2DM ( Table 9). The MAF at these variants were comparable among the Kuwaiti exomes, GWAS data, and the GME data set; the carrier distributions were also comparable with one another among the Kuwaiti exomes and GWAS data set.
Finally, disorders relating to 52 of the identified variants were observed in Arab population. Inheritance modes associated with these 52 variants were: 28 autosomal recessive, 15 autosomal dominant, and 9 ambiguous. 25 of these variants were relating to 'rare' and 27 were relating to complex disorders. This study (based on 291 exomes) provided data on 23 known Arab mutations for 23 disorders seen in Arab populations, data on 12 putative mutations for 12 disorders observed but not yet characterized for genetic basis in Arab population, and data on 17 additional putative mutations for disorders characterized for genetic basis in Arab populations. This data is useful for testing in future case-control studies.
Capturing the extent of genetic variation in Middle East region is poorly represented in global studies. However, the Greater Middle Eastern (GME) Variome Consortium 26 has recently made a notable effort to address this concern by way of capturing genetic variations from exomes of 1,111 unrelated and supposedly healthy individuals from Northwest and Northeast Africa, Turkish peninsula, Syrian desert, Arabian Peninsula and Persia & Pakistan. The GME data set included 214 exomes from Arabian Peninsula (AP), of which 45 are from Kuwaiti population. Our study consisting of 291 Kuwaiti samples, sourced from the 3 Kuwaiti population subgroups, complements and augments the GME genetic variation data by way of presenting a higher number of exomes representing a single state of AP namely Kuwait. It is further the case that the GME study discovered and presented the variegated genetic architecture in GME populations; this is complemented by the population genetics results from our study from a relatively larger sample set of native Arabs living in a single state from the Peninsula. The GME study demonstrated the utility of the GME exome data set in discovering the genetic basis of Mendelian disorders in Greater Middle Eastern populations; our study provides data on Arab mutations for 23 disorders and points to 31 OMIM-listed variants relating to disorders seen in Arab populations for testing in future case-control studies.
A potential limitation of this study arises from the number of exomes sequenced. Though the number of population-specific variants seemed to saturate with 291 exomes, the total number of "all" identified variants did not saturate (Fig. 2); this indicates that we need to sequence furthermore samples to sufficiently represent the Arab population from Kuwait. It is further the case that variants associated with only a small set of disorders observed in the region were seen in the reported Kuwaiti exome data.
In conclusion, the presented assessment of 291 exomes of unrelated healthy individuals unveiled the prevalence of rare as well as common variants related to various Mendelian disorders and common complex diseases that are predominantly inherited as recessive. The inclusion of different genome data sets in our analyses highlighted similarities in allele frequencies among Arabs and Jews, and among nomadic Bedouins and Africans. Furthermore, our data corroborates the Kuwaiti population substructures previously determined by genome-wide genotype data; the results on population structures from Kuwait is generally in agreement with the variegated genetic architecture seen in Greater Middle Eastern populations 26 . The striking occurrence of pharmacogenomic variants relating to common complex disorders, underlines the importance and need for cataloguing genetic variants in similar Arab populations of the Middle East region. This study is a significant addition to regional data resources (such as GME 26 ) and global resources (such as 1kGP 3,4 ) on human exome variability; however, a wide range of similar studies in the region are warranted to support genomic discoveries in medical and population genetics at the regional and global levels 26 .

Methods
Ethics Statement. The protocols used in the study were approved by the International Scientific Advisory Board and the Ethical Review Committee at Dasman Diabetes Institute, Kuwait. Written informed consent was obtained from participants before collecting blood samples. Identities of the participants were protected from public exposure, and samples/data were processed anonymously. All methods were performed in accordance with the relevant guidelines and regulations.
Selection of subjects for whole-exome sequencing. To capture the extent of exome variation in the entire Kuwaiti population, 291 healthy, unrelated native Kuwaiti individuals from the study cohorts used in our earlier studies were selected 24,36,69,70 . At the time of recruitment, all participants in this study were healthy and deemed free of Mendelian or rare genetic disorders, cognition or physical disability, mental retardation or chronic disorders, such as cancer. Distribution of the selected participants in three subgroups of Kuwaiti population 24  Exome data analysis. The HugeSeq 71 computational pipeline was used to automate the variant discovery process. Sequence reads were aligned to the reference human genome build hg19 using BWA 72 . Prior to variant calling, alignment files were processed using the Genome Analysis Toolkit (GATK) 73 . Post-alignment procedures included PCR duplicate removal, local realignment around known indels and base quality recalibration. Best practices for the GATK workflow were followed, and standard hard filtering parameters 74,75 were used for variant discovery from the processed alignment files. Variant calling on each sample's BAM file was performed using HaplotypeCaller followed by joint genotyping analysis of the resultant gVCFs to create raw SNV and indel VCFs. Variants called in the sequenced exomes were restricted to intervals covered by both TruSeq (163 samples) and Nextera (128 samples) Exome Enrichment kits. To improve the quality of the data set, the resulting variant call sets were filtered by setting sample variant thresholds at ≥10X depth, <180X depth and genotype quality of >20.
Variants with allele balance of <30% were removed to filter out sites where the fraction of non-reference reads was too low. Hardy-Weinberg Equilibrium was assessed using an exact test, as defined by Wigginton et al. 76 , and excluded sites with p-values of <10 −5 . Lastly, all variants with a call rate of <90% were excluded. Thus, after the variant quality filtering steps, only the consensus of variants determined using both kits appeared in the final VCFs.
Classifying the variants. The Ensembl genome database build 75 was used as reference for gene annotation.
SNP Variation Suite (SVS) v8.7.1 from Golden Helix Inc 77 was used to derive functional classifications of the identified variants. The identified SNVs and indels were categorised as 'known' and 'novel' based on the content of the single-nucleotide polymorphism database of dbSNP146 78 . Variants already reported in dbSNP146 were annotated as 'known' , and the others were annotated as 'novel' . Variants observed in only a single exome from the study cohort and not seen in 1KGP or GME data sets were annotated as 'personal' . Variants (excluding the 'personal') that were not observed in 1KGP phase 3 data were annotated as 'population-specific' , and population-specific variants observed in more than one exome from the study cohort were annotated as 'population-specific polymorphic' variants. Variants leading to stop gain, stop loss, frameshift and damage in splice sites were annotated to cause LoF (loss of function  in all three Kuwaiti subgroups and the regional and global populations was created. Golden Helix SVS software v8.7.1 was used to perform principal component analysis with the merged data set. pF ST likelihood ratio tests: Comparison of allele frequency distribution among the Kuwaiti population subgroups. Reference alleles and alternate alleles were binned to set the standard for 'Kuwaiti exome' . In order to detect alleles driving differentiation among the three Kuwaiti subpopulation groups of KWP, KWS and KWB, pF ST likelihood ratio tests 79 for allele frequency differences in autosomal variants (filtered for missingness rate and deviation from Hardy-Weinberg equilibrium) were performed.

Identification of SNVs with significant differences (SAFD variants) in allele frequencies between
Kuwaiti and global populations. Autosomal SNVs observed in both Kuwaiti exomes and 1KGP phase 3 exomes 4 were identified, and SNVs for which minor alleles were not observed in Kuwaiti exomes were excluded. SAFD variants that exhibited significant allele frequency differences were identified by performing one-sided binomial exact tests (allele frequencies in 1KGP global populations were considered as 'expected'), followed by Bonferroni correction. A p-value threshold of 0.05 was used to assess the significance of allele frequency differences. ClinVar 80 data resource was used to assess the clinical significance of the identified SAFD variants. In the context of population structure analyses, populations from gnomAD data set 8 were also used to compare allele frequency distributions. The comprehensive scrutinization of population-wide occurrence was performed by considering the paired incidence of populations with maximum allele frequency.
Principal Component Analysis of the merged set of Kuwaiti exomes, Ashkenazi Jews, 1KGP phase 3 and Qatar and F ST analysis. We combined Kuwaiti exomes with the data sets from Ashkenazi Jews 43 , 1KGP phase 3 4 and Qatar 44 . The combined data set of coding-region variants was cleaned and LD-pruned to obtain a total of about 896 variants and 3,336 individuals representing world populations. Principal component analysis (PCA) was performed using smartpca in the EIGENSOFT software package (v 6.1.4) 81,82 . Two-dimensional and three-dimensional scattered PCA plots were created using RStudio 83 (v 1.1.423). Mean pairwise F ST values and the matrix between populations were generated using PLINK 84 (v 1.9). The F ST heatmap was created using RStudio (v 1.1.423).

Examining OMIM and ClinVar annotations for inferring clinical significance of SNVs. OMIM
and ClinVar should mention the Kuwaiti exome SNV, with literature evidence and citation reference, as an associated variant for a disorder; the dbSNP identifier of the SNV and the observed risk allele should be mentioned as such in the OMIM and ClinVar annotation 80,85 . The clinical significance for the variant should be mentioned consistently with the same term (such as 'pathogenic' or 'risk factor' or 'protective' or 'drug response') in all the records for the disorder; it should not be the case that few records list the significance as 'pathogenic' and few other records list as 'benign' or 'conflicting interpretation' for the disorder; ClinVar records listing "not specified" for the data item of 'conditions' were not considered. As is the practice 86 , in cases of ClinVar variants with conflicting annotation for clinical significance, evidence from a peer-reviewed publication and manually curation (OMIM) takes precedence over evidence from clinical testing submissions. ClinVar defines "Pathogenic" variants as those that are interpreted for Mendelian disorders; or as those that have low penetrance; "Drug response" variants as those that affect drug response, and not a disease; "Risk factor" variants as those that are interpreted not to cause a disorder but to increase the risk; "Association" variants as those that were identified in a GWAS study and further interpreted for their clinical significance; "Protective" variants that decrease the risk of a disorder, including infections; and "Susceptibility to" variants that increase the risk of a disorder. In those instances, wherein ClinVar annotated a variant as "pathogenic" but the associated disorder was "complex or common or more prevalent in the study population" or the patient carrying the variant was annotated in OMIM as susceptible to the disorder (which is often a common disorder), we reannotated the variant as "Risk factor by inference".
Examining the Kuwaiti exomes for rare, deleterious and pathogenic variants. 'Known' missense and LoF SNVs having MAF of <1% in the 1KGP phase 3 data 4 and ExAC database 8 were catalogued. Of these, only the variants annotated as damaging by both SIFT 87 and PolyPhen-2 88 tools were retained. The Kuwaiti exomes were examined for such variants. As an additional step, the resulting variants were filtered based on their Combined Annotation-Dependent Depletion score 89 to prioritise functional, deleterious and pathogenic variants across many functional categories, effect sizes and genetic architectures. A scaled score of ≥20 was applied to retrieve only those variants that were predicted to be among the top 1% of deleterious variants in the human genome. The above set of variants were screened for clinical significance using the OMIM 85 and ClinVar 80 databases.
Variants found to be 'rare' within global populations but 'common' within Kuwaiti population. A data set of SNVs that are rare in 1KGP phase 3 populations but common within Kuwaiti exomes was created. For such missense variants, scores predicting their pathogenicity were calculating using the REVEL 90 software. Examining the pharmacogenomic relevance of Kuwaiti exome variants. Variants of pharmacogenomic relevance were delineated using the resources built upon the concept of druggable genome originally formulated by Hopkins and Groom 49 and PharmGKB 50 . From the data set of variants derived for Kuwaiti exomes, missense SNVs (with MAF of >1%) that are not deleterious (i.e. SIFT and PolyPhen-2 scores were outside the deleteriousness range) were mapped to protein domains (using InterPro 91 ) and checked for inclusion in the list of 130 domains reported by Hopkins and Groom. From the resulting set of variants mapping to drug-binding domains, only those for which pharmacogenomic annotation was available in PharmGKB database were retained.

Data Availability
The 291