Introducing the first whole genomes of nationals from the United Arab Emirates

Whole Genome Sequencing (WGS) provides an in depth description of genome variation. In the era of large-scale population genome projects, the assembly of ethnic-specific genomes combined with mapping human reference genomes of underrepresented populations has improved the understanding of human diversity and disease associations. In this study, for the first time, whole genome sequences of two nationals of the United Arab Emirates (UAE) at >27X coverage are reported. The two Emirati individuals were predominantly of Central/South Asian ancestry. An in-house customized pipeline using BWA, Picard followed by the GATK tools to map the raw data from whole genome sequences of both individuals was used. A total of 3,994,521 variants (3,350,574 Single Nucleotide Polymorphisms (SNPs) and 643,947 indels) were identified for the first individual, the UAE S001 sample. A similar number of variants, 4,031,580 (3,373,501 SNPs and 658,079 indels), were identified for UAE S002. Variants that are associated with diabetes, hypertension, increased cholesterol levels, and obesity were also identified in these individuals. These Whole Genome Sequences has provided a starting point for constructing a UAE reference panel which will lead to improvements in the delivery of precision medicine, quality of life for affected individuals and a reduction in healthcare costs. The information compiled will likely lead to the identification of target genes that could potentially lead to the development of novel therapeutic modalities.

are the first ever described for Emiratis and add to other middle-eastern data in the 4 WGS from the Kuwait genome project [28][29][30] and 104 WGS from Qatar 31 .

Results
Information on subjects and alignment statistics. Two citizens of the United Arab Emirates (UAE) were sequenced in this study. The first (UAE S001) participant was a male aged 87 years. He was diagnosed with hypertension, dyslipidemia, diabetes mellitus and psoriasis. His sample was analyzed using Principal Component Analysis (PCA) and supervised admixture analysis in which all 51 populations from the Human Genome Diversity Project (HGDP) database were used as possible ancestral populations 32 . This analysis showed an admixture ratio of 2.78% (Sub-Saharan Africa), 0.001% (North Africa), 36.96% (Middle East), 54.31% (Central/South Asia), 0.001% (East Asia), 0.001% (Oceania), 5.93% (Europe) and 0.001% (America).
The second (UAE S002) sample was of an 87-year old Emirati female, diagnosed with hypertension. Results from the PCA supervised admixture analysis showed an admixture ratio of 3.28% (Sub-Saharan Africa), 2.69% (North Africa), 35.93% (Middle East), 51.31% (Central/South Asia), 2.97% (East Asia), 3.77% (Oceania), 0.001% (Europe) and 0.001% (America). Figure 1 shows the principal components of the admixture ratios of the two Emirati samples as pie charts. These two individuals are shown in the context of genotyping data of other UAE citizens from the Emirates Family Registry and data compiled through the Human Genome Diversity Project (HGDP) that includes individuals of African, Central/South Asian, Eastern Asian, Native American, European and Oceanian descent Table 1 summarizes the data compiled through the alignment of and genome coverage for the whole genome sequences of UAE S001 and UAE S002. Read lengths of 151 and 152 base pairs (bps) were generated covering the whole genome at 27X and 31X for UAE S001 and UAE S002, respectively. The total number of reads that passed quality control (QC) exceeded 839,000,000 for both individuals. In total, 712,659,088 (83.7%) of reads were mapped or aligned properly to the reference genome, hcg19 33,34 , for UAE S001. The total number of reads mapped to the reference was higher for UAE S002 at 826,900,438 (98.5%). The number of reads mapped in proper pairs was 83.7% and 98.5% in UAE S001 and UAE S002, respectively. There were 857,112 singletons in UAE S001 and 3887,602 in UAE S002.
Y-chromosome and mitochondrial haplogroups of the participants. The Y haplogroup was determined for UAE S001 using AMY-tree and yHaplo. Both tools indicated that this individual belonged to the y-Haplogroup Q1a2b2 (Q-L933). The Q haplogroup was found to have originated in Central Asia and Southern Siberia, subsequently migrating toward Eurasia, and arriving in the Arabian Peninsula [35][36][37] .  Annotation of SNPs and indels. Through the annotation process, variants were classified based on their impact, functional class, and by type within the different genomic locations. These classifications were defined based on SnpEff annotation. Table 5 provides a summary of variants that were categorized into high, low, moderate, and modifiers based on their genomic impact. From UAE S001 and UAE S002 respectively, 99.43% and 99.44% of the total variants were modifiers. The number of total variants with low impact was almost 24 times the number of total variants with high impact in both samples. Variants   Total  3,994,521  4,031,580   'true'  3,835,491  3,865,759   'not listed'  159,030  165,821   SNPs   Total  3,350,574  3,373,501   'true'  3,283,240  3,302,437   'not listed'  67,334  71,064   Indels   Total  643, Table 6 presents variants of the two genomes classified into four functional classes. The number of total variants of each functional class (missense, nonsense, silent, or none identified) for UAE S001 was similar to that of UAE S002. Tables 7 and 8 are summaries of variants classified into 23 groups according to genomic location. Furthermore, the two tables summarizes (in brackets) the number of "real" and "not listed" variants that overlap with poorly-resolved regions or low complexity regions, which includes segmental duplications, rDNA chromosome arms, centromeric, telomeric, large retrotransposable elements, etcetera as provided by UCSC Table Browser 41 for samples UAE S001 and UAE S002. Most of the 'true' and 'not listed' variants lie in intergenic regions (52.58% of the total variants for UAE S001, and 52.71% of the total variants for UAE S002), followed by those that lie in the introns. It is also worth noting that >50% of the SNPs and >68% of the indels that are intergenic variants are located in the low complexity regions. Table 9 summarizes the variants of UAE S001 and UAE S002 (listed or not) with respect to GnomAD, showing a significant increase in the true variants in comparison to dbSNP 138. Additionally, Table 10 is a summary of the genic variants that are not listed with respect to GnomAD for both samples.

UAE S001 UAE S002
Variants associated with specific diseases. It is important to delineate the genotype-disease association for personal genomes by relating the variants to potential susceptibility for certain disorders. The 23 genomic classes were further annotated according to the clinical significance of the variant (pathogenic, likely pathogenic, drug-response, risk-factor, affection, and association) with reference to the ClinVar and OMIM databases (Table  S1). Figure 2 shows the clinical significance classification based on the databases used and the number of variants identified in each class for the two UAE participants.
Concordance in SNP calls between the deep sequencing experiment and genotyping experiment using Bead Chip array. Next Generation Sequencing (NGS) results for the UAE S001 sample were compared to genotyping data obtained for the subject using the Illumina Omni 5 Exome bead chip technology. After applying quality control, the intersection of the remaining SNP positions and the single nucleotide variant calls from UAE S001 NGS yielded 226,007 SNPs. Of these, 275 (or 0.12%) were not concordant. Similarly for UAE S002, the comparison of NGS and array data yielded 160,608 SNPs. Of these, 111 (or 0.069%) were not concordant.  www.nature.com/scientificreports www.nature.com/scientificreports/ Comparing the sequenced genomes with individual genomes from other continents. A phylogenetic tree comparing subjects UAE S001 and UAE S002 with Human Genome Diversity Project (HGDP) and additional available data of Kuwaiti genome 29,42 was constructed using the neighbor-joining method and shown in Figure 3. The two local samples cluster with genome data from the Kuwaiti study and near the population representing Central/ South Asia. All populations fall into respective clades. However, European Middle Eastern   www.nature.com/scientificreports www.nature.com/scientificreports/ subjects fall into the same cluster. The fact that they are not in entirely separated subclades can possibly be attributed to limited number of common variants available for analysis, with only 20,658 common variants used.
Further, the number of variations identified for both UAE S001 (3,994,521) and UAE S002 (4,031,580) genomes was comparability higher than the total number identified from a whole genome sequence of an Indian individual 43 of around 3.4 million, when aligned to hg19. Additionally, it was slightly higher than seen in the sequenced individual (3,977,914) from the Persian subgroup of Kuwaiti population (KWP1) 28 . Figure 4 shows a Venn diagram of the total identified variants in the two UAE samples and KWP1 in which 1,729,424 variants were found in the three samples.

Discussion
There is an intolerable gap in the human genome landscape. Despite the best efforts of the Human Genome Organization (HUGO), Haplotype Map (HapMap) and other international consortia, genome data from ethnic groups of the Arab-speaking world is underrepresented. In a recent audit of genome data in the public domain, genome data from populations of the Middle East was less than 1% 44 .
Here, the whole genome sequence of two Emiratis using next-generation sequencing (NGS) technology is presented. We report around four million genome variants, some of which are 'not listed' in dbSNP 138 dataset. Furthermore, to determine the actual continental or population contributions for the two studied samples, ADMIXTURE was run in supervised mode with reference populations from HGDP. Figure 1 shows principal component analysis supervised admixture for the two samples showing both have contributions from Central/ South Asian populations.
The Y-chromosome haplogroup (Q1a2b2 (Q-L933)) for the male sample (UAE S001) is consistent with the individual with origins from Central/Southern Asia. Furthermore, the mitochondrial DNA lineages of both individuals also indicate a maternal line from Central/Southern Asia regions.
The whole genome of the two Emirati samples was sequenced at a coverage depth of greater than 27X. The distributions of variants were almost the same in the two Emiratis when compared with the human reference genome (hg19) 33,34 . This included homozygous variants (41.21% (gw), 39.05% (auto)), heterozygous variants (58.79% (gw), 58.67% (auto)) in UAE S001. There were 39.10% (gw) and 37.89%(auto) homozygous variants as well as 60.90% (gw) and 59.21% (auto) heterozygous variants in UAE S002. These proportions of homozygosity/ heterozygosity were almost in concordance with the proportions in sequencing 100 Malay Genomes using Next Generation Sequencing (NGS) 45 .  www.nature.com/scientificreports www.nature.com/scientificreports/ SNPs and indels were checked against the dbSNP 138 database 40 . Up to 96% of the SNPs that were identified were classified as 'true' . Of the total number of variants in UAE S001, 16.12% were indels. The proportion of indels in UAE S002 was also similar, at 16.32%. Approximately 4% of the total variants identified in the two Emiratis were 'not listed': 3.98% for UAE S001 and 4.15% for UAE S002. Novel variants were as low as 0.01% when compared to GnomAD. Most of the 'true' and 'not listed' variants were localized to intergenic regions (52.58% of the total variants for UAE S001 and 52.71% of the total variants for UAE S002), followed by those that were in introns. It is also worth noting that >50% of the SNPs and >68% of the indels that were intergenic in nature were found in the low complexity regions (Tables 7 and 8). This is consistent with the observations made in a Kuwaiti study  Codon change plus codon deletion  54  0  52  0  2  56  0  53  0  3   Codon change plus codon insertion  26  0  25  0  1  30  0  28  0  2   Codon deletion  23  0  21  0  2  25  0  25  0  0   Codon insertion  49  0  48  0  1  42  0  40  0    There number of true variants with high impact on protein coding process in UAE S001 included 70 nonsense and 24 missense variants. In UAE S002, there were 63 nonsense and 24 missense variants. In addition, among the total coding variants identified as stop-gained or stop-lost, 14 in UAE S001 and 12 in UAE S002 were 'not listed' variants. Moreover, 'true' variants identified with loss of function (LOF) from the coding regions in UAE S001 and UAE S002 were categorized and is presented in Table S2. A set of 467 protein coding variants (384 'true' variants and 83 'not listed' variants) were annotated as loss of function in UAE S001. There were 451 loss of function variants (376 'true' variants and 75 'not listed' variants) in UAE S002. Two hundred and nineteen variants in UAE S001 and 220 variants in UAE S002 were homozygous leading to complete loss of function. Of the annotated variants that were 'true' to have loss of function, the majority were identified in the splice site regions (119 in UAE S001, and 130 in UAE S002) followed by frame shifts region (75 in UAE S001, and 66 in UAE S002). On the other hand, only 2 homozygous modifier insertions were identified in the third prime untranslated region UTR 3' in each of the genomic sequences.
For the identification of novel and known variants in the two samples, the dbSNP 138 version where novel and known indicates whether the variant was 'true' or 'not listed' was used 40 . Since more recent databases such as dbSNP 151 40 and the GnomAD 46 database are now available, these were used as the basis for identifying those variants that are novel. For example, the called variants from both UAE S001 sample and UAE S002 sample that were found to be listed in the dbSNP 151 were significantly less than the 'not listed' variants reported. For UAE S001 it changed from 159,030 variants in dbSNP 138 to 55,489 variants in dbSNP 151; and for UAE S002 it changed from 165,821 variants in dbSNP 138 to 57,734 variants in dbSNP 151. Additionally, when compared with GnomAD, the number of the variants decreased further (GnomAD for UAE S001: 45,087 variants; GnomAD for UAE S002: 47,339 variants) resulting in only around 28.35% and 28.5% of the 'not listed' variants for UAE S001 and UAE S002 respectively, being called "novel" variants (not reported in GnomAD). This indicates that the previously "not listed" variants called were indeed genuine variants as they were subsequently identified in GnomAD, part of which is classified by type within the different genomic locations as reported in Table 9. When the regions with genes for the two genomes were compared with variants in GnomAD variants, 14,520 variants for UAE S001 and 15,102 variants for UAE S002 were obtained and listed in Table 10.
The Transition/Transversion (Ti/Tv) ratio is usually used as a quality measure for called variants and is calculated for both genome-wide and autosomal variants ( Table 4). The 'true' variants was 2.069 for both individuals which were in agreement with the expected range of 2.0 to 2.1 for whole genome sequencing 47 . The values for 'not listed' variants were 1.258 and 1.356 for UAE S001 and UAE S002 respectively, which is lower than the expected ratio of 2. This could be due to the fact that in the variant calling pipeline the VQSR target truth sensitivity was set at 99.9, which could have been excessively stringent. According to Cai et al. (2017) a sensitivity VQSR target truth of 90 was found to optimize the balance of the Ti/Tv ratio of the novel variants with retaining as many potential novel variants as possible 48 . Therefore, the data was reanalysed using the lenient VQSR target truth sensitivity of 90. The Ti/Tv ratio of the 'not listed' variants indeed increased to 1.619 and 1.88 for UAE S001 and UAE S002 respectively. Other reasons for the low ratio could include one or a combination of different factors which include sequencing errors resulting in residual false positives, a relative deficit in transitions due to sequencing context bias, or a higher transition ratio that can result from low frequency variants 49 . Furthermore, the autosomal values were found to be, as expected, less than the genome-wide variants but the Ts/Tv ratios were not significantly different.
In this study, several methods were used to estimate the genetic ancestry to understand the admixture of the two samples that were chosen from the UAE population for this study. The two samples were not chosen to represent all ethnic groups of the UAE population. Principal Component Analyses were performed on both UAE S001 and UAE S002 genomes to estimate their ethnic composition by correlating their genetic polymorphisms www.nature.com/scientificreports www.nature.com/scientificreports/ with data of different populations in the HGDP. The principal component based method is the most commonly used method for many large dense genotype datasets 50 . The results of the genetic ancestry analysis illustrate the different ethnic background of the two individuals with a influence from the Central/Southern region of Asia.
Genetic ancestry can also be deduced from mtDNA and Y chromosome haplogroups or by using multiple unlinked autosomal markers 51 . To confirm the genealogical ancestor of the UAE S001 sample, the Y-chromosome Haplogroup was determined using AMY-tree and yHaplo. The Q1a2b2 (Q-L933) Haplogroup for the male subject, UAE S001 is a member of the Q Haplogroup, which mostly frequent among the Amerind 35 . However, a study of 471 individuals with subclades of the Q haplogroup by Huang et al. (2018) concluded that the Q haplogroup originated from Central Asia and Southern Siberia and dispersed to the Amerind and subsequently to whole Eurasia and part of Africa 37 . The Q haplogroup was found to have arrived in the Arab Gulf region, across Iran, from central Southern and Southeast Asia and were found to be abundant in the UAE, Iran and Pakistan 36 .
Mitochondrial (mtDNA) haplogroups were determined for both samples using Haplogrep. The R2 + 13500 haplogroup was identified in UAE S001, a lineage which is mostly concentrated in Southern Pakistan and www.nature.com/scientificreports www.nature.com/scientificreports/ India 38,52 . A study that focused on the human mtDNA variation in the Southern Arabia identified the presence of the R2 clade in Arabia and nearby regions 53 . As for UAE S002 sample, the G2a1 haplogroup that was identified is a lineage found mainly in Central Asia, with some overflow at low frequencies in adjacent regions including Iran and Southwest Asia 54 .
The extent of variability in the two Emirati genomes, UAE S001 and UAE S002, were determined by comparison to genomes from different world population. The two Emirati genomes cluster with a Kuwaiti genome. Additionally, both Emirati genomes clustered with the Central Asian group in reference to the HGDP dataset on the phylogenetic tree (Fig. 3), which is consistent with the rest of the analyses performed here. As elucidated earlier, migration and population movement were common events that widely occurred throughout the region spanning from Southern Asia across the Levantine and the Arabian Peninsula to North Africa, confirming the likelihood of the admixtures found in the 2 genomes that were studied.
Disease susceptibility and many inherited traits are affected by interactions between different variants located in multiple genes spread across the genome 55 . A total of 213 variants were identified in the splice site acceptor and splice site donor regions in UAE S001; with three variants of clinical significance. These include a known homozygous SNP (rs2004640) in IRF5 gene that has been shown to be associated with Rheumatoid Arthritis, a heterozygous deletion (rs1799759) in the A2M gene that is a risk factor for the susceptibility to Alzheimer's disease, and a heterozygous SNP (rs10774671) known to result in the loss of function of the OAS1 gene, a high impact risk factor for susceptibility to Type 1 Diabetes. As for UAE S002, 209 variants were identified in the splice site acceptor and splice site donor regions, in which only one is clinically significant. The heterozygote SNP (rs10774671) is known to cause loss of function in the OAS1 gene and is a high impact risk factor for susceptibility to Type 1 Diabetes.
Sixty-nine variants in the intronic region in the sequence data of UAE S001 may have specific clinical relevance to the individual's reported medical history, such as diabetes, obesity and cholesterol. For example, two genotypes linked with the susceptibility of Type 2 Diabetes Mellitus (T2DM); rs7903146 SNP in the TCF7L2 gene [OMIM: 125853], a heterozygous modifier affecting drug response, and the rs4402960 SNP in IGF2BP2 gene [OMIM: 125853] a heterozygous risk factor modifier. Two other heterozygous genotypes in the WFS1 gene (rs10010131 SNP and rs6446482 pathogenic SNPs) have also previously been shown to be associated with Type 2 Diabetes Mellitus. An obesity linked protein coding variant rs1421085 in the FTO gene has previously been defined as a heterozygous risk factor modifier. A heterozygous protein coding rs326 variant in the LPL gene is a modifier known to be associated with high density lipoprotein cholesterol level quantitative trait locus 11. As for UAE S002, two heterozygous risk factors were related to the susceptibility of Type 2 Diabetes Mellitus, specifically rs3792267 [OMIM:125853] and rs4402960 [OMIM:125853]. Another two heterozygous variants were associated with Non-insulin Dependent Diabetes Mellitus located within the WFS1 gene; rs10010131 and rs6446482.
They were six clinically significant variants in the downstream region of the UAE S001 whole genome sequence. Of these, only two were of particular interest as they were heterozygous risk factors of Type 2 Diabetes Mellitus. Both rs11196205 [OMIM:125853] and rs122555372 [OMIM:125853] are variants located in TCF7L2 gene, that has been widely studied as a marker for Type 2 Diabetes Mellitus.
There were 84 non-synonymous coding variants with missense function in the whole genome sequence of the UAE S001 participant. Of these, four variants were associated with Type 1 Diabetes (rs2476601, rs231775, rs237025, rs1131454), two with Maturity Onset Diabetes of the Young (rs5219, rs1169288), two with Type 2 Diabetes Mellitus (rs13266634, rs5219), and two with microvascular complications of diabetes (rs4880, rs854560). Moreover, three cholesterol related variants were identified: rs6180 variant in the GHR gene [OMIM:143890], a heterozygous risk factor for familial hypercholesterolemia; rs5370 variant in the EDN1 gene identified with www.nature.com/scientificreports www.nature.com/scientificreports/ heterozygous association with High Density Lipoprotein (HDL) cholesterol levels; and rs5882 variant in the CETP gene [OMIM:143470], a heterozygous SNP associated with Hyperalphalipoproteinemia. Additionally, the variant rs1042714 located in the ADRB2 gene was identified as a risk factor for obesity with moderate impact. Another locus of particular interest was rs33980500 [OMIM: 614070] in the TRAF3IP2 gene as it has been identified as a risk factor for Psoriasis, a skin related condition. A hypertension related variant was also identified as a protein coding risk factor residing in the NOS3 gene. Upon closer inspection of the whole genome sequence data of UAE S002, genetic variants related to diabetes, hypertension, cholesterol and obesity related were present. In particular, two hypertension related mutations were identified; a homozygous risk factor rs699 locus was found to have a missense functional class causing an amino acid change (M268T) and a heterozygous risk factor rs1799983 locus in the NOS3 gene casing an amino acid change (D298E).
It is important to note that these genetic variations alone do not provide definitive diagnosis of a specific disorder. It is challenging process to describe the genetic underpinnings and the genome architecture of common complex traits and multifactorial chronic diseases as these are influenced by multiple loci and genetic factors 56 , with contribution from the environment. Nevertheless, sequencing of whole genomes in the UAE will continue as it will give access to all, including 'true' and 'not listed' variants, which can be used to initiate functional studies to identify the contribution of casual variants to human phenotypes 57 .
This study is a step that adds to the efforts in neighboring countries to address the deficiency in genomic data on populations of the Middle East. Importantly, a review of the literature in the PubMed and Science Direct databases has revealed a lack of information in the UAE. Despite smaller populations in Qatar and Kuwait, whole genome sequences are available 28,31 . However, there have been no studies published on the whole genome sequence of the UAE population. Therefore, this presentation of the first ever whole genome sequence in the UAE is important as it is expected to lead to greater initiatives in genome-based medicine including improved understanding of chronic disease among its populous and the development of new paradigms in medicine, specifically the establishment of precision, personalized and P4-type strategies 58 .

Materials and Methods
Sample and DNA extraction. Prior to enrolment, the two subjects (UAE S001 and UAE S002) provided their written informed consent on a form that had been approved by the Institutional Ethics Committee IRB (Institute Review Board) of Mafraq Hospital in Abu Dhabi, United Arab Emirates (UAE). All experimental protocols were approved by the IRB of Mafraq Hospital in Abu Dhabi and all methods were performed in accordance with the guidelines and regulations of this IRB.
Subjects were also given a questionnaire to collect their historical and demographical information. To be included in the study, subjects had to be an adult (>18 years old) citizen of the UAE who understood their contribution to the study and was subsequently able to give consent.
Saliva samples were collected from the two subjects using the Oragene OGR-500 kit (DNA Genotek, Ottawa, Canada www.nature.com/scientificreports www.nature.com/scientificreports/ the functional effect of a variant in the genome. Both classes of variants (SNPs and indels) were further categorized into 'true' and 'not listed' . The latter related to variants that have not appeared or been annotated in dbSNP 138 40 . The ClinVar database which incorporates entries from the OMIM database was used to determine the clinical significance, disease associations and linked phenotypes of the variants that were discovered. VCF miner 65 , a graphical user interface was used for sorting, filtering and querying information encoded in the VCF files. Furthermore, data files containing comprehensive information for centromeres, telomeres, short arms, segmental duplications, and repeats from UCSC Table Browser (11,707), satellites (9,566), and others. Note that this in particular included repeat families like 202 Alu families (part of the SINE repeat class), 310 L1 families and 115 L2 families (part of the LINE repeat class) and six SVA families (3,733 in total under repeat class 'other').
A filter for variants in these regions was applied in Python using an efficient interval-tree data structure.
Analyses of Y-chromosome and mitochondrial haplogroups. The Y-chromosome variants were called using yHaplo 66 and Amy-Tree 67 to construct the haplogroup of the male participant (UAE S001). The default settings of the respective tools were used and followed with the VQSR-filtered SNP set of the recalibrated VCF file, which locates a male based on lineage defining marker SNPs in a top down manner. The paired-end reads generated for the two samples were previously aligned to the reference, hg19. For the mitochondrial analyses, this lineage sequence was realigned and mapped to the revised Cambridge Reference Sequence (rCRS) 68 . The Haplogrep tool 69 was used to call the mtDNA Haplotypes.
Genetic ancestry. For the purpose of defining the genetic ancestry of the UAE population, a cohort of 1,192 citizens of the country were genotyped using the Illumina Omni 5 Exome bead chip (Illumina Inc, San Diego, California, USA). The bead chip contains 4.6 million Single Nucleotide Polymorphism (SNPs), and genotyping was part of a long running project to establish an Emirates Family Registry for anthropological and disease association studies 70 . The genotype data of these Emiratis were compared with the genotype data from the Human Genome Diversity Project (HGDP) using multidimensional scaling (MDS), a form of Principal Components Analysis (PCA). MDS was performed using the PLINK 71 , i.e. SNPs that fail Hardy-Weinberg-Equilibrium test with significance of 0.001, minor allele frequency <1%, missingness <1%. This yielded a data set with 493 K SNPs for all samples. Subsequently, the principal components for UAE S001 and UAE S002 were plotted using Python 72 and Matplotlib 73 .
Validation of SNP calls. The Illumina Omni 5 Exome bead chip used for the genetic ancestry was reused for the concordance calculations. The variant calling file (VCF) generated after the recalibration steps for UAE S001 was converted to Plink's ped/map file format using vcftools 74 . The final comparison between the two sets was performed with a custom Python script concordance, that was used to account for deviations from the reference genome (hg19) 33,34 and multiallelic loci using dbSNP 138 40 . Calculation of intergenome distance between two samples' genomes and genomes from world populations. In order to contextualize the genomes of UAE S001 and UAE S002 in a comprehensive phylogenetic tree, their variants were compared against subjects from all world populations sampled during the Human Genome Diversity Project (HGDP) 75 and available data from a neighboring country an individual of similar south/central Asian ancestry from, Kuwait. Due to the comparatively small variant set of the intersection dataset, the final overlap of variants was 20,658. Subsequently all mutual intergenome distances were calculated using Plink's Identity by state distance measure, which expresses distances as genomic proportions. The resulting distance matrix was subjected to Neighbor Joining using BioPython's Phylo module 76 . The phylogenetic tree was visualized using iToL2 77 .