Unraveling a fine-scale high genetic heterogeneity and recent continental connections of an Arabian Peninsula population

Recent studies have showed the diverse genetic architecture of the highly consanguineous populations inhabiting the Arabian Peninsula. Consanguinity coupled with heterogeneity is complex and makes it difficult to understand the bases of population-specific genetic diseases in the region. Therefore, comprehensive genetic characterization of the populations at the finest scale is warranted. Here, we revisit the genetic structure of the Kuwait population by analyzing genome-wide single nucleotide polymorphisms data from 583 Kuwaiti individuals sorted into three subgroups. We envisage a diverse demographic genetic history among the three subgroups based on drift and allelic sharing with modern and ancient individuals. Furthermore, our comprehensive haplotype-based analyses disclose a high genetic heterogeneity among the Kuwaiti populations. We infer the major sources of ancestry within the newly defined groups; one with an obvious predominance of sub-Saharan/Western Africa mostly comprising Kuwait-B individuals, and other with West Eurasia including Kuwait-P and Kuwait-S individuals. Overall, our results recapitulate the historical population movements and reaffirm the genetic imprints of the legacy of continental trading in the region. Such deciphering of fine-scale population structure and their regional genetic heterogeneity would provide clues to the uncharted areas of disease-gene discovery and related associations in populations inhabiting the Arabian Peninsula.


Samples
We included samples of 620 Kuwaiti individuals from the State of Kuwait who were part of a larger cohort collected for studies of metabolic disorders [1][2][3]. The participants were recruited through the following means: (i) from visitors to clinics and Open Day Events at our institute (ii) from visitors to our campaigns at malls, primary health centers and blood banks in each of the five governorates of the State of Kuwait, and (iii) from visitors to our campaigns at Kuwait University, where students are enrolled from all the governorates. All participants were healthy and enrolled after obtaining written informed consent. The study protocol was approved by the Ethical Review Committee of the Dasman Diabetes Institute, Kuwait. Participant recruitment, sample collection, and related procedures were conducted in accordance with the tenets of the Declaration of Helsinki, as detailed elsewhere [1][2][3]. Each of the 620 individuals were assigned to subgroups (Kuwait-P, Kuwait-S and Kuwait-B) based on their surnames that inform their respective ancestral lineage i.e., Persian, Saudi Arabian Tribe, and Bedouin. The surname lineage classification was described in detail in our previous study [1].

Genotyping and quality control
All the 620 individuals were genotyped using HumanOmniExpress arrays for 730,525 SNPs (Illumina, Inc., San Diego, CA, USA) in accordance with the manufacturers' protocol. Quality control checks and data filtering were conducted using the PLINK (version 1.9) whole genome data analysis toolset [4]. The dataset was filtered to contain only autosomal SNPs with a genotyping success rate of >90%, a minor allele frequency of >5%, and a probability (p) value of >0.001 as determined using the Hardy-Weinberg exact test. Only samples with a genotyping success rate of >90% were included for analysis. To ensure that the study individuals are unrelated, we examined relatedness among them using the "-genome" feature in PLINK with the threshold PI_HAT > 0.125 i.e., up to third degree relatives and randomly removed one sample per pair of related individuals. After QC and relatedness filtering, 583 individuals and 587,819 SNPs met the inclusion criteria for analysis ( Supplementary Fig. 1). Genotype data has been deposited at the European Genome-phenome Archive (EGA), which is hosted by the European Bioinformatics Institute (EBI) and the Centre for Genomic Regulation (CRG), under accession number EGAS00001005034.

Datasets, merging, and phasing
We merged our data with the published genome-wide SNP genotype data of global populations (Supplementary Fig. 1

Population structure analyses: FST, PCA, ADMIXTURE, and RoH
To explore the population genetic structure, we initially computed the mean pairwise FST differences between all population groups using the Weir and Cockerham method [6] implemented in PLINK version 1.9 [4]. Next, we conducted PCA of the linkage disequilibrium-pruned combined dataset using the smartpca program included with the EIGENSOFT software package version 6.1.4 [7,8]. Further, we ran an unsupervised structure-like analysis using the ADMIXTURE tool (version 1.3.0) [9] on the same dataset 25 times at different time intervals, with K values ranging from 2 to 12. Notably, K = 9 was the best supported K value as determined from the lowest cross-validation indexes. RoH was estimated using PLINK version 1.9 [4] with a sliding window of 100 SNPs (1000 kb), allowing for one heterozygous and five missing calls per window.
Analyses to test admixture events and relative allele sharing: f3 and f4 statistics We computed the f3 and f4 statistics using the qp3Pop and qpDstat programs (with f4 mode: YES) implemented in the ADMIXTOOLS software package [10]. A dataset containing 244,688 SNPs We estimated the amount of Neanderthal ancestry present in the Kuwaiti subgroups using the f4-ratio implemented in the ADMIXTOOLS package [10]. We computed the Neanderthal fraction using the f4-ratio in the form f4 (Altai, Chimp, Target, Dinka) / f4 (Altai, Chimp, Vindija, Dinka) following Petr et al. (2019) [13].

Haplotype-based fine-scale analyses: ChromoPainter and fineSTRUCTURE
We used the phased dataset with 244,688 SNPs and 2139 individuals for haplotype-based analyses.
The ChromoPainter tool, designed to identify haplotypes in sequence data, was used to "paint" each individual as a combination of all other sequences [14]. As a first step, we estimated the -n (recombination scaling constant) and -M (per site mutation rate) parameters by running the software's EM option on a small subset of populations and five randomly selected chromosomes (3, 7, 10, 17, and 22), as described elsewhere [15]. The estimated values for the two parameters were -n = 510.86 and -M = 0.00033. Then, we executed the ChromoPainter tool in the "All vs All" mode (using the -a flag), where all individuals are considered as both donors and recipients.
Next, we analyzed the resulting painted dataset using the fineSTRUCTURE algorithm [14] to identify genetically homogenous clusters. We first ran the software performing 2 million burn-in iterations and 4 million MCMC iterations thinned every 10,000. This generated an MCMC file (.xml) that we used to build the tree structure using the option --T 1.
The analyzed individuals were initially classified into 233 clusters, which we reduced to increase the interpretability of subsequent analyses. More specifically, we iteratively "climbed the tree," and the combined branches consisted of less than five clusters if at least one of them was composed of less than five individuals. The obtained tree was further refined by pooling together pairs or triplets of clusters if the pairwise total variation distance (TVD) based on the number of chunks shared among members of a branch was >0.035. After refinement, 40 clusters remained.

Non-negative least square (NNLS)
Starting from the copying vectors obtained with the ChromoPainter tool, we reconstructed the ancestry profile of each cluster or individual by applying a slight modification of the NNLS function of R software version 3.5.1 (https://www.r-project.org/), as described elsewhere [15,16].
Therefore, for each individual belonging to a Kuwait cluster and for each of these clusters, we decomposed their ancestry as a mixture (with proportions summing to 1) of five (North/East Europe, Bedouins1, Yoruba, Druze, and North Africa clusters) and three (only North/East Europe, Bedouins1, and Yoruba) putative ancestral sources.

Exploring the variability of Kuwaiti individuals via pairwise TVD analysis
To obtain a detailed picture of the variation underlying modern-day Kuwaiti population, we determined the pairwise TVD [16,17]

Estimation of admixture dates
The times of admixture events were investigated using the GLOBETROTTER software [18]. We applied an "individual" approach by analyzing each Kuwaiti individual alone [19]. First, we estimated the time of the admixture event by applying the prop.ind=1, null.ind=0 approach to the 583 target individuals. Then, we performed 20 bootstrap iterations with the following settings: prop.ind=0, bootstrap.date.ind=1, and num.admixdates.bootstrap=1. For each of the inferred admixture events, we considered only those that were characterized by bootstrap values for the time of an admixture event between 1 and 400.
We also computed a weighted LD statistic estimating the date of admixture using ALDER version 1.03 that model the decay of admixture LD [20] to validate the admixture events calculated by ALDER and GLOBETROTTER are consistent.  Supplementary Fig. 3. Principal Component Analysis plot based on allele frequency representing the first two principal components, PC1 and PC2, accounting for 2.11% and 1.72% of the variation respectively. Supplementary Fig. 4. f3 -statistics with Papuans as outgroup in the form D(Papauans; Pop1, X) where Pop1 represents a Kuwaiti subgroup and X is a present-day population, comparing the genetic affinity of the Kuwait population subgroups to other global modern populations. The group color code for the populations on the y-axis: brown-Africa; blue-Arabian Peninsula; blue violet-Middle East; deep sky blue-Europe; coral-Caucasus; chartreuse-Central Asia; dark orange-South Asia.