Population genetics and forensic utility of 23 autosomal PowerPlex Fusion 6C STR loci in the Kuwaiti population

This study evaluates the forensic utility of 23 autosomal short tandem repeat markers in 400 samples from the Kuwaiti population, of which four markers (D10S1248, D22S1045, D2S441 and SE33) are reported for the first time for Kuwait. All the markers were shown to exhibit no deviation from Hardy–Weinberg equilibrium, nor any linkage disequilibrium between and within loci, indicating that these loci are inherited independently, and their allele frequencies can be used to estimate match probabilities in the Kuwaiti population. The low combined match probability of 7.37 × 10–30 and the high paternity indices generated by these loci demonstrate the usefulness of the PowerPlex Fusion 6C kit for human identification in this population, as well as to strengthen the power of paternity testing. Off-ladder alleles were seen at several loci, and these were identified by examining their underlying nucleotide sequences. Principal component analysis (PCA) and STRUCTURE showed no genetic structure within the Kuwaiti population. However, PCA revealed a correlation between geographic and genetic distance. Finally, phylogenetic trees demonstrated a close relationship between Kuwaitis and Middle Easterners at a global level, and a recent common ancestry for Kuwait with its northern neighbours of Iraq and Iran, at a regional level.

Before utilising this kit for criminal and relationship cases in Kuwait, population and forensic statistical data for the loci in the kit must be evaluated. In this study, we aim to increase the amount of genetic data available for the Kuwaiti population, using the 23 autosomal STRs in the PowerPlex Fusion 6C kit, of which four loci (D10S1248, D22S1045, D2S441 and SE33) have not been reported before for Kuwait. In addition, we aim to evaluate the forensic utility of these autosomal STRs in this underrepresented region, and to investigate the utility of these markers in population genetic differentiation by examining the genetic distance between the Kuwaiti population and other global populations for which data are available.

Materials and methods
Samples and genotyping. Blood samples were collected on Whatman FTA cards (GE Healthcare Life Sciences, IL, USA) from 400 unrelated Kuwaiti (253 males and 147 females). DNA was amplified directly, without quantification, from a 1.2 mm FTA card punch, according to the directions in the PowerPlex Fusion 6C manual, using a SureCycler 8800 thermal cycler (Agilent Technologies, CA, USA). Detection and separation of the DNA fragments were carried out using an Applied Biosystems 3500 Genetic Analyzer (Thermo Fisher Scientific) with the internal lane standard WEN ILS 500 and allelic ladder provided with the PowerPlex Fusion 6C kit. Genotype determination and allele calling for only the 23 autosomal loci were carried out using GeneMapper ID-X software version 1.4 (Thermo Fisher Scientific).
Statistical analysis. Data analysis was carried out for the 23 autosomal loci only (the sex chromosomes are not included in this paper). Arlequin statistical software version 3.5 was used to calculate allele frequencies, to test for linkage disequilibrium, and to test for deviation from the Hardy-Weinberg Equilibrium 6 . Forensic parameters, including the random match probability (RMP), discrimination power (DP), power of exclusion (PE), typical paternity index (TPI) and polymorphic information content (PIC), were calculated using STRAF (http://cmpg.unibe .ch/shiny /STRAF /), an online tool for STR data analysis 7 .

Intra-population genetic structure among Kuwaitis. Countries in the Arabian Peninsula, including
Kuwait, have a high rate of consanguineous marriage, which causes differential distribution of alleles among families and tribes, resulting in population genetic stratification 8,9 . Newly presented markers therefore must be assessed for the presence of any population structure, to avoid calculation of forensic parameters using inaccurate allele frequencies taken from the total population, rather than the relevant subpopulation. Stratification also negatively impacts discrimination power, because the chance of random individuals possessing similar genotypes is higher within a subpopulation, than within the total population 10 . Two methods were therefore used to detect genetic structure in the population, principal component analysis (PCA), and a Bayesian-based method implemented in STRU CTU RE version 2.3.4 [11][12][13] .
In order to demonstrate whether these two methods were able to cluster the samples into their real subpopulations, each sample was categorised into one of three ancestral subgroups (K = 3) based on the donor's surname. It has previously been found that the Kuwaiti population is mainly composed of settlers coming from three different regions: the Arabian Peninsula (from Saudi Arabia), the desert (representing nomadic tribes), and Persian countries (mainly from Iran) 9, 14-17 . On this basis, the samples were categorised into three groups: KW-1 (n = 162) representing individuals originating from the Arabian Peninsula, KW-2 (n = 163), which consists of those coming from Persian countries and Iraq (north), and KW-3 (n = 75) composed of Bedouin individuals coming from nomadic tribes. PCA was carried out on allele frequencies at the 23 autosomal STR loci for the different population groups KW-1-KW-3 using R software 33 and visualised using the factoextra package 34 .
In contrast to PCA, which is an unsupervised clustering algorithm, STRU CTU RE (a Bayesian-based approach) takes a range of numbers of populations (K) in order to calculate the proportion of the genome of each individual in the sample originating from each inferred population 11 . STRU CTU RE software calculates the likelihood of the data (X) for range of K values, and the true number of K is determined by the maximal value of Ln P(X|K). However, it was found by Evanno et al. 18 that the maximal value does not always provide the correct number of K in the data. Instead, the maximal value of the rate of change (Delta K) in the Ln P(X|K) between successive K values accurately infers the true number of genetic clusters in the data 18 . As such, both Ln P(X|K) and Delta K at each K were calculated and reported. STRU CTU RE was run without population information, as recommended in the STRU CTU RE documentation, in order to check whether the results approximately agreed with the separation of samples into their subgroups. Thus, the predefined groups (KW-1 to KW-3) were only included as a population label rather than as prior information for the analysis. The parameters for the analysis were set as follows: 'admixture' and 'correlated allele frequencies' models using 100,000 Markov Chain Monte Carlo (MCMC) steps for each run, with the first 100,000 discarded as a burn-in, and the inferred number of K was set from 1 to 10. At each K, the analysis was repeated five times in order to test the results for consistency. The results were visualised using CLUMPAK (Clustering Markov Packager Across K, available at http://clump ak.tau.ac.il/index .html) 19 , and the best K was calculated using STRU CTU RE HARVESTER (available at http:// taylo r0.biolo gy.ucla.edu/struc tureH arves ter/) 20 .
Inter-population genetic structure and population relationships. To  www.nature.com/scientificreports/ CEPH) using the online forensic STR frequency browser, popSTR (http://spsma rt.cesga .es/popst r.php) [21][22][23] . Data from Lebanon (LEB) and an Indian (IND) population from Madhya Pradesh typed for the 23 autosomal loci 24 were also included in the analysis. Genetic distance was also assessed at a regional level using allele frequencies for the 13 ), and Iraq (IRQ 32 and IRQ1 2 ). PCA analysis was conducted using R software 33 and visualised using the factoextra package 34 .
In addition to the PCA, we studied the genetic relationship between the Kuwaiti samples and the other populations at both the continental and regional levels, using phylogenetic trees. These trees were constructed using pairwise genetic distances (D A ) based on Nei et al. 35 , which were calculated from the allele frequencies of the populations using POPTREE2 software 36 . The type of phylogenetic trees used were Neighbour-joining (NJ) trees, constructed using Mega X software version 10.0.5 37 .
Ethics statement. The study was performed in accordance with the University of Strathclyde code of practice on investigations involving human beings, and ethical approval (reference number DEC18/PAC06) was granted by the Department of Pure and Applied Chemistry Ethics Committee. Written, informed consent was obtained from all participants prior to sampling.

Results and discussion
Allele frequencies and forensic performance. Full PowerPlex Fusion 6C STR profiles were recovered from blood samples taken from 400 Kuwaiti individuals. Table 1 shows the allele frequencies and forensic parameters calculated for these samples. Similar to studies of other global populations 38,39 , SE33 was the most discriminative locus in the Kuwaiti population, having 45 different alleles (PIC = 0.945). In contrast, TPOX was the least discriminative locus, with only eight different alleles (PIC = 0.616). The calculated combined match probability (CMP) was 7.37 × 10 -30 , meaning that the probability of observing two identical profiles for the 23 autosomal loci in the Kuwaiti population was 1 in 1.36 × 10 29 The TPI ranged between 1.439 (TPOX) and 8.333 (SE33), and the combined PE was > 99.9999%. These high values indicate the usefulness of the PowerPlex Fusion 6C kit for both human identification and paternity testing in the Kuwaiti population.

Statistical analysis of populations. No significant deviation from the expectations of the Hardy-Wein-
berg Equilibrium was detected at any locus in the Kuwaiti genotypic data, therefore, the PowerPlex Fusion 6C autosomal STR alleles are independent and can be used to estimate allele frequencies from their genotype frequencies. Association between alleles at all possible pairwise combinations of loci was evaluated using the linkage disequilibrium test. Significant linkage disequilibrium was detected between 22 (of a total of 253) pairs of loci (p < 0.05). However, after Bonferroni correction of the significance level using the number of tests (0.05/253 = 0.000198), none of the pairs of loci showed significant linkage disequilibrium, indicating that all loci are statistically independent. Therefore, their allele frequencies can be multiplied together to estimate match probabilities in the Kuwaiti population.
Off-ladder and novel alleles. Alleles that could not be identified using the GeneMapper allelic ladder for the PowerPlex Fusion 6C kit were assigned as off-ladder (OL) alleles, and were observed in 13 samples. These samples were re-amplified for confirmation and all OL alleles were confirmed. OL alleles were observed at the PentaE (5 alleles), PentaD (1 allele), D22S1045 (1 allele), SE33 (5 alleles), and D18S51 (1 allele) loci. The samples were previously sequenced using the Verogen ForenSeq DNA Signature Prep kit (manuscript in preparation), and these data were examined to determine whether the undesignated alleles at the PentaE, PentaD, D22S1045 and SE33 loci could be identified; the repeat structure sequences from this dataset are shown in Table 2 and permitted all alleles to be identified. The D18S51 locus is not included in the ForenSeq kit therefore, its OL allele was identified using the allelic ladder bins created in GeneMapper software.
All of the identified alleles have been reported previously in the STRBase database (an online STR database created by the United States National Institute of Standards and Technology (NIST) 40 ), except for the PentaD 11.2 allele, which is a novel allele not reported before in the literature.
Intra-population genetic structure. Markers that are used for human identification may have weaker discrimination power in populations with genetic structure than in unstructured populations, due to the impact that the presence of subpopulation groups has on the random match probability. This is due to the fact that individuals coming from the same subpopulation groups tend to possess similar alleles, which means the likelihood of seeing random individuals possessing similar genotypes would increase in the presence of genetic structure 10 . Despite the fact that, in this study, no significant deviation from the expectations of the Hardy-Weinberg Equilibrium was detected between the markers, indicating that there is no genetic stratification, it is useful to assess the markers to see if they reveal any genetic clusters within the data. To achieve this, PCA was carried out on the DNA profiles obtained from the Kuwaiti samples for the 23 autosomal PowerPlex Fusion 6C markers. PCA is an unsupervised clustering method that does not require any prior information about the ancestral origin of the samples. Simply, it clusters the samples based on their similarities to each other, forming homogenous clusters of individuals that can be seen on a PCA plot. As expected, the PCA plot (Fig. 1), did not show any pattern of segregation that could be related to the ancestral population of origin of the individuals in the data, indicating that there is no genetic structure within the sample.           www.nature.com/scientificreports/ Another widely used method to infer population structure in genetic data is the Bayesian-based model implemented in the STRU CTU RE software, which calculates how likely each individual in the data is to belong to each of a number of K (predetermined by the user) populations, and then uses this information to assign individuals into population subgroups 18 . The analysis was run without population information, and the mean log likelihood across five repeated runs of the analysis for each value of K (from 2 to 10) was estimated. The results showed inconsistency in estimating the log likelihood at K = 5 and over, which is indicated by the high standard deviation (SD), as presented in Supplementary Figure S1A. Based on the method described in Evanno et al. 18 , the most likely inferred value of K was 7, as this is the number of populations at which the highest Delta K value was recorded (Supplementary Figure S1B).
In this study, 30% of individuals declared their origins as being from the north (Iraq and Iran), 39% from the south region (Saudi Arabia, Bedouin and Bahrain), and 24% had parents of different origin (admixed). www.nature.com/scientificreports/ Therefore, the closer genetic relationship of our samples to the northern region might be due to the presence of these individuals. There is no information available about the population of origin for the samples collected in the two previous Kuwaiti studies (KW1 3 and KW2 2 ). It is therefore not possible to determine whether sampling from different sub-populations could explain why, in contrast to our sample, these two Kuwaiti samples cluster more closely with the Saudi Arabian sample than the samples from Iran and Iraq. Overall, it can be seen that the allele frequencies of the 23 autosomal markers in the PowerPlex Fusion 6C kit can be successfully used to separate both geographically distant global populations and closely related populations on the basis of their genetic distance, making them a good choice for detecting genetic differentiation between populations.

Conclusion
This study evaluated the forensic utility of the 23 autosomal STR loci included in the Promega PowerPlex Fusion 6C kit for the Kuwaiti population. Among these loci, D10S1248, D22S1045, D2S441 and SE33 are reported for the first time for Kuwait. The genetic data indicate that these 23 autosomal STRs are highly polymorphic in the Kuwaiti population and are of high value for human identification and paternity testing. STRU CTU RE and PCA analysis show no signature of genetic structuring of the Kuwaiti population into subpopulations. Comparison of the Kuwaiti population to other global populations indicates that Kuwait clusters with other Middle Eastern populations, and shows a close relationship with Iran and Iraq, suggesting that they may share common ancestry.