Introduction

The identification of perpetrators of crimes by DNA investigations may be hindered by the absence of a reference sample from the offenders or database hits. In such cases, there is a need for additional DNA analyses that can lead the investigation in a specific direction1,2. Analysis of ancestry informative markers (AIMs) may be used to infer the biogeographic ancestry of the individual that left the trace sample at the crime scene3. Sets of AIMs can be combined to target different geographical regions on large or small geographical scales. The resolution of ancestry at the continental level, e.g. Africa, Europe, East Asia, and the Americas has been achieved using commercially available panels including the Precision ID Ancestry Panel and the ForenSeq DNA Signature Prep Kit4,5,6,7,8,9. However, such panels have limited success with population assignment of individuals with admixed ancestries and may not be able to differentiate individuals from regions that have experienced multiple population migration events.

From a population genetic perspective, the Middle East is a particularly interesting region. The Middle East (covering Turkey in the West to Afghanistan in the East) connects Africa, South-Central Asia, and Europe and has a history of many human migration events10. Within the last 5000 years, many different powerful empires have dominated the region, fighting each other for political control and subsequently declining in power after relatively short periods of reign11. As a consequence, the borders in the Middle East changed and parts of the populations migrated each time a new empire gained power12,13,14. Migration of individuals to and from the surrounding regions further reduced the level of genetic divergence between the Middle Eastern, European, and South-Central Asian populations 15. The current political borders between the countries in the Middle East were agreed between the United Kingdom and France in 191611,16, and they do not reflect the genetics of the Middle East populations12,13,17,18. Furthermore, the Middle East is situated in the centre of an allele frequency gradient from North-Western Europe to East Asia15,19. This makes it particularly difficult to differentiate individuals from the Middle East, Europe, and South-Central Asia15.

The EUROFORGEN North Africa and Middle East (NAME) panel is a regional ancestry panel for ancestry inference of individuals from the Middle East and North Africa20. It includes 111 AIMs and was designed to improve the population assignment of individuals from the Middle East, North Africa, and Europe. The AIMs were amplified with four PCR multiplexes and analysed using the MassARRAY System. Each multiplex required 2.5–10 ng DNA, which is more than what is recovered from the majority of crime scene samples. Therefore, a custom AmpliSeq assay was developed for the Ion S5 platform21. Of the 111 loci in the EUROFORGEN NAME panel, 102 loci were successfully amplified and sequenced with as little as 0.5–1 ng DNA.

In this work, we typed 1098 individuals from 14 populations from the Middle East, North Africa, North-East Africa, South-Central Asia, and Europe for 265 AIMs using the AmpliSeq EUROFORGEN NAME panel and the Precision ID Ancestry Panel (Thermo Fisher Scientific)6. We used the method developed for the GenoGeographer software22 to assess whether the use of the combination of the EUROFORGEN NAME and Precision ID Ancestry Panel markers would improve the correctness of the assignment of individuals to their population of origin. It was further investigated if the combination of the panels increased the weight of the evidence of the assignment of the population of origin.

Results

Of the 1098 individuals analysed, 336 were typed with the MassARRAY EUROFORGEN NAME assay20, and 762 were typed with the custom AmpliSeq EUROFORGEN NAME panel21. Only the 102 loci that were included in both assays were used for the population genetic and ancestry analyses below. The information concerning the physical position and rs-numbers of the loci included in the AmpliSeq design is shown in Supplementary Table S1. All samples were also typed for the 165 AIMs of the Precision ID Ancestry Panel6,21 and in this work. Two AIMs, rs12913832 and rs4833103, were present in both the EUROFORGEN NAME panel and the Precision ID Ancestry Panel. These two AIMs performed best in the EUROFORGEN NAME panel, and the results from the Precision ID Ancestry Panel were not used. Of the 1098 individuals, 28 individuals had no genotype calls in more than 10% of the loci. The data of these individuals were excluded from further analysis. Data for the remaining 1070 individuals were used for the downstream analyses.

The data obtained with the EUROFORGEN NAME and Precision ID Ancestry panels were tested separately for Hardy–Weinberg equilibrium (HWE). For the EUROFORGEN NAME panel, the data of the AIM rs7873963 was in Hardy–Weinberg disequilibrium in five populations (Pcor = 4.9E-04). There was an excess of homozygotes of the T allele, which was caused by a deletion downstream of the locus that was associated with the C allele. Only samples typed with the MassARRAY assay were affected by the deletion; the locus was in HWE in the populations typed with the AmpliSeq EUROFORGEN NAME panel. The locus, which was also in linkage disequilibrium (LD) with another locus (see below), was excluded from further population genetic analysis.

The HWE was also assessed for the markers present in the Precision ID Ancestry panel. After Bonferroni correction, the AIM rs310644 was in Hardy–Weinberg disequilibrium in the Pakistani and Portuguese populations (Pcor = 3.07E−4). Among Portuguese individuals, 74 had the TT genotype, two had the CC genotype, while no heterozygote individual was observed. Among Pakistani individuals (N = 72), 43 individuals had the TT genotype, 13 the CC genotype, and 16 the CT genotype.

Linkage disequilibrium (LD) analysis was performed on the combined dataset including 265 AIMs with 34,980 pairs of loci. Besides LD most likely due to physical linkage, LD between alleles at different chromosomes was also observed. Supplementary Tables S4 and S5 show the pairs of loci that were in statistically significant LD in the different populations. Several loci in the EUROFORGEN NAME panel showed statistically significant LD. The HaploView software was used to evaluate if these loci could belong to haplotype blocks. The analysis showed that two groups of markers on chromosome 4 (rs4975193—rs1757928—rs337277—rs1699387, and rs17616434—rs4833103), one group on chromosome 7 (rs9649356—rs1227171), one group on chromosome 10 (rs2031581—rs2765650), and one group on chromosome 12 (rs10862511—rs10506882) seemed to form haplotype blocks. The loci rs1406045 (typed with the EUROFORGEN NAME panel) and rs4463276 (typed with the Precision ID Ancestry Panel) on chromosome 6 as well as rs621341, typed with the EUROFORGEN NAME panel, and rs6754311, typed with the Precision ID Ancestry Panel on chromosome 2 were in linkage disequilibrium (Supplementary Table S4). To ensure marker independence, one locus in each pairwise comparison was eliminated for the population genetic analyses. The performance of the loci in terms of heterozygote balance, locus balance, noise level, and the number of genotype drop-outs was evaluated and for each pair, the locus with the best performance was retained. If the loci performed equally well, preference was given to the locus with the shortest read length (Supplementary Table S6). After evaluating the LD, the final numbers of loci for further genetic analysis were 72 for the EUROFORGEN NAME panel and 161 for the Precision ID Ancestry Panel. The combined dataset included 233 SNP markers.

Genetic structure

The population variation of reference groups from Sub-Saharan Africa (N = 606), Europe (N = 604), the Middle East (N = 134), South-Central Asia (N = 689), and East Asia (N = 504) was analysed. Figure 1 shows a PCA plot of the combined data set with 233 AIMs. PCAs where each population is highlighted can be found in the supplementary materials (Supplementary Figures S1S14). The Sub-Saharan African, European, South Asian, and East Asian individuals were separated from each other by PC1 and PC2. The Middle Eastern individuals was located between the South Asian and the European individuals with a small overlap with the European individuals. The North African individuals were situated between the Sub-Saharan African and the Middle Eastern individuals, while the NE African individuals were found between the North African and Sub-Saharan African individuals. Supplementary Figure S15 shows a similar analysis based on the 72 EUROFORGEN NAME markers only. PCA analyses showed that the Middle Eastern individuals had a larger overlap with the Southern European populations from Greece and Albania than with the Danish individuals (Supplementary Figures S2, S3, and S5). There was a substantial overlap between the Middle Eastern and South-Central Asian populations mainly consisting of individuals from Afghanistan.

Figure 1
figure 1

PCA plot of the results obtained with the combined dataset of 233 AIMs included in the EUROFORGEN NAME panel and the Precision ID Ancestry Panel. The PCA were performed using a custom script written in R v. 3.5.0 using the ‘adegenet’ v. 2.1.2 and the ‘ade4’ v. 1.7-15 R packages43,44.

To evaluate the genetic structure of the populations, STRUCTURE analyses were performed using K = 3 to K = 7. Figure 2 shows the results for K = 4 to K = 6 for the 233 loci in the combined data set. The most likely number of clusters was K = 4 corresponding to the Sub-Saharan, East Asian, South-Central Asian, and European populations. Co-ancestry contribution from Sub-Saharan, European, and South-Central Asian populations was observed among individuals from North-East Africa and North Africa, whereas the Middle Eastern individuals shared cluster memberships with primarily the European populations and, to a smaller degree, South-Central Asians. With K = 6, an additional component was observed for the Middle Eastern, North-East African, and the European individuals. For the Middle Eastern individuals, the component differed from those of the North-East African and North African populations mainly due to the Sub-Saharan contribution to the latter populations, and it differed from the clusters of the European populations due to the South-Central Asian contribution to the cluster. Some variation within the European cluster was also observed at K = 6. South Europeans shared more cluster membership with the Middle Eastern, North-East African, and North African populations than the North Europeans. The STRUCTURE analysis performed with EUROFORGEN NAME markers only showed a similar pattern (Supplementary Figure S16).

Figure 2
figure 2

Diagram of the STRUCTURE analysis with runs of K = 4 to 6 of the combined dataset of 233 AIMs. The reference data are from the 1000 Genomes Project. Population abbreviations are the same as those described in Supplementary Materials Table S2. The membership proportions were plotted using CLUMPP v.1.1.22248 and Distruct v. 1.1.2349.

Population assignment based on z-score and LR

Based on the STRUCTURE and PCA results, the 14 populations typed in this work were grouped into five meta-populations: (1) a European meta-population including individuals from Albania, Denmark, Greece, Portugal, and Slovenia, (2) a Middle Eastern meta-population including individuals from Afghanistan, Iran, Iraq, Syria, and Turkey, (3) a North-East African meta-population including individuals from Eritrea and Somalia, (4) a North-African meta-population including individuals from Morocco, and (5) a South-Central Asian meta-population including individuals from Pakistan.

A z-score test was performed for each of the 1070 individuals using the GenoGeographer software and the cross-validation method22,23. This was done for the EUROFORGEN NAME panel (72 loci), the Precision ID Ancestry Panel (161 loci), and the combined dataset (233 loci). The AIM profiles were tested against both the individual’s meta-population of origin and the four other meta-populations. Table 1 shows the results of the z-score tests. The results of the test of each AIM profile against each meta-population with the three sets of AIMs were categorised as either “Accepted”, “Ambiguous”, or “Rejected” (Fig. 4).

Table 1 Effectiveness of the ancestry estimation with the Precision ID Ancestry Panel, the EUROFORGEN NAME panel, and the combined panel.

Irrespectively of the origin of the sample, the number of AIM profiles categorised as “Ambiguous” was lower with the combined set of markers than with the Precision ID Ancestry Panel. The reduction in the number of ambiguous profiles was most pronounced for individuals from the Middle East and South-Central Asia (Table 1). In both cases, the population assignments primarily changed from “Ambiguous” to “Accepted”. For example, 47.4% of the Middle Eastern individuals were classified as “Ambiguous” with the Precision ID Ancestry Panel, while only 36.7% were classified as “Ambiguous” with the combined panel. The percentage of Middle Eastern individuals in the “Accepted” category increased from 38.5% with the Precision ID Ancestry Panel to 49.9% with the combined panel. Furthermore, fewer Middle Eastern individuals, categorised as “Accepted” or “Ambiguous”, likely belonged to the European meta-population based on the genotypes generated with the combined panel (1.1% and 8.6%, respectively) compared to the genotypes generated with the Precision ID Ancestry Panel (3.3% and 15.6%, respectively).

For the North African and the North-East African meta-populations, the number of profiles assigned to the ‘Rejected’ category increased when the combined panel was used. Regarding North African individuals, four profiles classified as ‘Accepted’ and two profiles classified as ‘Ambiguous’ with the Precision ID Ancestry panel were assigned as ‘Rejected’ with the combined panel. For the North-East African individuals, three profiles (one defined as ‘Accepted’ and two as ‘Ambiguous’) were classified as ‘Rejected’ when the combined panel was used. These AIM profiles were outliers in all reference populations (z-scores > 1.64; P < 0.05) with the combined panel.

Figure 3 shows the distribution of the log LRs for all individuals with z-scores ≤ 1.64 (P ≥ 0.05) for their populations of origin. Overall, the combined panel (red distribution in Fig. 3) led to an increase in LRs compared to those of the two panels separately. The increase in LR for the combined panel was greatest when the AIM profiles of individuals from North Africa and North-East Africa were compared with those from individuals from Europe, the Middle East, and South-Central Asia, while it was smallest when the AIM profiles of individuals from (1) Europe and the Middle East and (2) the Middle East and South-Central Asia were compared.

Figure 3
figure 3

Distributions of log LRs for the individuals with z-score ≤ 1.64 (P ≥ 0.05) for the population of origin (listed in the plot headings). The colours refer to the panels used. The curves are based on smoothed kernel density estimates for the 1070 individuals. The hypothesized meta-population in the numerator of the LRs is given by the heading of each plot, while the hypothesized meta-population in the denominator is indicated to the left of the ordinate. R v. 3.5.0 and the ‘ggplot2’ v. 3.2.1 R package (https://ggplot2.tidyverse.org/index.html) was used to visualise the LRs.

Figure 4
figure 4

Diagrammatic presentation of the decisions for classification of the results of investigations of ancestry with AIMs.

Discussion

With commercial Massively Parallel Sequencing ancestry panels such as the Precision ID Ancestry Panel and the ForenSeq DNA Signature Prep Kit, a continental differentiation is now possible6,7,24. The panels work well as global ancestry panels; the purpose of which is to explore whether a DNA sample from an unidentified individual could originate from any of the major geographical regions of e.g. Sub-Saharan Africa, Europe, East Asia, and the Americas6,7,25,26,27,28,29,30. The interest is now shifting towards regional ancestry panels and their practical use for ancestry inference in forensic casework. From a Danish perspective, the ability to separate individuals of European descent from those of Middle Eastern, North African, and North-East African descent is particularly interesting due to the recent immigration events to Denmark from these regions. The effectiveness of separating individuals from the above-mentioned populations with a global ancestry panel is limited because the AIMs in these panels were selected to separate individuals from the major, continental populations31,32,33. Therefore, the use of regional ancestry panels for the separation of individuals on a finer geographical scale is relevant.

The design of custom panels with online design tools is relatively straightforward21,34. Thus, the number of custom-made panels for the assignment of individuals to specific population groups will most likely increase in the future. The EUROFORGEN NAME panel was designed for the identification of individuals of North African and Middle Eastern ancestries. The original MassARRAY version of the EUROFORGEN NAME panel included 111 loci amplified by four separate multiplex PCRs. In contrast, the AmpliSeq panel tested in this work amplified 102 loci in one multiplex PCR21. In the present work, it was assessed if the EUROFORGEN NAME AmpliSeq panel would be valuable as a supplement, i.e. a regional ancestry panel, to the Precision ID Ancestry Panel for the assessment of the ancestry of individuals most likely from either Europe, the Middle East, North Africa, North-East Africa, or South-Central Asia. Some loci of the EUROFORGEN NAME panel and/or the Precision ID Ancestry Panel were shown to be in linkage disequilibrium with each other. Therefore, 32 markers were excluded for further population genetic and ancestry analysis to maintain the a priori assumption that the genetic markers used for the analysis were independent and in HWE. Together, the Precision ID Ancestry Panel and the EUROFORGEN NAME panel include 233 different genetic SNP markers. The PCA plots and the STRUCTURE analyses (Figs. 1 and 2 as well as Supplementary Figures S1S16) confirmed that the combined panel increased the separation of the European and Middle Eastern population groups compared to that of the Precision ID Ancestry Panel alone. Although PCA and STRUCTURE analyses provide easy ways to visualise data clustering, these methods are empirical and not adequate for ancestry inference in forensic genetics, as they cannot calculate the statistical weight of the results.

For forensic genetic population assignment, sufficient population reference data and their geographical distribution play a major role4,7,23. The data from the reference populations are used to estimate the likelihood of an AIM profile. The population with the highest likelihood is often reported as the population of origin of that individual4,35. However, if the true population of origin is not present in the population reference data, this assignment is incorrect and misleading23. The method implemented in the GenoGeographer software22 includes a statistical z-score test that evaluates if an appropriate reference population is present among the reference data. Calculation of the statistical weight should not be performed if the AIM profile does not belong to any of the reference populations22. Here, we used GenoGeographer to evaluate if the combination of a global and a regional ancestry panel increased the differentiation power compared to each of the panels separately. The combination of panels affected the rates of “Accepted”, “Rejected”, and “Ambiguous” ancestry profiles in the different meta-populations and increased the LRs. The positive effect of using both panels was strongest for Middle Eastern and South-Central Asian individuals. When the combined panel was used instead of the individual panels, the number of “Accepted” ancestry profiles increased and the number of “Ambiguous” results decreased. Additionally, the rate of “Rejected” ancestry profiles increased when an individual was compared to an incorrect meta-population. For the North African meta-population (Moroccans) and the North-East African meta-population (Eritreans and Somalis), 15 AIM profiles were “Rejected” from their true meta-populations of origin with the Precision ID Ancestry Panel, while 24 AIM profiles were “Rejected” with the combined panel. These individuals may belong to populations that were not included among the reference populations, or they may have mixed ancestries. In both cases, investigations of neighbouring populations, including Sub-Saharan populations, would have been relevant.

LRs were calculated for individuals that were assigned to their assumed population of origin (“Accepted” and “Ambiguous” categories) (Fig. 3). The use of the combined panel increased the LRs for all comparisons. The reporting of the results of AIM testing in forensic genetics is based on the likelihood ratio principle that is recommended by the International Society for Forensic Genetics36,37. In the assessment of ancestry, however, there is a logic challenge in situations in which both likelihoods in the LR are based on hypotheses that are nonsense. Methods based on the likelihood principle are—on their own—not well suited for the evaluation of the plausibility of the hypotheses of the ancestry of an individual with a particular AIM profile. To avoid calculating LRs based on two nonsense-hypotheses—which would make such an LR of no use—we introduced a prior-test of the probabilities of the hypotheses based on the z-score test. If both hypotheses are rejected, no further calculation of LR is performed. In practical forensic genetic work, this means that LRs are only calculated if the AIM profile of the individual is accepted in at least one of the reference populations and the LR will give meaningful information. If an AIM profile is accepted to belong to only one population, it is still relevant to calculate the LRs based on comparison with all other relevant populations to assess the strength of the evidence. If an AIM profile is accepted to belong to two or more populations, the various LRs based on comparisons between the relevant populations can be calculated. If any LR is statistically significantly higher than any other, this will be strong evidence in favour of the individual belonging to the population resulting in the highest likelihood. Again, the strength of the evidence is given by the LR. It must, however, be taken into consideration that the pure fact that an AIM profile based on a z-score value ≤ 1.64 (P > 0.05) is accepted to belong to one—and only one—tested population does not prove that the individual belongs to that particular population. The individual may instead belong to another population that is genetically close to the tested population, or the individual may be of admixed ancestry with sufficient contribution from the tested population to allow the admixed AIM profile to be accepted into that population. The individual could also belong to a population that is not present in the reference database. Thus, in a case with a question to which of two populations an individual belongs, the first step would be to perform a z-test. If the AIM profile can belong to any of the tested populations, the LR is calculated as the weight of the evidence. If the results of the z-test indicate that the AIM profile for practical purposes cannot belong to any of the tested populations, no further calculation is performed, and the conclusion is that the contributor of the AIM profile does not belong to any of the proposed populations.

The analyses performed in this work demonstrated that the addition of the loci in the EUROFORGEN NAME panel to those of the Precision ID Ancestry Panel improved the efficiency of the population assignment of individuals from Europe, the Middle East, North Africa, North-East Africa, and South-Central Asia. However, when comparing the performance of the panels separately, the Precision ID Ancestry Panel had higher inclusion and rejection rates than the EUROFORGEN NAME panel, which was designed as a regional ancestry panel and not as a stand-alone panel.

Materials and methods

Samples, DNA extraction, and quantification

A total of 1098 samples were selected from the Biobank of the Department of Forensic Medicine, University of Copenhagen, Denmark (Danish agency for data protection, ref. no. 2002-54-1080). All samples were anonymized. According to the Danish Act on Research Ethics Review of Health Research Projects, the work did not require approval by the Ethics Committee. The authors confirm that all methods were performed in accordance with the relevant guidelines and regulations. A total of 336 individuals from Albania, Denmark, Greece, Iraq, Slovenia, Somalia, and Turkey were previously typed with the MassARRAY EUROFORGEN NAME assay20. To reach a minimum of 75 individuals per population23, samples from 220 individuals from these populations were typed with the AmpliSeq EUROFORGEN NAME panel (Supplementary Table S1)21. Furthermore, 542 individuals from Afghanistan, Eritrea, Iran, Morocco, Pakistan, Portugal, and Syria were typed with the AmpliSeq EUROFORGEN NAME panel. All samples were also typed with the Precision ID Ancestry Panel (Thermo Fisher Scientific), either in this or previous works6,23,31.

DNA was extracted from either blood samples using the QIAamp DNA Blood Mini Kit (Qiagen, Hilden, Germany) or from blood or buccal swabs on FTA cards (Whatman Inc., Clifton, NJ) with the BioRobot EZ1 Workstation (Qiagen, Hilden, Germany) and the QIAamp DNA Investigator Kit (Qiagen, Hilden, Germany).

DNA extracts were quantified with the Qubit dsDNA High Sensitivity (HS) Assay Kit (Thermo Fisher Scientific) and a Qubit 3.0 Fluorometer (Thermo Fisher Scientific).

Library preparation and DNA sequencing

DNA libraries were built using the AmpliSeq EUROFORGEN NAME panel and the Precision ID Ancestry Panel (Thermo Fisher Scientific) and the Ion AmpliSeq Library Kit. 2.0 (Thermo Fisher Scientific) according to the manufacturers’ recommendations, except for using 25 PCR cycles6 and preparing the libraries using half volume of reagents. The DNA input ranged from 0.3 to 1 ng. The DNA libraries were purified with Agencourt AMPure XP magnetic beads (Beckman Coulter Inc., CA, USA) with a Biomek 3000 Laboratory Automation Workstation (Beckman Coulter Inc., CA, USA)38,39. The barcoded libraries were quantified using a Qubit Fluorometer and the Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific). The DNA libraries were subsequently pooled in equimolar concentrations (28–35 pM) using a Biomek 3000 Laboratory Automation Workstation (Beckman Coulter Inc., CA, USA)38,39. The template preparation (emulsion PCR, enrichment of Ion Sphere particles, and chip loading) was performed on an Ion Chef instrument (Thermo Fisher Scientific) using the Ion S5 Precision ID Chef Kit (Thermo Fisher Scientific). Approximately 80 libraries were loaded per Ion 530 chip. A negative control was included per chip. Sequencing was performed with an Ion S5 (Thermo Fisher Scientific) using the Ion S5 Precision ID Sequencing Kit (Thermo Fisher Scientific).

Data analysis

Sequence analysis was performed with data from BAM-files using the HID-SNP Genotyper v.5.2.2 plug-in on a Torrent Suite server. Target and hotspot BED-files were provided to specify the loci of interest in the human hg19 reference genome. The following default settings were used for the data analysis: Minimum allele frequency = 0.1, minimum coverage = 6, minimum coverage of either strand = 0, and maximum strand bias = 1. The plug-in generated CSV-files that were used for the downstream analysis. The Precision ID Ancestry Panel and AmpliSeq EUROFORGEN NAME DNA libraries were analysed separately as the datasets required different target and hotspot BED-files. To evaluate the quality of the data, the heterozygote balance (Hb) and noise level were calculated. The Hb was calculated as the number of reads for one nucleotide divided by the number of reads for the other nucleotide in the called genotype in the order A, C, G, and T. Noise was calculated as the number of reads that were different from the called genotype divided by the total number of reads. Two sets of genotype acceptance criteria were applied: (1) a minimum read depth ≥ 45 reads and 0.3 ≤ heterozygote balance ≤ 3.0; and (2) for loci with a read depth of 20 to 44, 0.7 ≤ heterozygote balance ≤ 1.3 and no noise. Genotypes that did not fulfil these criteria were named NN, i.e. no data.

Tests for Hardy–Weinberg equilibrium (HWE) of alleles were performed with the Arlequin v.3.5.2.2 software40 using 1,000,000 Markov chain steps. Pairwise tests for linkage disequilibrium (LD) between alleles was tested with Arlequin v.3.5.2.2 using an exact test. Only individuals with full profiles (no missing data) were used for the LD analysis. Since multiple tests were performed, the P values of analyses of HWE and LD were adjusted with the method of Bonferroni41 and indicated as Pcor. Data from loci that showed statistically significant LD were further investigated using the HaploView v. 4.2 software to assess whether the loci were included in haplotype blocks using the default model42.

Genetic structure

PCA and STRUCTURE analyses were performed with different reference data according to the availability of information concerning the markers. For the EUROFORGEN NAME markers, reference data included samples from the 1000 Genomes Project and HGDP-CEPH20 (Supplementary Table S2). For the combined dataset of EUROFORGEN NAME and Precision ID Ancestry Panels (referred to as ‘Combined’ in the following sections), PCA and STRUCTURE analyses were carried out using the reference data from the 1000 Genomes Project (Supplementary Table S3).

The principal component analyses (PCA) were performed using a custom-designed script written in R v. 3.5.0 using the ‘adegenet’ v. 2.1.2 and the ‘ade4’ v. 1.7-15 R packages43,44. The ancestral component of each individual was assessed using the software STRUCTURE v.2.3.445,46. The clustering analysis was carried out using 100,000 steps for burn-in followed by 100,000 MCMC steps. Three to seven clusters (K) were explored. For each K, 10 iterations were performed for the EUROFORGEN NAME panel, and three for the combined panel. We used the admixed model with correlated allele frequencies. Population information was used for reference populations to help cluster formation (POPFLAG = 1). The optimal K was evaluated using STRUCTURE HARVESTER47. The membership proportions were plotted using CLUMPP v.1.1.22248 and Distruct v. 1.1.2349. Results from the PCA and the STRUCTURE analyses were used to group the 14 populations into meta-populations.

GenoGeographer analysis

Three sets of data (the Precision ID Ancestry Panel, the EUROFORGEN NAME panel, and the combined panels) were compared to each other using the GenoGeographer software22,23. Using the z-score test, GenoGeographer first tested if the investigated AIM profile could be grouped with any of the reference populations in the database. The test included an estimation of the variance of the allele frequencies in the reference populations50. If the P value of the AIM profile belonging to a reference population was < 0.05 (z-score > 1.64), GenoGeographer rejected the hypothesis that the AIM profile belonged to that particular reference population (Fig. 4). For each AIM profile and marker panel, the z-score of the AIM profile was computed using cross-validation (out-of-sample) procedure. Here, the investigated AIM profile was compared with the reference AIM profiles by leaving out the investigated AIM profile. The z-scores were computed for all individuals against all meta-populations23. An AIM profile was included in the “Accepted” category if it was: (1) accepted in only one meta-population (z-score ≤ 1.64; P ≥ 0.05) or (2) accepted in more than one meta-population (z-score ≤ 1.64; P ≥ 0.05) and the likelihood of the AIM profile belonging to the population was statistically significantly higher than those of all other likelihoods (P < 0.05). An AIM profile was included in the “Rejected” category if it was not accepted in any meta-population (z-score > 1.64). An AIM profile was included in the “Ambiguous” category if it was (1) accepted in more than one meta-population (z-scores ≤ 1.64; P ≥ 0.05) and (2) the population likelihoods were not statistically significantly different from each other (z-scores ≤ 1.64; P ≥ 0.05). The likelihood ratios (LR) (Figs. 3, 4) were only calculated for individuals that were assigned “Accepted” or “Ambiguous”.

R v. 3.5.0 and the ‘ggplot2’ v. 3.2.1 R package (https://ggplot2.tidyverse.org/index.html) were used to visualise the LRs.