Introduction

The remarkable progress made in DNA technology in the past decade has had an enormous impact on several disciplines, including forensic science. Identification of thousands of genetic markers, particularly the short tandem repeat (STR) loci, distributed throughout the human genome, and their analysis using polymerase chain reaction (PCR) based techniques, tremendously augmented the efficiency in individual identification and determination of genetic relationships among individuals. Based on population genetic characteristics desired in forensic analysis, such as adherence to the expectations of Hardy–Weinberg equilibrium (HWE) and independence of alleles across loci, as well as ease of laboratory typing, a set of 13 STR loci (viz., D3S1358, vWA, FGA, D8S1179, D21S11, D18S51, D5S818, D13S317, D7S820, D16S539, CSF1P0, TPOX, TH01) have been established as the core genetic markers for use in DNA forensic analysis and parentage testing.1,2 These developments together with the recommendations of the National Research Council (NRC)3 with respect to statistical interpretation of DNA evidence, have been instrumental in the worldwide acceptance of DNA evidence in the criminal justice system. However, the databases on these 13 loci are largely restricted to broadly defined population groups, such as US White, US Black, Hispanics. Although regional data from the United States have been compiled,4,5 and allele frequency data from Europe are being made available through web site presentations (http://www.uni-Duesseldorf.de/WWW/MedFak/Serology; http://www.cstl.nist.gov/biotech/strbase) in addition to occasional reports from some worldwide populations,6 data from ethnically defined populations, particularly isolated populations with smaller effective sizes, are relatively scarce. Therefore, our knowledge remains limited as to whether population-related evolutionary forces, such as population bottlenecks and genetic drift among others, impact on the dynamics of these loci and consequently affects their forensic use in specific populations. This is particularly relevant because, with the NRC3 recommendation for computing statistical significance of a DNA match becoming standard, the need for empirical estimates of worldwide values of genetic differentiation (FST or θ)7 among ethnically defined populations has become urgent.8,9,10

With these rationales, we have studied genetic variation at 9 STR markers, which are a subset of the 13 core forensic loci named above, in over 900 individuals drawn from 20 ethnically defined populations representing five major human groups. The objectives are to: (1) generate a worldwide database of allele and genotype frequencies; (2) test independence of alleles within and across loci in the examined populations; (3) estimate the coefficient of co-ancestry at the global as well as major group levels of population differentiation; (4) examine the genetic relationships amongst the sampled populations; and (5) evaluate average match probabilities with and without adjustments for population substructure effects in the studied populations. Additionally, we also present the statistical power of using these loci for parentage testing by evaluating exclusion probabilities for each population database. Our data indicate that, in general, alleles across the nine studied loci are mutually independent in all populations, and pooling of population data by major geographic groupings introduces a co-efficient of co-ancestry no larger than 3.5%, even for small isolated populations (eg, Native Americans). Consequently, even with population substructure adjustment,3 the estimated match probabilities do not increase by more than 10-fold compared to the ones predicted under the assumption of strict allelic independence. Evaluation of exclusion probabilities indicate that even in the small isolated populations, use of these nine loci offers an exclusionary power above 99.3%. However, in paternity testing with mother's genotype unknown, and with paternity exclusion confirmed by at least two loci, the power of exclusion could fall below 80% in some isolated populations. Therefore, while this worldwide database validates the use of these nine STR loci for DNA-based forensic identification and parentage testing purposes, supplementation with additional loci is recommended for parentage testing in small populations, particularly when the mother is not available for genotyping.

Materials and methods

Population samples

The 20 populations surveyed in this research include: Africans, viz. Sudanese (SUD), Nigerian (NIG), Benin (BEN), South Carolina Blacks (SCB); Caucasians, viz. German (GER), Spanish (SPN), United Arab Emirates (UAE), Brazilian White (BRA); Asians, viz. Chinese (CHN), Japanese (JAP), Kachari (KAC), Thai (THA), Kampuchean (KAM); Native Americans, viz. Dogrib (DOG), Ngöbé (NGB), Wounan (WON), Bri Bri (BRI), Pehuenche (PEH); and Oceanic, viz., Samoan (SAM), Papua New Guinea Highlanders (PNG). These populations are globally distributed and representative of large (African, Caucasian and Asian) and small, isolated (Native American and Oceanic) populations known to have undergone recent population bottlenecks. Nigerian, Benin, Sudanese, South Carolina Black, German, Spanish, Arabs from UAE, Brazilian Whites, Chinese, Japanese, Thai, Kampuchean, derive their names from their countries or regions of origins, and are representatives of the broadly defined ancestral groups to which they belong. The Kachari are a Tibetoburman speaking Mongoloid group from Northeast India. Of the American Indian groups, the NaDene speaking Dogrib population is distributed in the Northwest territories of Canada; the Bri Bri from Costa Rica and the Ngöbé from Panama are Chibcha speaking groups; the Wounan from Panama are Chocoan speakers; and the Pehuenche are a group of Araucanian Indians from Chile. Among the Oceanic groups, Samoans are a Polynesian population distributed over the independent nation of Samoa and the US territory of American Samoa; the New Guinea Highlanders are sampled from the Central Highlands of Papua New Guinea. Further details of these populations are also found elsewhere.11,12,13,14,15,16

DNA analysis

We have used the Profiler Plus kit from Applied Biosystems, which is designed for co-amplification of the nine STR loci. Multiplex PCR amplification of these loci, viz., D3S1358, HumvWA, HumFGA, D8S1179, D21S11, D18S51, D5S818, D13S317, D7S820 and the amelogenin locus was conducted following the protocol in the AmpF1STR Profiler Plus PCR manual,2 with the only modification of using a 25 μl PCR reaction volume instead of the 50 μl as described in the manual. The amplified products were separated on an ABI 377 DNA sequencer. GeneScan 3.1, AmpF1STR Profiler Plus Template and Genotyper 2.5 (Applied Biosystems) software were used for sizing and genotyping.

Statistical analysis

As the studied loci are autosomal co-dominant, allele frequencies were computed by gene counting.17 Three tests were used for testing conformity with Hardy–Weinberg proportion of genotype frequencies, viz., exact test for multiallelic loci,18 log likelihood method,19 and the homozygosity test.20 The levels of significance for each test statistic were evaluated through 10 000 replicates of permutations of the observed alleles within each database. As the results of these three tests were in general congruent, we have reported in the text only the levels of significance of the exact test, since this is the most powerful of the three test procedures.21

Mutual independence of alleles was tested by two test statistics, each of which utilized the nine-locus genotypes of individuals from each population. The first test statistic is the variance of the number of heterozygous loci across the individual DNA profiles in each database. This test statistic (sk2) detects the presence of linkage disequilibria across loci,22,23 which in the context of these unlinked loci, signifies the presence of population substructure within each database. The observed value of sk2 was compared with its 95% confidence limit estimate based on the assumption of mutual independence of alleles, analytically computed by the methods as described in Brown et al22 and Chakraborty.23 The second test is based on the distribution of the number of shared alleles between all possible pairs of nine-locus DNA profiles of individuals within each population database. The expected distribution of allele sharing was analytically evaluated based on the theory described in Chakraborty and Jin.24 Concordance of the observed and expected distributions of allele sharing is the indicator of mutual independence of alleles, relevant for forensic application of such databases.

Estimates of co-ancestry measures were obtained by apportionment analysis of gene diversity and allele size variance, conducted first at the level of geographic grouping of populations (five groups, as mentioned earlier), and second, by using two level substructuring (among five groups, and between-populations within each group), using the theory of Chakraborty et al,25 which is an extension of the AMOVA analysis,26 adapted for microsatellite loci. The levels of significance of GST estimates from these computations were determined by the permutation test (10 000 replications).

Average match probability and exclusion probabilities for parentage testing were computed by using the computational formulae as listed in Chakraborty et al.10 For match probability evaluation, impact of possible population substructure effects within each database (judged to be non-significant for each individual population) was examined by computing a weighted conditional match probability (as shown in, since this computational formula is not explicitly available in the literature).

figure 2

Appendix 1

Results and discussion

The allele frequencies at the nine studied loci are presented in. Occasionally, some DNA samples could not be optimally amplified at some loci, and consequently, sample sizes differ to some extent from one locus to the other. One possible reason for non-amplification could be sequence-variation at the primer-binding site. However, frequencies of null alleles resulting from such phenomena are rare and do not affect the validity of these loci in forensic analysis.27 Nonetheless, the allele frequency distributions show that each STR locus is substantially polymorphic in the worldwide populations. This is also reflected in summary measures of genetic variation, viz., number of alleles, allele size variance and heterozygosity, which are presented in Table 1. This indicates that the levels of variation at the nine STR loci are high in all populations, irrespective of their effective sizes. Even though the larger continental populations (eg, the populations of African, Caucasian, and Asian descent) show a somewhat larger level of variation, the reduction of genetic diversity in the smaller isolated groups (eg, the Native Americans and the Oceanic populations) appears to be marginally small. It may be argued that this reduced genetic variation could be an artifact of the small sample size of these populations, eg, only 99 Oceanic individuals were sampled compared to the 291 Asians. It is known that average heterozygosity and allele size variance are not strikingly affected by sample size differences of this order.28,29 However, the number of segregating alleles is sensitive to such sample size effects. To account for this, we computed the expected average number of alleles (as described in 30) that would have been observed if 99 individuals were sampled from each of the five major geographic groups of populations. From this analysis, we obtain the average number of alleles ranging from 7.8 (among the Native Americans and Oceanic populations) to 10.2 (among the Africans), with the Caucasians and Asians having intermediate allele numbers, 9.4 and 9.3, respectively. Thus, the somewhat reduced levels of diversity at these nine STR loci, among the Native American and Oceanic populations, are not due to their sample size differences, but are rather reflections of genetic drift operating more actively in these populations.

figure 3

Appendix 2

Table 1 Summary statistics of within population variation at nine STR loci in 20 global populations

Tests for conformity of genotype frequencies with HWE, performed by the exact test,18 showed only nine significant departures from equilibrium out of a total of 180 locus-population combinations. Of these, seven (Brazilian at D13S1379, Sudanese at FGA, Chinese at D8S1179, D13S317, D7S820, Ngöbé at D21S11 and Pehunche at D8S1179) were at 5% level of significance, and two (New Guinea Highlander at D13S1379 and Sudanese at D21S11) were at 1% level of significance. Overall, the proportion of discordances (9 out of 180) exactly conforms to the nominal level of significance (5%), indicating a general agreement with HW proportions of genotype frequencies in the entire dataset.

In order to examine whether HWE expectations hold at the level of geographic populations, we performed a similar analysis of exact test18 on the pooled samples within each of the five major groups. Of the 45 locus-group combinations, eight significant deviations were observed. The Africans and the Caucasians were at HWE at all loci; the Asians and the Oceanians showed departure at a single locus each, D5S818 locus (P=0.026) and D13S317 (P=0.007), respectively. However, among the Native Americans, six of the nine loci were significantly different from the HWE expectations (vWA, P=0.049; D8S1179, P=0.013; D21S11, P=0.005; D18S51, P=0.020; D5S818, P=0.004; and D7S820, P=0.015).

These results, together, suggest that for these nine STR loci, the assumption of HWE holds reasonably well for anthropologically defined populations. Further, when ethnic groups are pooled as geographic and/or broadly defined entities, in general, the larger continental and cosmopolitan populations still adhere to the expectations of HWE. However, groupings of isolated populations even of common ancestral origin, such as the Native Americans, exhibit the presence of population substructure, which could be attributed to the effects of genetic drift resulting from relative isolation and smaller population sizes.

Summary statistics of two tests of mutual independence of the nine STR loci are shown in Table 2. When each multi-locus genotype occurs only once in a sample, the summed number of heterozygous loci is a sufficient statistic for testing the hypothesis of mutual independence of loci.8 Therefore, the test statistic, sk2 (the variance of the number of heterozygous loci across individuals, in their nine-locus genotype profiles), used for testing the hypothesis of mutual independence of all loci contains all information in a database of multi-locus genotypes. For all 20 populations, the observed values of sk2 are within their respective 95% confidence limits, supporting agreement with the hypothesis of mutual independence of loci. This conclusion is also corroborated by the test of conformity of observed and expected number of alleles shared between all pairs of individuals within each population (last two columns of Table 2). Thus, there is no evidence of non-random association of alleles across loci in any of the 20 populations examined in this study. Absence of non-random association of alleles at these loci is also evident at the level of major groupings of populations (see Table 2). In addition, allele-sharing data reveals another important population genetic characteristic, which is not readily observed in tabulations of allele frequencies. The last two columns of Table 2 show that individuals, who are members of smaller isolated populations, share more alleles in their multilocus genotype profiles than do the individuals from larger populations. This larger sharing of alleles is, nonetheless, in expectation of random combination of alleles in their genotypes (as seen from the conformity of observed and expected). Thus, the larger allele sharing in Native Americans and Oceanic populations is consistent with their reduced genetic variation (Table 1).

Table 2 Tests of multi-locus independence of allele frequencies in 20 global populations

Tables 3 and 4 provide summary results of gene diversity analyses of the nine STR loci. For geographic populations within each of the five major groups, we have evaluated the coefficient of gene diversity GST, which is effectively equivalent to the coefficient of coancestry, θ, based on gene diversity and allele size variance separately.31 Although in the context of evolutionary relationships of populations, allele size variance based estimates are preferred, gene diversity based estimates are more relevant for forensic applications. Nevertheless, data presented in Table 3 establishes two important points. First, for all major groups of populations, estimates of θ <3% are adequate, as suggested in the NRC report. Second, the levels of significance, obtained by a permutation-based method,10 indicate that even small values of θ can be statistically significant. In other words, even when two databases from two different samples from the same population show statistically significant differences of allele frequencies, such observations do not compromise forensic calculations, because such departures can be taken into account by invoking values of θ as suggested in the NRC report.

Table 3 Estimates of coefficient of gene differentiation (GST) among populations for five major groups of humans based on nine STR loci
Table 4 Gene diversity analysis of 20 global populations sub-divided as five major groups and sub-populations within each group

Table 4 illustrates another aspect of the gene diversity analysis. The estimates of GST for between populations within a major group are smaller than among the major groups of geographic populations. This provides empirical support for the notion that establishing STR databases based on broad definitions of populations is adequate for use in forensic analysis.8,32

Figure 1 shows a neighbour-joining tree33 of the genetic affinities amongst the 20 populations based on the chord distance,34 which has been demonstrated to generate reliable tree topologies.35 We have also estimated the phylogenetic relationships based on Nei's standard genetic distance,36 which showed very similar topologies and bootstrap values compared with the chord distance (data not shown). A notable feature of the network tree is that, in general, populations within a major geographic or racial group have clustered together. For example, all of the populations of African ancestry are proximally placed, as are the populations of Caucasian/European and Asian origins, respectively. Interestingly, all of the five Native American groups are located on the same branch. An exception is the position of the Samoans, whose branch lies between the Africans and the Caucasians. Based on the known ethno-history and affinity of this population,37,38 one would expect the Samoans to cluster with other Asian populations. However, the bootstrap values supporting the Samoan branch are rather low and thus this anomalous observation is most likely due to the limited number of markers used. It should also be noted that in a previous study on South-east Asian and Oceanic populations, using a separate set of nine STRs and five Y-specific STR loci in the principal component analysis, the Samoans were an outlier compared to the majority of the South-east Asians.16Appendix 1Appendix 2

Figure 1
figure 1

A neighbour-joining tree based on chord distances. The abbreviated population names are the same as those mentioned in the Materials and methods section. Bootstrap values indicate the degree of support of 1000 replicates for each branch point.

In Table 5, we illustrate the power of the battery of the nine loci for forensic and parentage testing applications. In general, the nine loci have adequate discriminatory power for forensic identification of individuals, as well as sufficient exclusionary power for parentage analysis. As expected, with adjustment of population substructure effect (ie, with non-zero values of θ), the match probabilities are reduced to some extent. However, for all populations the average match probability is well below their respective current population sizes, reflecting global rarity of nine-locus DNA profiles based on these nine loci. In other words, a somewhat reduced level of genetic variation in isolated populations (such as the Native Americans and the Oceanic populations) does not compromise the use of these STR loci for the purpose of human identification. As shown in the last four columns of Table 5, these loci are also adequate for parentage testing. With the criterion of exclusion based on at least one locus, and with data on the mother–child pair, the exclusion probability exceeds 99.3% in all populations. Exclusion based on at least two loci offers exclusion probability in excess of 94.3%. However, in motherless cases, for some populations (particularly for the small isolated ones), the exclusion probability falls below 80%. Thus, there may be a need to supplement these loci with additional markers for cases that involve unknown mothers, or more complicated forensic scenarios, eg, DNA mixtures involving two or more samples. In view of our observation that pooling of data from Native Americans produced a considerable degree of departure from HWE (but not generating FST/GST above 4.1%, see Table 3), a further degree of conservativeness in forensic use of our data presented in Appendix 2 may be achieved by imposing a minimum threshold allele frequency, a concept advocated in forensic literature.3,39

Table 5 Match probability and paternity exclusion probability with the combined testing of nine STR loci in global populations

In summary, this report establishes a nine-locus STR database in a globally diverse set of anthropologically defined populations. Analyses of genotype and allele frequency data demonstrate that the assumptions of HWE and multi-locus independence of alleles are globally applicable for the STR loci, and sampling designs generally employed in human genetic surveys provide adequate representations of random samples for DNA typing. Gene diversity analysis reflects that when STR databases are pooled over populations by geographic groupings, population substructure effects can be accounted for with values of θ consistent with the ones recommended in NRC report (ie, θ <1% for all cosmopolitan populations, and ≈3% for small isolated populations). Finally, with regard to the power of discrimination and exclusion probability, data presented here also show that a reduced level of genetic variation in smaller and isolated populations does not substantially compromise the utility of these loci.