Introduction

DNA-repair systems are essential for the maintenance of integrity of the genetic material. They play a key role in protecting the genetic material against deleterious mutations leading to cancer as well as to neurodegeneration or aging. Interindividual differences in the DNA-repair efficiency appear to be genetically determined (Crompton and Ozsahin 1997; Mohrenweiser and Jones 1998; Pero et al. 1983; Pero et al. 1989). Presence of potentially functional polymorphisms within the coding regions of the APEX1, XRCC1, ERCC4, and ERCC2 (Shen et al. 1998), hMSH2 (Liu et al. 1995), as well as hMLH1, and hMSH3 genes (Benachenhou et al. 1998) is consistent with this view. The XRCC1 protein interacts with DNA ligase III, polymerase β and poly (ADP-ribose) polymerase (Caldecott et al. 1996), as well as apurinic endonuclease APEX1 (Demple et al. 1991). The proteins hMLH1 and hMSH3 contribute to the mismatch-repair machinery (e.g., Kolodner 1996). ERCC4 corresponds to a nuclease that acts at the 5’ ends of DNA lesions (Bessho et al. 1997), while ERCC2 has a 5’-3’ ATP-dependent helicase activity (Sung et al. 1993) associated with the TFIIH protein complex required both for transcription and repair (Lehmann 1995). Amino acids replacements within these proteins might affect the efficiency of DNA repair and thus modulate an individual’s risks of developing cancer or influence the therapeutic response. The knowledge of the frequency profiles of such genetic variants across human populations is needed for genetic epidemiological or pharmacogenetics studies.

In particular, we wanted to know whether the distribution of different variants can be explained by recent out-of-Africa expansion, separating sub-Saharan Africa from Europe and the Middle East (ME) on one side and Southeast Asia (SEA) and America on the other (Cavalli-Sforza et al. 1994; Lahr and Foley 1994). It turns out that different polymorphisms in a plot of multidimensional scaling of the respective F ST‘s, demonstrate different continental affinities, thus suggesting a role of selection or genetic drift confounded with demographic factors in their partitioning of genetic variants among human populations.

Materials and methods

DNA samples and genotyping

Genomic samples were either from earlier non-nominative DNA collections or were obtained, on a non-nominative basis, from consenting adults who provided the information about their ethnicity and country of origin following a protocol approved by the Institutional Review Board. Studied groups included individuals from the Middle East and North Africa (n=23), referred to as Middle East; from Southeast Asia (n=24); North America (Athabascan speakers from Saskatchewan n=23 and Algonquian speakers from Quebec and Ontario n=24); individuals of sub-Saharan African descent (n=23), referred to as Africans; as well as French Canadians from the Province of Quebec, Canada (n=323), representing individuals of European descent. PCR was carried out in standard conditions in a volume of 50 μl using 25 ng of genomic DNA. Amplification products were dot blotted onto a Hybond TM-N+ membrane (Amersham) and subsequently hybridized with ASO probes (Table 1) following the protocol described earlier (Labuda et al. 1999).

Table 1 Characteristics of DNA-repair polymorphisms and conditions of PCR-ASO hybridization genotyping assays. Uppercase characters indicate polymorphic sites

Statistical analysis

F ST (see Weir and Cockerham 1984), a measure of allele frequency differences among population samples, was estimated per site or locus, for pairs of continental populations, for all of them as a group, or for combined groups of populations based on their pairwise analysis as indicated. It is defined as \( F_{{ST}} = \frac{{H_{T} - H_{S} }} {{H_{T} }} \) where H S, the average heterozygosity among subpopulations, corresponds to \( H_{S} = \frac{1} {s}{\sum\limits_{j = 1}^s {H_{j} } }, \) and \( H_{j} = {\left( {1 - {\sum\limits_{i = 1}^k {p^{2}_{{ij}} } }} \right)}\frac{{2n_{j} }} {{2n_{j} - 1}} \) with s representing the number of subpopulations, k the number of alleles, p i the frequency of allele i and n the number of individuals tested in subpopulation j, whereas H T, total heterozygosity, is \( H_{T} = \frac{2} {{s{\left( {s - 1} \right)}}}{\sum\limits_{j = 1,j \prec l}^s {{\left( {1 - {\sum\limits_{i = 1}^k {p_{{ij}} p_{{il}} } }} \right)}} } \). F ST was calculated with the help of the Arlequin package v.2.0 (Schneider et al. 2000). The same results were obtained with the GDA software v.1.0 (Lewis and Zaykin 2001). Because of a much larger size of our French Canadian sample representing Europe, which, when used as such for the F ST estimation, strongly biased the resulting F ST values, its size was artificially reduced to n=46 chromosomes, a size comparable to all other studied populations.

Two-site haplotypes from the hMSH3 gene were obtained from the observed genotypes by inspection. Their estimated frequencies were subsequently confirmed using the EH linkage utility program, which was also used to assess the significance of linkage disequilibrium between the contributing polymorphisms. EH is available at ott@linkage.rockfeller.edu. D’ = D/D max was calculated as described in Hartl and Clark (1989). The graphical display of the pairwise F ST’s by multidimensional scaling was obtained using STATISTICA v.6.

Results

Population distribution of DNA-repair protein variants

Five subcontinental groups were surveyed for the presence of seven amino acid substitution polymorphisms in six proteins involved in DNA repair (Table 2). The population samples extended from sub-Saharan Africa through the ME to Europe, and further to SEA and North-America. In all but ERCC4, minor allele frequency exceeds 10% in at least one of the populations. We note that at ERCC4 there is no significant differentiation among populations, as indicated by zero F ST. Two other sites, APEX1 and hMSH3 exon 23, also show little differentiation with nonzero but low F ST values and are nonsignificant in population pairwise comparison. In contrast to ERCC4, in these two sites, the minor allele frequency is in the range of 0.1–0.45, thus suggesting a mechanism maintaining a relatively high and comparable heterozygosity across continental populations.

Table 2 Population frequencies of DNA-repair genetic variants. n.d. not determined. Note that for the estimation of F ST we set the European sample to be of similar size as other populations i.e. n=46

In other loci, F ST estimates ranging from 10 to 17% (Table 2) are similar to those observed in a number of other genetic systems (Cavalli-Sforza et al. 1994; Fullerton et al. 2002; Relethford 2001). Minor allele frequencies (Table 2) as well as matrices of pairwise F ST’s represented by multidimensional scaling in Fig. 1, indicate different geographic distribution of the analyzed variants. Strikingly, there is no clear division between sub-Saharan Africa and the remaining populations. For all polymorphic sites, Europe and the ME cluster together, at least looking at one dimension at a time. This clustering includes Africa in the case of XRCC1, APEX1, and hMSH3 exon 23. In contrast, in hMLH1 and ERCC4 Africa clearly joins SEA. In XRCC1, when the total population is considered to be composed of two subpopulations—one comprising Africa, the ME, and Europe and the other SEA and North America—the estimated F ST of 0.241 is close to that estimated considering the total population composed of six subpopulations (Table 2). This indicates that greatest differentiation occurred along the line separating the above groups of populations. In hMLH1, if Africa and SEA are combined in a single subpopulation as opposed to others, the F ST of 0.159 is almost identical to the F ST evaluated considering six independent subpopulations.

Fig. 1
figure 1

Display of pairwise F ST’s among studied populations by multidimensional scaling: Af Africa, Eu Europe, SEA Southeast Asia, ME Middle East, Alg and Atb Algonquian and Athabascan speakers from North America, respectively. Asterisks indicate polymorphisms where seven or more of population pairwise F ST distances were statistically significant

In hMSH3, differences between sites within the locus are observed. In contrast to the site in exon 21, that in exon 23 displays little population differentiation and high minor allele frequency. Interestingly, in spite of the increase in frequency of Gln940 allele in Europe and the ME, the frequency of exon 23 Ala1036 is maintained similar across Old World continents. Indeed, if Algonquian speakers are excluded from the analysis, F ST of this polymorphism reduces to almost zero among the remaining five subpopulations. It is therefore interesting to evaluate the relationship between these two sites in hMSH3 (Table 3). Based on the linkage disequilibrium and the allelic state in different species, we can assign the origins of both polymorphisms to Africa through mutations Arg→Gln and Ala→Thr occurring on independent chromosomes and the appearance of Gln-Thr haplotype through a recombination. The value of the F ST estimate for the hMSH3 locus two site haplotype (Table 3) drops to about 5%, less than each site considered separately (data not shown).

Table 3 hMSH3 haplotypes involving sites G2835A (Arg940Gln) and A3124G (Thr1036Ala)

Discussion

Under neutrality and in the absence of mutation, genetic variation across populations is expected to be determined by genetic drift only, which in turn is determined by the demographic history of populations. A priori, all loci in the genome have the same expected degree of differentiation, which may be used to detect the action of natural selection (Cavalli-Sforza 1966; Fullerton et al. 2002; Lewontin and Krakauer 1973). If the allele frequency data are available for a large set of putatively neutral loci, then an empirical distribution of F ST values can be constructed to identify loci with unusual differentiation patterns.

Recently, Fullerton et al. (2002) compared F ST estimates in CAPN10, a candidate susceptibility locus for type 2 diabetes, with those of 86 biallelic RFLPs from earlier studies. They found that several polymorphic sites within CAPN10 had a relatively elevated F ST, which could have been interpreted as the effect of selection. These authors observed the most pronounced diversification between sub-Saharan African and non-African populations that differentiated the population risk of type 2 diabetes attributable to the susceptibility haplotypes (Fullerton et al. 2002). This diversification of allelic frequencies could have been caused by neutral demographic mechanisms, such as drift or migrations, population bottlenecks, and founder effects accompanying the out-of-Africa expansion (Cavalli-Sforza et al. 1994).

Interestingly, although our analysis of seven DNA-repair genetic polymorphisms did not show the greatest differentiation between Africans and non-Africans, certain observed continuities and discontinuities in allelic frequencies (Fig. 1) go along with two putative routes of the out-of-Africa expansion (Lahr and Foley 1994): the northern route connecting Africa to Europe through the ME and the earlier southern route linking Africa with Southeast Asia (Kivisild et al. 1999; Quintana-Murci et al. 1999).

In the case of the XRCC1 and hMSH3 exon 23 polymorphisms, the Africans show a tendency to cluster with Europe and the ME, while in hMLH1 they clearly cluster with Asia (seen also in hMSH3 exon 21 and ERCC4). In contrast, the distribution of two other polymorphisms, APEX1 and hMSH3, exon 23 does not seem to correlate with the history of human migrations. As noted before, in spite of tight linkage between hMSH3 exon 21 and exon 23 polymorphisms, their distribution patterns appear to follow different trajectories. APEX1 and hMSH3 exon 23 sites are characterized by an overall high heterozygosity and yet a low or not significantly different F ST, suggesting a possibility of balancing selection as a force maintaining a high and relatively even frequency distribution among populations (e.g., Bürger 2000).

The interpretation evoking selection is more plausible, if the polymorphism affects the protein primary structure, when the substitution changes the nature of the amino acid residue and when this residue appears conserved in extant species. These conditions were not clearly met in the case of studied polymorphisms (Table 2). This is not unexpected given that at certain loci, a “favorable” allele, under certain circumstances, may become an “at risk” allele in a different context (Neel 1962), as illustrated by the effect of common polymorphisms in the MTHFR locus (Rosenberg et al. 2002). The F ST of about 10% for 677T and 1268C polymorphisms based on Rosenberg et al. (2002) data are within the expected range (e.g., Fullerton et al. 2002). However, functional significance of MTHFR polymorphisms was independently documented in different studies. Therefore, here a high minor allele frequency across populations seems to be the best indicator of potential functional/selective significance, as in the APEX1 or hMSH3 exon 23 polymorphisms.

The results presented here were obtained with a particular population of European origin and a relatively small population sample of non-Europeans. Certain estimates may change if more samples or different populations representing continents or linguistic groups are surveyed. However, we would not expect a major change in the overall picture, since our estimates are similar to those obtained for a variety of European populations as well as for non-Europeans for loci whose data were available (Abdel-Rahman et al. 2000; Butkiewicz et al. 2001; Duell et al. 2001; Dybdahl et al. 1999; Fan et al. 1999; Fredman et al. 2002; Lunn et al. 1999).

In conclusion, this study illustrates how population genomics can be used to provide insight into the functional significance of certain DNA variants. The question remains why none of the seven studied polymorphisms demarcates sub-Saharan Africa from other continents, expected under a neutral scenario and assuming the recent out-of-Africa model of human population history. Finally, it is known that the incidence of different cancers differs between populations of different geographic and ethnic origin. It remains to be shown to what extent, beside environmental factors, the genetic differences in candidate genes, such as demonstrated here, are responsible for this variable incidence.