Introduction

The dispersals of human populations from Africa into new environments in the last 75 000 years, and the introduction of farming starting 10 000 years ago, involved profound changes in human lifestyle, including diet and exposure to pathogens and xenobiotics.1 During the Palaeolithic period, the dietary choices of modern humans must have been limited to minimally processed wild plants and animal foods. With the advent of agriculture, novel foods were introduced, of which the human genome had little evolutionary experience. Thus, farming communities were challenged by new selective pressures that led to the emergence of genetically transmitted phenotypes increasing human survival. Besides digesting new energy-rich nutrients, humans also had to detoxify a wide range of novel xenobiotics, including toxins and carcinogens.2 For example, changes in the temperature at which meat and fish are cooked modify human exposure to exogenous carcinogens (heterocyclic amines and polycyclic aromatic hydrocarbons), and therefore, increases the risk of developing colon cancer.2 This example suggests that the genes involved in the detoxification of exogenous molecules, generally speaking, might have played an important role in human adaptation during the transition from foraging to farming.

Among the enzymes involved in xenobiotic detoxification, arylamine N-acetyltransferase 2 (NAT2) is a phase II drug-metabolizing enzyme (DME), which catalyses the N acetylation of aromatic amines.3 This DME reached prominence initially as one of the first enzymes to be recognized as a cause of interindividual variation in drug metabolism. Functional polymorphisms at NAT2 gene segregate in humans and in other mammals into rapid and slow acetylation phenotypes.4 The acetylation status has been previously shown to modify the frequency and/or the severity of drug and xenobiotic toxicity in human populations.5, 6 For example, slow acetylators are at increased risk of hepatotoxicity to isoniazid, a major antitubercular drug, because the high and extended circulating concentration of this drug gives rise to toxic effects in humans.7 Slow acetylators are also at increased risk for prostate and urinary bladder cancers following exposure to aromatic amine carcinogens.8, 9 Conversely, for example, fast acetylators are at increased risk for colon cancer (also lung, breast and laryngeal cancers) when individuals are highly exposed to heterocyclic amines found in well-cooked meat.9 In this context, the involvement of acetylation phenotypes in the metabolism of several xenobiotics suggests that they do change the fitness of human populations. Consequently, the changes in human exposure to toxic molecules and carcinogens introduced by the transition to agriculture might have implied a number of modifications in the selective pressures acting on DMEs, as recently shown for the NAT2 gene.10

To test in depth whether past demography and lifestyles have influenced the selective pressures acting on NAT2, we characterized genetic diversity and the patterns of linkage disequilibrium (LD) of the NAT2 gene region in six different human populations from Central Asia. Populations inhabiting Central Asia exhibit marked differences in terms of lifestyle (sedentary agriculturalists vs nomad pastoralists), dietary choices and past demographic histories, and therefore represent an excellent model system to test the extent to which changes in human lifestyle might have influenced human adaptation to xenobiotic environments.

Materials and methods

DNA samples

In our study, we analysed a total of 138 unrelated individuals (276 chromosomes) from six populations of Central Asia, including: Kirghizs who were sampled in the south (lowland) (population named KRA, 2n=94) and in the north of Kirghizstan (highland) (KRM, 2n=46), Kazakhs from Kazakhstan (KZ, 2n=52), Tajiks from the Uzbek cities of Agalic (TJA, 2n=64) and Urgut (TJU, 2n=58) and Uzbeks from Uzbekistan (UZ, 2n=62) (see Figure 1). The Kirghizs sampled in this study are thought to have always been nomad pastoralists, because of their origins from southern Siberia where the Neolithic revolution appeared late and marginally.11, 12 Their diet is dominated by meat. It is widely accepted that the Uzbeks and the Kazakhs were nomads who became sedentarized in the 14–15th century and the 20th century, respectively.13 They are now agriculturalists and have a diet based mainly on meat and dairy.14 By contrast, the Tajiks are long-term agriculturalists and may represent the descendants of people who made the Neolithic transition in this area (4000–3000 years ago15). They are sedentary agriculturalists rearing cattle. Their food diet is less dominated by meat consumption, contrary to the other populations cited above.16

Figure 1
figure 1

Localities of the seven Central Asian samples. Population names are labelled as follows: KRA, KRM: Kirghizs; KZ: Kazakhs; TJA, TJU: Tajiks; UZ: Uzbeks; TK: Turkmen.

Genotyping and sequencing

All the individuals were typed for a total of 15 polymorphic SNPs. These SNPs correspond to the nine intergenic SNPs selected in a previous study10 and to the six SNPs found in NAT2 coding region, except SNP 191G>A as this is monomorphic in non-African samples (282C>T, 341T>C, 481C>T, 590G>A, 803A>G, 857G>A; for more details, see Patin et al10). The nine intergenic SNPs, covering a region of 200 kb, were genotyped by TaqMan (ABI Prism® 7000 Sequence Detection System, Applied Biosystems, Foster City, CA, USA) assays. As to the six NAT2 SNPs, a region of 1500 bp including NAT2 coding exon (870 bp) and its flanking regions were PCR amplified and sequenced with the ‘ABI Prism® 3100 Genetic Analyser’ (Applied Biosystems). Protocols were reported in a previous study.10 Sequences were analysed using GENALYS.17

Data analysis

Allele frequencies were calculated by gene counting. Using Arlequin v.2.000,18 FST statistics19 from all pairwise population comparisons were calculated and deviations from the Hardy–Weinberg equilibrium were tested. Haplotype reconstruction from unphased genotype was performed using the Bayesian method implemented in PHASE v.2.1.1.20 The phenotype (fast/slow acetylator) proportions were calculated for each population considering that the ‘slow acetylator’ allele is codominant.

Using DnaSP v.4.10,21 we calculated the nucleotide diversity π and Watterson's estimator θW of θ and we performed the neutrality tests Tajima's DT,22 Fu and Li's F* and D*.23 These statistics are functions of the number of segregating sites in a sample of sequences, the mean pairwise difference between the sequences and the number of derived variants that are observed only once in the sample, all of which are affected by natural selection.22, 23 P-values were estimated by the use of coalescent simulations under a neutral model of evolution (n=10 000).

Pairwise LD between the 15 genotyped SNPs was estimated in each population. We calculated the statistics D24 and r2 Hill and Robertson25 using DnaSP v.4.10.26 Their statistical significance was tested using a Fisher's exact test followed by Bonferroni corrections for multiple testing.27 To represent the decay of LD over distance, we plotted the proportion of SNP pairs showing significant Fisher's exact tests for D′ against the physical distance between SNP pairs.

We also performed the long-range haplotype (LRH) test using SWEEP v.1.028 on a core region corresponding to NAT2 coding exon.10 Then we assessed for each core haplotype its relative extended haplotype homozygosity (REHH) 200 kb apart. To test the significance of potentially selected core haplotypes, our data set was compared with both empirical (obtained from the screening of the entire chromosome 8 in European-descent populations; HapMap database) and simulated distributions10 of ‘core haplotype frequencies vs REHH’.

We inferred the population growth rate r and the age g of NAT2 non-synonymous mutations by the use of a joint maximum-likelihood estimation of these parameters, as described in a previous study.29 Since demographic events leave the same signature on the whole genome, we also estimated the growth rate and age of all the nine intergenic SNPs and considered them as references to discern the respective influences of population expansion and selection on NAT2 non-synonymous mutations. The recombination parameter required for these analyses was estimated by comparing deCODE and Marshfield genetic and physical distances in the NAT2 region (UCSC Genome Bioinformatics). We performed our estimations with the effective population size Ne fixed to 20 000 individuals. We also used the data set from Patin et al10 to assess the mutations’ ages in populations from Western Eurasia (Ashkenazi Jews, Sardinians, Swedes, Saami populations) and Eastern Eurasia (Gujarati from India, Chinese and Thai). We also added the agriculturalist Turkmen (TK) population from Uzbekistan.10

Results

Sequence variation at the NAT2 gene region

To evaluate the extent to which natural selection has shaped the patterns of variability of the NAT2 gene in Central Asian populations, we re-sequenced NAT2 in six populations presenting different lifestyles (see Material and methods). We identified only six SNPs in the NAT2 coding region (282C>T, 341T>C, 481C>T, 590G>A, 803A>G and 857G>A), which have all been previously reported.8, 10 All populations were at the Hardy–Weinberg equilibrium at all SNPs. Haplotype reconstruction and frequency estimation were performed in each population separately (Table 1). As to haplotypes encoding an altered NAT2 protein (noted ‘slow’ in Table 1), the Tajiks and the Kazakhs exhibited the highest frequency of haplotype NAT2*5B (TJA: 26%; TJU: 22% and KZ: 23%), which carries the non-synonymous mutation 341T>C encoding the NAT2 enzyme with the most altered activity as compared to the other NAT2 slow proteins.8, 9 The NAT2*6A haplotype, which carries the non-synonymous mutation 590G>A, was present at high frequencies, especially in Tajik populations (TJA: 39% and TJU: 45%), while it was found at the lowest frequency in Kazakhs (KZ: 13%). Finally, the Kazakhs exhibited the highest frequency (23%) of NAT2*7B, which carries the non-synonymous slow-acetylation mutation 857G>A, which is mainly restricted to East Eurasian populations.10 As to ‘fast’ haplotypes, the haplotype NAT2*4, which is the ancestral state of the gene and is defined as the reference NAT2 haplotype,30 was found at high frequency in the Kirghizs (KRA: 48% and KRM: 46%) and in the Uzbeks (UZ: 42%), at nearly two-fold lower frequencies in the Tajiks (TJA: 23% and TJU: 26%) and at intermediate frequency in the Kazakhs (KZ: 38%).

Table 1 Allelic composition and frequency of NAT2 haplotypes in Central Asia

Acetylation phenotype inference

We inferred from NAT2 genotypes the distribution of fast/slow acetylation phenotypes across populations (Table 2). Heterozygotes bearing a ‘fast’ and a ‘slow’ haplotype were considered ‘intermediate acetylators’, because they present a mean intermediate NAT2 activity significantly different from that of fast homozygotes.31 Phenotype frequencies showed strong variation among the different studied populations. Indeed, the Tajiks exhibited significantly higher proportions of slow acetylators (TJA: 63% and TJU: 55%; χ2 test, P<0.05) as compared to the Uzbeks, the Kirghizs and the Kazakhs, who presented proportions of slow acetylators ranging from 26% (KRA) to 35% (KZ).

Table 2 Acetylation phenotype frequencies of Central Asian populations

Population differentiation

We analysed the pattern of population differentiation by using the FST statistics, estimated from haplotype frequencies (Table 3a). When considering the NAT2 coding region, the tests of exact differentiation for all pairs of populations first indicated that both Tajik populations (TJA and TJU) were significantly differentiated from the Kirghiz and Kazakh populations (KRA, KRM and KZ) (FST=0.046–0.078; P<0.05). Conversely, Tajiks were not statistically different from Uzbek (FST=0.027; P=0.058) and Turkmen populations. In addition, the two Tajik populations were not differentiated between each other. Finally, the south Kirghiz population (KRA) was differentiated from Turkmens. When all the SNPs were considered in the analysis (Table 3b), including the genotyped intergenic SNPs, the patterns of differentiation became slightly different: the TJA and TJU populations were no longer differentiated from the KZ; the TJA population was no longer differentiated from the KRM population; significant FST values were weaker between the Tajiks and the other populations (FST=0.021–0.038), while the Turkmen population (TK) remained differentiated from KRA population (FST=0.011; P=0.002). In both cases, the UZ population was undifferentiated from all the other populations.

Table 3 Pairwise estimates of FST between Central Asian populations using (a) the nine intergenic SNPs (below diagonal) and the six NAT2 SNPs (above diagonal) and (b) all 15 genotyped SNPs (below diagonal) and phenotype frequencies (above diagonal)

Statistical tests of neutrality

Estimates of diversity (π and θW) at the NAT2 gene for all populations are shown in Table 4. To test whether patterns of DNA sequence variation at NAT2 fit expectations under the neutral model of evolution, we analysed the sequences by use of DT, D* and F* (Table 4). For all populations, DT was positive and ranged from 1.074 (KRM) to 2.08 (TJA). It reached significance in the Kazakhs (DT=1.657, P=0.044) and in both Tajik populations TJU and TJA (DT=1.745 and 2.08; P=0.044 and 0.02, respectively). D* ranged from 1.125 (KRA) to 1.181 (KRM) and no value was significant. In contrast, F* exhibited the same trend as DT: all values were positive, and ranged from 1.327 (KRA) to 1.709 (TJA) and reached significance for the same three populations KZ, TJU and TJA (F*=1.554, 1.584 and 1.709; P=0.039, 0.035 and 0.028, respectively). The significance of these three statistics was assessed without any assumption about populations’ demography (ie neither population growth neither bottleneck events were taken into account).

Table 4 Summary statistics and tests of neutrality based on the sequenced NAT2 coding and flanking regions in Central Asian populations

Long-range haplotype test for recent directional selection

We next investigated whether recent directional selection has influenced the NAT2 region by the use of the LRH test. This test is made to detect recent directional selection by comparing two allele age estimates: the allele frequency and the breakdown of LD around the allele.28 An allele under recent directional selection will be typically too frequent given its age. P-values were estimated for all core haplotypes in all populations from the simulated distribution and from empirical data. None of the haplotypes in any of the sampled populations appeared to deviate from simulated neutral and empirical genomic expectations.

Growth rate and mutation age

LD breakdown at the surrounding sites of a mutation is very informative for inferring allele age estimates by considering the recombination rate as a ‘genetic clock’. Taking into account the results of neutrality tests indicating a possible effect of natural selection, we investigated growth rates and ages of both functional NAT2 mutations and intergenic mutations as a reference (Table 5). Our estimations showed that the non-functional mutations exhibited a mean growth rate of 0.012 (CI=0.008; 0.020). Conversely, we observed that all functional mutations did not experience the same history among the different populations. The mutation 341T>C, together with mutations 481C>T and 803A>G, which are strongly associated with 341T>C (Table 1), showed a high growth rate in KRM and TK populations (mean r=0.029, CI=0.018; 0.049), while both Tajik populations TJA and TJU presented a weak growth rate (mean r=0.0115, CI=0.006; 0.021) in the range of growth rate estimates for intergenic mutations. The remaining KRA, KZ and UZ populations showed intermediate growth rates (with overlapped confidence intervals). The non-synonymous mutation, 590G>A, exhibited a high growth rate in the KRA, KRM and KZ (mean rate=0.023), while this value was weaker (in the same order as intergenic mutations) in the four remaining populations TJA, TJU, TK and UZ (mean rate=0.013). The mutation 857G>A exhibited a different pattern with no clear tendency, KZ showing the highest growth rate (r=0.021) and KRA the weakest (r=0.012). No growth rate estimate was obtained for 857G>A in the Tajik populations because the mutation frequency in these populations was too low.

Table 5 Mutation frequency P, growth rate r and mutation age g with 95% confidence intervals for all NAT2 functional mutations in Central Asian populations

Linkage disequilibrium levels

We determined pairwise LD between SNPs using D′ and r2, which are useful for modelling recombination rates and association power, respectively.32 LD generally decreased with physical distance, although large variation in the LD–distance relationship was observed between populations (Figure 2). The Tajik population, TJA, exhibited a more important decrease in LD compared to the other populations. Conversely, the population KZ presented the slowest decrease in LD with distance.

Figure 2
figure 2

Levels of linkage disequilibrium (as measured by the significance of D′) against physical distance at NAT2 region in Central Asian populations.

Discussion

The diversity survey of the NAT2 gene in Central Asians revealed a clear contrast between the Tajik populations as compared to the remaining groups. Both Tajik groups exhibited a frequency of slow acetylators (0.55 and 0.63) twice higher than in the other populations (KZ, UZ, KRA and KRM: 0.26–0.35). Accordingly, the frequencies of the fast haplotype NAT2*4 (0.23 and 0.26) were approximately half of NAT2*4 frequency in the remaining four populations (0.38–0.48). Interestingly, the lifestyle of both Tajik populations strongly differs from that of the remaining populations. Indeed, Tajiks shifted to a sedentary lifestyle the earliest and were the first to practice agriculture in Central Asia.15 Accordingly, this early transition might have implied differences in the dietary conditions of Tajiks, as compared to nomadic populations. The harsh climatic conditions of Central Asia might have discouraged the spread of agriculture but it is likely that some fertile spots, like the Fergana Valley (between Kirghizstan and Uzbekistan) and other areas where water was available, could have sustained early farming communities, such as the Tajiks. A similar dichotomous pattern at the NAT2 gene has been observed among sub-Saharan Africans, where the Bantu-speaking agriculturalists presented a higher frequency of slow acetylators, as compared to the Pygmy hunter-gatherers (0.46 and 0.10, respectively).33 These significant differences in the frequency distribution of slow/fast acetylation phenotypes depending on lifestyle, attested by significant FST estimates, strongly suggest that being slow acetylator has been an advantage in long-term agriculturalist populations in Central Asia.

The results of sequence-based neutrality tests give further support to this observation. Indeed, the agriculturalist Tajik populations exhibited significantly positive Tajima's DT and Fu and Li's F* values, whereas NAT2 variation in nomadic populations was compatible with neutrality. The significantly positive values observed for these neutrality tests in the two Tajik populations (TJA and TJU), as well as in the Kazakh population (KZ), are due to an excess of haplotypes at intermediate frequency. This observation can be explained by three different, non-mutually exclusive hypotheses: (1) demographic processes,34 (2) balancing selection or (3) directional selection on standing variation.35 In the context of a demographic explanation, population subdivision may lead to an excess of intermediate-frequency haplotypes36 by wrongly considering two isolated populations as a unique virtual population. Recent bottlenecks can also generate the same diversity patterns, because they involve the loss of low-frequency haplotypes.37 Because both population subdivision and bottlenecks tend to increase allelic association over distance, we would expect LD levels in agriculturalist populations to be higher than in other populations.38 Actually, we observed the inverse pattern in the NAT2 region, the TJA and TJU populations showing the strongest decreases of LD with distance (Figure 2). In addition, if the patterns of genetic diversity at NAT2 are the sole result of demography and genetic drift, its diversity would be comparable to that inferred from neutral markers. Mitochondrial DNA and Y-chromosome uniparentally inherited markers show no signs of population subdivision or reduction in Tajik populations (E Heyer, unpublished data). In this view, the hypothesis that demographic forces yielded to the significantly positive Tajima's DT and Fu and Li's F* values at NAT2 seems unlikely.

The second hypothesis invokes the action of balancing selection, corresponding to the heterozygote advantage. This selective regime can maintain different alleles at intermediate frequencies over time, yielding to an excess of intermediate-frequency alleles. The significantly positive values of DT observed at NAT2 gene support the action of balancing selection favouring heterozygous individuals having both a ‘fast’ and a ‘slow’ NAT2 haplotype. Indeed, we observed a frequency distribution of NAT2 slow and fast haplotypes in Kazakhs (0.38 vs 0.62) and Tajiks (0.25 vs 0.75), which could be consistent with balancing selection. However, balancing selection favours heterozygotes only if the heterozygote status represents a distinct phenotype compared to homozygotes. Depending on the substrate used to test NAT2 acetylation activity, fast/slow heterozygotes have, or have not, a different phenotype than both fast and slow homozygotes. In fact, when phenotypes are determined with caffeine as a test substrate, a distinct dose effect could be observed: fast/slow heterozygous individuals present an intermediate phenotype between the phenotypes of fast and slow homozygotes.31 Given that this intermediate acetylator status is observed in humans, we can easily imagine why being fast/slow heterozygotes would be advantageous: fast acetylators exposed to xenobiotics becoming harmful when acetylated would be disadvantageous, whereas slow acetylators exposed to xenobiotics becoming harmless when acetylated would, in turn, be disadvantageous. Because the human chemical environment is probably made of the two types of xenobiotic molecules, the best strategy to avoid the harmful effects of numerous toxic products would be to have an intermediate acetylator status. It is also interesting to note that the fast/fast homozygotes are more susceptible to colon cancer whereas the slow/slow homozygotes are more susceptible to bladder cancer. In this view, being fast/slow heterozygous could be an advantage as compared to the two individual homozygous states.

Alternatively, the observed patterns of genetic diversity at the NAT2 gene can result from the action of directional selection acting on standing low-frequency alleles. Indeed, human adaptation to new environments might have involved more likely alleles already being present at appreciable frequency, resulting eventually in an excess of intermediate-frequency alleles. For example, directional selection exerting on standing variation with P=0.05, the frequency at which a given allele starts to be favoured, can lead to positive values of DT.35 In our case, the allele carrying the slow mutation 341T>C might have been selectively neutral and present at low frequency in Central Asia prior to the emergence of agriculture. Subsequently, it might have been selected with ensuing environmental changes, such as the appearance of new toxic or antibiotic molecules linked with agricultural practices (for instance, cereals storage34, 39). However, the LRH test, which is conceived to directly detect the action of recent directional selection, was not significant for any of the NAT2 haplotypes, in contrast with the previous results observed in the Turkmen population.10 Several reasons can explain why this test failed to detect any signature of directional selection. First, the selective advantage conferred by NAT2 mutations might have been too weak to be detected by the LRH test. This hypothesis is likely if we assume a global advantage of being a slow acetylator. In this case, each of the three mutations altering the NAT2 activity (341T>C, 590G>A and 857G>A) would have been advantageous, which would globally weaken the individual signature of selection at each altering mutation. Second, demographic events can affect the extent of LD around a given allele. The LRH test corrects for the spurious effects of demography by dividing the extent of LD around the tested core haplotype by the extent of LD around the remaining core haplotypes, which are all equally influenced by demography and expected to evolve under neutrality. However, this assumption can be misleading in the NAT2 case, because a global advantage of being a slow acetylator would have favoured several NAT2 core haplotypes bearing an altering mutation, a situation that would short circuit the correction of the test for demography.

Several other aspects of NAT2 diversity are consistent with a scenario where several mutations are simultaneously targeted by directional selection. For example, growth rates and FST values at sites under directional selection are expected to be higher than at neutral sites. We observed such trends at NAT2-altering mutations (Tables 3a and 5), even if they did not reach statistical significance. In addition, a selective advantage of being a slow acetylator would make altering mutations increase slowly in frequency, which could also explain the significant excess of intermediate-frequency alleles detected by Tajima's DT and Fu and Li's F*. This scenario is further supported by the fact that the populations that show the highest and most significant values of Tajima's DT (TJA, TJU, KZ and TK) correspond to the populations with the highest proportions of slow haplotypes (correlation coefficient r2=0.95, P=0.0038). Theoretical predictions are nevertheless required to conclude if the genetic diversity at NAT2 is best explained by this particular case of ‘multiallelic’ directional selection.

In conclusion, the patterns of genetic diversity at the NAT2 gene in six Central Asian populations revealed striking differences between long-term agriculturalists and nomadic populations, in terms of both genetic variability and acetylation phenotypes. As previously reported in other geographic regions,10 these observations further support the observation that the acetylation status of agriculturalist populations has been the target of natural selection, in the form of balancing or directional selection. The NAT2 example, together with an increasing number of genes showing signs of recent directional selection, suggests that the selective pressures acting on our species have frequently changed due to rapid evolution of human-driven environments. In this context, detailed evolutionary studies of populations with well-defined lifestyles will provide valuable clues to better understand the consequences that major human cultural transitions had on the population genome diversity, both neutral and disease-related.