Introduction

During the Neolithic Revolution (8500–2500 BC), many human populations shifted from a nomadic hunter-gathering to a farming or herding mode of subsistence.1 This transition was accompanied by dramatic changes in diet and likely reshuffled selective pressures acting on metabolic genes, presumably with local adaptations in populations with contrasted lifestyles (for a review, see Brown2). More recently, in many populations, the Industrial Revolution yielded new nutritional and cultural changes, which were accompanied by an important rise in the prevalence of metabolic disorders like type 2 diabetes (T2D). T2D, a major public health problem,3 is characterized by an elevation of blood glucose levels due to prolonged impaired insulin secretion and/or insulin resistance, the low efficiency of insulin to store glucose into cells where it is later used for energy. To date, around 40 genetic variants have been associated with polygenic T2D.4, 5 These causal variants are expected to reduce fitness even if the mean age of onset for T2D is relatively late. Indeed, as observed in other diseases,6 the variance in age of onset could lead to enough cases occurring during reproductive life to compromise the fitness of individuals carrying a risk allele. Furthermore, causal variants of T2D may also affect fitness indirectly, through their association with other syndromes occurring earlier in life, notably insulin resistance and cardiovascular diseases7 or gestational diabetes.8 As a result, one would expect these variants to be selected against by natural selection, and therefore the high prevalence of T2D as well as its variability among populations has remained a puzzle in evolutionary biology.9

Several hypotheses have been put forward to explain this paradox. Most of them hypothesized a past evolutionary advantage to carry T2D risk variants. The ‘thrifty genotype’ hypothesis10, 11, 12 proposed that in populations facing strong food insecurity (hunter-gatherers, but also pre-industrialized farmers13, 14), insulin resistance was selected for because it ensured a more efficient use of available nutritional resources. On the other hand, the ‘carnivore connection’ hypothesis15, 16 proposed that insulin resistance was selected for in herders and hunter-gatherers, but not in farmers, as an adaptive response to low-carbohydrate and protein-rich diet. Finally, the ‘variable disease selection’ hypothesis17 suggested that thrifty variants were selected for in response to infectious diseases,18 as the immune response has energetic costs that, in turn, affect the metabolic syndrome. Although they rely on different mechanisms, these hypotheses all predict that the variants associated with the risk of T2D were favored in pre-Neolithic hunter-gatherer populations, as well as in some, but not all, post-Neolithic populations.

On the other hand, Allen and Cheer19 proposed that milk consumption in Europe was responsible for an increased uptake of glucose in the diet, providing an opportunity for the selection of protective (non-thrifty) variants. This mechanism would lead to the positive selection of protective variants in milk-consuming populations, which might explain the low prevalence of T2D in Europeans. Hancock et al20 proposed that genes associated with common metabolic diseases like T2D have been targeted by selection during adaptation to climate, leading to the increase in frequency of alleles with potentially opposing effects on disease susceptibility. Last, non-adaptationist theories have also been proposed, suggesting that drift alone might be responsible for the differences in prevalence observed between populations.21, 22

Which hypothesis do the genetic data support? A handful of studies have investigated the signatures of selection on candidate genes for T2D,23, 24, 25, 26, 27, 28, 29 but most of them did not assess whether the risk or the protective alleles were targeted by selection. On the contrary, Vander-Molen et al.’s30 and Helgason et al.’s31 studies reported signatures of positive selection on protective haplotypes for T2D, the latter study providing evidence for an increase in frequency of these haplotypes 8400 years ago. These results seemingly contradict the thrifty genotype hypothesis,32 but as each study was based on the analysis of a single gene (CAPN10 and TCF7L2, respectively), they cannot be considered as definitive evidence. Hancock et al.20 analyzed worldwide correlations between climate variables and allele frequencies at metabolic genes and found evidence of selection on both protective (in LEPR and PON1) and risk haplotypes (in FABP2 and EPHX2). Klimentidis et al.33 found that for three genes, selection targeted the risk haplotypes (in IGF2BP2, WFS1 and SLC30A8) and that T2D-associated loci are highly differentiated on a worldwide scale, with Sub-Saharan Africans and East Asians being particularly prone to positive selection. Given the latter two studies did not provide estimates of the timing of the selective event, it remains unclear which evolutionary hypothesis is supported by the genetic data.

In this study, we investigated the selective pressures acting on genes associated with T2D in two neighboring populations with contrasted lifestyles and dietary habits. We concentrated on Central Asia, a region with high ethnic diversity, which has been the focus of a recent population genetics survey.34 We studied a population of ancestrally nomadic herders (Kyrgyz) and a population of long-term agriculturalists (Tajiks), given that the main hypotheses to explain the paradox of the high prevalence of T2D proposed that some lifestyles and/or dietary habits could have provided a selective advantage to individuals carrying T2D risk variants. We analyzed re-sequencing data and genotyping data in genomic regions located around 10 T2D-associated mutations and in 20 presumably neutral regions in order to (i) investigate the selective patterns in one herder and one farmer population, controlling for the unknown demographic history; (ii) identify which allele (risk or protective) has been or still is targeted by selection; and (iii) infer the timing of onset of selection (before or after the Neolithic).

Material and methods

DNA samples and re-sequencing

We collected DNA from 40 Kyrgyz individuals in Bishkek, Kyrgyzstan, as well as 39 Tajik individuals in Bukhara, Uzbekistan (see Supplementary Note 1). We chose 10 mutations associated with T2D or related phenotypes, in FABP2, PPARG, TCF7L2, LEPR, KCNJ11, SLC30A8, HHEX, CDKAL1, KCNQ1 and PON1 (see Supplementary Table S1 and Supplementary Note 2) and obtained re-sequencing data for 1.1 kb regions around them. We also sequenced 20 presumably neutral regions, designed by Patin et al.35 (average length: 1.3 kb). PCR and sequencing reactions were performed as described in Supplementary Note 1 and Supplementary Table S2.

Sequence analyses and within-population neutrality tests

We used DnaSP version 536 to estimate several genetic diversity statistics and to compute neutrality tests based on the site-frequency spectrum (Tajima’s D,37 Fu and Li’s D and F38 and Fu’s Fs39). Significance was tested by means of 10 000 coalescent simulations for each test statistics and each sequence, using DnaSP, assuming a large constant population size and a neutral infinite-sites model of mutation. The simulations were conditioned on the observed level of nucleotide diversity (θ) in each population, estimated as the average number of nucleotide differences between individuals per sequence and per population. We also computed Zeng et al.’s40 E statistics and tested its significance in a similar manner as for the other tests using the java program kindly provided to us by K. Zeng. One-tailed P-values were computed as the probability of obtaining lower values than the ones observed, further transformed into (1–2|P–0.5|), and then corrected for the false discovery rate (FDR) for multiple testing.41

Between-populations neutrality tests

Indices of population differentiation (FST) were computed with Genepop v.4.7.42 We used the software package Dfdist43 to generate 1 000 000 coalescent simulations in a symmetrical 10-demes island model at migration-drift equilibrium,44 conditional on the observed level of differentiation measured at the 147 presumably neutral SNPs (FST=0.006). As Dfdist was originally designed for the analysis of bi-allelic, dominant markers (see, eg,45), we modified it in order to simulate co-dominant, bi-allelic markers (see, eg,46). The coalescent simulations were performed using θ=2nNμ=0.2 (where n is the number of demes of size N, and μ is the mutation rate), in order to match the observed overall gene diversity of the presumably neutral SNPs in the pooled sample (He=0.182). We checked that the distribution of FST conditional on heterozygosity was robust to a range of alternative values (from θ=0.02 to θ=2.0; results not shown). One-tailed P-values were computed for the 10 genic mutations (probability that the mean FST was as small or smaller than the one observed) and corrected for the FDR in multiple testing.41

Haplotype-based tests of selection

We used the data from a companion paper, for which all individuals were genotyped using the Illumina microarray Human-660W-Quad v1.0 (Paris, France) (see Supplementary Note 3). SNPs were phased with the software fastPhase 1.4,47 using population label information to estimate phased haplotypes. We computed the integrated haplotype scores (iHS) to compare the decay of homozygozity around the mutations of interest between the ancestral and the derived background, in each population, using the rehh package.48 The iHS estimates were standardized per allelic frequency bins, using genome-wide data.49 P-values for the 10 candidate genic mutations were corrected for the FDR in multiple testing.41

For mutations with a significant iHS value (corrected P-value <0.05), we used Austerlitz et al.’s method50 as implemented in their Mathematica51 notebook. This method provides a maximum-likelihood estimate of the time elapsed since the appearance of the mutation and its intrinsic growth rate, using the number of copies of the mutant allele in the population and the level of allelic association between this allele and one or several closely linked markers. Nine SNPs were chosen for that purpose on each side of each target SNP, at about 20, 35, 50, 75, 100, 125, 150, 200 and 250 kb, respectively, from the target mutation. We assumed a population size of 100 000 individuals for the computations and checked that alternative choices of population size (10 000 or 1 000 000) did not affect our results.

Results

Genetic diversity

Genetic diversity statistics are provided in Table 1 for genic candidate sequences and in Supplementary Table S3 for presumably neutral sequences. Overall nucleotide diversity was not significantly different between presumably neutral and genic candidate sequences (in average 1 × 10−3 vs 1.5 × 10−3, respectively, Wilcoxon’s rank sum test, P-value=0.29). The highest nucleotide diversity was observed in FABP2, in both populations: 4.1 × 10−3 in Kyrgyz and 4.3 × 10−3 in Tajiks. We tested for deviation from Hardy–Weinberg equilibrium for the 208 observed polymorphisms in each population. Four mutations, located in presumably neutral sequences, departed significantly from Hardy–Weinberg equilibrium (Chi-square test, P-value <0.001). These polymorphisms were re-genotyped and confirmed by an independent PCR.

Table 1 Summary statistics of genetic diversity per genic region and per population

Neutrality tests within population

As shown in Supplementary Table S4, we found that only one genomic region departed from neutrality, namely the region around rs1799883 in FABP2 (Tajima’s D=2.88, FDR-corrected P-value=0.045) in Kyrgyz. No significantly positive values were found in the presumably neutral sequences. Given that the candidate mutations were associated to the phenotype of interest through genome-wide association studies (GWAS), they might be biased toward intermediate allelic frequencies, which might inflates Tajima’s D values.52 This is indeed visible when we compare the allelic frequency spectrum of genic and presumably neutral sequences (compare Supplementary Figure S1a and 1b). In order to test whether such a bias could result in spurious signatures of selection, we ran the same tests on a subset of presumably neutral sequences with the same allele frequency spectrum as the target mutations (see Supplementary Figure S1c). None of the so-ascertained presumably neutral sequences departed from neutrality (see Supplementary Table S5), which suggests that the observed pattern in FABP2 is unlikely to be caused by demographic history, but could rather be taken as evidence for selection.

Between-population tests

The 10 candidate mutations tended to be more differentiated on average than neutral regions (FST=0.030 vs 0.006), although not significantly (Wilcoxon’s rank sum test, P-value=0.06). Using Beaumont and Nichols’43 approach, we found two out of these 10 candidate mutations departing from neutral expectations (Figure 1, rs1137100 in LEPR; P-value=0.02 and rs2237892 in KCNQ1; P-value=0.01). This suggests that genetic variation at these mutations may have been affected by natural selection, most probably by differential local adaptation between Kyrgyz and Tajiks.

Figure 1
figure 1

Genetic differentiation (FST) as a function of heterozygosity for 10 SNPs associated with the risk of type 2 diabetes. The upper and lower lines represent the 99% confidence region expected under neutrality (see Material and Methods), with the middle line showing the median genetic differentiation under neutrality. Each circle represents a SNP associated with the risk of type 2 diabetes. The names and corresponding genes of SNPs with a significant corrected P-value (≤0.05) are shown (excess of differentiation).

Haplotype-based tests

As shown in Table 2, we found a significant iHS score for LEPR in Kyrgyz (iHS=−2.7, P-value=0.01). Using Austerlitz et al.’s50 method, we found that this mutation started to increase in frequency 7500 ya (95% confidence interval (CI95%): 6500–8900) in this population, with a growth rate of 1.027 (CI95%: 1.020–1.040). The same trend was observed in Tajiks (iHS=−1.9, P-value=0.04), even though the growth rate was much lower (1.010, CI95%: 1.007–1.016, starting 14 400 ya, CI95%: 11 900–17 600). We also observed a significantly positive iHS value for HHEX in both populations: iHS=2.9 in Kyrgyz (P-value=0.01) and 2.6 in Tajiks (P-value=0.02). This selective event started around 10 500 ya (CI95%: 8700–12 700) and 10 700 ya (CI95%: 9000–13 100), respectively, in Kyrgyz and Tajiks, with respective growth rates of 1.027 (CI95%: 1.020–1.040) and 1.021 (CI95%: 1.018–1.032). We also found a signal of selection on PON1 in Tajik: iHS=2.0 (P-value=0.04), which was not detected using Austerlitz et al.’s50 method (very low growth rate of 1.005, CI95%: 1.003–1.008, starting 43,000 ya, CI95%: 35 500–52 200). In all the cases, the haplotypes targeted by positive selection carried the protective allele, which corresponded either to the derived (LEPR) or the ancestral allele (HHEX, PON1).

Table 2 Allelic frequencies, FST and iHS values for each variant in Kyrgyz and Tajiks from Central Asia, as well as in two populations from the HGDP-CEPH panel: ASN (individuals of Japanese and Chinese ancestries) and CEU (individuals of European ancestry)

The mutation rs2237892 on KCNQ1, which was more differentiated between populations than expected under neutrality, did not present a significant iHS score in either population. However, we found two SNPs around rs2237892 with significant iHS values in Kyrgyz only (see Supplementary Note 5), suggesting that the differential selection inferred on KCNQ1 with the FST-based tests likely results from recent selection acting in Kyrgyz only.

Intriguingly, we did not find signals of selection for the mutation rs7903146 in TCF7L2, although it represents the strongest signal of association with T2D to date in other populations and the most consistent signal of selection.20, 31, 33 However, the estimated growth rate of the protective allele in Kyrgyz was higher than that of the risk variant (1.027, CI95%: 1.019-1–040 vs 1.009, CI95%: 1.007–1.017) and pointed to a selective event starting 12 000 years ago. Tajiks did not show such a signal (growth rate of 1.018, CI95%: 1.013–1.026 for the protective vs 1.017, CI95%: 1.012–1.027 for the risk allele).

Discussion

Characterizing the patterns of selection at genes associated with T2D in Central Asia

Using neutrality tests based on within-population diversity, haplotype structure, and between-population differentiation, we were able to identify complementary signals of selection among candidate genes for T2D. In particular, we found evidence for differential positive selection at rs1137100 in LEPR, at which (i) the FST between Kyrgyz and Tajiks was higher than expected under neutrality and (ii) the iHS statistic provided evidence for positive selection acting in Kyrgyz and to a lesser extent in Tajiks. We also found evidence for balancing selection (or a recent partial selective sweep or selection on standing variation53) on FABP2 in both populations, where we found a high nucleotide diversity and a significantly positive Tajima’s D in Kyrgyz. Consistently, two mutations in FABP2 presented the lowest FST estimates between Kyrgyz and Tajiks of the full data set (FST=−0.014).

Identifying the targeted alleles and the timing of selection

We did not find any evidence in Central Asia of pre-Neolithic selection favoring T2D risk variants, in contradiction with the thrifty genotype,10 the carnivore connection15, 16 and the variable disease selection17 hypotheses. However, we cannot exclude that some of these variants were selected for in a distant past, as signatures of ancient selection might be difficult to detect by means of population genetic approaches. We did not find any evidence of post-Neolithic selection acting on T2D risk variants, which contradicts the thrifty genotype hypothesis (the more recent version of which considers that food insecurity is stronger in farming populations, where it should select for thrifty variants13, 14), as well the carnivore connection hypothesis (which predicts that insulin resistance should be selected for in herders in response to their low-carbohydrate diet).

Contrastingly, we found signatures of positive selection of protective variants (either ancestral or derived) in both populations. Our analyses further showed that selection of protective variants occurred between 5500 and 12 000 years ago (depending on the gene considered), which corresponds to the earliest stages of the Neolithic Revolution. Our results therefore suggest that protective variants were selected for during and/or after this transition, which echoes previous studies that reported signals of positive selection of protective alleles for genes associated with risks of heart diseases54 and hypertension.55, 56

Possible evolutionary scenarios

We identified footprints of selection that are likely to reflect a shift in the metabolic constraints accompanying the Neolithic transition, with protective alleles becoming advantageous. Interestingly, the KCNQ1 protective haplotype (which frequency is higher in Kyrgyz than in Tajiks, see Table 2) has been shown to be at low frequency in populations where cereals are the main dietary component57 and under recent selection in four out of seven pastoral populations from South Asia.33 It seems therefore that this protective haplotype was recently targeted by selection in response to specific pastoral dietary habits. However, we have shown in this study that signals of selection toward protective variants have been found in both herders and farmers from Central Asia, suggesting that the same phenotype is favored in both the lifestyles. It could be that the input of cereals in the diet of farmers at the Neolithic is responsible for reshuffling the selective pressures on genes involved in glucose metabolism, while the consumption of milk in pastoral populations have led to similar major metabolic changes.19 Milk consumption is indeed widespread among Kyrgyz populations (even though the frequency of lactase persistency is low58), as in the Hausa population, where a signal of selection has been detected in CAPN10.23

On the other hand, evidence of recent selection for the protective haplotype at LEPR and TCF7L2 (as we found in Kyrgyz) is also documented in other East Asian populations20, 31, 33, 49 (see also the high FST and iHS values in ASN, Table 2), from which the Kyrgyz are genetically closer than the Tajiks.34 These signatures of selection are, therefore, more likely to reflect a differential adaptation between East Asians and other populations than between herders and farmers. Similarly, a strong differentiation at HHEX was found between East Asians and other groups33 (see also the high iHS value in ASN, Table 2), a gene for which we showed that the protective mutation was under recent selection in both Kyrgyz and Tajiks. This suggests that selection might act for this gene at a broader geographical scale, encompassing both ethnic groups from Central Asia. These results point to a recent selection of protective haplotypes in Asia, which might be the result of a specific type of agriculture developed in this part of the world, or because of particular climatic and/or pathogenic conditions.

Perspectives

We acknowledge that our conclusions are based on a limited number of common variants, which do not necessarily represent the genetic architecture of T2D susceptibility as a whole. This complex disease is indeed only partially explained by the common variants identified so far, and further studies based on additional risk variants are now required to complete our knowledge of the evolution of T2D susceptibility.

Much effort has been devoted so far in the search for thrifty variants, and the thrifty genotype hypothesis has a deep impact into the therapeutic diet strategies adopted in modern societies to manage chronic diseases.59, 60 Yet our results, along with those from other authors,20, 29, 30, 31, 33 support a radically different scenario, in which protective (non-thrifty) haplotypes have been and might still be under positive selection in many populations worldwide. This suggests that the biological constraints driving the evolution of genetic variants associated with T2D are still poorly understood. There is, therefore, a need to reconsider the selective pressures acting on these genes. In particular, it is crucial to analyze additional populations with contrasted lifestyles and modes of subsistence, as most populations studied so far are farmers. Furthermore, we believe that considerable progress could be made if forthcoming studies were based on the analysis of individual populations (rather than geographical groups of populations), provided that detailed investigations on their lifestyle and mode of subsistence are undertaken. It is also important that these studies infer the time of onset of selection. Only with this information will we be able to evaluate the extent to which positive selection of protective variants occurred since the Neolithic.