Introduction

Adaptive response to geographically restricted selective pressures, such as diet, climate, and pathogen burden, is one of the drivers of population differences in frequencies of disease-associated genetic variants.1 Advantageous variants that enhance survival against these selective pressures in human history rise to high frequency, and may now increase risk for chronic diseases as a consequence of lifestyle and ecological changes. This may have contributed to differences in the prevalence of diseases such as type 2 diabetes2, 3 and malaria,4 and traits such as lactase persistence.5 Identifying genomic loci that have been the subject of selection and discerning their relationship with disease phenotypes is instructive in understanding the adaptive genetic basis of population-level differences in the prevalence of common diseases.6, 7, 8

Several genome-wide scans for recent positive selection have identified genomic loci displaying signatures of natural selection.9 However, only a few African populations have been included in these studies.10 The rich genetic, ecological, and socio-cultural diversity of African populations is favorable for the detection of locally restricted and shared selection signals, and for elucidation of putative selective forces.10

The Wolaita are one of the indigenous Ethiopian populations that have inhabited the mid- and high-land areas of southern Ethiopia for several thousand years, and predominantly speak Wolaita, an Omotic branch of the Afroasiatic language family. Compared with the HapMap samples, the Wolaita (WETH) had the smallest genetic differentiation from Kenyan Maasai (MKK) and the largest differentiation from Japanese (JPT), and lay at the farthest end of the African genetic cluster nearest to MKK and farthest from Nigerian Yoruba (YRI).11

In the present study, we performed a genome-wide analysis to identify regions displaying evidence of recent positive selection in a sample of 120 WETH individuals. Our analyses revealed strong evidence of recent positive selection in the human leukocyte antigen (HLA) locus and in genes involved in energy metabolism. We also found that loci selected possibly for survival against pathogens and food deficiency in the past overlapped with genome-wide association study (GWAS) loci linked with protection to podoconiosis,12 and increased risk for metabolic disorders.

Methods

Data sets

We used IlluminaHap 610 Bead Chip genotype data of 120 randomly selected individuals from the Wolaita ethnic group from Southern Ethiopia (WETH; Supplementary Figure 1) who were recruited to serve as controls in a GWAS of podoconiosis.12 Of the 551 840 autosomal single-nucleotide polymorphisms (SNPs) in the raw data set, 39 948 SNPs failed data quality filters (Supplementary Methods). The remaining 511 892 SNPs that passed quality control were merged with the HapMap data set that contained 1 440 616 SNP genotypes in 1184 individuals from 11 populations. A total of 464 642 SNPs were common to both data sets in 1075 unrelated individuals (120 from WETH and 955 unrelated individuals in HapMap 3.2). The ethno-geographical breakdown of the HapMap sample has been described in Supplementary Methods (http://hapmap.ncbi.nlm.nih.gov/cgi-perl/gbrowse/hapmap3r2_B36/).13

Tests for signatures of recent positive selection

We performed the integrated haplotype score (iHS, which identifies partial sweeps)14 and cross-population extended haplotype homozygosity (XP-EHH, which identifies complete sweeps) tests.1, 15 The SNP genotypes were phased using fastPHASE v 1.4.16 The haplotype inference and missing genotype estimation error rates in WETH were similar to or less than those in other African populations16, 17 (detailed in Supplementary Methods, Supplementary Tables 1 and 2). We performed 11 XP-EHH tests comparing WETH with each HapMap sample and one iHS test for WETH (Supplementary Methods). For every SNP, we computed FST using the method of Reynolds, Weir and Cockerham,18 and assessed statistical significance using a Bonferroni-corrected permutation test P-value as described elsewhere.19

Pathway enrichment analysis

We identified all genes within 100 kb up- and down-stream of SNPs in the top 0.1% iHS score and performed pathway enrichment analysis using PANTHER (http://www.pantherdb.org/).20

Results

Signatures of recent positive selection

A total of 453 SNPs were found in the 0.1% tail of the empirical distribution of iHS; SNPs in the HLA locus were overrepresented (82/453, P<0.0001). Moreover, 6 of the 10 SNPs with the highest iHS scores were in the HLA locus (Table 1; Supplementary Tables 3 and 4). The XP-EHH test found 21 SNPs that were selected for in WETH compared with each HapMap sample; the HLA locus was overrepresented with 12 out of the 21 SNPs located near HLA-DQB1 and BTNL2 genes (Table 2). Other selected loci include NCAM1, which is involved in immunity, and ASTN2, which is expressed in the brain and previously reported to be in the top 0.1% selected loci in the Ethiopian Amhara and Ari populations.21

Table 1 Number of SNPs showing evidence of recent positive selection in WETHa
Table 2 Twenty-one SNPs showing evidence of recent positive selection in WETH compared with each of the 11 HapMap populations (XP-EHH>2)a

FST test identified 27 SNPs with statistically significant differentiation between WETH and all HapMap samples (Table 3, Supplementary Table 5). The list included the obesity risk locus NEGR1 previously shown to be highly differentiated in sub-Saharan Africans.2 The FST values and iHS scores of the HLA loci had overlapping distributions (Figure 1). The SNP with significantly differentiated FST between WETH and all HapMap populations in the HLA locus (rs2233971, a missense mutation in C6orf15) also had the highest iHS in WETH. Nucleotide substitution (rs2233971=chr6. hg19: g.31080323G>T) in the C6orf15 gene resulting in arginine to serine substitution is predicted to be ‘possibly damaging’ by PolyPhen. We also found significant FST differentiations in ASTN2 and HLA-DRA loci, replicating the XP-EHH and iHS findings.

Table 3 SNPs that showed significant FST-based differentiation between WETH and all HapMap populationsa
Figure 1
figure 1

Distribution of iHS and FST in the HLA locus. (a) FST WETH with non-Africans. (b) FST WETH with HapMap Africans. (c) iHS WETH. Horizontal yellow and red dotted lines represent iHS values of 2 and the top 0.1% threshold. iHS scores quickly decay with distance from the HLA loci with peak iHS scores, consistent with expectations of selective sweeps. Regions of peak iHS in WETH also overlap with those of high FST between WETH and HapMap populations.

Re-construction of founder haplotypes using haploPS22 (details in Supplementary Methods) found 101 signals of selection (Supplementary Table 6), and several overlaps with the iHS, XP-EHH, and FST signals (Supplementary Tables 7–9). To assess whether the allele driving the selection signal was carried by the longest haplotype, as a proof-of-principle, we compared haplotype lengths in the HLA-DRA locus around rs3129882 using HapFinder.23 We observed that the length of the longest haplotype increased significantly when the core haplotype frequency decreased from 25 to 20%, suggesting the rs3129882 [G] variant (frequency=22.9%) or nearby sequence variants may be driving the selection signal (Supplementary Figure 2).

Functional prediction and pathway enrichment analyses

Functional predictions of the SNPs in the top 0.1% tail of the empirical distribution of iHS are presented in Figure 2. Eleven SNPs were exonic, of which eight were missense variants in genes implicated in nervous system development (MAPT/SPPL2C), neuropsychiatric responses (ANKK1), fertilization, muscle development and neurogenesis (AADM28), and genes in the HLA locus (HLA-DOB and ZSCAN12; Table 4).

Figure 2
figure 2

Predicted function of variants in the top 0.1% iHS. Predictions are according to Ensembl as defined by the Sequence Ontology Project.

Table 4 Eight non-synonymous variants in the top 0.1% |iHS|a

The ‘T-cell activation’ PANTHER pathway and the ‘mammary gland development’, ‘cellular defense response’, ‘response to stimulus’, and ‘antigen processing and presentation’ PANTHER biological processes showed Bonferroni-corrected statistically significant enrichment (Supplementary Tables 10 and 11).

We found several selection signals in the PPARA gene locus in the XP-EHH test; SNP rs5767743 with the highest iHS score in this locus also had XP-EHH>2 when comparing WETH with CEU, GIH, JPT, MEX, MKK, and TSI. PPARA was a component of the PANTHER-enriched ‘carbohydrate metabolic process’ term.

Targets of recent positive selection and common traits

To identify positively selected loci that presumably enhanced survival against pathogens and food shortage in human history, but presently increase risk for chronic diseases, we explored overlaps between the identified selection signals and loci associated with podoconiosis and type 2 diabetes. We also explored selection signals known to have a role in skin pigmentation.

Podoconiosis

We compared podoconiosis GWAS loci12 with this study’s selection signals and found that the HLA region containing the 12 XP-EHH SNPs selected for among WETH overlapped with that of the top 10 GWAS SNPs.12 Pair-wise LD calculations showed modest correlation (r2=0.56) between two SNPs with strong XP-EHH (rs9275141 and rs2856695; Table 2) and SNP rs17612858 (chr6.hg19:g.32620622A>T) that has the best GWAS signal for podoconiosis. This correlation was stronger than the average LD in the HLA locus in a 30 kb window (r2=0.20) and between adjacent SNPs (r2=0.33). The frequency of the TA haplotype that carries the positively selected T alleles of rs9275141 and rs2856695 and the podoconiosis-protective rs17612858 A allele was higher than expectation under linkage equilibrium among non-podoconiosis controls (0.61 vs 0.45, P<0.001) and podoconiosis cases (0.42 vs 0.25, P<0.001; Supplementary Tables 12 and 13).

Type 2 diabetes

We searched for overlaps between genes in the top 0.1% iHS and genes (n=79) reported to be associated with type 2 diabetes in two recent meta-analyses.24, 25 We found a novel selection signal at rs9348453, an intronic SNP in CDKAL1.

Skin pigmentation

The SLC24A5 SNPs rs1426654 (chr15.hg19:g.48426484A>G) and rs1834640 (chr15.hg19:g.48392165A>G) implicated in skin pigmentation in Eurasian populations were within the top 1 and 3% iHS in WETH, and the A alleles of both SNPs implicated in light skin pigmentation had high frequency (47.9%) in WETH.

Discussion

We found that the HLA locus and genes involved in immune response and metabolism are enriched for genomic signatures of selection among the Wolaita ethnic group from southern Ethiopia (WETH). The majority of the HLA selection signals found in this study were not detected in previous studies conducted in global populations including a few that involved other African populations.1, 21, 26, 27, 28 Detection of distinct HLA selection signals in WETH is consistent with data that show a high degree of HLA locus differentiation in African populations,29 and the presence of multiple, population-specific targets of selection in the HLA locus.30 These findings imply that diverse African populations should be studied to capture population-specific selection signals and to understand the effects of geographically restricted selective pressures such as diverse demographic history, pathologic, dietary, and climatic challenges.

Clustering of selection signals in the HLA loci exclusively in WETH suggests that the selective forces are locally distinctive and strong. The specific selective force(s) that have operated in this setting are not clearly understood; however, the key role of HLA in immune response against infections suggests that pathogens are plausible candidates.31 Our PANTHER analysis also revealed enhanced selection of other genes that contribute to mammary gland development and immunity, which may be related to nutrition and increased immunity contributing to defense against the high burden of pathogens in the region. It has clearly been shown that both recent positive selection and balancing selection shape the complex nucleotide diversity in the HLA locus.32 In balancing selection, the equilibrium frequency favoring different alleles can fluctuate with time and space. When environments change over time, recent positive selection can lead one allele to increase in frequency with the collective selection events ultimately maintaining multiple alleles at a locus.33 Moreover, the haplotype pattern around a locus under recent balancing selection can resemble an incomplete sweep of positive selection.8, 34, 35 Therefore, the LD-based tests we used detect recent balancing as well as positive selection, and currently available analysis methods have little power to distinguish these.35 The possible effect of long-term balancing selection on HLA diversity in Ethiopia has been shown by analysis of 61 global populations by Prugnolle et al. and analysis of 535 populations by Sanchez-Mazas et al.36, 37 These studies found that HLA genetic diversity is positively correlated with pathogen diversity of a geographic region and inversely correlated with geographic distance from Ethiopia.36, 37 This corroborates previous evidence suggesting the presence of more pathogens in Africa where humans have lived the longest, and in Ethiopia, where pathogen richness (the number of kinds of pathogens) is one of the greatest on a global scale.38 Historical accounts and archeological evidence indicate expansion of agriculture in the fertile Ethiopian highlands, and formation of urban centers at least as early as the fifth millennium BC.39 These markers of ancient civilization were linked with more settled life, high population density and poor hygienic conditions that facilitated spread of pathogens, which are still a significant cause of mortality and morbidity in the region. Taken together, these data suggest that pathogens are the strongest driving force behind the distinctive and highly enriched selection signals in the HLA locus that we found among this Ethiopian population.

We found signatures of positive selection in genes involved in metabolic processes including a novel selection signal in CDKAL1, a gene that has been implicated in type 2 diabetes, pancreatic β-cell function, and insulin secretion.24, 25, 40, 41, 42, 43 CDKAL1 inhibits the CDK5 protein leading to enhanced insulin secretion under conditions of high glucose levels.44, 45 Therefore, reduced expression of CDKAL1 inhibits insulin secretion leading to an impaired response to glucotoxicity and increased risk for type 2 diabetes.46 The CDKAL1 rs9348453 ancestral A allele in the haplotype favored by selection had high frequency (>70%) in all population groups analyzed in this study. Characterization of sequence variation produced by the 1000G Project shows that rs9348453 has potential regulatory role in human skeletal muscle myoblast cell lines that are used to study diabetes and insulin resistance. Moreover, rs9348453 is correlated (r2=0.73) with rs79915874 (chr6.hg19:g.21005146T>C) that has been predicted to disrupt binding motifs of the hepatocyte nuclear factor 4 transcription factor, which is mainly expressed in the liver and pancreatic β cells. Taken together, these findings suggest that CDKAL1 may be one of the key energy metabolism genes targeted by recent selection.5, 47, 48 Investigation of this novel locus of selection using targeted-sequencing may provide new insights in the pathogenesis of diabetes and metabolic disorders.

The carbohydrate metabolic process term that was significantly enriched in PANTHER included the PPARA (peroxisome proliferator activated receptor alpha) gene. PPARA plays a key role in lipid and carbohydrate metabolism by direct regulation of numerous genes encoding enzymes and transport proteins that are important for glucose homeostasis and lipid metabolism.49 PPARA is activated during energy deprivation and plays a key role in the management of energy stores during fasting and prolonged food deprivation.50, 51, 52, 53 Within the context of the historic dietary experiences of the Wolaita people, selection pressure acting on the PPARA may be due to the long-term consumption of Ensete ventricosum. Enset is a food crop that resembles the banana plant and has high-carbohydrate and low-fat content. It is the main food source in the Wolaita area, with more than 10 million people in the highlands of southern Ethiopia depending on it for food, fiber, animal forage, construction materials, and medicines.54 The Omotic-speaking people of Ethiopia introduced enset as a food source and domesticated the plant as early as 10 000 years ago in response to a food crisis.55 A well-known feature of the enset plant is its drought tolerance, hence, it was called the ‘tree against hunger’ by European travelers to Ethiopia in the 1600s.54 Given this history, we propose that the diet-induced positively selected haplotype we found may inhibit PPARA expression leading to a metabolic adaptation in which lipid oxidation is reduced and carbohydrate oxidation is enhanced even in times of energy deprivation. Consistent with our observation, others have reported signals of natural selection in PPARA in populations of diverse ancestry from widely ranging ecological regions.1 In all, our findings of selection signals in PPARA and NEGR1 (an obesity locus that has been shown to be highly differentiated among sub-Saharan Africans2) strengthen emerging investigations that show the impact of natural selection on genetic variants that shape metabolic processes in coping with fluctuating food availability and dietary adaptation, and dramatic shifts in food consumption in human history.2, 5, 47, 48

Along with diet, high-altitude hypoxia may have exerted an additional selective pressure on the PPARA gene in the Wolaita people who have inhabited the Ethiopian highlands for millennia. Similar to the effect of a high-carbohydrate and low-fat content diet, genetic and non-genetic adaptations to hypoxia at high altitudes lead to a shift from lipid to carbohydrate oxidation, promotion of lipid storage, and preference for anaerobic glycolysis for energy expenditure.56 Consistent with this thinking, previous studies have shown that PPARA may be implicated in high-altitude adaptation in the Amhara population group from northern Ethiopia and in Tibetans.21, 57 Also, correlation between the selected PPARA haplotype and serum-free fatty acid levels has been observed among Tibetans.56 In all, PPARA may have been targeted by reinforced selective forces from the nutritional content of the enset diet and high-altitude hypoxia; these observations may explain our findings of several PPARA selection signals in both the iHS and XP-EHH tests. Our finding of PPARA gene selection has relevance for cardio-metabolic diseases because genetic variants in PPARA have been found to be associated with blood lipid levels, lipoproteins, and type 2 diabetes.58 Reduction in fatty acid oxidation through PPARA inhibition results in increased levels of stored and circulating lipids, a known risk factor for cardiovascular diseases and the metabolic syndrome.59 This advantageous genetic adaptation of the past may fuel the rise in cardiovascular diseases in Ethiopians,60, 61 and perhaps other populations undergoing urbanization and transition toward lipid-rich diets and sedentary lifestyles. The cardio-metabolic effect of this genetic adaptation, which may have important clinical and public health implications, needs to be investigated further in other African and global populations.

We found selection signals in WETH (an Omotic language speaking ethnic group) around the SLC24A5 gene implicated in skin pigmentation in European and West Asian populations.1, 15 A recent study found these loci in the top 5% of selection signals for Semitic-Cushitic-speaking Ethiopian populations, but not in a combined analysis of three Omotic language speaking Ethiopian ethnic groups (Wolaita, Ari Agricultural, and Ari Blacksmith).62 Moreover, we found a higher frequency of the alleles associated with light skin pigmentation in Eurasian populations in WETH (A allele frequency of both SNPs=47.9% in WETH vs 23% in the combined Omotic sample reported62). Our study had a larger sample of the Wolaita than the previous study (n=120 vs n=8). Moreover, the genetic, geo-climatic, and demographic differences between the Wolaita and the Ari may have masked the selection signal during combined analysis of the ethnic samples in the previous study.62 The SLC24A5 gene’s derived Ala111Thr allele (rs1426654 A allele) is present at low frequencies in other sub-Saharan African populations.63 Our finding of high frequency of this allele in an Omotic-speaking indigenous southern Ethiopian ethnic population that has little shared genetic ancestry with Eurasians62 strengthens previous suggestions of the need for studies to understand whether the derived alleles underlying the adaptive response originated in East Africa or Eurasia.63

We replicated several genes reported to be under selection by at least two genome-wide scans (Supplementary Table 14). Consistent with a study which demonstrated that inflammatory-disease susceptibility loci are enriched for signatures of recent positive selection,64 we identified selection signals in loci implicated in podoconiosis. The findings suggest that positive selection has favored the haplotypes carrying the podoconiosis-protective alleles. The presence of more risk alleles in podoconiosis cases may be due to the effect of mate selection because individuals from podoconiosis affected families are subject to social stigmatization and are excluded from marriage by non-affected community members who recognize that podoconiosis is heritable.65, 66, 67

Overall, this study provides strong evidence of selection in the HLA locus and metabolism genes in an Ethiopian population. It is likely that the burden of a diverse array of pathogens and adaptations to dietary fluctuations may represent the strongest selective forces in the history of this African population. Furthermore, our findings of overlaps between several previously reported disease susceptibility GWAS loci and targets of recent positive selection demonstrate the usefulness of African population samples to elucidate the adaptive genetic basis for many complex diseases.