Infectious pathogens are arguably among the strongest selective forces that act on human populations1. Migrations and cultural changes during recent human evolutionary history (the past 100,000 years or so) exposed populations to dangerous pathogens as they colonized new environments, increased in population density and had closer contact with animal disease vectors, including both conventionally domesticated animals (for example, dogs, cattle, sheep, pigs and fowl) and those exploiting permanent human settlement (for example, rodents and sparrows)2,3. Consequently, both birth and mortality rates increased markedly4.

Host genetics strongly influences an individual's susceptibility to infectious disease5,6. Pathogens that diminish reproductive potential, either through death or poor health, drive selection on genetic variants that affect resistance; selection is likely to be most evident for pathogens with a long-standing relationship with Homo sapiens, including those that cause malaria, smallpox, cholera, tuberculosis and leprosy7 (Fig. 1). We also contend with new threats, such as AIDS and severe acute respiratory syndrome (SARS). Some pathogens cause acute illnesses such as smallpox and cholera but, once survived, pose little additional threat. Other pathogens — for example, those causing malaria, tuberculosis and leprosy, as well as parasitic worms — can be carried as chronic infections and impair nutrition, growth, cognitive development and fertility. The timing, strength and direction (that is, positive, negative or balancing) of selection shape the patterns of variation that remain in the genome. These signatures of selection will therefore vary with the age, geographical spread and virulence of the pathogen.

Figure 1: Pathogen emergence during human history.
figure 1

Key events in recent human evolution (boxes outlined in black) are juxtaposed with the estimated ages of infectious disease emergence (boxes outlined in red). The fragmentation of the human lineage into genetically and geographically distinct populations (blue lines) accelerates with migration out of Africa. Later, these populations started mixing more (blue shaded regions between the populations) along trade routes (such as the Silk Road), through colonization and through high rates of global travel nowadays.

PowerPoint slide

For those with access, modern medicine radically diminishes exposure to various pathogens. In developed countries, vaccination, better nutrition and improved public health have eliminated diseases that were common in the past8. Common immune-mediated diseases may be partly caused by evolutionary adaptations for resistance and symbiosis with potentially dangerous microorganisms9,10,11,12. For example, decreased gut microbiome diversity in residents of developed countries13 may alter mucosal immune responses14. Understanding host–pathogen interactions will inform the development of new therapies both to counter ongoing pathogen evolution and to better manage immune-mediated diseases15.

Here, we review how the technological revolution in genomics allows us to examine human adaptation to infectious disease in new ways. Natural selection leaves distinctive signatures in the genome, as genetic variants that improve survival and reproduction increase in frequency, and detrimental variants vanish. Hundreds of candidate regions of selection were identified in early genomic data sets, but only few adaptive variants were identified16. High-throughput biotechnology enables large-scale surveys of genome diversity, genome-wide association studies (GWASs), next-generation sequencing, and high-throughput experimental and bioengineering approaches17. Together with expanding computational capacity18, these tools offer new power to find and functionally elucidate adaptive changes. Pathogen susceptibility and immune traits are particularly amenable to mapping approaches that combine scans for natural selection and genetic association. We consider how these new genomic analyses provide insights into human evolution and have implications for human health. We focus primarily on examples in which selection is connected to infectious disease susceptibility through additional phenotypic associations or functional investigations.

Methods and technologies

Signatures of selection. Natural selection is the tendency for traits to increase or decrease in frequency in a population depending on the reproductive success of those exhibiting them. Positive selection increases the frequency of favoured alleles, negative selection eliminates detrimental alleles, and balancing selection favours diversity. This process leaves unusual patterns of genetic diversity that mark selected loci (that is, signatures of selection) when compared with the background distribution of genetic variation in the genome, which is assumed to evolve under neutrality to a large extent19. Population events that alter genetic diversity — including bottlenecks, expansions, splits and admixture — complicate accurate detection of selected loci.

Scans for natural selection have been made possible by statistical tools to detect signatures of selection and by rapidly expanding whole-genome data sets in multiple human populations. Whereas early scans relied on single-nucleotide polymorphism (SNP) genotyping arrays, next-generation sequencing technologies now enable generation of whole-genome sequence data sets for analysis. Such data sets have several advantages. They do not suffer from ascertainment bias, which is the distortion in measures of genetic diversity and neutral variation20 created by the nonrandom sampling of SNPs on arrays. In addition, sequence data make it feasible to dissect loci with complex patterns of selection and short blocks of linkage disequilibrium (LD), such as the haemoglobin beta (HBB) gene that is associated with sickle cell anaemia21. Finally, as sequencing can detect potentially all variation throughout an individual's genome, the search for the precise causal variant driving selection is facilitated; however, as for SNP genotyping arrays, the comprehensiveness of capturing population-wide variation will depend on the number of individuals sampled.

Here, we briefly describe a few commonly used signatures that can help to elucidate human adaptations to pathogens. Various excellent resources provide more detailed background on statistical methods for detecting selection22,23,24,25,26.

Signatures of positive selection. Positive selection increases the prevalence of genetic variants that improve survival and fertility. For example, a mutation that protects against malaria by disrupting expression of the Duffy antigen gene DARC (also known as FY), which encodes the receptor used by the Plasmodium vivax malarial parasite to enter red blood cells, has reached fixation in most of sub-Saharan Africa27. Positive selection can act on new variants or on standing variation that becomes favourable owing to environmental changes28,29.

The test used to detect positive selection depends on when the selection occurred and on whether the variant is standing or new28. Very ancient selection may leave an excess of fixed, functional (for example, protein-coding) genetic changes that have been acquired over millions of years and through repeated selective sweeps. A selected variant that increases rapidly in frequency in the past ~250,000 years can be detected as an unusual reduction in genetic diversity. Recent positive selection (within the past 5,000–100,000 years) can be found with three different signals: unusually large allele frequency differences between populations, unusually high frequency of newly derived variants and unusually extended LD caused by the rapid increase in frequency of a single allele. The LD-based methods30 are particularly useful for detecting incomplete sweeps (that is, variants increasing in prevalence but not to fixation), which are more common in recent human evolution than complete sweeps31,32,33,34.

These methods have detected hundreds of loci with signatures of selection in the human genome16,28,30,35,36,37,38,39,40,41. Their sensitivity to recent events depends on the strength of selection, as more advantageous variants increase to detectable frequencies faster. The comparison of many individuals from closely related populations using tree-based methods can help to detect smaller selection- driven changes in allele frequencies42. Combining several different tests for selection improves sensitivity to a wider range of selection regimes, helps to narrow down candidate regions and pinpoints a small number of top causal candidates28,33,43.

Pathogen resistance alleles are prime candidates for discovery, as they are likely to be both recent and common, and have increased in frequency owing to the burden of disease.

Signatures of balancing selection. Balancing selection maintains multiple alleles at a locus. A variant may confer a heterozygous advantage; for example, at the HBB sickle cell locus, heterozygous carriers are malaria resistant to a large extent but otherwise healthy, which gives them a reproductive advantage in malaria-endemic environments over homozygous wild-type individuals (who are susceptible to malaria) and homozygous carriers (who have sickle cell disease and high childhood mortality27). Alternatively, the advantage conferred by a variant may depend on its prevalence. Diversifying selection favours high levels of polymorphism, as in the major histocompatibility complex (MHC). MHC diversity confers resistance to a broader range of pathogens44 and is strongly selected for45. In humans, MHC diversity correlates with local pathogen diversity, which is consistent with pathogen-driven balancing selection46. Selection at this locus is of particular interest, as it harbours more bona fide disease associations than any other region in the human genome47,48, including associations to infectious disease susceptibility (including AIDS49,50,51, leprosy52, leishmaniasis53, hepatitis B54,55,56, hepatitis C57 and human papilloma virus (HPV) infection58), autoimmune disorders, cancers and neuropsychiatric diseases.

Recent balancing selection can resemble an incomplete selective sweep and be detected using LD-based tests for positive selection, as at the HBB sickle cell locus59,60. Long-term balancing selection reduces the number of rare alleles in a region and causes an excess of polymorphism, an excess of intermediate-frequency variants and reduced allele frequency differences between populations61.

Cross-species comparative sequence analysis is particularly effective for detecting the subtle signals of ancient balancing selection. A comparison of humans and non-human primates found that both the MHC61 and the blood group locus ABO62,63 contain ancient multiallelic polymorphisms maintained across species, which indicates balancing selection. A comparison of whole-genome sequences for 10 chimpanzees and 59 humans found that regions with signatures of balancing selection — MHC and 125 other loci64 — were enriched for membrane glycoproteins. These are proteins exploited by a broad range of pathogens as receptors for cell invasion and to evade the host immune response, which suggests that the selection was pathogen driven65,66,67.

Signatures of negative selection and purifying selection. Negative selection eliminates existing detrimental variation from a population68. For example, when human populations in the Ganges River Delta encountered pathogenic Vibrio cholerae, individuals of blood type O had higher risk of dying from severe cholera, which put them at a strong reproductive disadvantage. Nowadays, populations in the cholera-endemic Ganges River Delta have the lowest rates of blood type O in the world, which is consistent with negative selection69,70. Purifying selection is the ongoing removal of deleterious alleles as they arise. Signatures of purifying selection include decreased overall diversity, loss of functional variation and an excess of rare alleles68. Purifying selection also manifests as a lack of substitutions between species, and this signal is used to identify functionally important, highly conserved genomic regions in cross-species comparisons71.

GWASs of adaptive traits. GWASs identify genomic variants that are significantly correlated with a phenotype of interest, typically in large sample cohorts. The sample size required is smaller if variants are at high prevalence, have strong effects and are in regions of extensive LD72,73, all of which are characteristics of positively selected loci (Fig. 2). As beneficial alleles increase in prevalence, they carry nearby variants with them (known as genetic hitchhiking); these nearby variants can thus be proxies for the causal allele and enhance power to detect association.

Figure 2: Positive selection increases power to detect associations in GWASs.
figure 2

a | The variants in a population-wide sample are shown for a schematic genomic locus. The red region indicates a variant that provides selective advantage (such as a host variant that confers relative resistance to an infectious agent). Positive selection of that variant rapidly increases its prevalence in a population and also the prevalence of nearby alleles that are in linkage with it. b | Positive selection on a variant is detectable with three types of signals: high levels of differentiation (that is, when positive selection in one geographical region causes larger frequency differences between populations than those expected for neutrally evolving alleles); high frequency of the derived allele (that is, when a new allele increases to a frequency higher than that expected under genetic drift); and long haplotypes (as determined by the integrated haplotype score and cross-population extended haplotype homozygosity) that are left when the selected allele increases in frequency sufficiently quickly that long-range associations with neighbouring variants are maintained. Combining different signals of selection into a composite score can increase resolution by up to 100-fold, which facilitates identification of the causal variants. c | Variants that are of high frequency, of strong effects and in regions of extensive linkage disequilibrium (LD) — all of which are characteristics of positively selected loci — are detectable with smaller sample sizes in genome-wide association studies (GWASs). Even with full sequence data, low-frequency alleles (solid blue lines) require larger sample sizes (x axis; equal number of cases and controls) than high-frequency alleles (solid red lines) for equivalent power over a range of effect sizes (1.2, 1.5 and 2.0). When mapping with genotyping arrays, shorter LD in unselected regions (blue dashed lines; modelled here using power simulations for the sparser Illumina 300K array) can cause a larger loss of power (relative to full sequence data) than in selected regions with longer LD (red dashed lines; modelled as denser Illumina 1M array). FST, Wright's fixation index. Part c is adapted from Ref. 198.

PowerPoint slide

The US National Institutes of Health (NIH) Catalog of published GWASs includes all strong SNP–trait associations (P <1×10−5) from 1,900 curated publications48. However, only 68 publications are related to pathogen susceptibility, and most of those (63%) are focused on AIDS (16 studies), hepatitis B (12 studies) and hepatitis C (15 studies). The remainder includes GWASs of tuberculosis (5 studies), prion diseases (3 studies), malaria (3 studies), leprosy (2 studies), smallpox (2 studies) and 9 others (see Supplementary information S1 (table)). Before GWASs, a candidate gene approach was used to find strong-effect variants that were associated with disease susceptibility in HBB, DARC and SLC4A1 (solute carrier family 4 (anion exchanger), member 1 (Diego blood group)) for malaria, and the CCR5 (chemokine (C-C motif) receptor 5) gene for AIDS74. However, candidate gene studies were generally underpowered, and beyond these textbook examples most previous associations have so far failed to replicate in GWASs75. For example, of 22 tuberculosis-associated genes, none were significantly associated in a GWAS of 11,425 Africans, nor were these genes overrepresented among nominally significant results76,77,78. Candidate gene studies of infectious disease susceptibility have been reviewed in detail elsewhere79.

Integrating selection metrics into GWASs can increase their power. A study of malaria resistance loci in Africa found that when association results are weighted on the basis of evidence of positive selection, power increases by 1–2 orders of magnitude80,81. Another approach was used to identify positively selected drug resistance loci in the malaria parasite Plasmodium falciparum; the sample was partitioned into two populations defined by phenotype (for example, resistant and susceptible), and regions of selection were identified in resistant individuals82. Furthermore, as described above, selection increases the power of GWASs themselves by driving the emergence of common alleles of strong effect. Indeed, using the Kolmogorov–Smirnov test, we found that variants in positively selected regions in the NIH Catalog of published GWASs are more significantly associated with the tested phenotype (PKS = 2 × 10−7) than those in the rest of the genome33.

Signatures of polygenic selection. Similar to many other human traits, adaptation to pathogens is likely to be polygenic33,83, as mutations in many genes emerge and increase in prevalence through selection. Pathway-based approaches scrutinize candidate regions identified in GWASs84 to elucidate functional pathways that are disrupted in complex human diseases85. The same approach can be applied to candidate selected regions; however, as natural selection acts on many traits, they are less powered. Only when a particular pathway is repeatedly selected for in a population, such as skin pigmentation in Europeans, do we expect significant enrichment.

Functional characterization. For an allele to be selected it must have a functional effect, but the relevant phenotype is typically unknown. Selected regions therefore present an opportunity to develop a systematic process for functionally characterizing genetic variation. In a few cases, a selected region contains well-characterized genes that offered clear functional hypotheses, for example, Toll-like receptor 5 (TLR5, which is involved in bacterial flagellin sensing33) (Fig. 3A), apolipoprotein L1 (APOL1, which encodes a serum factor that lyses Trypanosoma brucei86) (Fig. 3B) and HBB (which influences infection by P. falciparum27 (see above)). For genes with unknown or diverse functions, identification of the selected trait is more complicated. Moreover, most selection signals are non-genic and are likely to alter poorly characterized regulatory regions of the genome43,87,88.

Figure 3: Selected variants implicated in pathogen resistance.
figure 3

A | A genome-wide scan for signals of positive selection in the Yoruban population of Nigeria found a strongly selected nonsynonymous single-nucleotide polymorphism (SNP) that alters the pathogen recognition protein Toll-like receptor 5 (TLR5) (part Aa) and that is predicted to disrupt TLR5 activation in response to flagellated bacteria33 (part Ab). Cell lines that carry the new TLR5 variant (Leu616Phe) had significantly reduced nuclear factor-κB (NF-κB) signalling in response to flagellin, which is potentially protective against some bacterial infections (part Ac). Error bars represent the standard error of the mean over at least three independent experiments; P values are indicated above the bar graphs. B | Two common variants of the apolipoprotein L1 (APOL1) gene (Allele G1 and Allele G2) that are strongly associated with kidney disease in African Americans (part Ba) show evidence of recent positive selection in Yorubans86 (part Bb). In vitro, the G1 and G2 variants lyse subspecies of the Trypanosoma spp. pathogen that are resistant to wild-type APOL1 (part Bc). Arrows point to the swelling lysosome. cM, centimorgan; PMA, phorbol myristate acetate. Part A reprinted from Cell, 152, Sharon R. Grossman et al., Identifying recent adaptations in large-scale genomic data, 703–713, © (2013), with permission from Elsevier. Part B from Genovese, G. et al. Association of trypanolytic ApoL1 variants with kidney disease in African Americans. Science 329, 841–845 (2010). Reprinted with permission from AAAS.

PowerPoint slide

Functional elucidation remains a difficult and mostly manual process, and benefits enormously when candidate variants are narrowed down to a small number. New methods for detecting selection are remarkably effective at identifying top functional candidates among all variants in selected regions — a list that can be further reduced by incorporating GWAS results, expression quantitative trait loci (eQTLs) and functional annotations33,89. Useful resources include the 1000 Genomes Project, which aims to identify all common (>1%) human genetic variation by sequencing >1,000 individuals90, and the Encyclopedia of DNA Elements (ENCODE) Project, which aims to characterize all functional elements in the genome91. Breakthroughs in next-generation sequencing, high-throughput functional screens, single-cell genomics, microfluidics, chromosome conformation capture and genome engineering approaches now make it possible to test many variants in parallel, to investigate non-genic regions and to functionally screen the whole genome17,92,93,94,95,96,97,98,99.

Genetics of infectious disease resistance

The dynamics of host–pathogen interactions (for example, length of exposure, geographical spread, morbidity and mortality, and co-occurring environmental events) influences the genetic architecture of resistance variants in modern populations (Fig. 4). Many different modes of selection shape patterns of variation in humans (reviewed in Refs 25,26), and selection scans using current methods and data sets can only detect a subset of selected loci. The most conspicuous signals are perhaps left by positive selection in recent human evolutionary history. This is the timeframe during which many major pathogens first emerged (Fig. 1; Table 1), which suggests that this mode of selection is particularly relevant to studies of infectious disease susceptibility. Moreover, finding beneficial variants that are favoured by recent selection could suggest new medical therapies. Transforming genetic discoveries into improved healthcare will take time, but understanding natural resistance is an important first step.

Figure 4: The power offered by combining natural selection with GWASs depends on the age of selection and populations chosen.
figure 4

For pathogens that predate human dispersal from Africa, ancient and complex signals of selection are shared between human populations, and there are variable implications for genome-wide association studies (GWASs). For widespread pathogens that are more recent, the range of new resistance variants will be more limited, but selection is harder to detect when it is shared between populations. For recent pathogens that affect specific populations, GWASs in the selected populations will be particularly powerful, as causal variants will have been driven to high prevalence. Selection signals from one population may help to detect resistance loci in other unselected populations, but only if resistance variants arise in the same genetic loci. For GWASs of very new diseases, supplementing these studies with methods to detect selection will add no power, unless variants that confer resistance also protect against more ancient pathogens. Examples of pathogens matching each scenario are given on the right.

PowerPoint slide

Table 1 Age and geographical origin of major human pathogens

We note that some have questioned the extent of recent positive selection in humans in light of a recent paper31. In that paper, the authors estimated that during the past 250,000 years, ~0.5% of nonsynonymous substitutions have swept to fixation (that is, 100% prevalence), and such an observation has been interpreted to suggest that very few positively selected variants can be found in humans. In actuality, this corresponds to ~340 adaptive nonsynonymous mutations in the 1000 Genomes Project data. Moreover, much (probably most) adaptive evolution occurred in regulatory, non-coding regions100,101, and most recent selective sweeps are far from complete102. Thus, the actual number of positively selected loci could be much larger.

Under the framework of recent positive selection, one would anticipate that genetic variants conferring pathogen resistance that are moderately old (dating since the human migrations from Africa but at least thousands of years old), that were geographically limited in history and that exerted strong positive selective pressure will be most readily detected, provided that the studies are carried out in the population with the history of disease exposure. Many of the pathogen studies so far are imperfect fits for these criteria. Here, we review some of the prominent diseases of human history (Table 1), and discuss the strengths and limitations of investigating natural selection and association for these traits.

Malaria. Malaria is caused by obligate parasitic Plasmodium spp., which infects hundreds of millions of people and kills ~1 million children annually103. P. falciparum has afflicted humans for ~100,000 years, and a rapid upsurge of malaria ~10,000 years ago increased selective pressure on some human populations27,104. As a result, incidence of sickle cell disease and other inherited red blood cell disorders that are associated with malaria resistance (for example, α-thalassemia, glucose- 6-phosphate dehydrogenase (G6PD) deficiency and ovalocytosis) coincides with the geographical distribution of malaria27.

The presence of a disease in a population may indicate that the pathogen exerts selective pressure, but the inverse — absence of disease — can also be meaningful. Although P. falciparum is common in sub-Saharan Africa, P. vivax is noticeably absent. A mutation in the human DARC gene105 that disrupts expression of the Duffy antigen receptor to prevent infection106 has become 100% prevalent. In a possible example of convergent evolution, an independent DARC mutation is prevalent in Southeast Asia where P. vivax is common107. As expected in the host–pathogen evolutionary 'arms race', the high prevalence of Duffy-negative hosts could be driving P. vivax to adapt. Strains of P. vivax have emerged in Africa that can infect Duffy-negative individuals108 and that carry new variants of the gene encoding Duffy-binding protein109.

The long history and strong selective pressure exerted by malaria may paradoxically complicate detection of resistance loci by GWASs. When selection drives multiple resistance variants to arise in a locus (for example, HBB and MHC), patterns of selection and association are more complex. Further complicating GWASs of malaria susceptibility, most cases occur in African populations, in which LD is short and highly variable between populations. Thus, causal variants are poorly tagged by SNP genotyping arrays21,110. Methods that successfully tackle these challenges are being developed, including GWASs by sequencing, improved imputation and Bayesian statistical approaches that allow heterogeneity in effect size and location110,111,112. An initial malaria GWAS in West Africa with 2,500 cases, 3,400 controls and 400,000 SNPs found no genome-wide significant associations21. Increasing the sample set to 3,500 cases, 4,300 controls and 800,000 SNPs (4 million after imputation) found HBB, ABO and two novel loci: one is intergenic and the other is in the ATP2B4 gene, which encodes an erythrocyte calcium channel111. An international collaborative GWAS of severe malaria is likely to identify more loci113.

Leprosy. Leprosy is a chronic disease caused by the bacterium Mycobacterium leprae and causes infertility, disfiguring skin lesions, social ostracism, permanent physical disability and shortened lifespan114,115,116,117,118. M. leprae is an ancient and obligate human pathogen with a phylogeography that follows human migrations119. Although not immediately fatal, leprosy infection decreases fertility118 by up to 85%120, which poses a potent selective pressure. Consistent with selection that favours resistance variants, most people nowadays are genetically protected against M. leprae121. Indeed, host genetics has such a large role in susceptibility that, before the disease was proved to be caused by bacteria, leprosy was thought to be an inherited disease rather than an infectious one122,123.

Leprosy was endemic in Europe with a prevalence of 10–40%124 until the sixteenth century; its subsequent rapid decline is still unexplained. Currently, it remains a major public health burden in India, China and South America125. The genetics of leprosy susceptibility differs between populations. GWASs in Han Chinese implicated the bacterial pattern recognition receptor NOD2 and three other components of NOD2-mediated innate immunity52,126, but these associations did not replicate in an independent study of Indians and Malawians127. The Indian cohort instead had strong association to a functional knockout mutation in the TLR1 gene that protects against leprosy128. This variant is nearly absent in the Han Chinese population (2%) and rare in Indians (9%) but extremely common in Europeans (70%)129 owing to positive selection130. A locus associated with leprosy susceptibility in Chinese populations that is near the immune regulator gene cylindromatosis (CYLD)131,132 may also be positively selected in Europeans (Fig. 5). Selection at leprosy-associated loci in European populations, together with the evidence from Indian and Chinese studies that risk loci differ between populations, suggests that combining European selection scans with a European GWAS might offer the most power. However, as leprosy is now rare in Europe, a GWAS would require a phenotypic proxy. Alternatively, as technology improves for analysing ancient DNA samples, thousands of medieval skeletal remains in Europe, many of which have lesions indicative of leprosy124, may be usable in archaeological GWASs.

Figure 5: Signals of selection and association may differ between populations.
figure 5

Two loci (represented by dashed and solid lines) associated with leprosy susceptibility in Han Chinese52,126 show no evidence of positive selection in East Asians (blue circles). The association downstream of the cylindromatosis (CYLD) gene (solid line), but not the association near the bacterial pattern recognition receptor gene NOD2 (dashed line), has signals of positive selection in Europeans (red circles). Selection scores for four different metrics were calculated using data from the 1000 Genomes Project and published in Ref. 33. FST, Wright's fixation index; NKD1, naked cuticle homologue 1; SNX20, sorting nexin 20.

PowerPoint slide

Tuberculosis. Tuberculosis is an often lethal disease caused by Mycobacterium tuberculosis, which infects one-third of the human population. Similar to the pathogens causing malaria and leprosy, M. tuberculosis is an ancient and obligate human pathogen that is estimated to have emerged in East Africa ~40,000 years ago133 and to have spread around the world with ancient human migrations134. Consistent with long host–pathogen co-evolution, only 10% of infected individuals develop active disease; in most infected individuals the host immune system contains the pathogen. Host genetic factors are strongly implicated in susceptibility135. Despite this, a GWAS of tuberculosis susceptibility found only 2 significant associations in a huge data set of 11 million SNPs and, after replication, nearly 23,000 individuals76,136, compared with 490,000 SNPs and 9,200 individuals in the leprosy GWAS136.

What explains the different outcomes of the tuberculosis and leprosy GWASs? One possible factor is that M. tuberculosis is more genetically diverse than M. leprae134, and particular lineages of M. tuberculosis may have adapted to specific populations137. As individual tuberculosis status varies with how the host genetics interacts with the infecting M. tuberculosis strain, an accurate phenotype would include sequence data from the pathogen. In addition, tuberculosis resistance can be defined either as preventing or containing infection. Finally, co-infection with other pathogens increases the risk of developing active tuberculosis (for example, HIV infection increases risk by 21–34-fold138).

Phenotypic ambiguity, strain differences and strain–host interactions reduce the power of GWASs to associate host genetic factors with tuberculosis susceptibility139, but the evidence of host–pathogen co-evolution suggests that incorporating tests of selection into GWASs may be particularly powerful. The strongest association so far is found in a GWAS carried out primarily in Africa, which implicates a region downstream of the Wilms tumour 1 (WT1) gene that has signatures of positive selection in East Asians33,136.

AIDS. During the 1980s, the highly variable and fast evolving retroviral pathogen HIV-1 emerged to cause a global infectious disease pandemic that has killed >30 million people140. HIV-1 infects immune cells and causes a progressive and incurable failure of the immune system that allows other opportunistic infections, such as tuberculosis, to take hold. Both HIV-1 and the distantly related virus HIV-2 are predicted to be recent (<100 years) cross-species transmissions of simian immunodeficiency viruses (SIVs) into humans141. SIVs are mostly non-pathogenic species-specific lentiviruses carried by at least 41 African primate species142. Some SIVs have infected the same host for >30,000 years and probably much longer143. In the last century, at least ten primate-to-human SIV transmissions have been documented142. This suggests that human populations, particularly those in Africa, may have experienced ancient lentivirus epidemics and driven variants that confer resistance to modern HIV strains to prevalence through natural selection. The role of host genetics in HIV infection has been extensively researched, and the NIH Catalog of published GWASs lists at least 15 HIV-related publications (4 of which report significant associations of P < 1 × 10−8), but any connection to ancient selection remains unclear.

Among the first HIV resistance variants to be elucidated is a 32-base deletion in the cell surface receptor gene CCR5 (known as CCR5Δ32) that prevents the expression of the receptor on T cells and confers complete HIV immunity on homozygous carriers144,145. On the bases of the high prevalence of the variant in northern Europe and the apparently high LD with nearby variants, some researchers hypothesized that CCR5Δ32 was <1,000 years old and that positive selection for resistance to Yersinia pestis (the causative agent of the plague) increased its prevalence146, which led to various follow-on studies. However, re-evaluation of the locus using newer and denser genetic maps found no evidence to support positive selection acting on this locus, which highlights the necessity of dense genome-wide data sets when identifying loci with exceptional patterns of variation147. Subsequent work on CCR5 highlights the medical relevance of identifying naturally occurring resistance variants. Patients infused with autologous CD4 T cells that are engineered with a CCR5-disrupting mutation designed to phenotypically mimic CCR5Δ32 showed partial resistance to HIV infection148.

The strongest signals of HIV resistance are in the MHC loci, which are under extreme positive and balancing selection. In a GWAS comparing individuals infected with HIV who do not develop clinical disease (that is, HIV controllers) to individuals with advanced disease, MHC variants provided an explanation for 19% of the phenotypic variance49. They included a protective regulatory variant that is correlated with increased expression of human leukocyte antigen C (HLA-C)149. Although higher HLA-C expression protects against HIV progression, it also increases risk of the inflammatory disorder Crohn's disease150, which highlights the potential for health repercussions of pathogen-driven selection.

Cholera. Cholera — a notoriously deadly disease with historic mortality rates as high as 50%151,152,153 — is caused by the V. cholerae bacterium and endemic to the Ganges River Delta154 of Bangladesh, where cholera is still prevalent now155,156,157. Host genetic factors strongly influence susceptibility69,70 to this dangerous pathogen, which suggests selection favouring cholera resistance variants. Consistent with this hypothesis, the region has the world's lowest prevalence of blood type O, which is associated with an increased risk of severe cholera69,70.

A genome-wide scan for positive selection in a Bangladeshi population identified >300 selected regions. The most strongly selected genes were also associated with cholera susceptibility in a targeted analysis158. A gene set enrichment analysis159 found that two types of genes were statistically overrepresented in the selected regions compared with the rest of the genome: genes encoding potassium channels that are involved in cyclic AMP-mediated chloride secretion, and genes encoding components of the innate immune system that are involved in nuclear factor-κB (NF-κB) signalling. The success of the enrichment analysis suggests that cholera resistance in Bangladesh, similarly to pigmentation in Europe, provides an exceptionally strong evolutionary advantage and has driven selection at many different genomic loci.

The history of selection could make GWASs of cholera susceptibility in Bangladesh particularly powerful. As cholera is still common in Bangladesh, unlike leprosy in Europe, such a study is feasible. The results could help to elucidate the power for mapping positively selected resistance variants that protect against other pathogens with geographical disparity and high mortality.

Norovirus. The single-stranded RNA viruses of the genus Norovirus are the leading cause of extremely contagious viral gastroenteritis outbreaks worldwide160; they are particularly dangerous to young children and cause up to 200,000 deaths each year in developing countries161. The origin of Norovirus is obscured by the rapid evolution of these viruses162,163,164, but complex signals of selection in humans suggest that they could be very old. Individuals who are homozygous for null mutations of the fucosyltransferase 2 (FUT2) gene do not secrete ABO antigens and are protected against some strains165,166,167. Non-secretors are common worldwide (for example, in 20% of Caucasians). The underlying FUT2 mutational spectrum is unexpectedly complex, as it is comprised of multiple independent mutations that vary in frequency between populations and that have diverse evolutionary signatures, from long-term balancing selection to recent positive selection168.

Influenza. Great pandemics inflict massive mortality and are of particular interest to evolutionary geneticists. The most striking modern example is the 1918 influenza pandemic, which was caused by an unusually lethal strain of influenza A that killed 50–100 million people, including many previously healthy young adults169,170. Influenza is almost certainly very old: Hippocrates described a flu-like illness ~2,400 years ago171. Observational data suggest that host genetics influences susceptibility to severe illness172; for example, cases of the highly pathogenic H5N1 strain show strong familial aggregation173. Recently, interferon-induced transmembrane proteins (IFITMs) have been implicated in resistance to influenza A. IFITMs inhibit in vitro replication of some pathogenic viruses174, and IFITM3 expression protects against infection by multiple strains of influenza A in vitro and in vivo175. Hospitalized patients with severe influenza were significantly more likely to carry a splice acceptor site variant in IFITM3 that reduces its ability to restrict influenza virus replication in vitro175. Versions of IFITM3 that protect against influenza seem likely to confer a selective advantage, and selection scans show signals of recent positive selection in the IFITM3 region175.

Smallpox. Only a century ago, smallpox — caused by the Variola virus — ravaged human societies with mortality rates of up to 30%. It was an ancient and widespread scourge, and was described in historical records thousands of years old from China, India and Egypt. It is now gone and represents the only infectious disease in humans that has been eradicated by modern medicine. Variola virus has a highly conserved (>99.6% across 45 isolates) 186-kb double-stranded DNA genome176. Its extremely low mutation rate, simple genetic makeup and reliance on humans as its only host limited its ability to adapt and facilitated its eradication.

The age and phylogeography of smallpox is unresolved despite efforts to integrate historical records with sequence data from 45 viral isolates176,177. We noted that, for 32 viral isolates with documented mortality, death rates are lower in Africa (0.4−12%) than elsewhere (4−38%), even though all isolates were from a single phylogenetic clade178. This is consistent with selection for resistance in Africa, where the smallpox virus is predicted to have evolved from a rodent-borne ancestor tens of thousands of years ago and where outbreaks of other poxviruses continue nowadays. Human evolutionary history may help to clarify the origins of smallpox.

With smallpox eradicated, vaccine response is used as a crude phenotypic proxy for studying host resistance. Two GWASs that included European, African American and Hispanic populations identified 37 SNPs associated with cytokine response to vaccination179,180 (P < 1 × 10−8). Most of the significant associations (65%) were found in African Americans, even though their sample size was half of that of the European cohorts, which is consistent with a larger effect due to selection in Africa. These results are preliminary: the studies had relatively small sample sizes (~200 African Americans), no overlap in their results and no replication. Incorporating tests for natural selection could add power for detecting true associations.

Infectious disease selection and common disease

The hygiene hypothesis proposes that autoimmune disorders are partly caused by differences between the pathogen-rich environment in which our immune system evolved and the more sterile modern world. In the absence of diverse pathogens from which to defend ourselves, our immune responses may turn on us12. Loci associated with common inflammatory disorders are enriched for signals of positive selection1,181,182,183, and GWASs have proved particularly powerful for this class of diseases184. Elucidating the effect of ancient selection for pathogen resistance should help to decipher the aetiology of autoimmune diseases89 but will require more data on immune responses to common pathogens. In cases in which selected variants have pleiotropic effects, pathogen-driven selection may even underlie diseases with no apparent immune component.

Inflammatory bowel disease. Inflammatory bowel disease (IBD) is a group of disorders, including Crohn's disease and ulcerative colitis, that are caused by autoimmune attacks on the gastrointestinal system. One hundred and sixty-three distinct loci have been significantly associated with IBD risk using meta-analyses of up to 75,000 European cases and controls, and these loci are strongly enriched for signatures of selection11,117. Moreover, seven of the eight leprosy susceptibility loci52,126 are also associated with increased IBD risk54,125. Risk allele frequencies at some IBD loci correlate with local pathogen diversity, which is consistent with pathogen-driven selection185.

These observations broadly support the hygiene hypothesis and connect autoimmunity to ancient evolution for pathogen resistance. However, the relationship is not straightforward. Of the four leprosy risk loci that precisely overlap IBD association peaks11, the IBD risk variant is associated with decreased leprosy risk at only two loci; the other two associate with increased risk. Further complicating the story, one of those two — the variant in NOD2 — is associated with both an increased risk of Crohn's disease and protection against ulcerative colitis186.

One potential source of the seemingly discrepant GWAS results is population differences. Whereas the 163 IBD loci were identified in cohorts of European ancestry, the leprosy GWASs were carried out in East Asian populations. The NOD2 pathway association with leprosy resistance in East Asians has not been replicated in other populations127. In addition, East Asians rarely carry the functional knockout mutation in TLR1 that is common in Europe. Experimental data suggest that TLR1 and NOD2 activate distinct pathways in response to leprosy infection187; thus, correlating East Asian pathogen resistance variants with autoimmune disease risk in Europeans may not be informative.

A second factor is simply the lack of data on many pathogens. The remarkable overlap between GWAS loci of leprosy and IBD may reflect a bias in the available data, as leprosy is only one of the few pathogen susceptibility GWASs completed. Not a single GWAS has been carried out, for example, on susceptibility to parasitic worms, which are potentially of great relevance to gastrointestinal disorders such as IBD.

Coeliac disease. Coeliac disease is a strongly heritable188 (~80%) inflammatory intestinal disorder triggered by gluten consumption. Despite severely affecting nutritional intake, coeliac disease occurs at 1–2% in Europe189 and up to 6% in North African Sahrawi190. Loci associated with coeliac disease have signatures of positive selection181. A functional analysis of one selected locus in the SH2B adaptor protein 3 (SH2B3) gene found that individuals who are homozygous for the coeliac risk allele (~22% of the European population) have stronger activation of the NOD2 pathway and a 3–5-fold higher pro-inflammatory cytokine response to lipopolysaccharide191. Better protection against bacterial infection may have conferred a selective advantage that outweighed the increased risk of coeliac disease risk. Inferring selection pressure is problematic, as gluten consumption and thus the selection against coeliac disease probably changed with agriculture. A crude estimate of the age of the SH2B3 variant based simply on haplotype length suggested that it was very recent191 (<2,000 years old). However, simulations suggest that the long haplotype tests used to detect the selection at SH2B3 are sensitive to selection events ~5,000–50,000 years ago28, which implies that the SH2B3 selection could date to either before or after the spread of agriculture through Europe ~10,000 years ago192. Accurately dating selected variants is challenging and requires methods that can both estimate multiple parameters and test various ancestry models, such as Approximate Bayesian Computation193.

Non-autoimmune disease: kidney disease. African Americans suffer from kidney disease — including focal segmental glomerulosclerosis (FSGS) and hypertension-attributed end-stage kidney disease (H-ESKD) — at higher rates than European Americans. A region around the myosin heavy chain gene MYH9 was associated with FSGS and H-ESKD in African Americans, but no causal variants were found194,195. One study86 expanded the search to include an adjacent signal of African positive selection at the APOL1 gene. Using data from the 1000 Genomes Project, they tested all polymorphisms with large frequency differences between Africans and Europeans in this expanded interval and identified two independent coding variants in APOL1 that are strongly associated with FSGS (odds ratio = 10.5) and H-ESKD (odds ratio = 7.3). In vitro assays showed that the kidney disease-associated variants lyse T. brucei rhodesiense, which is the trypanosome parasite that causes the most acute, virulent form of sleeping sickness (Fig. 3B). The authors propose that ancient selection for resistance to sleeping sickness or a related pathogen in Africa contributes to the high rates of kidney disease in African Americans.


One of the oldest topics in genetics — natural selection for pathogen resistance — is being transformed by high-throughput biotechnology that offers unprecedented power to examine genome evolution. New research finds that both the most ancient signals of balancing selection (on cell surface glycoproteins) and some of the clearest signals of recent positive selection (on TLRs) implicate pathogens as the strongest selective pressure to drive the evolution of modern humans. Incorporating ancient history into disease susceptibility studies will identify functional variants and elucidate cellular mechanisms, and facilitate the development of new therapies for a surprisingly wide range of human illnesses. The range of diseases affected by ancient pathogen-driven selection extends beyond infectious diseases and immune- mediated disorders. Inflammatory and immunity genes are associated with psychiatric diseases, including schizophrenia196 and autism197, and the role of host genetics in maintaining a healthy microbiome is only starting to be examined14. At the dawn of genomic medicine, our ancient evolutionary history is one of our most powerful resources for understanding human biology towards improving human health.