Given the importance of Africa to studies of human origins and disease susceptibility, detailed characterization of African genetic diversity is needed. The African Genome Variation Project provides a resource with which to design, implement and interpret genomic studies in sub-Saharan Africa and worldwide. The African Genome Variation Project represents dense genotypes from 1,481 individuals and whole-genome sequences from 320 individuals across sub-Saharan Africa. Using this resource, we find novel evidence of complex, regionally distinct hunter-gatherer and Eurasian admixture across sub-Saharan Africa. We identify new loci under selection, including loci related to malaria susceptibility and hypertension. We show that modern imputation panels (sets of reference genotypes from which unobserved or missing genotypes in study sets can be inferred) can identify association signals at highly differentiated loci across populations in sub-Saharan Africa. Using whole-genome sequencing, we demonstrate further improvements in imputation accuracy, strengthening the case for large-scale sequencing efforts of diverse African haplotypes. Finally, we present an efficient genotype array design capturing common genetic variation in Africa.
At a glance
- Gene flow from North Africa contributes to differential human genetic diversity in southern Europe. Proc. Natl Acad. Sci. USA 110, 11791–11796 (2013) et al.
- The International HapMap Consortium. The International HapMap Project. Nature 426, 789–796 (2003)
- The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012)
- The genetic structure and history of Africans and African Americans. Science 324, 1035–1044 (2009) et al.
- Genomic variation in seven Khoe-San groups reveals adaptation and complex African history. Science 338, 374–379 (2012) et al.
- Patterns of ancestry, signatures of natural selection, and genetic association with stature in Western African pygmies. PLoS Genet. 8, e1002641 (2012) et al.
- The H3Africa Consortium. Enabling the genomic revolution in Africa. Science 344, 1346–1348 (2014)
- Bringing together linguistic and genetic evidence to test the Bantu expansion. Proc. R. Soc. Lond. B 279, 3256–3263 (2012) , , &
- Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009) , &
- Ethiopian genetic diversity reveals linguistic stratification and complex influences on the Ethiopian gene pool. Am. J. Hum. Genet. 91, 83–96 (2012) et al.
- Ancient admixture in human history. Genetics 192, 1065–1093 (2012) et al.
- Ancient west Eurasian ancestry in southern and eastern Africa. Proc. Natl Acad. Sci. USA 111, 2632–2637 (2014) et al.
- The complete genome sequence of a Neanderthal from the Altai Mountains. Nature 505, 43–49 (2014) et al.
- Climate-controlled Holocene occupation in the Sahara: motor of Africa's evolution. Science 313, 803–807 (2006) &
- Lakeside cemeteries in the Sahara: 5000 years of holocene population and environmental change. PLoS ONE 3, e2995 (2008) et al.
- The Bradshaw Foundation. The Origin of the Prehistoric Rock Art Artists http://www.bradshawfoundation.com/giraffe/artists.php (2014)
- Rock art in African Highlands, Ennedi Highlands, Chad—Artists and Herders in a Lifeworld on the Margins. In Atlas of Cultural and Environmental Change in Arid Africa http://www.academia.edu/1580718/Rock_art_in_African_Highlands_Ennedi_Highlands_Chad_-_Artists_and_Herders_in_a_Lifeworld_on_the_Margins (Heinrich Barth Institute, 2007)
- The impact of agricultural emergence on the genetic history of African rainforest hunter-gatherers and agriculturalists. Nature Commun. 5, 3163, http://dx.doi.org/10.1038/ncomms4163 (2014) et al.
- Genetic evidence for the convergent evolution of light skin in Europeans and East Asians. Mol. Biol. Evol. 24, 710–722 (2007) et al.
- Association of the OCA2 polymorphism His615Arg with melanin content in east Asian populations: further evidence of convergent evolution of skin pigmentation. PLoS Genet. 6, e1000867 (2010) et al.
- Genome-wide scans provide evidence for positive selection of genes implicated in Lassa fever. Phil. Trans. R. Soc. Lond. B 367, 868–877 (2012) et al.
- Candidate malaria susceptibility/protective SNPs in hospital and population-based studies: the effect of sub-structuring. Malar. J. 9, 119 (2010) et al.
- Complement receptor 1 variants confer protection from severe malaria in Odisha, India. PLoS ONE 7, e49420 (2012) et al.
- Evidence for malaria selection of a CR1 haplotype in Sardinia. Genes Immun. 12, 582–588 (2011) et al.
- Somatic mutations in ATP1A1 and ATP2B3 lead to aldosterone-producing adenomas and secondary hypertension. Nature Genet. 45, 440–444, http://dx.doi.org/10.1038/ng.2550 (2013) et al.
- Abnormal increase in urinary aquaporin-2 excretion in response to hypertonic saline in essential hypertension. BMC Nephrol. 13, 15 (2012) , , , &
- Differential susceptibility to hypertension is due to selection during the out-of-Africa expansion. PLoS Genet. 1, e82 (2005) et al.
- A map of recent positive selection in the human genome. PLoS Biol. 4, e72 (2006) , , &
- Common variants in the ATP2B1 gene are associated with susceptibility to hypertension: the Japanese Millennium Genome Project. Hypertension 56, 973–980 (2010) et al.
- Genetic variations in ATP2B1, CSK, ARSG and CSMD1 loci are related to blood pressure and/or hypertension in two Korean cohorts. J. Hum. Hypertens. 24, 367–372 (2010) et al.
- Genome-wide association study of blood pressure and hypertension. Nature Genet. 41, 677–687 (2009) et al.
- Malaria: looking for selection signatures in the human PKLR gene region. Br. J. Haematol. 149, 775–784 (2010) et al.
- Imputation-based meta-analysis of severe malaria in three African populations. PLoS Genet. 9, e1003509 (2013) et al.
- Levels of soluble CD163 and severity of malaria in children in Ghana. Clin. Vaccine Immunol. 15, 1456–1460 (2008) et al.
- Interleukin-10 (IL-10) polymorphisms are associated with IL-10 production and clinical malaria in young children. Infect. Immun. 80, 2316–2322 (2012) et al.
- Analysis of IL10 haplotypic associations with severe malaria. Genes Immun. 6, 462–466 (2005) et al.
- Murine malaria is exacerbated by CTLA-4 blockade. J. Immunol. 169, 2323–2329 (2002) , , , &
- Identification of cell surface molecules involved in dystroglycan-independent Lassa virus cell entry. J. Virol. 86, 2067–2078 (2012) , , , &
- Population genetics of IFIH1: ancient population structure, local selection, and implications for susceptibility to type 1 diabetes. Mol. Biol. Evol. 27, 2555–2566 (2010) et al.
- Identifying Darwinian selection acting on different human APOL1 variants among diverse African populations. Am. J. Hum. Genet. 93, 54–66 (2013) et al.
- Increased susceptibility of Fas ligand-deficient gld mice to Trypanosoma cruzi infection due to a Th2-biased host immune response. Eur. J. Immunol. 29, 81–89 (1999) et al.
- Fas-FasL interaction modulates nitric oxide production in Trypanosoma cruzi-infected mice. Immunology 103, 122–129 (2001) et al.
- Trypanosomiasis-induced Th17-like immune responses in carp. PLoS ONE 5, e13012 (2010) et al.
- Siglecs and their roles in the immune system. Nature Rev. Immunol. 7, 255–266 (2007) , &
- Host HDL biogenesis machinery is recruited to the inclusion of Chlamydia trachomatis-infected cells and regulates chlamydial growth. Cell. Microbiol. 14, 1497–1512 (2012) , , &
- Human conjunctival transcriptome analysis reveals the prominence of innate defense in Chlamydia trachomatis infection. Infect. Immun. 78, 4895–4911 (2010) et al.
- Genome-wide and fine-resolution association analysis of malaria in West Africa. Nature Genet. 41, 657–665 (2009) et al.
- Genome-wide comparisons of variation in linkage disequilibrium. Genome Res. 19, 1849–1860 (2009) et al.
- Population genetics of malaria resistance in humans. Heredity 107, 283–304 (2011)
- Extremely low-coverage sequencing and imputation increases power for genome-wide association studies. Nature Genet. 44, 631–635 (2012) et al.
Extended data figures and tables
Extended Data Figures
- Extended Data Figure 1: Allele sharing between sequenced populations in the AGVP. (238 KB)
a, The overlap of SNPs between 4×WGS data from Zulu, Ugandan and Ethiopian individuals (subsampled to 100 samples each). b, The overlap of novel variants (those not in the 1000 Genomes Project phase I integrated call set, ‘1000G’) between the three populations. c, d, The allele frequency spectra of variants in different portions of the Venn diagrams depicted in a and b, respectively. There appear to be a large proportion of unshared (private) variants in each population: between 10% and 23% of the total number of variants in a given population. The proportion of novel variants was high, with Ethiopia showing the greatest proportion of novel variation. Most of the novel variation appears to be unshared and rare.
- Extended Data Figure 2: The first ten principal components for the African data set. (304 KB)
PC1 shows a cline among several African populations, most likely to represent Eurasian gene flow (n = 1,481). PC2 shows a clear separation between West and South/East Africa. Subsequent PCs show more detailed structure between, and within African populations.
- Extended Data Figure 3: The first ten principal components for the global data set, including populations from the 1000 Genomes Project. (336 KB)
PC1 shows a cline among several African populations extending towards European populations, most likely to represent non-SSA gene flow (n = 2,864). PC2 shows a clear separation between European and Asian populations. Subsequent PCs show more detailed structure between populations globally, and within African populations. GBR, British in England and Scotland; ACB, African Caribbeans in Barbados; ASW, Americans of African ancestry in southwestern USA; CDX, Chinese Dai in Xishuangbanna, China; CEU, Utah residents with Northern and Western European ancestry; CHB, Han Chinese in Beijing, China; CHS, Southern Han Chinese; CLM, Colombians from Medellin, Colombia; FIN, Finnish in Finland; GIH, Gujarati Indian from Houston, Texas, USA; IBS, Iberian population in Spain; JPT, Japanese in Tokyo, Japan; KHV, Kinh in Ho Chi Minh City, Vietnam; MXL, Mexican ancestry from Los Angeles, USA; PEL, Peruvians from Lima, Peru; PUR, Puerto Ricans from Puerto Rico, and TSI, Toscani in Italy.
- Extended Data Figure 4: The first ten principal components for the global extended data set, including populations from the 1000 Genomes Project, Human Genome Diversity Project, North African and Khoe-San population groups. (444 KB)
PC1 shows a cline among several African populations extending towards European populations, most likely to represent non-SSA gene flow (n = 3,202). PC2 shows a clear separation between European and Asian populations. Subsequent principal components show more detailed structure between populations globally, and within African populations.
- Extended Data Figure 5: Projection of principal components to assess admixture among African populations. (203 KB)
a, The projection of principal components calculated on YRI and CEU from the 1000 Genomes Project onto the African populations. The AGVP populations are seen to fall on a cline between YRI and CEU, with Ethiopian populations closest to CEU. This is suggestive of Eurasian ancestry among these populations. b, The projection of principal components calculated on YRI and Ju/’hoansi onto the AGVP and other Khoe-San populations. The AGVP and Khoe-San populations are seen to fall on a cline between YRI and Ju/’hoansi, with Zulu and Sotho leading the cline among the AGVP populations. This is suggestive of HG gene flow among these populations.
- Extended Data Figure 6: ADMIXTURE clustering analysis for AGVP samples combined with the 1000 Genomes Project, Human Genome Diversity Project, North African and Khoe-San samples. (412 KB)
Cluster K = 2 shows separation of European and African ancestry, with delineation of Asian and Khoe-San ancestry in cluster K = 4. Subsequent clusters show separation of East, West, North and South African ancestral components n = 3,202.
- Extended Data Figure 7: Dating and source of admixture in the AGVP. (264 KB)
a, The time and most likely sources of admixture with means and 95% confidence intervals for different AGVP populations estimated with MALDER (see Supplementary Note 5). Circular markers with a line drawn around them represent high-probability events, while those with no line around them represent low-probability events. b, The time and most likely sources of admixture estimated with MALDER for the same populations using high-quality imputed data to improve resolution.
- Extended Data Figure 8: Loci with marked allelic differentiation either globally or within Africa. (231 KB)
The derived and ancestral alleles are depicted in blue and red, respectively, for all loci. a, The global distribution of the non-synonymous variant rs17047661 at the CR1 locus implicated in malaria severity. This locus was noted to be among the most differentiated sites (in the top 0.1%) between Europe and Africa. b, The global distribution of the rs10216063 SNP at the AQP2 locus. The derived allele appears to be the major allele among European populations in contrast to African populations. c, The allele frequency distribution of rs10924081 at the ATP1A1 locus. Marked differentiation is observed globally, with the derived allele noted to be the major allele among European populations. d, The global distribution of the risk allele for the SNP rs1378940 in the CSK locus associated with hypertension. This locus was found to be within the top 0.1% of differentiated loci within Africa, and within the top 1% of differentiated loci globally. e, The allele frequency distribution of the rs3213419 SNP at the HP locus. f, The allele frequency distribution of the rs7313726 SNP at the CD163 locus. The HP and CD163 are among the top 0.1% of differentiated sites between malaria endemic and non-endemic regions in Africa.
- Extended Data Figure 9: The global distribution of biologically relevant loci used for simulation of traits to examine reproducibility of signals across AGVP populations. (234 KB)
a, The frequency of the sickle-cell variant (rs334) in different regions globally. The blue portion of each pie chart represents the frequency of the causal allele A. b, The distribution of the SORT1 causal SNP rs12740374, with the derived allele T depicted in blue. c–f, The distributions of the APOL1 variant rs73885319, TCF7L2 variant rs7903146, the APOE variant rs429358 and the PRDM9 variant rs6889665, respectively.
- Extended Data Figure 10: The coverage obtained across the genome for variants at different allele frequencies for a hypothetical African genotype array with one million tagging variants. (257 KB)
Different allele frequency bins are depicted in different colours. The lines show the coverage that can be achieved by imputation at different r2 thresholds. Coverage, here, is defined as the proportion of variants within an allele frequency captured above a pre-defined r2 threshold (along the x axis) after imputation. The solid lines represent the coverage obtained with one million variants selected using the hybrid tagging and imputation approach, while the broken lines represent the coverage obtained by using a simple pairwise tagging approach to capture one million tagging variants. The hybrid method improves the coverage obtained, particularly for common variation. Coverage for common variants (>5%) appears to be high at an r2 threshold of 0.8 and above, with >80% of these variants accurately imputed.