The African Genome Variation Project shapes medical genetics in Africa

Journal name:
Nature
Volume:
517,
Pages:
327–332
Date published:
DOI:
doi:10.1038/nature13997
Received
Accepted
Published online

Abstract

Given the importance of Africa to studies of human origins and disease susceptibility, detailed characterization of African genetic diversity is needed. The African Genome Variation Project provides a resource with which to design, implement and interpret genomic studies in sub-Saharan Africa and worldwide. The African Genome Variation Project represents dense genotypes from 1,481 individuals and whole-genome sequences from 320 individuals across sub-Saharan Africa. Using this resource, we find novel evidence of complex, regionally distinct hunter-gatherer and Eurasian admixture across sub-Saharan Africa. We identify new loci under selection, including loci related to malaria susceptibility and hypertension. We show that modern imputation panels (sets of reference genotypes from which unobserved or missing genotypes in study sets can be inferred) can identify association signals at highly differentiated loci across populations in sub-Saharan Africa. Using whole-genome sequencing, we demonstrate further improvements in imputation accuracy, strengthening the case for large-scale sequencing efforts of diverse African haplotypes. Finally, we present an efficient genotype array design capturing common genetic variation in Africa.

At a glance

Figures

  1. Populations studied in the AGVP.
    Figure 1: Populations studied in the AGVP.

    a, 18 African populations studied in the AGVP including 2 populations from the 1000 Genomes Project. (The term ‘Ethiopia’ encompasses the Oromo, Amhara and Somali ethno-linguistic groups.) b, c, ADMIXTURE analysis of these 18 populations alone (n = 1,481) (b) and in a global context (n = 3,904) (c). Each colour represents a different ancestral cluster, with clusters 2–6 represented along the y-axis in b and clusters 2–18 represented in c. K = 6 and K = 18 were the most likely clusters on ADMIXTURE analysis. ADMIXTURE analysis suggests substructure between North, East, West and South Africa. Studying these populations in the context of Eurasian and African HG populations suggest extensive Eurasian and HG admixture across Africa.

  2. Dating and proportion of Eurasian and HG admixture among African populations.
    Figure 2: Dating and proportion of Eurasian and HG admixture among African populations.

    The proportion and distribution of Eurasian and HG admixture among different populations across Africa, with approximate dating of admixture using MALDER (code was provided by J. Pickrell; see Supplementary Information).

  3. Improvement in imputation accuracy with the AGVP WGS panel.
    Figure 3: Improvement in imputation accuracy with the AGVP WGS panel.

    The substantial improvement in imputation accuracy in some populations (Sotho), compared to minimal improvement in others (Igbo) with the addition of the AGVP WGS reference panel to the 1000 Genomes Project phase I reference panel (‘merged’) suggests poor representation of some haplotypes (for example, Khoe-San haplotypes in Sotho) in the 1000 Genomes Project reference panel alone (‘1000’). r2 is the correlation coefficient, representing the correlation between imputed and genotyped data, on masking each genotyped variant during imputation. MAF, minor allele frequency.

  4. Allele sharing between sequenced populations in the AGVP.
    Extended Data Fig. 1: Allele sharing between sequenced populations in the AGVP.

    a, The overlap of SNPs between 4×WGS data from Zulu, Ugandan and Ethiopian individuals (subsampled to 100 samples each). b, The overlap of novel variants (those not in the 1000 Genomes Project phase I integrated call set, ‘1000G’) between the three populations. c, d, The allele frequency spectra of variants in different portions of the Venn diagrams depicted in a and b, respectively. There appear to be a large proportion of unshared (private) variants in each population: between 10% and 23% of the total number of variants in a given population. The proportion of novel variants was high, with Ethiopia showing the greatest proportion of novel variation. Most of the novel variation appears to be unshared and rare.

  5. The first ten principal components for the African data set.
    Extended Data Fig. 2: The first ten principal components for the African data set.

    PC1 shows a cline among several African populations, most likely to represent Eurasian gene flow (n = 1,481). PC2 shows a clear separation between West and South/East Africa. Subsequent PCs show more detailed structure between, and within African populations.

  6. The first ten principal components for the global data set, including populations from the 1000 Genomes Project.
    Extended Data Fig. 3: The first ten principal components for the global data set, including populations from the 1000 Genomes Project.

    PC1 shows a cline among several African populations extending towards European populations, most likely to represent non-SSA gene flow (n = 2,864). PC2 shows a clear separation between European and Asian populations. Subsequent PCs show more detailed structure between populations globally, and within African populations. GBR, British in England and Scotland; ACB, African Caribbeans in Barbados; ASW, Americans of African ancestry in southwestern USA; CDX, Chinese Dai in Xishuangbanna, China; CEU, Utah residents with Northern and Western European ancestry; CHB, Han Chinese in Beijing, China; CHS, Southern Han Chinese; CLM, Colombians from Medellin, Colombia; FIN, Finnish in Finland; GIH, Gujarati Indian from Houston, Texas, USA; IBS, Iberian population in Spain; JPT, Japanese in Tokyo, Japan; KHV, Kinh in Ho Chi Minh City, Vietnam; MXL, Mexican ancestry from Los Angeles, USA; PEL, Peruvians from Lima, Peru; PUR, Puerto Ricans from Puerto Rico, and TSI, Toscani in Italy.

  7. The first ten principal components for the global extended data set, including populations from the 1000 Genomes Project, Human Genome Diversity Project, North African and Khoe-San population groups.
    Extended Data Fig. 4: The first ten principal components for the global extended data set, including populations from the 1000 Genomes Project, Human Genome Diversity Project, North African and Khoe-San population groups.

    PC1 shows a cline among several African populations extending towards European populations, most likely to represent non-SSA gene flow (n = 3,202). PC2 shows a clear separation between European and Asian populations. Subsequent principal components show more detailed structure between populations globally, and within African populations.

  8. Projection of principal components to assess admixture among African populations.
    Extended Data Fig. 5: Projection of principal components to assess admixture among African populations.

    a, The projection of principal components calculated on YRI and CEU from the 1000 Genomes Project onto the African populations. The AGVP populations are seen to fall on a cline between YRI and CEU, with Ethiopian populations closest to CEU. This is suggestive of Eurasian ancestry among these populations. b, The projection of principal components calculated on YRI and Ju/’hoansi onto the AGVP and other Khoe-San populations. The AGVP and Khoe-San populations are seen to fall on a cline between YRI and Ju/’hoansi, with Zulu and Sotho leading the cline among the AGVP populations. This is suggestive of HG gene flow among these populations.

  9. ADMIXTURE clustering analysis for AGVP samples combined with the 1000 Genomes Project, Human Genome Diversity Project, North African and Khoe-San samples.
    Extended Data Fig. 6: ADMIXTURE clustering analysis for AGVP samples combined with the 1000 Genomes Project, Human Genome Diversity Project, North African and Khoe-San samples.

    Cluster K = 2 shows separation of European and African ancestry, with delineation of Asian and Khoe-San ancestry in cluster K = 4. Subsequent clusters show separation of East, West, North and South African ancestral components n = 3,202.

  10. Dating and source of admixture in the AGVP.
    Extended Data Fig. 7: Dating and source of admixture in the AGVP.

    a, The time and most likely sources of admixture with means and 95% confidence intervals for different AGVP populations estimated with MALDER (see Supplementary Note 5). Circular markers with a line drawn around them represent high-probability events, while those with no line around them represent low-probability events. b, The time and most likely sources of admixture estimated with MALDER for the same populations using high-quality imputed data to improve resolution.

  11. Loci with marked allelic differentiation either globally or within Africa.
    Extended Data Fig. 8: Loci with marked allelic differentiation either globally or within Africa.

    The derived and ancestral alleles are depicted in blue and red, respectively, for all loci. a, The global distribution of the non-synonymous variant rs17047661 at the CR1 locus implicated in malaria severity. This locus was noted to be among the most differentiated sites (in the top 0.1%) between Europe and Africa. b, The global distribution of the rs10216063 SNP at the AQP2 locus. The derived allele appears to be the major allele among European populations in contrast to African populations. c, The allele frequency distribution of rs10924081 at the ATP1A1 locus. Marked differentiation is observed globally, with the derived allele noted to be the major allele among European populations. d, The global distribution of the risk allele for the SNP rs1378940 in the CSK locus associated with hypertension. This locus was found to be within the top 0.1% of differentiated loci within Africa, and within the top 1% of differentiated loci globally. e, The allele frequency distribution of the rs3213419 SNP at the HP locus. f, The allele frequency distribution of the rs7313726 SNP at the CD163 locus. The HP and CD163 are among the top 0.1% of differentiated sites between malaria endemic and non-endemic regions in Africa.

  12. The global distribution of biologically relevant loci used for simulation of traits to examine reproducibility of signals across AGVP populations.
    Extended Data Fig. 9: The global distribution of biologically relevant loci used for simulation of traits to examine reproducibility of signals across AGVP populations.

    a, The frequency of the sickle-cell variant (rs334) in different regions globally. The blue portion of each pie chart represents the frequency of the causal allele A. b, The distribution of the SORT1 causal SNP rs12740374, with the derived allele T depicted in blue. cf, The distributions of the APOL1 variant rs73885319, TCF7L2 variant rs7903146, the APOE variant rs429358 and the PRDM9 variant rs6889665, respectively.

  13. The coverage obtained across the genome for variants at different allele frequencies for a hypothetical African genotype array with one million tagging variants.
    Extended Data Fig. 10: The coverage obtained across the genome for variants at different allele frequencies for a hypothetical African genotype array with one million tagging variants.

    Different allele frequency bins are depicted in different colours. The lines show the coverage that can be achieved by imputation at different r2 thresholds. Coverage, here, is defined as the proportion of variants within an allele frequency captured above a pre-defined r2 threshold (along the x axis) after imputation. The solid lines represent the coverage obtained with one million variants selected using the hybrid tagging and imputation approach, while the broken lines represent the coverage obtained by using a simple pairwise tagging approach to capture one million tagging variants. The hybrid method improves the coverage obtained, particularly for common variation. Coverage for common variants (>5%) appears to be high at an r2 threshold of 0.8 and above, with >80% of these variants accurately imputed.

References

  1. Botigué, L. R. et al. Gene flow from North Africa contributes to differential human genetic diversity in southern Europe. Proc. Natl Acad. Sci. USA 110, 1179111796 (2013)
  2. The International HapMap Consortium. The International HapMap Project. Nature 426, 789796 (2003)
  3. The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 5665 (2012)
  4. Tishkoff, S. A. et al. The genetic structure and history of Africans and African Americans. Science 324, 10351044 (2009)
  5. Schlebusch, C. M. et al. Genomic variation in seven Khoe-San groups reveals adaptation and complex African history. Science 338, 374379 (2012)
  6. Jarvis, J. P. et al. Patterns of ancestry, signatures of natural selection, and genetic association with stature in Western African pygmies. PLoS Genet. 8, e1002641 (2012)
  7. The H3Africa Consortium. Enabling the genomic revolution in Africa. Science 344, 13461348 (2014)
  8. de Filippo, C., Bostoen, K., Stoneking, M. & Pakendorf, B. Bringing together linguistic and genetic evidence to test the Bantu expansion. Proc. R. Soc. Lond. B 279, 32563263 (2012)
  9. Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 16551664 (2009)
  10. Pagani, L. et al. Ethiopian genetic diversity reveals linguistic stratification and complex influences on the Ethiopian gene pool. Am. J. Hum. Genet. 91, 8396 (2012)
  11. Patterson, N. et al. Ancient admixture in human history. Genetics 192, 10651093 (2012)
  12. Pickrell, J. K. et al. Ancient west Eurasian ancestry in southern and eastern Africa. Proc. Natl Acad. Sci. USA 111, 26322637 (2014)
  13. Prüfer, K. et al. The complete genome sequence of a Neanderthal from the Altai Mountains. Nature 505, 4349 (2014)
  14. Kuper, R. & Kropelin, S. Climate-controlled Holocene occupation in the Sahara: motor of Africa's evolution. Science 313, 803807 (2006)
  15. Sereno, P. C. et al. Lakeside cemeteries in the Sahara: 5000 years of holocene population and environmental change. PLoS ONE 3, e2995 (2008)
  16. The Bradshaw Foundation. The Origin of the Prehistoric Rock Art Artists http://www.bradshawfoundation.com/giraffe/artists.php (2014)
  17. Tilman, L.-E. Rock art in African Highlands, Ennedi Highlands, Chad—Artists and Herders in a Lifeworld on the Margins. In Atlas of Cultural and Environmental Change in Arid Africa http://www.academia.edu/1580718/Rock_art_in_African_Highlands_Ennedi_Highlands_Chad_-_Artists_and_Herders_in_a_Lifeworld_on_the_Margins (Heinrich Barth Institute, 2007)
  18. Patin, E. et al. The impact of agricultural emergence on the genetic history of African rainforest hunter-gatherers and agriculturalists. Nature Commun. 5, 3163, http://dx.doi.org/10.1038/ncomms4163 (2014)
  19. Norton, H. L. et al. Genetic evidence for the convergent evolution of light skin in Europeans and East Asians. Mol. Biol. Evol. 24, 710722 (2007)
  20. Edwards, M. et al. Association of the OCA2 polymorphism His615Arg with melanin content in east Asian populations: further evidence of convergent evolution of skin pigmentation. PLoS Genet. 6, e1000867 (2010)
  21. Andersen, K. G. et al. Genome-wide scans provide evidence for positive selection of genes implicated in Lassa fever. Phil. Trans. R. Soc. Lond. B 367, 868877 (2012)
  22. Eid, N. A. et al. Candidate malaria susceptibility/protective SNPs in hospital and population-based studies: the effect of sub-structuring. Malar. J. 9, 119 (2010)
  23. Panda, A. K. et al. Complement receptor 1 variants confer protection from severe malaria in Odisha, India. PLoS ONE 7, e49420 (2012)
  24. Kosoy, R. et al. Evidence for malaria selection of a CR1 haplotype in Sardinia. Genes Immun. 12, 582588 (2011)
  25. Beuschlein, F. et al. Somatic mutations in ATP1A1 and ATP2B3 lead to aldosterone-producing adenomas and secondary hypertension. Nature Genet. 45, 440444, http://dx.doi.org/10.1038/ng.2550 (2013)
  26. Graffe, C. C., Bech, J. N., Lauridsen, T. G., Vase, H. & Pedersen, E. B. Abnormal increase in urinary aquaporin-2 excretion in response to hypertonic saline in essential hypertension. BMC Nephrol. 13, 15 (2012)
  27. Young, J. H. et al. Differential susceptibility to hypertension is due to selection during the out-of-Africa expansion. PLoS Genet. 1, e82 (2005)
  28. Voight, B. F., Kudaravalli, S., Wen, X. & Pritchard, J. K. A map of recent positive selection in the human genome. PLoS Biol. 4, e72 (2006)
  29. Tabara, Y. et al. Common variants in the ATP2B1 gene are associated with susceptibility to hypertension: the Japanese Millennium Genome Project. Hypertension 56, 973980 (2010)
  30. Hong, K. W. et al. Genetic variations in ATP2B1, CSK, ARSG and CSMD1 loci are related to blood pressure and/or hypertension in two Korean cohorts. J. Hum. Hypertens. 24, 367372 (2010)
  31. Levy, D. et al. Genome-wide association study of blood pressure and hypertension. Nature Genet. 41, 677687 (2009)
  32. Machado, P. et al. Malaria: looking for selection signatures in the human PKLR gene region. Br. J. Haematol. 149, 775784 (2010)
  33. Band, G. et al. Imputation-based meta-analysis of severe malaria in three African populations. PLoS Genet. 9, e1003509 (2013)
  34. Kusi, K. A. et al. Levels of soluble CD163 and severity of malaria in children in Ghana. Clin. Vaccine Immunol. 15, 14561460 (2008)
  35. Zhang, G. et al. Interleukin-10 (IL-10) polymorphisms are associated with IL-10 production and clinical malaria in young children. Infect. Immun. 80, 23162322 (2012)
  36. Wilson, J. N. et al. Analysis of IL10 haplotypic associations with severe malaria. Genes Immun. 6, 462466 (2005)
  37. Jacobs, T., Graefe, S. E., Niknafs, S., Gaworski, I. & Fleischer, B. Murine malaria is exacerbated by CTLA-4 blockade. J. Immunol. 169, 23232329 (2002)
  38. Shimojima, M., Stroher, U., Ebihara, H., Feldmann, H. & Kawaoka, Y. Identification of cell surface molecules involved in dystroglycan-independent Lassa virus cell entry. J. Virol. 86, 20672078 (2012)
  39. Fumagalli, M. et al. Population genetics of IFIH1: ancient population structure, local selection, and implications for susceptibility to type 1 diabetes. Mol. Biol. Evol. 27, 25552566 (2010)
  40. Ko, W. Y. et al. Identifying Darwinian selection acting on different human APOL1 variants among diverse African populations. Am. J. Hum. Genet. 93, 5466 (2013)
  41. Lopes, M. F. et al. Increased susceptibility of Fas ligand-deficient gld mice to Trypanosoma cruzi infection due to a Th2-biased host immune response. Eur. J. Immunol. 29, 8189 (1999)
  42. Martins, G. A. et al. Fas-FasL interaction modulates nitric oxide production in Trypanosoma cruzi-infected mice. Immunology 103, 122129 (2001)
  43. Ribeiro, C. M. et al. Trypanosomiasis-induced Th17-like immune responses in carp. PLoS ONE 5, e13012 (2010)
  44. Crocker, P. R., Paulson, J. C. & Varki, A. Siglecs and their roles in the immune system. Nature Rev. Immunol. 7, 255266 (2007)
  45. Cox, J. V., Naher, N., Abdelrahman, Y. M. & Belland, R. J. Host HDL biogenesis machinery is recruited to the inclusion of Chlamydia trachomatis-infected cells and regulates chlamydial growth. Cell. Microbiol. 14, 14971512 (2012)
  46. Natividad, A. et al. Human conjunctival transcriptome analysis reveals the prominence of innate defense in Chlamydia trachomatis infection. Infect. Immun. 78, 48954911 (2010)
  47. Jallow, M. et al. Genome-wide and fine-resolution association analysis of malaria in West Africa. Nature Genet. 41, 657665 (2009)
  48. Teo, Y. Y. et al. Genome-wide comparisons of variation in linkage disequilibrium. Genome Res. 19, 18491860 (2009)
  49. Hedrick, P. W. Population genetics of malaria resistance in humans. Heredity 107, 283304 (2011)
  50. Pasaniuc, B. et al. Extremely low-coverage sequencing and imputation increases power for genome-wide association studies. Nature Genet. 44, 631635 (2012)

Download references

Author information

  1. These authors contributed equally to this work.

    • Deepti Gurdasani,
    • Tommy Carstensen,
    • Fasil Tekola-Ayele,
    • Luca Pagani &
    • Ioanna Tachmazidou
  2. These authors jointly supervised this work.

    • Chris Tyler-Smith,
    • Charles Rotimi,
    • Eleftheria Zeggini &
    • Manjinder S. Sandhu

Affiliations

  1. Wellcome Trust Sanger Institute, Genome Campus, Hinxton, Cambridge CB10 1SA, UK

    • Deepti Gurdasani,
    • Tommy Carstensen,
    • Luca Pagani,
    • Ioanna Tachmazidou,
    • Konstantinos Hatzikotoulas,
    • Savita Karthikeyan,
    • Louise Iles,
    • Martin O. Pollard,
    • Graham R. S. Ritchie,
    • Yali Xue,
    • Jennifer Asimit,
    • Elizabeth H. Young,
    • Cristina Pomilla,
    • Katja Kivinen,
    • Dominic Kwiatkowski,
    • Chris Tyler-Smith,
    • Eleftheria Zeggini &
    • Manjinder S. Sandhu
  2. Department of Public Health and Primary Care, University of Cambridge, 2 Wort’s Causeway, Cambridge, CB1 8RN, UK

    • Deepti Gurdasani,
    • Tommy Carstensen,
    • Savita Karthikeyan,
    • Louise Iles,
    • Elizabeth H. Young,
    • Cristina Pomilla &
    • Manjinder S. Sandhu
  3. Centre for Research on Genomics and Global Health, National Human Genome Research Institute, National Institutes of Health, 12 South Drive, MSC 5635, Bethesda, Maryland 20891-5635, USA

    • Fasil Tekola-Ayele,
    • Ayo P. Doumatey,
    • Adebowale Adeyemo &
    • Charles Rotimi
  4. Department of Biological, Geological and Environmental Sciences, University of Bologna, Via Selmi 3, 40126 Bologna, Italy

    • Luca Pagani
  5. Department of Archaeology, University of York, King’s Manor, York YO1 7EP, UK

    • Louise Iles
  6. Sydney Brenner Institute of Molecular Bioscience (SBIMB), University of the Witwatersrand, The Mount, 9 Jubilee Road, Parktown 2193, Johannesburg, Gauteng, South Africa

    • Ananyo Choudhury &
    • Michele Ramsay
  7. Vertebrate Genomics, European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

    • Graham R. S. Ritchie
  8. Medical Research Council/Uganda Virus Research Institute, Plot 51-57 Nakiwogo Road, Uganda

    • Rebecca N. Nsubuga,
    • Anatoli Kamali,
    • Gershim Asiki,
    • Janet Seeley &
    • Pontiano Kaleebu
  9. Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Headington, Oxford OX3 7BN, UK

    • Kirk Rockett &
    • Dominic Kwiatkowski
  10. Medical Research Council Unit, Atlantic Boulevard, SerrekundaPO Box 273, Banjul, The Gambia

    • Fatoumatta Sisay-Joof,
    • Muminatou Jallow &
    • Kalifa Bojang
  11. Medical Research Council/Wits Rural Public Health and Health Transitions Unit, School of Public Health, Education Campus, 27 St Andrew’s Road, Parktown 2192, Johannesburg, Gauteng, South Africa

    • Stephen Tollman
  12. INDEPTH Network, 38/40 Mensah Wood Street, East Legon, PO Box KD 213, Kanda, Accra, Ghana

    • Stephen Tollman
  13. Institute of Biotechnology, Addis Ababa University, Entoto Avenue, Arat Kilo, 16087 Addis Ababa, Ethiopia

    • Ephrem Mekonnen
  14. Department of Genetics Evolution and Environment, University College, London, Gower Street, London WC1E 6BT, UK

    • Rosemary Ekong
  15. University of Haramaya, Department of Biology, PO Box 138, Dire Dawa, Ethiopia

    • Tamiru Oljira
  16. Henry Stewart Group, 28/30 Little Russell Street, London WC1A 2HN, UK

    • Neil Bradman
  17. Division of Human Genetics, National Health Laboratory Service, C/O Hospital and de Korte Streets, Braamfontein 2000, Johannesburg, South Africa

    • Michele Ramsay
  18. School of Pathology, Faculty of Health Sciences, University of the Witwatersrand, Braamfontein 2000, Johannesburg, South Africa

    • Michele Ramsay
  19. Department of Microbial, Cellular and Molecular Biology, College of Natural Sciences, Arat Kilo Campus, Addis Ababa University, PO Box 1176, Addis Ababa, Ethiopia

    • Endashaw Bekele
  20. Department of Diabetes and Endocrinology, University of KwaZulu-Natal, 719 Umbilo Road, Congella, Durban 4013, South Africa

    • Ayesha Motala &
    • Fraser Pirie
  21. Department of Paediatrics, University of Witwatersrand, 7 York Road, Parktown 2198, Johannesburg, Gauteng, South Africa

    • Shane A. Norris

Contributions

Overall project coordination: D.G., C.P., M.S.S. (Project Chair), E.H.Y. and E.Z. coordinated the project. Analysis and writing: C.P. coordinated sample collation, genotyping, quality control and data generation for the study. J.A., T.C., D.G. and C.P. carried out quality control and curation of data. R.N. and Y.X. undertook quality control for MalariaGEN and Ethiopian population sets respectively. M.O.P. carried out quality control and bam (sequencing reads file format) improvement of sequence data at all depths. T.C. curated and generated all sequence data, and carried out comparisons with genotype array data and with higher coverage data. D.G. carried out the population structure and admixture analyses. A.C., D.G., S.K. and L.P. carried out analysis of positive selection and population differentiation. L.P. and I.T. carried out analysis of linkage disequilibrium decay. T.C., K.H. and I.T. carried out imputation-based analyses. T.C. developed an efficient tagging algorithm and carried out analysis for coverage of tagging variants for the design of the African genotype array. D.G. and F.T.-A. carried out fine mapping analyses. C.R., M.S.S., C.T.-S. and E.Z. critically appraised and commented on the manuscript. D.G., T.C., L.P. and M.S.S. prepared the manuscript and the Supplementary Information. C.P. and L.I. contributed to the writing of the Supplementary Information. All authors commented on the interpretation of results, and reviewed and approved the final manuscript. Management, fieldwork, laboratory analyses and coordination of contributing cohorts: K.B., M.J., K.K., D.K., K.R. and F.S.-J. (the Gambian cohorts—MalariaGEN); G.A., P.K., A.K., M.S.S. and J.S. (The General Population Cohort Study); A.M. and F.P. (the South African Zulu cohort); A.A., A.P.D., C.R. and F.T.-A. (the Kenyan, Ghanaian and Nigerian cohorts); A.C., S.N., M.R. and S.T. (the South African Sotho cohort); and E.B., N.B., R.E., E.M., T.O., L.P and C.T. (the Ethiopian cohort).

Competing financial interests

The authors declare no competing financial interests.

The ADMIXTURE code is available at https://www.genetics.ucla.edu/software/admixture/download.html. The MALDER software is available from J. Pickrell (jkpickrell@nygenome.org). All other source code can be obtained by contacting D.G. (dg11@sanger.ac.uk). See Supplementary Methods for details.

Author details

Extended data figures and tables

Extended Data Figures

  1. Extended Data Figure 1: Allele sharing between sequenced populations in the AGVP. (238 KB)

    a, The overlap of SNPs between 4×WGS data from Zulu, Ugandan and Ethiopian individuals (subsampled to 100 samples each). b, The overlap of novel variants (those not in the 1000 Genomes Project phase I integrated call set, ‘1000G’) between the three populations. c, d, The allele frequency spectra of variants in different portions of the Venn diagrams depicted in a and b, respectively. There appear to be a large proportion of unshared (private) variants in each population: between 10% and 23% of the total number of variants in a given population. The proportion of novel variants was high, with Ethiopia showing the greatest proportion of novel variation. Most of the novel variation appears to be unshared and rare.

  2. Extended Data Figure 2: The first ten principal components for the African data set. (304 KB)

    PC1 shows a cline among several African populations, most likely to represent Eurasian gene flow (n = 1,481). PC2 shows a clear separation between West and South/East Africa. Subsequent PCs show more detailed structure between, and within African populations.

  3. Extended Data Figure 3: The first ten principal components for the global data set, including populations from the 1000 Genomes Project. (336 KB)

    PC1 shows a cline among several African populations extending towards European populations, most likely to represent non-SSA gene flow (n = 2,864). PC2 shows a clear separation between European and Asian populations. Subsequent PCs show more detailed structure between populations globally, and within African populations. GBR, British in England and Scotland; ACB, African Caribbeans in Barbados; ASW, Americans of African ancestry in southwestern USA; CDX, Chinese Dai in Xishuangbanna, China; CEU, Utah residents with Northern and Western European ancestry; CHB, Han Chinese in Beijing, China; CHS, Southern Han Chinese; CLM, Colombians from Medellin, Colombia; FIN, Finnish in Finland; GIH, Gujarati Indian from Houston, Texas, USA; IBS, Iberian population in Spain; JPT, Japanese in Tokyo, Japan; KHV, Kinh in Ho Chi Minh City, Vietnam; MXL, Mexican ancestry from Los Angeles, USA; PEL, Peruvians from Lima, Peru; PUR, Puerto Ricans from Puerto Rico, and TSI, Toscani in Italy.

  4. Extended Data Figure 4: The first ten principal components for the global extended data set, including populations from the 1000 Genomes Project, Human Genome Diversity Project, North African and Khoe-San population groups. (444 KB)

    PC1 shows a cline among several African populations extending towards European populations, most likely to represent non-SSA gene flow (n = 3,202). PC2 shows a clear separation between European and Asian populations. Subsequent principal components show more detailed structure between populations globally, and within African populations.

  5. Extended Data Figure 5: Projection of principal components to assess admixture among African populations. (203 KB)

    a, The projection of principal components calculated on YRI and CEU from the 1000 Genomes Project onto the African populations. The AGVP populations are seen to fall on a cline between YRI and CEU, with Ethiopian populations closest to CEU. This is suggestive of Eurasian ancestry among these populations. b, The projection of principal components calculated on YRI and Ju/’hoansi onto the AGVP and other Khoe-San populations. The AGVP and Khoe-San populations are seen to fall on a cline between YRI and Ju/’hoansi, with Zulu and Sotho leading the cline among the AGVP populations. This is suggestive of HG gene flow among these populations.

  6. Extended Data Figure 6: ADMIXTURE clustering analysis for AGVP samples combined with the 1000 Genomes Project, Human Genome Diversity Project, North African and Khoe-San samples. (412 KB)

    Cluster K = 2 shows separation of European and African ancestry, with delineation of Asian and Khoe-San ancestry in cluster K = 4. Subsequent clusters show separation of East, West, North and South African ancestral components n = 3,202.

  7. Extended Data Figure 7: Dating and source of admixture in the AGVP. (264 KB)

    a, The time and most likely sources of admixture with means and 95% confidence intervals for different AGVP populations estimated with MALDER (see Supplementary Note 5). Circular markers with a line drawn around them represent high-probability events, while those with no line around them represent low-probability events. b, The time and most likely sources of admixture estimated with MALDER for the same populations using high-quality imputed data to improve resolution.

  8. Extended Data Figure 8: Loci with marked allelic differentiation either globally or within Africa. (231 KB)

    The derived and ancestral alleles are depicted in blue and red, respectively, for all loci. a, The global distribution of the non-synonymous variant rs17047661 at the CR1 locus implicated in malaria severity. This locus was noted to be among the most differentiated sites (in the top 0.1%) between Europe and Africa. b, The global distribution of the rs10216063 SNP at the AQP2 locus. The derived allele appears to be the major allele among European populations in contrast to African populations. c, The allele frequency distribution of rs10924081 at the ATP1A1 locus. Marked differentiation is observed globally, with the derived allele noted to be the major allele among European populations. d, The global distribution of the risk allele for the SNP rs1378940 in the CSK locus associated with hypertension. This locus was found to be within the top 0.1% of differentiated loci within Africa, and within the top 1% of differentiated loci globally. e, The allele frequency distribution of the rs3213419 SNP at the HP locus. f, The allele frequency distribution of the rs7313726 SNP at the CD163 locus. The HP and CD163 are among the top 0.1% of differentiated sites between malaria endemic and non-endemic regions in Africa.

  9. Extended Data Figure 9: The global distribution of biologically relevant loci used for simulation of traits to examine reproducibility of signals across AGVP populations. (234 KB)

    a, The frequency of the sickle-cell variant (rs334) in different regions globally. The blue portion of each pie chart represents the frequency of the causal allele A. b, The distribution of the SORT1 causal SNP rs12740374, with the derived allele T depicted in blue. cf, The distributions of the APOL1 variant rs73885319, TCF7L2 variant rs7903146, the APOE variant rs429358 and the PRDM9 variant rs6889665, respectively.

  10. Extended Data Figure 10: The coverage obtained across the genome for variants at different allele frequencies for a hypothetical African genotype array with one million tagging variants. (257 KB)

    Different allele frequency bins are depicted in different colours. The lines show the coverage that can be achieved by imputation at different r2 thresholds. Coverage, here, is defined as the proportion of variants within an allele frequency captured above a pre-defined r2 threshold (along the x axis) after imputation. The solid lines represent the coverage obtained with one million variants selected using the hybrid tagging and imputation approach, while the broken lines represent the coverage obtained by using a simple pairwise tagging approach to capture one million tagging variants. The hybrid method improves the coverage obtained, particularly for common variation. Coverage for common variants (>5%) appears to be high at an r2 threshold of 0.8 and above, with >80% of these variants accurately imputed.

Supplementary information

PDF files

  1. Supplementary Information (17.5 MB)

    This file contains Supplementary Methods.

  2. Supplementary Data (17.1 MB)

    This file contains Supplementary Tables 1-8 and Supplementary Figures 1-18.

Additional data