The Simons Genome Diversity Project: 300 genomes from 142 diverse populations

Journal name:
Nature
Volume:
538,
Pages:
201–206
Date published:
DOI:
doi:10.1038/nature18964
Received
Accepted
Published online

Abstract

Here we report the Simons Genome Diversity Project data set: high quality genomes from 300 individuals from 142 diverse populations. These genomes include at least 5.8 million base pairs that are not present in the human reference genome. Our analysis reveals key features of the landscape of human genome variation, including that the rate of accumulation of mutations has accelerated by about 5% in non-Africans compared to Africans since divergence. We show that the ancestors of some pairs of present-day human populations were substantially separated by 100,000 years ago, well before the archaeologically attested onset of behavioural modernity. We also demonstrate that indigenous Australians, New Guineans and Andamanese do not derive substantial ancestry from an early dispersal of modern humans; instead, their modern human ancestry is consistent with coming from the same source as that of other non-Africans.

At a glance

Figures

  1. Genetic variation in the SGDP.
    Figure 1: Genetic variation in the SGDP.

    a, Neighbour-joining tree of relationships based on pairwise divergence. b, Plot of autosomal heterozygosity against the X-to-autosome heterozygosity ratio, showing the reduction in this ratio in non-Africans and pygmies. c, Estimate of Neanderthal ancestry with a heat map scale of 0–3%. d, Estimate of Denisovan ancestry with a heat map scale of 0–0.5% to bring out subtle differences in mainland Eurasia (Oceanian groups with as much as 5% Denisovan ancestry are saturated in bright red).

  2. Cross-coalescence rates and effective population sizes for selected population pairs.
    Figure 2: Cross-coalescence rates and effective population sizes for selected population pairs.

    ac, Cross-coalescence rates as a function of time in thousands of years ago (kya) estimated using MSMC, with four haplotypes per pair. In each subfigure legend, we give the point estimate of the date at which 25%, 50% and 75% of lineages in the pair of populations have coalesced into a common ancestral population. We generated these plots using data phased with the 1000 Genomes reference panel (method PS1 described in Supplementary Information section 9), but only show pairs of populations for which the cross-coalescence rates are relatively insensitive to the phasing approach. a, Selected African cross-coalescence rates. b, Central African rainforest hunter–gatherer cross-coalescence rates. c, Ancient non-African cross coalescence rates. df, Effective population sizes inferred using PSMC, using one diploid genome per population, for the same populations that we used in ac.

  3. Present-day populations have negligible ancestry from an early dispersal of modern humans out of Africa.
    Figure 3: Present-day populations have negligible ancestry from an early dispersal of modern humans out of Africa.

    Best-fitting admixture graph model of relationships among Australians, New Guineans, Andamanese and other diverse populations. Present-day populations are shown in blue, ancient samples in red, and select inferred ancestral nodes in green. Dotted lines indicate admixture events, all of which involve archaic humans. All f-statistic relationships are accurately fit to within 2.1 standard errors. Inset, results of adding putative early dispersal admixture to the graph model for different assumptions about when the early lineage split off. We specify the split time in terms of the genetic drift above the ‘Non-African’ node, with 0.01 units of drift representing on the order of ten thousand years. The (approximate) model likelihood is maximized with zero early dispersal ancestry, and no more than a few per cent is consistent with the data.

  4. Heat map of fraction of heterozygous sites missed in the 1000 Genomes project.
    Extended Data Fig. 1: Heat map of fraction of heterozygous sites missed in the 1000 Genomes project.

    For each sample, we examine all heterozygous sites passing filter level 1, and compute the fraction included as known polymorphisms in the 1000 Genomes project.

  5. Worldwide variation in human short tandem repeats.
    Extended Data Fig. 2: Worldwide variation in human short tandem repeats.

    a, Mean STR length is reported as the average of the length difference (in base pairs) from the GRCh37 reference for each genotype. Bubble area scales with the number of calls compared at each point. b, c, The first two principal components after performing principal component analysis on tetranucleotide and homopolymer genotypes, respectively. Colours represent the region of origin of each sample. d, Pairwise FST values between populations computed using only SNPs versus using combined SNP + STR loci. e, Block jackknife standard errors for the SNP versus SNP + STR FST analysis. The red dashed lines give the best-fit line, described by the formula in red. The black dashed line denotes the diagonal.

  6. ADMIXTURE analysis.
    Extended Data Fig. 3: ADMIXTURE analysis.

    We carried out unsupervised ADMIXTURE 1.238, 43 analysis over the 300 SGDP individuals in 20 replicates with randomly chosen initial seeds, varying the number of ancestral populations between K = 2 and K = 12 and using default fivefold cross-validation (–cv flag). We used genotypes of at least filter level 1, and restricted analysis to sites where at least two individuals carried the variant allele (as singleton variants are non-informative for population clustering). After further filtering of sites with at least 99% completeness and performing linkage-disequilibrium-based pruning in PLINK 1.944, 45 with parameters (–indep-pairwise 1000 100 0.2), a total of 482,515 single nucleotide polymorphisms remained. This figure shows the highest likelihood replicate for each value of K. We found that log likelihood monotonically increases with K, while the value K = 5 minimizes cross-validation error (not shown). The solution at K = 5 corresponds to major continental groups (Sub-Saharan Africans, Oceanians, East Asians, Native Americans, and West Eurasians), but we show the full range of K here as they illustrate finer-scale population structure that may be useful to users of the data.

  7. Principal component analysis and neighbour joining tree.
    Extended Data Fig. 4: Principal component analysis and neighbour joining tree.

    a, Principal component analysis. b, Neighbour-joining tree based on FST values for all populations with at least two samples.

  8. Fewer accumulated mutations in Africans than in non-Africans confirmed by mapping to chimpanzee.
    Extended Data Fig. 5: Fewer accumulated mutations in Africans than in non-Africans confirmed by mapping to chimpanzee.

    We compute a statistic D (Population A, Population B, Chimp), measuring the difference in the rate of matching to chimpanzee in Population A compared to Population B. The evidence of mismatching to chimpanzee is seen when we restrict to the male X chromosome to eliminate possible effects due to differences in heterozygosity across populations, and map to the chimpanzee genome which is phylogenetically symmetrically related to all present-day humans. We find that in 78 randomly chosen Population A = African and Population B = non-African pairs of males, transversion substitutions show no consistent skew from zero, but transition substitutions do.

  9. 3P-CLR scan for positive selection.
    Extended Data Fig. 6: 3P-CLR scan for positive selection.

    The red line denotes the 99.9% quantile cut-off. The genes in the top five regions are labelled. a, Scan for selection on the San terminal branch. b, Scan for selection on the non-San terminal branch. c, Scan for selection on the ancestral modern human branch.

  10. Scan for genomic locations where the great majority of present-day humans share a recent common ancestor.
    Extended Data Fig. 7: Scan for genomic locations where the great majority of present-day humans share a recent common ancestor.

    We carried out PSMC analysis on 40 pairs of haploid genomes chosen to sample some of the most deeply divergent present-day human lineages. We recorded the time since the most recent common ancestor (TMRCA) at each position, and rescaled to obtain an estimate of absolute time (Supplementary Information section 12). a, Distribution across the genome of the fraction of TMRCAs below specified date cut-offs. For the 100 kya cut-off, the maximum fraction observed anywhere in the genome is 68%. b, Distribution across the genome of the date T at which specified fractions of sample pairs are inferred to have a TMRCA less than T. c, Percentile points of the cumulative distribution function of B.

Tables

  1. Fewer accumulated mutations in Africans than in non-Africans
    Extended Data Table 1: Fewer accumulated mutations in Africans than in non-Africans

Accession codes

Primary accessions

European Nucleotide Archive

References

  1. Abecasis, G. R. et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 5665 (2012)
  2. Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589595 (2010)
  3. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 12971303 (2010)
  4. Li, H. FermiKit: assembly-based variant calling for Illumina resequencing data. Preprint at http://arxiv.org/abs/1504.06574 (2015)
  5. Sudmant, P. H. et al. Global diversity, population stratification, and selection of human copy-number variation. Science 349, aab3761 (2015)
  6. Gymrek, M. & Erlich, Y. Profiling short tandem repeats from short reads. Methods Mol. Biol. 1038, 113135 (2013)
  7. Gymrek, M., Golan, D., Rosset, S. & Erlich, Y. lobSTR: A short tandem repeat profiler for personal genomes. Genome Res. 22, 11541162 (2012)
  8. Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 16551664 (2009)
  9. Patterson, N., Price, A. L. & Reich, D. Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006)
  10. Keinan, A., Mullikin, J. C., Patterson, N. & Reich, D. Accelerated genetic drift on chromosome X during the human dispersal out of Africa. Nat. Genet. 41, 6670 (2009)
  11. Keinan, A. & Reich, D. Can a sex-biased human demography account for the reduced effective population size of chromosome X in non-Africans? Mol. Biol. Evol. 27, 23122321 (2010)
  12. Verdu, P. et al. Sociocultural behavior, sex-biased admixture, and effective population sizes in Central African Pygmies and non-Pygmies. Mol. Biol. Evol. 30, 918937 (2013)
  13. Joiris, D. V. The framework of central African hunter-gatherers and neighbouring societies. African Study Monographs Suppl. 28, 5779 (2003)
  14. Green, R. E. et al. A draft sequence of the Neandertal genome. Science 328, 710722 (2010)
  15. Meyer, M. et al. A high-coverage genome sequence from an archaic Denisovan individual. Science 338, 222226 (2012)
  16. Wall, J. D. et al. Higher levels of neanderthal ancestry in East Asians than in Europeans. Genetics 194, 199209 (2013)
  17. Reich, D. et al. Genetic history of an archaic hominin group from Denisova Cave in Siberia. Nature 468, 10531060 (2010)
  18. Prüfer, K. et al. The complete genome sequence of a Neanderthal from the Altai Mountains. Nature 505, 4349 (2014)
  19. Skoglund, P. & Jakobsson, M. Archaic human ancestry in East Asia. Proc. Natl Acad. Sci. USA 108, 1830118306 (2011)
  20. Li, H. & Durbin, R. Inference of human population history from individual whole-genome sequences. Nature 475, 493496 (2011)
  21. Schiffels, S. & Durbin, R. Inferring human population size and separation history from multiple genome sequences. Nat. Genet. 46, 919925 (2014)
  22. Gronau, I., Hubisz, M. J., Gulko, B., Danko, C. G. & Siepel, A. Bayesian inference of ancient human demography from individual genome sequences. Nat. Genet. 43, 10311034 (2011)
  23. Schlebusch, C. M. et al. Genomic variation in seven Khoe-San groups reveals adaptation and complex African history. Science 338, 374379 (2012)
  24. Veeramah, K. R. et al. An early divergence of KhoeSan ancestors from those of other modern humans is supported by an ABC-based analysis of autosomal resequencing data. Mol. Biol. Evol. 29, 617630 (2012)
  25. Labuda, D., Zietkiewicz, E. & Yotova, V. Archaic lineages in the history of modern humans. Genetics 156, 799808 (2000)
  26. Pickrell, J. K. et al. The genetic prehistory of southern Africa. Nat. Commun. 3, 1143 (2012)
  27. Patin, E. et al. Inferring the demographic history of African farmers and pygmy hunter-gatherers using a multilocus resequencing data set. PLoS Genet. 5, e1000448 (2009)
  28. Fu, Q. et al. Genome sequence of a 45,000-year-old modern human from western Siberia. Nature 514, 445449 (2014)
  29. Groucutt, H. S. et al. Rethinking the dispersal of Homo sapiens out of Africa. Evol. Anthropol. 24, 149164 (2015)
  30. Reyes-Centeno, H., Hubbe, M., Hanihara, T., Stringer, C. & Harvati, K. Testing modern human out-of-Africa dispersal models and implications for modern human origins. J. Hum. Evol. 87, 95106 (2015)
  31. Rasmussen, M. et al. An Aboriginal Australian genome reveals separate human dispersals into Asia. Science 334, 9498 (2011)
  32. Patterson, N. et al. Ancient admixture in human history. Genetics 192, 10651093 (2012)
  33. Liu, W. et al. The earliest unequivocally modern humans in southern China. Nature 526, 696699 (2015)
  34. Fu, Q. et al. An early modern human from Romania with a recent Neanderthal ancestor. Nature 524, 216219 (2015)
  35. Do, R. et al. No evidence that selection has been less effective at removing deleterious mutations in Europeans than in Africans. Nat. Genet. 47, 126131 (2015)
  36. Harris, K. Evidence for recent, population-specific evolution of the human mutation rate. Proc. Natl Acad. Sci. USA 112, 34393444 (2015)
  37. Ségurel, L., Wyman, M. J. & Przeworski, M. Determinants of mutation rate variation in the human germline. Annu. Rev. Genomics Hum. Genet. 15, 4770 (2014)
  38. Klein, R. G. & Edgar, B. The dawn of human culture. (Wiley, 2002)
  39. Racimo, F. Testing for ancient selection using cross-population allele frequency differentiation. Genetics 202, 733750 (2015)
  40. Turchin, M. C. et al. Evidence of widespread selection on standing variation in Europe at height-associated SNPs. Nat. Genet. 44, 10151019 (2012)
  41. Mcbrearty, S. & Brooks, A. S. The revolution that wasn’t: a new interpretation of the origin of modern human behavior. J. Hum. Evol. 39, 453563 (2000)
  42. Renfrew, C. Prehistory: the Making of the Human Mind. (Modern Library, 2009)
  43. Alexander, D. H. & Lange, K. Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinformatics 12, 246 (2011)
  44. Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015)
  45. Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559575 (2007)

Download references

Author information

  1. These authors contributed equally to this work.

    • Heng Li,
    • Mark Lipson &
    • Iain Mathieson
  2. Present addresses: Department of Computer Science, University of California at Los Angeles, California 90095, USA and Department of Human Genetics Science, University of California at Los Angeles, California 90095, USA (S.S.); Genome Foundation, Hyderabad 500076, India (L.S).

    • Sriram Sankararaman &
    • Lalji Singh

Affiliations

  1. Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115, USA

    • Swapan Mallick,
    • Mark Lipson,
    • Iain Mathieson,
    • Mengyao Zhao,
    • Niru Chennagiri,
    • Susanne Nordenfelt,
    • Arti Tandon,
    • Pontus Skoglund,
    • Iosif Lazaridis,
    • Sriram Sankararaman,
    • Qiaomei Fu,
    • Nadin Rohland &
    • David Reich
  2. Broad Institute of Harvard and MIT, Cambridge, Massachusetts 02142, USA

    • Swapan Mallick,
    • Heng Li,
    • Melissa Gymrek,
    • Mengyao Zhao,
    • Niru Chennagiri,
    • Susanne Nordenfelt,
    • Arti Tandon,
    • Pontus Skoglund,
    • Iosif Lazaridis,
    • Sriram Sankararaman,
    • Qiaomei Fu,
    • Nadin Rohland,
    • Nick Patterson &
    • David Reich
  3. Howard Hughes Medical Institute, Harvard Medical School, Boston, Massachusetts 02115, USA

    • Swapan Mallick,
    • Mengyao Zhao,
    • Niru Chennagiri,
    • Susanne Nordenfelt &
    • David Reich
  4. Whitehead Institute for Biomedical Research, Cambridge, Massachusetts 02142, USA

    • Melissa Gymrek
  5. Harvard-MIT Division of Health Sciences and Technology, MIT, Cambridge, Massachusetts 02139, USA

    • Melissa Gymrek
  6. New York Genome Center, New York, New York 10013, USA

    • Melissa Gymrek,
    • Yaniv Erlich,
    • Thomas Willems &
    • William Klitz
  7. Department of Integrative Biology, University of California, Berkeley, California 94720-3140, USA

    • Fernando Racimo
  8. Key Laboratory of Vertebrate Evolution and Human Origins of Chinese Academy of Sciences, IVPP, CAS, Beijing 100044, China

    • Qiaomei Fu
  9. Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, D-04103 Leipzig, Germany

    • Gabriel Renaud,
    • Svante Pääbo &
    • Janet Kelso
  10. Department of Computer Science, Columbia University, New York, New York 10027, USA

    • Yaniv Erlich
  11. Center for Computational Biology and Bioinformatics, Columbia University, New York, New York 10032, USA

    • Yaniv Erlich
  12. Computational and Systems Biology Program, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA

    • Thomas Willems
  13. Laboratorios de Investigación y Desarrollo, Facultad de Ciencias y Filosofía, Universidad Peruana Cayetano Heredia, Lima 15102, Perú

    • Carla Gallo &
    • Giovanni Poletti
  14. Computational Biology Graduate Group, University of California, Berkeley, California 94720, USA

    • Jeffrey P. Spence
  15. Computer Science Division, University of California, Berkeley, California 94720, USA

    • Yun S. Song
  16. Department of Statistics, University of California, Berkeley, California 94720, USA

    • Yun S. Song
  17. Department of Mathematics and Department of Biology, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA

    • Yun S. Song
  18. Genetics Institute, University College London, Gower Street, London WC1E 6BT, UK

    • Francois Balloux
  19. Institute of Linguistics, University of Bern, Bern CH-3012, Switzerland

    • George van Driem
  20. Department of Human and Clinical Genetics, Postzone S5-P, Leiden University Medical Center, 2333 ZA Leiden, Netherlands

    • Peter de Knijff
  21. School of Biological Sciences, Nanyang Technological University, 637551 Singapore

    • Irene Gallego Romero
  22. Lee Kong Chian School of Medicine, Nanyang Technological University, 636921 Singapore

    • Irene Gallego Romero
  23. Department of Human Genetics, University of Chicago, Chicago, Illinois 60637, USA

    • Aashish R. Jha,
    • Anna Di Rienzo &
    • Choongwon Jeong
  24. Estonian Biocentre, Evolutionary Biology group, Tartu 51010, Estonia

    • Doron M. Behar,
    • Hovhannes Sahakyan,
    • Ene Metspalu,
    • Jüri Parik,
    • Richard Villems,
    • Sergey Litvinov,
    • Toomas Kivisild &
    • Mait Metspalu
  25. Laboratorio de Genética Molecular Poblacional, Instituto Multidisciplinario de Biología Celular (IMBICE), CCT-CONICET La Plata/CIC Buenos Aires/Universidad Nacional de La Plata, La Plata B1906APO, Argentina

    • Claudio M. Bravi
  26. Department of Zoology, University of Oxford, Oxford OX1 3PS, UK

    • Cristian Capelli
  27. Department of Clinical Science, University of Bergen, Bergen 5021, Norway

    • Tor Hervig
  28. National Laboratory of Genomics for Biodiversity (LANGEBIO), CINVESTAV, Irapuato, Guanajuato 36821, Mexico

    • Andres Moreno-Estrada
  29. Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, Novosibirsk 630090, Russia

    • Olga L. Posukh
  30. Novosibirsk State University, Novosibirsk 630090, Russia

    • Olga L. Posukh
  31. Research Centre for Medical Genetics, Moscow 115478, Russia

    • Elena Balanovska &
    • Oleg Balanovsky
  32. Vavilov Institute for General Genetics, Moscow 119991, Russia

    • Oleg Balanovsky
  33. Moscow Institute for Physics and Technology, Dolgoprudniy 141700, Russia

    • Oleg Balanovsky
  34. Department of Medical Genetics, National Human Genome Center, Medical University Sofia, Sofia 1431, Bulgaria

    • Sena Karachanak-Yankova &
    • Draga Toncheva
  35. Laboratory of Ethnogenomics, Institute of Molecular Biology, National Academy of Sciences of Armenia, Yerevan 0014, Armenia

    • Hovhannes Sahakyan &
    • Levon Yepiskoposyan
  36. The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK

    • Chris Tyler-Smith &
    • Yali Xue
  37. RIPAS Hospital, Bandar Seri Begawan, Brunei

    • M. Syafiq Abdullah
  38. Department of Genetics, Evolution and Environment, University College London WC1E 6BT, UK

    • Andres Ruiz-Linares
  39. Department of Anthropology, Case Western Reserve University, Cleveland, Ohio 44106-7125, USA

    • Cynthia M. Beall
  40. Laboratory of Human Molecular Genetics, Institute of Molecular and Cellular Biology, Siberian Branch of Russian Academy of Sciences, Novosibirsk 630090, Russia

    • Elena B. Starikovskaya,
    • Stanislav Dryomov &
    • Rem Sukernik
  41. Department of Evolutionary Biology, University of Tartu, Tartu 51010, Estonia

    • Ene Metspalu &
    • Richard Villems
  42. Estonian Academy of Sciences, Tallinn 10130, Estonia

    • Richard Villems
  43. Department of Ecology and Evolution, Stony Brook University, Stony Brook, New York 11794, USA

    • Brenna M. Henn
  44. NextBio, Illumina, Santa Clara, California 95050, USA

    • Ugur Hodoglugil
  45. Gladstone Institutes, San Francisco, California 94158, USA

    • Robert Mahley
  46. Department of Forensic Medicine, University of Helsinki, Helsinki 00014, Finland

    • Antti Sajantila
  47. Department of Medicine, Division of Medical Genetics, University of Washington, Seattle, Washington 98195, USA

    • George Stamatoyannopoulos
  48. National Cancer Centre Singapore, 169610 Singapore

    • Joseph T. S. Wee
  49. Institute of Biochemistry and Genetics, Ufa Research Centre, Russian Academy of Sciences, Ufa 450054, Russia

    • Rita Khusainova,
    • Elza Khusnutdinova &
    • Sergey Litvinov
  50. Department of Genetics and Fundamental Medicine, Bashkir State University, Ufa 450074, Russia

    • Rita Khusainova,
    • Elza Khusnutdinova &
    • Sergey Litvinov
  51. Jaramogi Oginga Odinga University of Science and Technology, Bondo 40601, Kenya

    • George Ayodo
  52. Institut de Biologia Evolutiva (CSIC-UPF), Departament de Ciències Experimentals i de la Salut, Universitat Pompeu Fabra, Barcelona 08003, Spain

    • David Comas
  53. ARL Division of Biotechnology, University of Arizona, Tucson, Arizona 85721, USA

    • Michael F. Hammer
  54. Division of Biological Anthropology, University of Cambridge, Fitzwilliam Street, Cambridge CB2 1QH, UK

    • Toomas Kivisild
  55. Basic Research Laboratory, Center for Cancer Research, NCI, Leidos Biomedical Research, Inc., Frederick National Laboratory, Frederick, Maryland 21702, USA

    • Cheryl A. Winkler
  56. CHU Sainte-Justine, Pediatrics Departement, Université de Montréal, Québec H3T 1C5, Canada

    • Damian Labuda
  57. Department of Pediatrics, University of Washington, Seattle, Washington 98119, USA

    • Michael Bamshad
  58. Department of Human Genetics, University of Utah School of Medicine, Salt Lake City, Utah 84112, USA

    • Lynn B. Jorde
  59. Departments of Genetics and Biology, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA

    • Sarah A. Tishkoff
  60. Department of Human Genetics, Eccles Institute of Human Genetics, University of Utah, Salt Lake City, Utah 84112, USA

    • W. Scott Watkins
  61. Department of Paleolithic Archaeology, Institute of Archaeology and Ethnography, Siberian Branch of Russian Academy of Sciences, Novosibirsk 630090, Russia

    • Stanislav Dryomov
  62. Altai State University, Barnaul 656000, Russia

    • Rem Sukernik
  63. CSIR-Centre for Cellular and Molecular Biology, Hyderabad 500 007, India

    • Lalji Singh &
    • Kumarasamy Thangaraj

Contributions

S.M., Y.E., Y.S.S., S.P., J.K., N.P. and D.R. supervised the study. S.N., N.R., C.G., G.P., F.B., G.D., I.G.R., A.R.J., P.D., D.M.B., C.M.B., C.C., T.H., A.M.-E., O.L.P., E.B., O.B., S.K.-Y., H.S., D.T., L.Y., C.T.-S., Y.X., M.S.A., A.R.-L., C.B., A.D.R., C.J., E.B.S., E.M., J.P., R.V., B.M.H., U.H., R.W.M., A.S., G.S., J.T.S.W., R.K., E.K., S.L., G.A., D.C., M.H., T.K., W.K., C.A.W., D.L., M.B., L.B.J., S.A.T., W.S.W., M.M., S.D., R.S., L.S., K.T. and D.R. assembled samples. S.M., H.L., M.L., I.M., M.G., F.R., J.P.S., M.Z., N.C., A.T., P.S., I.L., S.S., Q.F., G.R., Y.S., N.P. and D.R. performed analyses. S.M., H.L., M.L., I.M., M.G., F.R., M.Z., N.P. and D.R. wrote the manuscript with help from all co-authors.

Competing financial interests

U.H. is employed by NextBio, a division of Illumina Ltd.

Corresponding authors

Correspondence to:

Raw data for 279 genomes for which the informed consent documentation is consistent with fully public data release are available through the EBI European Nucleotide Archive under accession numbers PRJEB9586 and ERP010710. For the remaining 21 genomes (designated by code ‘Y’ in the seventh column of Supplementary Data Table 1), data are deposited at the European Genome-phenome Archive (EGA), which is hosted by the EBI and the CRG, under accession number EGAS00001001959. Data for these 21 genomes can be obtained by submitting to the EGA Data Access Committee a signed letter containing the following text: “(a) I will not distribute the data outside my collaboration; (b) I will not post the data publicly; (c) I will make no attempt to connect the genetic data to personal identifiers for the samples; and (d) I will not use the data for any commercial purposes.” Compact versions of the SGDP dataset and software for accessing it are available at (http://genetics.med.harvard.edu/reichlab/Reich_Lab/Datasets.html). The short tandem repeat (STR) genotypes are available through dbVar under accession number nstd128 (http://www.ncbi.nlm.nih.gov/dbvar).

Reviewer Information Nature thanks P. Bellwood and S. Ramachandran and the other anonymous reviewer(s) for their contribution to the peer review of this work.

Author details

Extended data figures and tables

Extended Data Figures

  1. Extended Data Figure 1: Heat map of fraction of heterozygous sites missed in the 1000 Genomes project. (172 KB)

    For each sample, we examine all heterozygous sites passing filter level 1, and compute the fraction included as known polymorphisms in the 1000 Genomes project.

  2. Extended Data Figure 2: Worldwide variation in human short tandem repeats. (273 KB)

    a, Mean STR length is reported as the average of the length difference (in base pairs) from the GRCh37 reference for each genotype. Bubble area scales with the number of calls compared at each point. b, c, The first two principal components after performing principal component analysis on tetranucleotide and homopolymer genotypes, respectively. Colours represent the region of origin of each sample. d, Pairwise FST values between populations computed using only SNPs versus using combined SNP + STR loci. e, Block jackknife standard errors for the SNP versus SNP + STR FST analysis. The red dashed lines give the best-fit line, described by the formula in red. The black dashed line denotes the diagonal.

  3. Extended Data Figure 3: ADMIXTURE analysis. (560 KB)

    We carried out unsupervised ADMIXTURE 1.238, 43 analysis over the 300 SGDP individuals in 20 replicates with randomly chosen initial seeds, varying the number of ancestral populations between K = 2 and K = 12 and using default fivefold cross-validation (–cv flag). We used genotypes of at least filter level 1, and restricted analysis to sites where at least two individuals carried the variant allele (as singleton variants are non-informative for population clustering). After further filtering of sites with at least 99% completeness and performing linkage-disequilibrium-based pruning in PLINK 1.944, 45 with parameters (–indep-pairwise 1000 100 0.2), a total of 482,515 single nucleotide polymorphisms remained. This figure shows the highest likelihood replicate for each value of K. We found that log likelihood monotonically increases with K, while the value K = 5 minimizes cross-validation error (not shown). The solution at K = 5 corresponds to major continental groups (Sub-Saharan Africans, Oceanians, East Asians, Native Americans, and West Eurasians), but we show the full range of K here as they illustrate finer-scale population structure that may be useful to users of the data.

  4. Extended Data Figure 4: Principal component analysis and neighbour joining tree. (205 KB)

    a, Principal component analysis. b, Neighbour-joining tree based on FST values for all populations with at least two samples.

  5. Extended Data Figure 5: Fewer accumulated mutations in Africans than in non-Africans confirmed by mapping to chimpanzee. (117 KB)

    We compute a statistic D (Population A, Population B, Chimp), measuring the difference in the rate of matching to chimpanzee in Population A compared to Population B. The evidence of mismatching to chimpanzee is seen when we restrict to the male X chromosome to eliminate possible effects due to differences in heterozygosity across populations, and map to the chimpanzee genome which is phylogenetically symmetrically related to all present-day humans. We find that in 78 randomly chosen Population A = African and Population B = non-African pairs of males, transversion substitutions show no consistent skew from zero, but transition substitutions do.

  6. Extended Data Figure 6: 3P-CLR scan for positive selection. (449 KB)

    The red line denotes the 99.9% quantile cut-off. The genes in the top five regions are labelled. a, Scan for selection on the San terminal branch. b, Scan for selection on the non-San terminal branch. c, Scan for selection on the ancestral modern human branch.

  7. Extended Data Figure 7: Scan for genomic locations where the great majority of present-day humans share a recent common ancestor. (301 KB)

    We carried out PSMC analysis on 40 pairs of haploid genomes chosen to sample some of the most deeply divergent present-day human lineages. We recorded the time since the most recent common ancestor (TMRCA) at each position, and rescaled to obtain an estimate of absolute time (Supplementary Information section 12). a, Distribution across the genome of the fraction of TMRCAs below specified date cut-offs. For the 100 kya cut-off, the maximum fraction observed anywhere in the genome is 68%. b, Distribution across the genome of the date T at which specified fractions of sample pairs are inferred to have a TMRCA less than T. c, Percentile points of the cumulative distribution function of B.

Extended Data Tables

  1. Extended Data Table 1: Fewer accumulated mutations in Africans than in non-Africans (478 KB)

Supplementary information

PDF files

  1. Supplementary Information (8.4 MB)

    This file contains Supplementary Text and Data, Supplementary Tables Supplementary Figures and additional references (see Contents for details).

Excel files

  1. Supplementary Table 1 (124 KB)

    This file shows the data by each sample studied.

  2. Supplementary Table 2 (71 KB)

    This table shows the top hits for 3P-CLR run.

Additional data