Abstract
Here we report the Simons Genome Diversity Project data set: high quality genomes from 300 individuals from 142 diverse populations. These genomes include at least 5.8 million base pairs that are not present in the human reference genome. Our analysis reveals key features of the landscape of human genome variation, including that the rate of accumulation of mutations has accelerated by about 5% in non-Africans compared to Africans since divergence. We show that the ancestors of some pairs of present-day human populations were substantially separated by 100,000 years ago, well before the archaeologically attested onset of behavioural modernity. We also demonstrate that indigenous Australians, New Guineans and Andamanese do not derive substantial ancestry from an early dispersal of modern humans; instead, their modern human ancestry is consistent with coming from the same source as that of other non-Africans.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 51 print issues and online access
$199.00 per year
only $3.90 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Accession codes
Primary accessions
European Nucleotide Archive
Data deposits
Raw data for 279 genomes for which the informed consent documentation is consistent with fully public data release are available through the EBI European Nucleotide Archive under accession numbers PRJEB9586 and ERP010710. For the remaining 21 genomes (designated by code ‘Y’ in the seventh column of Supplementary Data Table 1), data are deposited at the European Genome-phenome Archive (EGA), which is hosted by the EBI and the CRG, under accession number EGAS00001001959. Data for these 21 genomes can be obtained by submitting to the EGA Data Access Committee a signed letter containing the following text: “(a) I will not distribute the data outside my collaboration; (b) I will not post the data publicly; (c) I will make no attempt to connect the genetic data to personal identifiers for the samples; and (d) I will not use the data for any commercial purposes.” Compact versions of the SGDP dataset and software for accessing it are available at (http://genetics.med.harvard.edu/reichlab/Reich_Lab/Datasets.html). The short tandem repeat (STR) genotypes are available through dbVar under accession number nstd128 (http://www.ncbi.nlm.nih.gov/dbvar).
References
Abecasis, G. R. et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012)
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010)
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010)
Li, H. FermiKit: assembly-based variant calling for Illumina resequencing data. Preprint at http://arxiv.org/abs/1504.06574 (2015)
Sudmant, P. H. et al. Global diversity, population stratification, and selection of human copy-number variation. Science 349, aab3761 (2015)
Gymrek, M. & Erlich, Y. Profiling short tandem repeats from short reads. Methods Mol. Biol. 1038, 113–135 (2013)
Gymrek, M., Golan, D., Rosset, S. & Erlich, Y. lobSTR: A short tandem repeat profiler for personal genomes. Genome Res. 22, 1154–1162 (2012)
Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009)
Patterson, N., Price, A. L. & Reich, D. Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006)
Keinan, A., Mullikin, J. C., Patterson, N. & Reich, D. Accelerated genetic drift on chromosome X during the human dispersal out of Africa. Nat. Genet. 41, 66–70 (2009)
Keinan, A. & Reich, D. Can a sex-biased human demography account for the reduced effective population size of chromosome X in non-Africans? Mol. Biol. Evol. 27, 2312–2321 (2010)
Verdu, P. et al. Sociocultural behavior, sex-biased admixture, and effective population sizes in Central African Pygmies and non-Pygmies. Mol. Biol. Evol. 30, 918–937 (2013)
Joiris, D. V. The framework of central African hunter-gatherers and neighbouring societies. African Study Monographs Suppl. 28, 57–79 (2003)
Green, R. E. et al. A draft sequence of the Neandertal genome. Science 328, 710–722 (2010)
Meyer, M. et al. A high-coverage genome sequence from an archaic Denisovan individual. Science 338, 222–226 (2012)
Wall, J. D. et al. Higher levels of neanderthal ancestry in East Asians than in Europeans. Genetics 194, 199–209 (2013)
Reich, D. et al. Genetic history of an archaic hominin group from Denisova Cave in Siberia. Nature 468, 1053–1060 (2010)
Prüfer, K. et al. The complete genome sequence of a Neanderthal from the Altai Mountains. Nature 505, 43–49 (2014)
Skoglund, P. & Jakobsson, M. Archaic human ancestry in East Asia. Proc. Natl Acad. Sci. USA 108, 18301–18306 (2011)
Li, H. & Durbin, R. Inference of human population history from individual whole-genome sequences. Nature 475, 493–496 (2011)
Schiffels, S. & Durbin, R. Inferring human population size and separation history from multiple genome sequences. Nat. Genet. 46, 919–925 (2014)
Gronau, I., Hubisz, M. J., Gulko, B., Danko, C. G. & Siepel, A. Bayesian inference of ancient human demography from individual genome sequences. Nat. Genet. 43, 1031–1034 (2011)
Schlebusch, C. M. et al. Genomic variation in seven Khoe-San groups reveals adaptation and complex African history. Science 338, 374–379 (2012)
Veeramah, K. R. et al. An early divergence of KhoeSan ancestors from those of other modern humans is supported by an ABC-based analysis of autosomal resequencing data. Mol. Biol. Evol. 29, 617–630 (2012)
Labuda, D., Zietkiewicz, E. & Yotova, V. Archaic lineages in the history of modern humans. Genetics 156, 799–808 (2000)
Pickrell, J. K. et al. The genetic prehistory of southern Africa. Nat. Commun. 3, 1143 (2012)
Patin, E. et al. Inferring the demographic history of African farmers and pygmy hunter-gatherers using a multilocus resequencing data set. PLoS Genet. 5, e1000448 (2009)
Fu, Q. et al. Genome sequence of a 45,000-year-old modern human from western Siberia. Nature 514, 445–449 (2014)
Groucutt, H. S. et al. Rethinking the dispersal of Homo sapiens out of Africa. Evol. Anthropol. 24, 149–164 (2015)
Reyes-Centeno, H., Hubbe, M., Hanihara, T., Stringer, C. & Harvati, K. Testing modern human out-of-Africa dispersal models and implications for modern human origins. J. Hum. Evol. 87, 95–106 (2015)
Rasmussen, M. et al. An Aboriginal Australian genome reveals separate human dispersals into Asia. Science 334, 94–98 (2011)
Patterson, N. et al. Ancient admixture in human history. Genetics 192, 1065–1093 (2012)
Liu, W. et al. The earliest unequivocally modern humans in southern China. Nature 526, 696–699 (2015)
Fu, Q. et al. An early modern human from Romania with a recent Neanderthal ancestor. Nature 524, 216–219 (2015)
Do, R. et al. No evidence that selection has been less effective at removing deleterious mutations in Europeans than in Africans. Nat. Genet. 47, 126–131 (2015)
Harris, K. Evidence for recent, population-specific evolution of the human mutation rate. Proc. Natl Acad. Sci. USA 112, 3439–3444 (2015)
Ségurel, L., Wyman, M. J. & Przeworski, M. Determinants of mutation rate variation in the human germline. Annu. Rev. Genomics Hum. Genet. 15, 47–70 (2014)
Klein, R. G. & Edgar, B. The dawn of human culture. (Wiley, 2002)
Racimo, F. Testing for ancient selection using cross-population allele frequency differentiation. Genetics 202, 733–750 (2015)
Turchin, M. C. et al. Evidence of widespread selection on standing variation in Europe at height-associated SNPs. Nat. Genet. 44, 1015–1019 (2012)
Mcbrearty, S. & Brooks, A. S. The revolution that wasn’t: a new interpretation of the origin of modern human behavior. J. Hum. Evol. 39, 453–563 (2000)
Renfrew, C. Prehistory: the Making of the Human Mind. (Modern Library, 2009)
Alexander, D. H. & Lange, K. Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinformatics 12, 246 (2011)
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015)
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007)
Acknowledgements
We thank the volunteers who donated samples. We thank H. Blanche, N. Boivin, H. Cann (deceased), E. Eichler, H. Greely, M. Petraglia, K. Prüfer, A. Rogers, M. Steinrücken, U. Stenzel and P. Sudmant for comments, critiques, discussions, or advice on assembling samples. We thank S. Fan for uploading 21 genomes to the European Genome-phenome archive. The sequencing was funded by the Simons Foundation (SFARI 280376) and the US National Science Foundation (BCS-1032255). I.M. was supported by a Long Term Fellowship grant LT001095/2014 from the Human Frontier Science program. P.S. was supported by the Wenner-Gren foundation and the Swedish Research Council (VR grant 2014-453). T.W. and M.G. were supported by an NIJ grant 2014-DN-BX-K089. Y.E. was supported by a Career Award at the Scientific Interface from the Burroughs Wellcome Fund and by NIJ grant 2014-DN-BX-K089. D.L. was supported by the Natural Sciences and Engineering Research Council of Canada. T.K. was supported by ERC Starting Investigator grant FP7 - 261213. R.S. received support from Russian Foundation for Basic Research (#15-04-02543). S.D. received support from the Russian Foundation for Basic Research (#16-34-00599). R.K., E.K. and S.L. were supported by the Russian Foundation for Basic Research (11-04-00725-a). E.B. was supported by the Russian Foundation for Basic Research (16-06-00303). O.B. was supported by the Russian Scientific Fund (14-04-00827) and by the Russian Foundation for Basic Research (16-04-00890). D.M.B., H.S., E.M., R.V. and M.M. were supported by Institutional Research Funding from the Estonian Research Council IUT24-1 and by the European Regional Development Fund (European Union) through the Centre of Excellence in Genomics to Estonian Biocentre and University of Tartu. D.C. was supported by the Spanish MINECO grant CGL-44351-P. L.B.J. and W.S.W. were supported by NIH grant GM59290. S.A.T. was supported by NIH grants 5DP1ES022577 05, 1R01DK104339-01, and 1R01GM113657-01. C.T.-S. and Y.X. were supported by The Wellcome Trust grant 098051. C.M.B. was supported by NSF grants 0924726 and 1153911. K.T. was supported by CSIR Network Project grant (GENESIS: BSC0121). J.P.S. and Y.S.S. were supported in part by an NIH grant R01-GM094402, and a Packard Fellowship for Science and Engineering. G.R., J.K and S.P. were funded by the Max Planck Society. N.P. and D.R. were supported by NIH grant GM100233 and D.R. is a Howard Hughes Medical Institute investigator.
Author information
Authors and Affiliations
Contributions
S.M., Y.E., Y.S.S., S.P., J.K., N.P. and D.R. supervised the study. S.N., N.R., C.G., G.P., F.B., G.D., I.G.R., A.R.J., P.D., D.M.B., C.M.B., C.C., T.H., A.M.-E., O.L.P., E.B., O.B., S.K.-Y., H.S., D.T., L.Y., C.T.-S., Y.X., M.S.A., A.R.-L., C.B., A.D.R., C.J., E.B.S., E.M., J.P., R.V., B.M.H., U.H., R.W.M., A.S., G.S., J.T.S.W., R.K., E.K., S.L., G.A., D.C., M.H., T.K., W.K., C.A.W., D.L., M.B., L.B.J., S.A.T., W.S.W., M.M., S.D., R.S., L.S., K.T. and D.R. assembled samples. S.M., H.L., M.L., I.M., M.G., F.R., J.P.S., M.Z., N.C., A.T., P.S., I.L., S.S., Q.F., G.R., Y.S., N.P. and D.R. performed analyses. S.M., H.L., M.L., I.M., M.G., F.R., M.Z., N.P. and D.R. wrote the manuscript with help from all co-authors.
Corresponding authors
Ethics declarations
Competing interests
U.H. is employed by NextBio, a division of Illumina Ltd.
Additional information
Reviewer Information Nature thanks P. Bellwood and S. Ramachandran and the other anonymous reviewer(s) for their contribution to the peer review of this work.
Extended data figures and tables
Extended Data Figure 1 Heat map of fraction of heterozygous sites missed in the 1000 Genomes project.
For each sample, we examine all heterozygous sites passing filter level 1, and compute the fraction included as known polymorphisms in the 1000 Genomes project.
Extended Data Figure 2 Worldwide variation in human short tandem repeats.
a, Mean STR length is reported as the average of the length difference (in base pairs) from the GRCh37 reference for each genotype. Bubble area scales with the number of calls compared at each point. b, c, The first two principal components after performing principal component analysis on tetranucleotide and homopolymer genotypes, respectively. Colours represent the region of origin of each sample. d, Pairwise FST values between populations computed using only SNPs versus using combined SNP + STR loci. e, Block jackknife standard errors for the SNP versus SNP + STR FST analysis. The red dashed lines give the best-fit line, described by the formula in red. The black dashed line denotes the diagonal.
Extended Data Figure 3 ADMIXTURE analysis.
We carried out unsupervised ADMIXTURE 1.238,43 analysis over the 300 SGDP individuals in 20 replicates with randomly chosen initial seeds, varying the number of ancestral populations between K = 2 and K = 12 and using default fivefold cross-validation (–cv flag). We used genotypes of at least filter level 1, and restricted analysis to sites where at least two individuals carried the variant allele (as singleton variants are non-informative for population clustering). After further filtering of sites with at least 99% completeness and performing linkage-disequilibrium-based pruning in PLINK 1.944,45 with parameters (–indep-pairwise 1000 100 0.2), a total of 482,515 single nucleotide polymorphisms remained. This figure shows the highest likelihood replicate for each value of K. We found that log likelihood monotonically increases with K, while the value K = 5 minimizes cross-validation error (not shown). The solution at K = 5 corresponds to major continental groups (Sub-Saharan Africans, Oceanians, East Asians, Native Americans, and West Eurasians), but we show the full range of K here as they illustrate finer-scale population structure that may be useful to users of the data.
Extended Data Figure 4 Principal component analysis and neighbour joining tree.
a, Principal component analysis. b, Neighbour-joining tree based on FST values for all populations with at least two samples.
Extended Data Figure 5 Fewer accumulated mutations in Africans than in non-Africans confirmed by mapping to chimpanzee.
We compute a statistic D (Population A, Population B, Chimp), measuring the difference in the rate of matching to chimpanzee in Population A compared to Population B. The evidence of mismatching to chimpanzee is seen when we restrict to the male X chromosome to eliminate possible effects due to differences in heterozygosity across populations, and map to the chimpanzee genome which is phylogenetically symmetrically related to all present-day humans. We find that in 78 randomly chosen Population A = African and Population B = non-African pairs of males, transversion substitutions show no consistent skew from zero, but transition substitutions do.
Extended Data Figure 6 3P-CLR scan for positive selection.
The red line denotes the 99.9% quantile cut-off. The genes in the top five regions are labelled. a, Scan for selection on the San terminal branch. b, Scan for selection on the non-San terminal branch. c, Scan for selection on the ancestral modern human branch.
Extended Data Figure 7 Scan for genomic locations where the great majority of present-day humans share a recent common ancestor.
We carried out PSMC analysis on 40 pairs of haploid genomes chosen to sample some of the most deeply divergent present-day human lineages. We recorded the time since the most recent common ancestor (TMRCA) at each position, and rescaled to obtain an estimate of absolute time (Supplementary Information section 12). a, Distribution across the genome of the fraction of TMRCAs below specified date cut-offs. For the 100 kya cut-off, the maximum fraction observed anywhere in the genome is 68%. b, Distribution across the genome of the date T at which specified fractions of sample pairs are inferred to have a TMRCA less than T. c, Percentile points of the cumulative distribution function of B.
Supplementary information
Supplementary Information
This file contains Supplementary Text and Data, Supplementary Tables Supplementary Figures and additional references (see Contents for details). (PDF 8661 kb)
Supplementary Table 1
This file shows the data by each sample studied. (XLSX 124 kb)
Supplementary Table 2
This table shows the top hits for 3P-CLR run. (XLSX 71 kb)
Rights and permissions
About this article
Cite this article
Mallick, S., Li, H., Lipson, M. et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016). https://doi.org/10.1038/nature18964
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nature18964
This article is cited by
-
SNVstory: inferring genetic ancestry from genome sequencing data
BMC Bioinformatics (2024)
-
Reconstructing the ancestral gene pool to uncover the origins and genetic links of Hmong–Mien speakers
BMC Biology (2024)
-
Differentiated genomic footprints suggest isolation and long-distance migration of Hmong-Mien populations
BMC Biology (2024)
-
More than a decade of genetic research on the Denisovans
Nature Reviews Genetics (2024)
-
Understanding the genomic heterogeneity of North African Imazighen: from broad to microgeographical perspectives
Scientific Reports (2024)