Here we report the Simons Genome Diversity Project data set: high quality genomes from 300 individuals from 142 diverse populations. These genomes include at least 5.8 million base pairs that are not present in the human reference genome. Our analysis reveals key features of the landscape of human genome variation, including that the rate of accumulation of mutations has accelerated by about 5% in non-Africans compared to Africans since divergence. We show that the ancestors of some pairs of present-day human populations were substantially separated by 100,000 years ago, well before the archaeologically attested onset of behavioural modernity. We also demonstrate that indigenous Australians, New Guineans and Andamanese do not derive substantial ancestry from an early dispersal of modern humans; instead, their modern human ancestry is consistent with coming from the same source as that of other non-Africans.
- An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012) et al.
- Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010) &
- The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010) et al.
- http://arxiv.org/abs/1504.06574 (2015) FermiKit: assembly-based variant calling for Illumina resequencing data. Preprint at
- Global diversity, population stratification, and selection of human copy-number variation. Science 349, aab3761 (2015) et al.
- Profiling short tandem repeats from short reads. Methods Mol. Biol. 1038, 113–135 (2013) &
- lobSTR: A short tandem repeat profiler for personal genomes. Genome Res. 22, 1154–1162 (2012) , , &
- Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009) , &
- Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006) , &
- Accelerated genetic drift on chromosome X during the human dispersal out of Africa. Nat. Genet. 41, 66–70 (2009) , , &
- Can a sex-biased human demography account for the reduced effective population size of chromosome X in non-Africans? Mol. Biol. Evol. 27, 2312–2321 (2010) &
- Sociocultural behavior, sex-biased admixture, and effective population sizes in Central African Pygmies and non-Pygmies. Mol. Biol. Evol. 30, 918–937 (2013) et al.
- The framework of central African hunter-gatherers and neighbouring societies. African Study Monographs Suppl. 28, 57–79 (2003)
- A draft sequence of the Neandertal genome. Science 328, 710–722 (2010) et al.
- A high-coverage genome sequence from an archaic Denisovan individual. Science 338, 222–226 (2012) et al.
- Higher levels of neanderthal ancestry in East Asians than in Europeans. Genetics 194, 199–209 (2013) et al.
- Genetic history of an archaic hominin group from Denisova Cave in Siberia. Nature 468, 1053–1060 (2010) et al.
- The complete genome sequence of a Neanderthal from the Altai Mountains. Nature 505, 43–49 (2014) et al.
- Archaic human ancestry in East Asia. Proc. Natl Acad. Sci. USA 108, 18301–18306 (2011) &
- Inference of human population history from individual whole-genome sequences. Nature 475, 493–496 (2011) &
- Inferring human population size and separation history from multiple genome sequences. Nat. Genet. 46, 919–925 (2014) &
- Bayesian inference of ancient human demography from individual genome sequences. Nat. Genet. 43, 1031–1034 (2011) , , , &
- Genomic variation in seven Khoe-San groups reveals adaptation and complex African history. Science 338, 374–379 (2012) et al.
- An early divergence of KhoeSan ancestors from those of other modern humans is supported by an ABC-based analysis of autosomal resequencing data. Mol. Biol. Evol. 29, 617–630 (2012) et al.
- Archaic lineages in the history of modern humans. Genetics 156, 799–808 (2000) , &
- The genetic prehistory of southern Africa. Nat. Commun. 3, 1143 (2012) et al.
- Inferring the demographic history of African farmers and pygmy hunter-gatherers using a multilocus resequencing data set. PLoS Genet. 5, e1000448 (2009) et al.
- Genome sequence of a 45,000-year-old modern human from western Siberia. Nature 514, 445–449 (2014) et al.
- Rethinking the dispersal of Homo sapiens out of Africa. Evol. Anthropol. 24, 149–164 (2015) et al.
- Testing modern human out-of-Africa dispersal models and implications for modern human origins. J. Hum. Evol. 87, 95–106 (2015) , , , &
- An Aboriginal Australian genome reveals separate human dispersals into Asia. Science 334, 94–98 (2011) et al.
- Ancient admixture in human history. Genetics 192, 1065–1093 (2012) et al.
- The earliest unequivocally modern humans in southern China. Nature 526, 696–699 (2015) et al.
- An early modern human from Romania with a recent Neanderthal ancestor. Nature 524, 216–219 (2015) et al.
- No evidence that selection has been less effective at removing deleterious mutations in Europeans than in Africans. Nat. Genet. 47, 126–131 (2015) et al.
- Evidence for recent, population-specific evolution of the human mutation rate. Proc. Natl Acad. Sci. USA 112, 3439–3444 (2015)
- Determinants of mutation rate variation in the human germline. Annu. Rev. Genomics Hum. Genet. 15, 47–70 (2014) , &
- The dawn of human culture. (Wiley, 2002) &
- Testing for ancient selection using cross-population allele frequency differentiation. Genetics 202, 733–750 (2015)
- Evidence of widespread selection on standing variation in Europe at height-associated SNPs. Nat. Genet. 44, 1015–1019 (2012) et al.
- The revolution that wasn’t: a new interpretation of the origin of modern human behavior. J. Hum. Evol. 39, 453–563 (2000) &
- Prehistory: the Making of the Human Mind. (Modern Library, 2009)
- Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinformatics 12, 246 (2011) &
- Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015) et al.
- PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007) et al.
Extended data figures and tables
Extended Data Figures
- Extended Data Figure 1: Heat map of fraction of heterozygous sites missed in the 1000 Genomes project. (172 KB)
For each sample, we examine all heterozygous sites passing filter level 1, and compute the fraction included as known polymorphisms in the 1000 Genomes project.
- Extended Data Figure 2: Worldwide variation in human short tandem repeats. (273 KB)
a, Mean STR length is reported as the average of the length difference (in base pairs) from the GRCh37 reference for each genotype. Bubble area scales with the number of calls compared at each point. b, c, The first two principal components after performing principal component analysis on tetranucleotide and homopolymer genotypes, respectively. Colours represent the region of origin of each sample. d, Pairwise FST values between populations computed using only SNPs versus using combined SNP + STR loci. e, Block jackknife standard errors for the SNP versus SNP + STR FST analysis. The red dashed lines give the best-fit line, described by the formula in red. The black dashed line denotes the diagonal.
- Extended Data Figure 3: ADMIXTURE analysis. (560 KB)
We carried out unsupervised ADMIXTURE 1.238, 43 analysis over the 300 SGDP individuals in 20 replicates with randomly chosen initial seeds, varying the number of ancestral populations between K = 2 and K = 12 and using default fivefold cross-validation (–cv flag). We used genotypes of at least filter level 1, and restricted analysis to sites where at least two individuals carried the variant allele (as singleton variants are non-informative for population clustering). After further filtering of sites with at least 99% completeness and performing linkage-disequilibrium-based pruning in PLINK 1.944, 45 with parameters (–indep-pairwise 1000 100 0.2), a total of 482,515 single nucleotide polymorphisms remained. This figure shows the highest likelihood replicate for each value of K. We found that log likelihood monotonically increases with K, while the value K = 5 minimizes cross-validation error (not shown). The solution at K = 5 corresponds to major continental groups (Sub-Saharan Africans, Oceanians, East Asians, Native Americans, and West Eurasians), but we show the full range of K here as they illustrate finer-scale population structure that may be useful to users of the data.
- Extended Data Figure 4: Principal component analysis and neighbour joining tree. (205 KB)
a, Principal component analysis. b, Neighbour-joining tree based on FST values for all populations with at least two samples.
- Extended Data Figure 5: Fewer accumulated mutations in Africans than in non-Africans confirmed by mapping to chimpanzee. (117 KB)
We compute a statistic D (Population A, Population B, Chimp), measuring the difference in the rate of matching to chimpanzee in Population A compared to Population B. The evidence of mismatching to chimpanzee is seen when we restrict to the male X chromosome to eliminate possible effects due to differences in heterozygosity across populations, and map to the chimpanzee genome which is phylogenetically symmetrically related to all present-day humans. We find that in 78 randomly chosen Population A = African and Population B = non-African pairs of males, transversion substitutions show no consistent skew from zero, but transition substitutions do.
- Extended Data Figure 6: 3P-CLR scan for positive selection. (449 KB)
The red line denotes the 99.9% quantile cut-off. The genes in the top five regions are labelled. a, Scan for selection on the San terminal branch. b, Scan for selection on the non-San terminal branch. c, Scan for selection on the ancestral modern human branch.
- Extended Data Figure 7: Scan for genomic locations where the great majority of present-day humans share a recent common ancestor. (301 KB)
We carried out PSMC analysis on 40 pairs of haploid genomes chosen to sample some of the most deeply divergent present-day human lineages. We recorded the time since the most recent common ancestor (TMRCA) at each position, and rescaled to obtain an estimate of absolute time (Supplementary Information section 12). a, Distribution across the genome of the fraction of TMRCAs below specified date cut-offs. For the 100 kya cut-off, the maximum fraction observed anywhere in the genome is 68%. b, Distribution across the genome of the date T at which specified fractions of sample pairs are inferred to have a TMRCA less than T. c, Percentile points of the cumulative distribution function of B.
Extended Data Tables
- Supplementary Information (8.4 MB)
This file contains Supplementary Text and Data, Supplementary Tables Supplementary Figures and additional references (see Contents for details).