High-coverage whole-genome sequence studies have so far focused on a limited number1 of geographically restricted populations2, 3, 4, 5, or been targeted at specific diseases, such as cancer6. Nevertheless, the availability of high-resolution genomic data has led to the development of new methodologies for inferring population history7, 8, 9 and refuelled the debate on the mutation rate in humans10. Here we present the Estonian Biocentre Human Genome Diversity Panel (EGDP), a dataset of 483 high-coverage human genomes from 148 populations worldwide, including 379 new genomes from 125 populations, which we group into diversity and selection sets. We analyse this dataset to refine estimates of continent-wide patterns of heterozygosity, long- and short-distance gene flow, archaic admixture, and changes in effective population size through time as well as for signals of positive or balancing selection. We find a genetic signature in present-day Papuans that suggests that at least 2% of their genome originates from an early and largely extinct expansion of anatomically modern humans (AMHs) out of Africa. Together with evidence from the western Asian fossil record11, and admixture between AMHs and Neanderthals predating the main Eurasian expansion12, our results contribute to the mounting evidence for the presence of AMHs out of Africa earlier than 75,000 years ago.
At a glance
European Nucleotide Archive
- Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327, 78–81 (2010) et al.
- Evolutionary history and adaptation from high-coverage whole-genome sequences of diverse African hunter-gatherers. Cell 150, 457–469 (2012) et al.
- Tracing the route of modern humans out of Africa by using 225 human genome sequences from Ethiopians and Egyptians. Am. J. Hum. Genet. 96, 986–991 (2015) et al.
- A selective sweep on a deleterious mutation in CPT1A in Arctic populations. Am. J. Hum. Genet. 95, 584–589 (2014) et al.
- Large-scale whole-genome sequencing of the Icelandic population. Nat. Genet. 47, 435–444 (2015) et al.
- The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 45, 1113–1120 (2013) et al.
- Inference of human population history from individual whole-genome sequences. Nature 475, 493–496 (2011) &
- Inferring human population size and separation history from multiple genome sequences. Nat. Genet. 46, 919–925 (2014) &
- Ancient admixture in human history. Genetics 192, 1065–1093 (2012) et al.
- Revising the human mutation rate: implications for understanding human evolution. Nat. Rev. Genet. 13, 745–753 (2012) &
- Climatic variability, plasticity, and dispersal: a case study from Lake Tana, Ethiopia. J. Hum. Evol. 87, 32–47 (2015) et al.
- Ancient gene flow from early modern humans into Eastern Neanderthals. Nature 530, 429–433 (2016) et al.
- Rethinking the dispersal of Homo sapiens out of Africa. Evol. Anthropol. 24, 149–164 (2015) et al.
- The earliest unequivocally modern humans in southern China. Nature 526, 696–699 (2015) et al.
- Genomic and cranial phenotype data support multiple modern human dispersals from Africa and a southern route into Asia. Proc. Natl Acad. Sci. USA 111, 7248–7253 (2014) et al.
- Genetic and archaeological perspectives on the initial modern human colonization of southern Asia. Proc. Natl Acad. Sci. USA 110, 10699–10704 (2013) , , , &
- Geography predicts neutral genetic diversity of human populations. Curr. Biol. 15, R159–R160 (2005) , &
- A draft sequence of the Neandertal genome. Science 328, 710–722 (2010) et al.
- Denisova admixture and the first modern human dispersals into Southeast Asia and Oceania. Am. J. Hum. Genet. 89, 516–528 (2011) et al.
- Genome sequence of a 45,000-year-old modern human from western Siberia. Nature 514, 445–449 (2014) et al.
- A revised timescale for human evolution based on ancient mitochondrial genomes. Curr. Biol. 23, 553–559 (2013) et al.
- The genetic history of Ice Age Europe. Nature 534, 200–205 (2016) et al.
- A high-coverage genome sequence from an archaic Denisovan individual. Science 338, 222–226 (2012) et al.
- Visualizing spatial population structure with estimated effective migration surfaces. Nat. Genet. 48, 94–100 (2016) , &
- A genetic atlas of human admixture history. Science 343, 747–751 (2014) et al.
- A model for the length of tracts of identity by descent in finite random mating populations. Theor. Popul. Biol. 64, 141–150 (2003) &
- Higher levels of Neanderthal ancestry in East Asians than in Europeans. Genetics 194, 199–209 (2013) et al.
- Pleistocene mitochondrial genomes suggest a single major dispersal of non-Africans and a late glacial population turnover in Europe. Curr. Biol. 26, 827–833 (2016) et al.
- 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012)
- Life history trade-offs explain the evolution of human pygmies. Proc. Natl Acad. Sci. USA 104, 20216–20219 (2007) , &
- Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 69–87 (2005)
- Inference of population structure using dense haplotype data. PLoS Genet. 8, e1002453 (2012) , , &
- A “Copernican” reassessment of the human mitochondrial DNA tree from its root. Am. J. Hum. Genet. 90, 675–684 (2012) et al.
- The archaeogenetics of Europe. Curr. Biol. 20, R174–R183 (2010) et al.
Extended data figures and tables
Extended Data Figures
- Extended Data Figure 1: Sample Diversity and Archaic signals. (727 KB)
a, Map of location of samples highlighting the diversity/selection sets. b, Sample-level heterozygosity is plotted against distance from Addis Ababa. The trend line represents only non-African samples. The inset shows the waypoints used to arrive at the distance in kilometres for each sample. c, ADMIXTURE plot (K = 8 and 14) which relates general visual inspection of genetic structure to studied populations and their region of origin. d, Box plots were used to visualize the Denisova (red), Altai (green) and Croatian Neanderthal (blue) D distribution for each regional group of samples. Oceanian Altai D values show a remarkable similarity with the Denisova D values for the same region, in contrast with the other groups of samples where the Altai box plots tend to be more similar to the Croatian Neanderthal ones. Boxes show median, first and third quartiles, with 1.5× interquartile range whiskers and black dots as outliers.
- Extended Data Figure 2: Data quality checks and heterozygosity patterns. (365 KB)
a, b, Concordance of DNA sequencing (Complete Genomics Inc.) and DNA genotyping (Illumina genotyping arrays) data (ref-ref; het-ref-alt and hom-alt-alt, see Supplementary Information 1.6) from chip (a) and sequence data (b). c, Coverage (depth) distribution of variable positions, divided by DNA source (blood or saliva) and complete genomic calling pipeline (release version). d, Genome-wide distribution of transition/transversion ratio subdivided by DNA source (saliva or blood) and by complete genomic calling pipeline. e, Genome-wide distribution of transition/transversion ratio subdivided by chromosomes. f, Inter-chromosome differences in observed heterozygosity in 447 samples from the diversity set. g, Inter-chromosome differences in observed heterozygosity in a set of 50 unpublished genomes from the Estonian Genome Center, sequenced on an Illumina platform at an average coverage exceeding 30×. h, Inter-chromosome differences in observed heterozygosity in the phase 3 of the 1000 Genomes Project. The total number of observed heterozygous sites was divided by the number of accessible base pairs reported by the 1000 Genomes Project.
- Extended Data Figure 3: FineSTRUCTURE shared ancestry analysis. (716 KB)
ChromoPainter and FineSTRUCTURE results, showing both inferred populations and the underlying (averaged) number of haplotypes that an individual in a population receives (rows) from donor individuals in other populations (columns). 108 populations are inferred by FineSTRUCTURE. The dendrogram shows the inferred relationship between populations. The numbers on the dendrogram give the proportion of MCMC iterations for which each population split is observed (where this is less than 1). Each ‘geographical region’ has a unique colour from which individuals are labelled. The number of individuals in each population is given in the label; for example, ‘4Italians; 3Albanians’ is a population of size 7 containing 4 individuals from Italy and 3 from Albania.
- Extended Data Figure 4: MSMC genetic split times and outgroup f3 results. (1,027 KB)
a, The MSMC split times estimated between each sample and a reference panel of nine genomes were linearly interpolated to infer the broader square matrix. b, c, Summary of outgroup f 3 statistics for each pair of non-African populations or an ancient sample using Yoruba as an outgroup. Populations are grouped by geographic region and are ordered with increasing distance from Africa (left to right for columns and bottom to top for rows). Colour bars at the left and top of the heat map indicate the colour coding used for the geographical region. Individual population labels are indicated at the right and bottom of the heat map. The f3 statistics are scaled to lie between 0 and 1, with a black colour indicating those close to 0 and a red colour indicating those close to 1. Let m and M be the minimum and maximum f3 values within a given row (that is, focal population). That is, for focal population X (on rows), m = minY,Y≠X f3(X, Y; Yoruba) and M = maxY,Y≠X f3(X, Y; Yoruba). The scaled f3 statistic for a given cell in that row is given by f3scaled = (f3 − m)/(M − m), so that the smallest f3 in the row has value f3scaled = 0 (black) and the largest has value f3scaled = 1 (red). By default, the diagonal has value f3scaled = 1 (red). The heat map is therefore asymmetric, with the population closest to the focal population at a given row having value f3scaled = 1 (red colour) and the population farthest from the focal population at a given row having value f3scaled = 0 (black colour). Therefore, at a given row, scanning the columns of the heat map reveals the populations with the most shared ancestry with the focal population of that row in the heat map.
- Extended Data Figure 5: Geographical patterns of genetic diversity. (730 KB)
Isolation by distance pattern across areas of high genetic gradient, using Europe as a baseline. The samples used in each analysis are indicated by coloured lines on the maps to the right of each plot. a–d, The panels show FST as a function of distance across the Himalayas (a), the Ural mountains (b), and the Caucasus (c) as reported on the colour-coded map (d). e, Effect of creating gaps in the samples in Europe. f, g, We tested the effect of removing samples from stripes, either north to south (f) or west to east (g), to create gaps comparable in size to the gaps in samples in the dataset. h, Effective migration surfaces inferred by EEMS.
- Extended Data Figure 6: Summary of positive selection results. (379 KB)
a, Bar plot comparing frequency distributions of functional variants in Africans and non-Africans. The distribution of exonic SNPs according to their functional impact (synonymous, missense and nonsense) as a function of allele frequency. Note that the data from both groups was normalized for a sample size of n = 21 and that the Africans show significantly (χ2 P < 1 × 10-15) more rare variants across all sites classes. b, Result of 1,000 bootstrap replica of the RX/Y test for a subset of pigmentation genes highlighted by Genome Wide Association Studies (GWAS, n = 32). The horizontal line provides the African reference (x = 1) against which all other groups are compared. The blue and red marks show the 95th and the 5th percentile of the bootstrap distributions respectively. If the 95th percentile is below 1, then the population shows a significant excess of missense variants in the pigmentation subset relative to the Africans. Note that this is the case for all non-Africans except the Oceanians. c, Pools of individuals for selection scans. fineSTRUCTURE-based co-ancestry matrix was used to define twelve groups of populations for the downstream selection scans. These groups are highlighted in the plot by boxes with broken line edges. The number of individuals in each group is reported in Supplementary Table 1:3.2-I.
- Extended Data Figure 7: Length of haplotypes assigned as African by fineSTRUCTURE as a function of genome proportion. (565 KB)
a, 447 Diversity Panel results, showing label averages (large crosses) along with individuals (small dots). b, Relative excluded Diversity Panel results, to check for whether including related individuals affects African genome fraction. Individuals that shared more than 2% of genome fraction were forbidden from receiving haplotypes from each other, and the painting was re-run on a large subset of the genome (all run of homozygosity (ROH) regions from any individual). c, ROH-only African haplotypes. To guard against phasing errors, we analysed only regions for which an individual was in a long (>500 kb) run of homozygosity using the PLINK command ‘–homozyg-window-kb 500000–homozyg-window-het 0–homozyg-density 10’. Because there are so few such regions, we report only the population average for populations with two or more individuals, as well as the standard error in that estimate. Populations for which the 95% confidence interval passed 0 were also excluded. Note the logarithmic axis. d, Ancient DNA panel results. We used a different panel of 109 individuals which included three ancient genomes. We painted chromosomes 11, 21 and 22 and report as crosses the population averages for populations with two or more individuals. The solid thin lines represent the position of each population when modern samples only are analysed. The dashed lines lead off the figure to the position of the ancient hominins and the African samples.
- Extended Data Figure 8: MSMC Linear behaviour of MSMC split estimates in presence of admixture. (255 KB)
a–c, The examined Central Asian (a), East African (b), and African–American (c) genomes yielded a signature of MSMC split time (truth, left-most column) that could be recapitulated (reconstruction, second left-most column) as a linear mixture of other MSMC split times. The admixture proportions inferred by our method (top of each admixture component column) were remarkably similar to the ones previously reported from the literature. d, MSMC split times calculated after re-phasing an Estonian and a Papuan (Koinanbe) genome together with all the available West African and Pygmy genomes from our dataset to minimize putative phasing artefacts. The cross coalescence rate curves reported here are quantitatively comparable with the ones of Fig. 2a, hence showing that phasing artefacts are unlikely to explain the observed past-ward shift of the Papuan–African split time. e, Box plot showing the distribution of differences between African–Papuan and African–Eurasian split times obtained from coalescent simulations assembled through random replacement to make 2,000 sets of 6 individuals (to match the 6 Papuans available from our empirical dataset), each made of 1.5 Gb of sequence. The simulation command line used to generate each chromosome made of 5 Mb was as follows, where x is the variable for the divergence time used. x = 0.064, 0.4 or 0.8 for the xOoA, Denisova (Den) and Divergent Denisova (DeepDen) cases, respectively. ms0ancient2 10 1. 065.05 -t 5000. -r 3000. 5000000 -I 7 1 1 1 1 2 2 2 -en 0. 1 .2 -en 0. 2 .2 -en 0. 3 .2 -en 0. 4 .2 -es .025 7.96 -en .025 8.2 -ej.03 7 6 -ej.04 6 5 -ej.060 8 3 -ej.061 4 3 -ej.062 2 1 -ej.063 3 1 -ej x 1 5.
- Extended Data Figure 9: Modelling the xOoA components with FineSTRUCTURE. (690 KB)
a, Joint distribution of haplotype lengths and derived allele count, showing the median position of each cluster and all haplotypes assigned to it in the maximum a posteriori (MAP) estimate. Note that although a different proportion of points is assigned to each in the MAP, the total posterior is very close to 1/K for all. The dashed lines show a constant mutation rate. Haplotypes are ordered by mutation rate from low to high. b, Residual distribution comparison between the two-component mixture using EUR.AFR and EUR.PNG (left), and the three-component mixture including xOoA (using the same colour scale) (right). The root mean square error (RMSE) residuals without xOoA are larger (RMSE = 0.0055 compared to RMSE = 0.0018) but more importantly, they are also structured. c, Assuming a mutational clock and a correct assignment of haplotypes, we can estimate the relative age of the splits from the number of derived alleles observed on the haplotypes. This leads to an estimate of 1.5 times older for xOoA compared to the Eurasian–Africa split.
- Extended Data Figure 10: Proposed xOoA model. (218 KB)
A schematic illustrating, as suggested by the results presented here, a model of an early, extinct Out-of-Africa (xOoA) signature in the genomes of Sahul populations at their arrival in the region. Given the overall small genomic contribution of this event to the genomes of modern Sahul individuals, we could not determine whether the documented Denisova admixture (question marks) and putative multiple Neanderthal admixtures took place along this extinct OoA. We also speculate (question mark) people who migrated along the xOoA route may have left a trace in the genomes of the Altai Neanderthal as reported by Kuhlwilm and colleagues12.