Introduction

Our understanding of European prehistory has been revolutionized by the availability of new DNA sequencing technologies1, which have allowed the unbiased characterization of sequence variation in modern and ancient human genomes. Genome-wide ancient DNA (aDNA) data have shown a clear discontinuity between Paleolithic hunter-gatherers and Neolithic farmers2,3,4,5. Patterns of diversity suggested low Paleolithic population sizes, with regional differences among Western and Scandinavian groups6. This picture has been further refined by the study of the DNA of ancient Yamnaya herders from the region of the Pontic-Caspian steppe, the apparent source of Bronze Age migrations into Europe and Asia7,8,9, and a debated region of origin for Indo-European languages10. These studies highlight an introgression into Europe at around 4.5 thousand years ago (KYA) from the East, followed by development of genetic structure in Bronze Age Europe7,8.

Ideas on European prehistory have been strongly influenced by studies not only of autosomal DNA, but also of uniparentally-inherited markers, which can provide information about sex-biased processes11. Analyses of aDNA show that today’s most frequent Y-chromosome haplogroup (R1b-M269) is very rare in Europe until 4.5 KYA5 (see summary elsewhere12), while it is present in all the Yamnaya samples8,9. This had initially suggested a major introgression of males from the Pontic-Caspian steppe; however, the R1b sublineage (R1b-L11) now common in Europe has not so far been found among Yamnaya sequences13. In contrast to Y data, patterns of ancient mtDNA reveal a period of widespread turnover of lineages in the Late Glacial period ~14.5 KYA14,15 and support the picture of a discontinuity between Paleolithic hunter-gatherers and Neolithic farmers seen in autosomal data, but contain no apparent signal of the Bronze Age steppe expansion16. Ancient DNA data on uniparentally-inherited markers therefore suggest a strong sex-bias in recent demographic changes in Europe.

Resequencing of the male-specific region of the Y chromosome (MSY)17 in modern European populations also emphasizes the importance of Bronze-Age transition12,18,19,20. Demographic reconstructions support an expansion starting ~2.1–4.2 KYA, and times-to-most-recent-common-ancestor (TMRCAs) of three major haplogroups, including R1b, are estimated between 3.5 and 7.3 KYA12. Many mtDNA studies have been undertaken in modern European samples, but most concentrate on particular segments of the mitochondrial genome21, consider the European continent as a single unit18,19, or have taken a phylogeographic approach, focusing on specific lineages of interest22,23,24,25,26,27. General conclusions have been that current mtDNA variation represents re-expansions from glacial refugia. However, population-based whole mtDNA resequencing studies in Europe are lacking, so a systematic comparison of the demographic histories of males and females has yet not been possible.

Here, we carry out a population-based study, resequencing mtDNA in a set of 17 populations for which MSY resequencing data are already available. The pattern observed in mtDNA is strikingly different from that in MSY, compatible with expansion after the Last Glacial Maximum, and emphasizing the male-specific nature of the Bronze Age expansion in Europe.

Results

To allow a comparison between female and male histories, we resequenced the mitochondrial genomes of 340 European and Middle Eastern individuals belonging to 17 populations (Table S1) that were previously analyzed for MSY12.

We constructed a maximum parsimony (MP) tree (Fig. 1a; see also median-joining network in Figure S1) based on the mtDNA coding region only, which is best suited for reliable phylogenetic inference due to its relatively low content of recurrently mutating sites (Figure S2 presents a median-joining network based on the entire sequence). We also determined haplogroups from sequences (Table S1): their frequencies and geographical distributions (Fig. 1b) are broadly consistent with previous data21.

Figure 1
figure 1

Phylogenies and geographical distributions of European mtDNA and MSY lineages. (a) Maximum-parsimony mtDNA tree, based on resequencing of 15,447 bp of the coding region. Branch lengths are proportional to molecular divergence among haplotypes, and colours indicate haplogroups. Point estimates of TMRCAs in KY are given in parentheses after haplogroup names; see also Table 1. (b) Map with pie-charts showing frequencies of mtDNA haplogroups (defined and coloured as in part (a)) in 17 populations from Europe and the Near East. Population abbreviations are as follows: bas: Spanish Basque country; bav: Bavaria (Germany); CEU: Utah residents with Northern and Western European ancestry from the CEPH collection (France); den: Denmark; eng: England;41,42 fri: Frisia (Netherlands); gre: Greece; hun: Hungary; ire: Ireland; nor: Norway; ork: Orkney;41,42 pal: Palestinians; saa: Saami (Finland); ser: Serbia; spa: central Spain; TSI: Toscani in Italia (Italy); tur: Turkey. Map from Mountain High Map Frontiers™ version 94.01 (Mountain High Maps® Copyright © 1993 Digital Wisdom®, Inc.; www.digiwis.com/dwi_frl.htm). (c) Maximum-parsimony MSY tree, based on resequencing of 3.7 Mb in each individual12. Branch lengths are proportional to molecular divergence among haplotypes, and colours indicate haplogroups. Point estimates of TMRCAs are given in parentheses after haplogroup names. (d) Map with pie-charts showing frequencies of MSY haplogroups (defined and coloured as in part (c)) in 17 populations from Europe and the Near East12. Population abbreviations are as in part (b). Map from Mountain High Map Frontiers™ version 94.01 (Mountain High Maps® Copyright © 1993 Digital Wisdom®, Inc.; www.digiwis.com/dwi_frl.htm).

In the overall dataset, the main haplogroups observed are H (34.1%), U (17.9%), T (13.5%), J (9.1%), K (7.3%), and V (5.3%). The remaining 12.6% of the dataset is comprised of many minor haplogroups; these include (in the Palestinian and Spanish samples) two examples of each of the lineages L2 and L3, typically found in sub-Saharan Africa28 (see also YRI data in Table S1).

Visual inspection of the distributions of haplogroups (Fig. 1b) shows the Saami to be a clear outlier with low diversity, dominated by haplogroups U5 and V, and including a single example of haplogroup D, found mostly in north and east Asia29. In the remaining populations there is no obvious geographical pattern, in agreement with previous observations based on analyses of specific segments of the mitochondrial genome21,30.

This phylogeography of mitochondrial genomes was compared with that of MSY, based on resequencing of 3.7 Mb of Y-DNA in the same set of samples12 (Fig. 1c,d). The geographical distributions of haplogroups are more localized for MSY (Fig. 1d) than for mtDNA (Fig. 1b): for example, R1b is at particularly high frequency in the northwest, R1a in north and central Europe, and J2 in the south. Although the two MP trees have very different scales, due to the much smaller number of nucleotides analyzed in mtDNA compared to MSY (with a ratio of 1:250), both show deep-rooting clades (mtDNA: haplogroups U, K and T2; MSY: E1b-M35, G2a-L31, I2-P215, J2-M172, L and T), as well as star-like clades, indicative of population expansions. For mtDNA, these clades are H, J1, T1, V, representing 51.7% of our sample, while for MSY they are I1-M253, N1c-M178, R1a-M198 and R1b-M269 (taken together, 64%). However, the major difference is in the relative lengths of the internal branches, which indicate that the expansions of mtDNA lineages are more ancient than those of MSY lineages. This is supported by TMRCA point estimates (based on the entire mitochondrial genome) for the star-like haplogroups (Fig. 1a,c; Table 1), which are all ≤6 KYA (post-Neolithic) for the MSY12, but >13 KYA (Paleolithic) for mtDNA.

Table 1 TMRCAs of major mitochondrial haplogroups in Europe.

To consider populations, rather than lineages, we reconstructed demographic histories by using Bayesian Skyline Plots (BSPs) based on mtDNA sequences (Fig. 2). As expected from their unusual haplogroup composition, the Saami also represent an outlier here, showing a steady decline in effective population size that becomes more marked at around 5 KYA. All other populations show a signature of Paleolithic expansion, between 13 and 20 KYA. The Turkish and Palestinian samples differ from the majority in showing considerably more ancient population expansion, at >40 KYA. These patterns contrast sharply with the BSPs for MSY12 in the same populations (Fig. 2), which in most cases (13/17 populations) show demographic histories featuring a minimum effective population size around 3 KYA (late Bronze Age for many of the populations studied), followed by rapid expansion to the present. In all comparisons except those in Basques and Danish, current point estimates of effective population size are higher for mtDNA than for MSY.

Figure 2
figure 2

Bayesian Skyline Plots for mtDNA and MSY. Thick lines (mtDNA: purple; MSY: orange) indicate the median for effective population size and thinner lines show 95% higher posterior density intervals. Population abbreviations are as in Fig. 1, and plots for MSY are adapted from Batini et al. (2015)12.

We also calculated diversity indices for each population (Table 2). In agreement with the observation of a limited number of haplogroups in Saami, and the corresponding BSP, this population shows the lowest value for all diversity measures. The highest values are seen in the Palestinian and Turkish samples, which again is concordant with the ancient population growth seen in the BSPs. We observe negative values of both D and FS for all populations except the Saami, which can indicate population growth; however, both values are significant for only twelve of the remaining populations. At a glance, there appears to be more diversity in southern than northern populations (Fig. 1b; Table 2). To formally test this, we carried out a correlation analysis between genetic diversity (number of polymorphic sites, and nucleotide diversity) and latitude, longitude and overland distance from the Franco-Cantabrian and Near-Eastern glacial refugia (Table S2). When all populations are included, both measures show a statistically significant correlation with latitude and distance from the Near-Eastern refugium, but not with longitude. These correlations are lost when we remove the outlier populations described above (Saami, Turkish and Palestinian), demonstrating the lack of any pattern of mtDNA diversity in most of Europe.

Table 2 Diversity parameters for the 17 populations for mtDNA coding region.

Discussion

The study of mtDNA in Europe has a long history and has been influential in developing hypotheses about the origins of modern Europeans15,31. Early population-based studies involved sequencing of the control region32, and later the typing of specific haplogroup-defining SNPs in the coding region21. Analyses of whole mitochondrial genome sequences have focused on specific haplogroups, often in the framework of the re-peopling of Europe after the last glacial maximum23,24,25. At the population scale, the only previous studies have considered Europe at the continental level18,19, and with low sample sizes (n = 86 and 81 respectively), and patchy population coverage.

Here, we have carried out whole mtDNA sequencing in a European and Middle Eastern population-based sample set of 340 individuals, in which MSY resequencing12 had previously been undertaken. The spectrum of haplogroups we observe (Table S1, Fig. 1a) is compatible with previously published data21. The population-based design of this study and the unbiased nature of variant ascertainment means that European mtDNA and MSY diversity can be compared fairly. The phylogenies and demographic reconstructions concur in showing a marked difference between female and male population histories, with Paleolithic expansions for mtDNA contrasting with Bronze Age expansions for MSY. While this is in agreement with continental-level differences observed previously18,19, here we also show that this difference holds for most of the individual populations, and reflects a lack of geographical pattern in Europe. The most ancient mtDNA expansions we detect, dating close to the early peopling of Eurasia (40–50 KYA), are in the Near and Middle East. This difference in timing of European female and male lineage expansions is mirrored in the Indian subcontinent, where a recent analysis33 shows that mtDNA expansions reflect processes in the pre-Holocene era, while MSY expansions are mostly in the last 10 KYA, with marked male-driven spread from Central Asia during the Bronze Age.

Since mtDNA analysis has forensic utility34, it is worth noting that the 340 individuals carry only 318 distinct haplotypes based on complete mitochondrial genomes (Figure S2; Table S3), emphasizing the relatively low discriminatory power of mtDNA sequencing, even at its maximum resolution. We observe 12 identical pairs of haplotypes, three trios, and one example of a haplotype found five times in the dataset. These cases represent within-population sharing (in Danish, Norwegian, Orcadian, Frisian, Basque and Saami), with one exception, a haplotype within hg V shared between two Saami and a Spanish individual (Figure S2b) – reflecting a connection previously noted35. As reflected in diversity measures (Table 2) the Saami have particularly high haplotype sharing (two pairs, two trios, one quintet). These findings emphasize the importance of large and appropriate reference databases in forensic analysis.

The outlier status of the Saami in our dataset of 17 populations is clear not only from the high frequencies of closely related mtDNA sequences (Figure S2), but also haplogroups for both mtDNA and MSY that are rare elsewhere in the dataset (Fig. 1b,d), as well as examples of MSY sequences36 and Y-STR haplotypes36 that are found in more than one individual. These features are in agreement with the lack of growth in effective population size seen in the BSP (Fig. 2). Population-based genome-wide SNP analysis37 and whole-genome sequencing of a single individual38 also show the Saami to be genetically differentiated compared to Europeans, and to carry East Asian ancestry components.

Our data are consistent with ancient DNA data14,15,16 in supporting sex-biased processes in recent European demographic changes: patterns of modern mtDNA diversity show no signal of the Bronze Age expansion, while much of the modern European MSY diversity has been shaped by this process12. However, the modern data differ in showing no clear signal of the Neolithic transition that has been highlighted in ancient mitochondrial and autosomal data5,16. This could be due to drift, which is important in shaping the observed patterns of diversity in uniparental markers, and also sampling effects.

Much progress has been made in understanding the prehistory of the European continent since the first classical genetic data were interpreted in favour of agriculturally-mediated demic diffusion39. A wealth of both modern and ancient DNA data is now available, and has highlighted previously unsuspected past migration and expansion events, with a sex-biased aspect supported by our population-based resequencing approach. However, there is still much future work to be done in increasing sample sizes and geographical coverage, and in fully integrating the ancient and modern data to test explicitly the complex scenarios they suggest.

Materials and Methods

DNA samples and sequencing

DNA samples from 340 individuals belonging to 17 populations (20 individuals each) across Europe and the Near East were used for analysis, as in our previous study of MSY diversity12. Populations were as follows: Greek, Serbian, Hungarian, German [Bavaria], Spanish Basque, central Spanish, French (Centre d’Etude du Polymorphisme Humain [CEPH] collection in Utah, USA, with ancestry from Northern and Western Europe40 [CEU]), Italian (Toscani in Italia40 [TSI]), Dutch [Frisia], Danish, Norwegian, Finnish [Saami], English41,42 [Herefordshire and Worcestershire, Gloucestershire, Oxfordshire, Forest of Dean], Orcadian41,42, Irish, Turkish and Palestinian. Twenty individuals from each of two additional HapMap40 population samples, CHB (Han Chinese in Beijing, China), and YRI (Yoruba in Ibadan, Nigeria) were included to provide variant validation data. All methods were carried out in accordance with relevant guidelines and regulations, and all experimental protocols were approved by the University of Leicester Research Ethics Committee. Informed consent was obtained from all subjects (University of Leicester Research Ethics Committee reference: maj4-cb66).

We generated three datasets using parallel sequencing strategies (based on Illumina HiSeq, Illumina MiSeq and Ion Torrent PGM technologies [Table S4]) and bioinformatic pipelines (Table S5), and validated variants by comparison with independent sequence and SNP-genotype data (Table S6). These three datasets were merged for all the subsequent evolutionary analyses. The 380 sequences are available in Supplementary Dataset S1.

Tree construction and haplogroup prediction

A maximum parsimony (MP) tree was constructed from coding-region sequences (positions 576-16,023) via MEGA643, using the Subtree-Pruning-Regrafting (SPR) algorithm44 with search level 0 in which the initial trees were obtained by the random addition of sequences (10 replicates). Branch lengths were calculated using the average pathway method44 and are proportional to the number of mutations. FigTree v1.4.045 was used for tree visualization. Haplogroups were predicted using HaploGrep246, and their phylogenetic coherence was verified using the tree, with manual examination of possible ‘phantom’ mutations, as inferred using Haplogrep246 (Table S7).

Median-joining networks47 based on either coding region or whole mtDNA sequences were constructed using Network 5.0.0.0, and represented using Network Publisher 2.1.1.2. Polymorphic sites were weighted according to their evolutionary rates using the parameters suggested in the literature48.

Haplogroups were defined according to Phylotree1649 by using HaploGrep50 and their relative frequencies represented as pie-charts plotted on a geographical map.

TMRCA estimation

TMRCAs of nodes of interest were estimated51 using BEAST v1.8.0. MCMC samples were based on 50,000,000 generations, logging every 1000 steps, with the first 5,000,000 generations discarded as burn-in. Three runs were combined for analysis using LogCombiner. We used an exponential growth coalescent tree prior, HKY substitution model, and an uncorrelated relaxed clock with a lognormal distribution for mutation rate (2.21 ± 0.17 × 10−8 mutations/nucleotide/year52). TMRCAs were estimated in a single run including all 17 populations and assigning samples to specific clades in agreement with the MP tree shown in Fig. 1. For this analysis, the entire mitochondrial genome was considered: given the timeframe of interest (<40 KYA), the rate and its standard error were adjusted by using the calculator52 in Soares et al. (2009) to infer the rate at each of the nodes of interest (see Table 1). The median of this distribution of values was used for estimating TMRCAs for the haplogroups of interest.

Bayesian skyline plots

BSPs51 were generated using BEAST v1.8.0. MCMC samples were based on 30,000,000 generations, logging every 1000 steps, with the first 3,000,000 generations discarded as burn-in. We used a piecewise linear skyline model with ten groups, a HKY substitution model, and an uncorrelated relaxed clock with a lognormal distribution for mutation rate (2.21 ± 0.17 × 10−8 mutations/nucleotide/year52) and a generation time of 30 years53,54. For this analysis, the entire mitochondrial genome was considered: given the timeframe of interest (<40 KYA) the rate and its standard error were adjusted by using the calculator52 in Soares et al. (2009) to infer the rate at each of the nodes of interest (see Table 1). The median of this distribution of values was used for estimating TMRCAs for the haplogroups of interest.

Intrapopulation diversity and geographical correlation

The number of polymorphic sites per population (S), nucleotide diversity, Tajima’s D 55, and Fu’s FS 56 were calculated57 using Arlequin 3.5. Correlations of S, and of nucleotide diversity, with latitude, longitude, and distances from glacial refugia were examined using the function cor.test of the package stats within R. The locations Anamur, Turkey (36.1°, 32.8°) and Fleurac, France (45.0°, 1.0°) were taken as proxies for the centres of the Near-Eastern and Franco-Cantabrian refugia respectively. Distances account for geographical barriers, and were estimated using the land transport distance tool at www.freemaptools.com.

Data availability

All data generated during this study are included in this article (and its Supplementary Information files).