Introduction

Demographic changes are known to leave footprints on allelic frequencies. Together with the increased availability of large genetic data sets, numerical methods stemming from Kingman’s coalescent theory1 allow inferring the past demography of populations from their present-day patterns of genetic diversity.2, 3 Several types of genetic markers can be used, each category of markers having its own specificities. For instance, as they are uni-parentally transmitted, mitochondrial sequences are informative only about maternal lineages, whereas the Y-chromosome provides information about paternal lineages. In Aimé et al.,4 it has been shown that these two types of markers sometimes provide different inferences on the demographic history of human populations, especially as females generally exhibit higher effective population sizes and higher migration rates than males.5, 6

Conversely, autosomal markers provide synthetic information about both maternal and paternal lineages. When these loci are carefully selected in order to avoid pairwise linkage disequilibria, they also offer the possibility to consider each of them as an independent replicate. Among these autosomal markers, sequences or microsatellites data also exhibit contrasted properties. In particular, microsatellites have higher mutation rates than sequences.7, 8 Moreover, microsatellites are subject to strong homoplasy, which can lead to false signals when using models where this phenomenon is not efficiently controlled for. One simulation-based study9 showed that when microsatellites are included into an Approximate Bayesian Computation analysis,10 they provide substantially better estimation for recent admixture events than sequence data. However, they did not investigate the question of inferring other kind of demographic events such as expansions. Moreover, an empirical study on a non-human species (the black sea porpoise) showed that integrating microsatellites in an ABC analysis allowed inferring very recent expansion events.11 In the latter study, the recent events occurred a few generations ago, so the question remains whether contrasting analyses based on either microsatellites and on sequences can help disentangle the question of the Neolithic vs Paleolithic expansion.

Furthermore, although DNA sequences used in previous studies were limited to some parts of the genome, next-generation data including whole-genome sequence data, which are now increasingly available, will certainly offer much greater power to make precise demographic inferences. However, up to now, their use is still limited by informatics issues. For instance, most coalescent-based applications allowing inferences of the demographic history of populations from contemporary genetic data are still limited in terms of amount of data that can be processed (eg, five individuals for MCMC12).

The timing of demographic expansions in humans is a long-standing question historically addressed by archeologists and paleoanthropologist. In particular, during the Neolithic period, the emergence of farming and animal domestication occurred in several parts of the world (Central Africa, Middle-East, Eastern Asia and Central America13), concomitantly with the sedentarization of most nomadic hunter–gatherer populations. This transition was one of the most important cultural and technological revolutions in our history, which affected many aspects of lifestyle (diet, technologies and social organization). According to most archaeologists and paleoanthropologists (eg, Bocquet-Appel14), the first major expansions in most Eurasian populations would have occurred as a result from this transition. Bocquet-Appel14 notably showed an increase in the number of enclosures and in the proportion of subadults in Eurasian burial sites during the Neolithic and this was interpreted as a proof of a natality increase. Conversely, in some other areas, archeological data have shown traces of demographic growth preceding this transition. For instance, in Africa, radiocarbon dating suggests that a demographic expansion started about 60 000 to 80 000 YBP.15

Recently, population genetics and coalescent-based methods have also been used to address major issues about the demographic history of human populations in multiple areas. For instance, although when and how early modern humans reached America for the first time was subject to debates (eg, Waters and Stafford16 and Goebel et al.17), demographic inferences using mitochondrial DNA (mtDNA) sequences indicated an initial differentiation from Asian populations ended with a moderate bottleneck in Beringia during the last glacial maximum (LGM), around 23 000 to 19 000 years ago.18 Then, toward the end of the LGM, a strong spatial and demographic population expansion started 18 000 and finished 15 000 years ago (thus well before the Neolithic transition).

Interestingly, some studies using mtDNA and/or autosomal DNA sequences inferred expansion events predating the Neolithic transition even in Eurasian populations. For instance, estimated expansion times range between 63 000 and 17 000 YBP using HVS-I data.19 In Aimé et al.,20 using nuclear and HVS-I data, we inferred also expansion events that predated the Neolithic transition in farmer and herder populations from Africa and Eurasia. Conversely, we did not find any signal of past expansion in contemporary hunter–gatherer populations from these areas. We thus suggested that previous Paleolithic demographic expansions may have promoted the emergence of farming during the Neolithic period. For Africa, several authors also inferred Palaeolithic expansion events, with onsets ranging from 80 000 to 25 000 YBP.21, 22, 23, 24 However, in a study on microsatellite data,25 we did not find signals of Palaeolithic expansions, either in Eurasia or Africa: the observed signals of expansion events were indeed consistent with the Neolithic transition. Finally, in Aimé et al.,4 we also found signals of expansion events that were consistent with the Neolithic transition for Eurasia using Y-chromosome microsatellite data.

In Aimé et al.,4, 25 we suggested that these contrasted findings may result from the specificities of each type of genetic markers and indicate two successive expansion events in the studied populations. Indeed, as explained above, the higher mutation rate of microsatellites as compared to DNA sequences may increase their sensitivity to recent events. In turn, if two successive expansions occurred in the studied population, signals of the more ancient event might be masked by more recent signals and thus be undetectable. Conversely, more ancient expansions signals may be detected using more slowly evolving markers such as sequence data. Using Simcoal version 2.1.2,26 we simulated here both DNA sequences (mitochondrial or autosomal) and microsatellite (autosomal or from the Y chromosome) data sets under several scenarios involving either one or two successive expansion events, starting at different points in time, consistent with either a Paleolithic or a Neolithic expansion. These data sets were similar to those used in Aimé et al.4, 20, 25 in terms of numbers of loci.

Then, we used the program Beast27 to obtain a posteriori estimations of these dates and their highest probability density (HPD) interval (95% HPD), in order to compare the estimated values with the true values used to simulate the data. This allowed us to investigate (i) to which extent the expansion could be inferred for each kind of markers in the single-expansion scenarios and (ii) whether the older or the younger expansion events were detected for each kind of markers in the scenarios with two successive expansions.

Materials and methods

Data simulation

We generated a large amount of simulated population genetics samples using the coalescent-based program Simcoal version 2.1.2.26 These simulations were performed under 10 different scenarios (Table 1 and Figure 1). For each scenario, a simulated population underwent either a single expansion event (‘single expansion’ scenarios) or two expansion events (‘successive expansions’ scenarios), with two possible expansion rates g (10−3 and 10−2 per generation) and three possible starting times t (200, 800 or 2000 generations ago) for the expansions. These starting times corresponded to 5000, 20 000 or 50 000 YPB, assuming a generation time of 25 years, as usually assumed in human population genetics studies.19, 22, 28 For all scenarios, samples of 100 individuals were simulated.

Table 1 Description of each scenario
Figure 1
figure 1

Description of each scenario. (a) Single ancient expansions (1: slow, 2: rapid), (b) single intermediary expansions (1: slow, 2: rapid), (c) single recent expansions (1: slow, 2: rapid), (d) ancient (slow) + recent (1: slow, 2: rapid) expansions and (e) intermediary (slow) + recent (1: slow, 2: rapid) expansions.

Four marker types were simulated for each scenario: autosomal sequences, mitochondrial sequences, autosomal microsatellites and Y-chromosome microsatellites. For the autosomal sequences, we simulated 20 unlinked diploid sequences of 1300 base pairs (bp) to be consistent with existing short neutral sequence marker sets, such as the one developed by Patin et al.28 used in Aimé et al.20 We assumed a mutation rate of 2.5 × 10−8 per generation and per site.29 For the mitochondrial sequences, we simulated a haploid sequence of 400 bp, corresponding to the HVS-I region, assuming a mutation rate of 10−5 per generation per site.30, 31 For the autosomal microsatellites, we simulated 20 unlinked diploid loci per individual, assuming a mutation rate of 10−4 per locus (ie, the lower bound of the uniform distribution that is generally used in the literature32). Finally, for the Y-chromosome microsatellites, we simulated ten linked haploid loci per individual, assuming a mutation rate of 2.1 × 10−3 for each locus.33 For all markers, we assumed a current effective population size of Ne=50 000 (or 2 Ne=100 000 for autosomal markers). This value is consistent with estimated values for the current effective population size of post-Neolithic populations (eg, African farmer populations32). One hundred replicates were performed per scenario and per marker type.

Data analysis

The simulated data were analysed using the parametric approach implemented in BEAST v1.8, following the same procedure as in Aimé et al.4, 20, 25 Four demographic models are implemented in BEAST: constant effective population size (N0; constant model), population expansion with an increasing growth rate (g) (exponential model), population expansion with a decreasing growth rate (logistic model) and the expansion model, in which N0 is the present day population size, N1 the population size that the model asymptotes to going into the distant past and g the exponential growth rate that determines how fast the transition is from near the N1 population size to N0 population size. As in Aimé et al.,20 we selected the best-fitting demographic model by estimating marginal likelihoods using two methods: path sampling and stepping-stone sampling.34 The model with the larger marginal likelihood was the expansion model for all cases, consistently with the parameters used for the simulations. It was thus considered as the best-fitting model. To infer the current effective population size N0 and the growth rate g from the composite parameters estimated with BEAST (N0 μ and g/μ, where μ is the mutation rate), we used the μ values used to simulate the data (see before). We then inferred the dates of expansion onsets (t) using the following formula: t=(1/g) × ln(N1/N0), applied to each step of the MCMC algorithm.25

Results

Single expansion scenarios

We found that the three parameters (Ne, g and t) of these scenarios were always correctly estimated. Indeed, the true values of the three parameters were almost always included in the 95% HPD interval estimated on the simulated data sets (Table 2). For each parameter set and each maker type, this was indeed the case for at least 96 of the 100 replicates. The estimates of Ne and t were unbiased in most cases, as their mean estimated values over the 100 replicates were close to the true values (see Supplementary Tables 1), whereas the estimated growth rate showed some upper or lower biases depending upon the case. Moreover, consistently with the simulated conditions, we found significant signals of expansion in all cases (ie, for each scenario and each replicate), as the HPD intervals for growth rates never included the value of 0.

Table 2 Number of replicates out of 100 for which the three parameters set in the simulations (Ne, g and t) were included in the 95% HPD interval of the estimations in the single expansion scenarios

Successive expansion’ scenarios

When performing the BEAST analysis on the simulated mtDNA or autosomal sequences, under the expansion model, only the more ancient expansion event was detected in a vast majority of cases (Figure 2 and Supplementary Table S2) for these scenarios. Indeed, in at least 87 replicates out of 100, the 95% HPD intervals of the estimates for the three parameters included only the true values of the parameters of the ancient expansion, but not those of the recent expansion event. In the majority of the other cases (between 2 and 13 replicates depending on scenarios and markers), the true values of the parameters of the ancient and the recent expansion were all included in the 95% HPD intervals. In the few remaining cases (0 to 6 replicates), neither the parameters of the ancient event nor those of the recent event were included in the 95% HPD interval. Finally, it never occurred that only the true values of the recent event were included in the 95% HPD intervals.

Figure 2
figure 2

Distributions of the number of replicates over 100 for which the oldest, the most recent, both and none of the simulated expansion events were correctly estimated, for each marker and scenario. An expansion event was considered to be correctly estimated when the three parameters set in the simulations (Ne, g and t) were included in the HPD (95% highest probability density) interval of the estimations. (a) Slow intermediate + rapid recent expansion, (b) rapid intermediate + rapid recent expansion, (c) slow ancient + rapid recent expansion and (d) rapid ancient + rapid recent expansion.

The pattern was strikingly opposite for the Y-chromosome and autosomal microsatellite markers, where only the recent expansion event was detected in the vast majority of cases. Indeed, for these markers, in at least 81 out of 100 replicates, only the true parameters of the most recent expansion event were included in the 95% HPD intervals. In most other cases, the HPD intervals of the estimations for the three parameters included those corresponding to both ancient and recent expansions (between 5 and 16 replicates). In a few cases (two to four replicates), neither the parameters of the ancient expansion event nor those of the recent one were included in the 95% HPD interval. Finally, it never occurred for these markers that the ancient expansion was included in the 95% HPD interval but not the recent one. Finally, when considering the mean estimated values and the mean 95% HPD interval over the 100 replicates, the mean modal estimate of the expansion onset time (t) was closer to the onset time of the ancient event and its 95% HPD interval included this onset time, but not the onset time of the more recent event, for the sequence data (autosomal or mitochondrial). Conversely, for the autosomal and Y-chromosome microsatellite data, the mean modal estimate of t was closer to the onset time of the recent event and the mean 95% HPD interval included this onset time, but not that of the more ancient one. The mean estimated growth rate (g) was intermediate between the growth rates of the ancient and recent expansions in all cases, while its HPD interval included the values corresponding to the most recent but not those corresponding to the more ancient expansion.

Discussion

First, we showed that using a moderate number of neutral markers, the MCMC method implemented in BEAST allowed us to efficiently detect signals of expansion events and provided reliable estimates of effective population sizes, growth rates and starting times in at least 96% of cases under quite simple scenarios involving a single expansion event. As these scenarios were very simple (single expansion, absence of admixture, etc.), this correct estimations of the parameters in these cases could be expected to some extent. However, we had to check that it was indeed the case with the sample sizes and number of markers assumed here, before interpreting our main results, which were obtained under conditions involving two successive expansions.

When we simulated these two successive expansion events, we found strikingly contrasting results for the different types of genetic markers. In particular, under several demographic scenarios involving two successive expansions, we detected only the most recent expansion event in at least 87 replicates out of 100 replicates when using Y-chromosome or autosomal microsatellite data. Conversely, only the oldest expansion event was detected in in at least 81 cases over 100 replicates using mtDNA or autosomal sequence data. It is noteworthy that, for the range of parameters tested in this study, these effects depended neither on growth rates nor on time intervals between the two expansions.

Considering together the results of this simulation study with those from our previous studies on real data in human populations4, 20, 25 provides rather interesting insights on the demographic history of our species. Indeed, using mitochondrial and autosomal DNA sequences, we detected expansion events predating the Neolithic transition in multiple African and Eurasian populations.20 These results were consistent with previous genetic studies.19, 21, 22, 23, 24 Conversely, in Aimé et al.,25 we detected signals of expansion events concomitant with the Neolithic transition in the same African and Eurasian populations using microsatellite data. In Aimé et al.,4 we also found expansion events concomitant with the Neolithic transition in Eurasia using Y-chromosome microsatellites. We suggested that these apparently contrasted results might be explained by two successive expansion events, one during the Palaeolithic and one during the Neolithic. The results from the present simulation study demonstrate clearly that it is a plausible scenario.

It is worth noting that the finding of a Paleolithic expansion event in Africa is consistent with some paleoanthropological data. Indeed radiocarbon dating suggested a demographic expansion in Africa 60 000–80 000 YBP.15 This Paleolithic demographic expansion could be linked to a rapid environmental change towards a dryer climate and/or to the emergence of new hunting technologies,15, 35 which may have increased food availability. However, for Eurasia, the idea that demographic growth in most populations started during the Neolithic period with the emergence of farming and sedentarization is largely accepted among archaeologists and paleoanthropologists14 Nevertheless, paleoanthropological remains might be too scarce for detecting the Palaeolithic expansion. It is often hard, indeed, to evaluate population densities based on archeological remains only, especially in more ancient times as the archaeological records become more fragmented.

Our study also highlights the advantages of simultaneously analysing different types of genetic markers when inferring the past demographic history of populations for any species. As for microsatellites, it is well known that mutation rates are variable among microsatellite markers.36 As we chose to consider a mutation rate of 10−4 per generation per site, which is the lower bound of the generally used uniform distribution,32 we certainly underestimated the difference in mutation rates between microsatellites and sequence data. This conservative assumption strengthens our conclusion that microsatellite markers and sequence data provide very different insights on the demographic history of human populations. Moreover, as the aim of this study was to investigate the consequences of using one type of marker rather than another when inferring the starting dates of past human expansions, we chose here to use growth rates and starting times close to those inferred in empirical studies in humans.19, 20, 21, 22, 23, 24, 25 It would be interesting to investigate other kind of scenarios in future studies.

In this context, as we aimed here to perform a simulation study to better understand the results of our previous empirical studies on microsatellites and short DNA sequences, we simulated here such kind of markers in order to be consistent. It will be also interesting in future works to simulate much larger data sets such as whole genome data, which are now becoming increasingly common and might offer sufficient power to infer several successive expansions. In this context, it will be interesting to analyze the efficiency of methods like MSMC12 or PopsizeABC,37 which assume a model in which populations go through successive events of instantaneous changes in population size though time. It is noteworthy however that these methods cannot assume a parametric model with for example one or two expansions with an exponential growth rate and that such models will need to be developed to be able to detect successive expansions, in order to compare directly with the work performed here.