Introduction

In recent years, advances in both theoretical and empirical population genetics have highlighted the importance of taking population structure into account when interpreting the variability of a species and inferring demographic parameters. Most, if not all, populations are indeed not isolated, but linked in a network of demes exchanging migrants to some extent (that is, they belong to a metapopulation). Many recent works have shown that failing to take the structure of the metapopulation into account may lead to spurious signatures of effective population size changes through time (Chikhi et al., 2010; Peter et al., 2010; Stadler et al., 2009) and to biased estimates of the mutation rate (Navascues and Emerson, 2009). Moreover, models of population structure can also leave a genetic footprint similar to episodes of ancient hybridization (Eriksson and Manica, 2014).

Of particular importance in this context are the range expansions (REs). Episodes of REs have been postulated for many species (Rendine et al., 1986; Ibrahim et al., 1996; Cavalli-Sforza and Feldman, 2003; Klopfstein et al., 2006; Boessenkool et al., 2009; Brown et al., 2010) but they have been quantitatively tested in a spatially explicit framework only in few studies (Barbujani et al., 1995; Hamilton et al., 2005; Ray et al., 2005; Francois et al., 2008; Neuenschwander et al., 2008; Gaggiotti et al., 2009; Schneider et al., 2010; Mona et al., 2013). REs leave distinct signatures in the genetic diversity of populations (Barbujani et al., 1995; Ray et al., 2003; Edmonds et al., 2004; Travis et al., 2007; Sousa et al., 2014), proceeding as a series of founder events followed by continuous migrations among neighboring demes (Austerlitz et al., 1997; Ray et al., 2003). In a seminal work on the genetics of metapopulation, Wakeley (1999) and Wakeley and Aliacar (2001) have elegantly separated the coalescent process of genes sampled in a structured population into two separate phases, namely the scattering and the collecting phase. The scattering phase, proceeding backward in time, starts at the present generation, ending when all sampled genes have either coalesced or migrated to different demes. At this point, the collecting phase begins: this period is characterized by a large number of migration events and a few coalescent events that can happen only when two lineages end up in the same deme. The coalescent process during the collecting phase in a structured population was shown to have the same properties as in an unstructured population, with time scaled by the number of demes, deme size and migration rates (Wakeley, 1999). A similar separation of timescales was also shown to hold in more complex and realistic models taking into account geographic features (Wakeley and Aliacar, 2001) in extinction/recolonization models (Wakeley and Aliacar, 2001; Wakeley, 2004), in continuous populations (Wilkins and Wakeley, 2002; Wilkins, 2004) as well as in range expansions (Ray et al., 2003; Wegmann et al., 2006; Nullmeier and Hallatschek, 2013; Mona et al., 2014).

Despite the growing interest in the genetics properties of REs, many of their features are not yet completely understood. Most of the attention has focused so far on REs in homogeneous environments (Ray et al., 2003), where the driving parameter of the genetic diversity is Nm (effective population size times the migration rate). This result mirrors Wright’s finite island model where Nm determines not only the diversity of the metapopulation (Wakeley, 1998), but also its genetic structure (Wright, 1931). However, REs are spatial explicit nonequilibrium models where the colonization process starts from an ancestral deme. So far, the role played by the effective population size of the ancestral deme (hereafter Nanc) on shaping the genetic variation of the metapopulation has not been thoroughly investigated. Moreover, both Ray et al. (2003) and Wegmann et al. (2006) noted that, for low Nm, demes with large carrying capacity have a higher number of haplotypes compared with demes with low carrying capacity. They explained this finding suggesting that gene lineages may co-migrate in the same deme and coalesce quickly with low N. This increases the probability of identity and decreases the number of haplotypes compared with demes where N is larger. Nevertheless, no further analyses were conducted to test this hypothesis. To better characterize the effect of different N values given a fixed Nm on the pattern of genetic diversity as well as to define the influence of Nanc, we simulated REs using SPLATCHE2 (Ray et al., 2010). We found that for a fixed Nm: (1) intra-deme genetic diversity depends on N (that is, for increasing N values we found a monotonic increase in the number of haplotypes and in the number of segregating sites) and (2) increasing N leads to a skew in the site frequency spectrum toward an excess of rare variants that leads to more significantly negative values of Tajima’s D. We compared these results with an equilibrium stepping stone model: fixing Nm we observed an increase in the genetic diversity for higher N as in the REs but we found no influence on the frequency of rare variants. We also identified the role played by the ancestral population size of the founding deme particularly on the degree of population structure of the whole metapopulation. We further demonstrated how it is indeed possible to correctly jointly estimate N and m in a RE using an approximate Bayesian computation framework (ABC), provided we have sampled many demes (at least more than one) and we have typed at least few independent loci.

Materials and methods

We simulated a range expansion of haploid individuals over a two-dimensional array of demes arranged on a lattice using the software SPLATCHE2 (Ray et al., 2010). Each deme exchanges migrants with its four nearest neighbors as in a two-dimensional stepping stone model (Kimura and Weiss, 1964) with bouncing edges. Simulations are done in two consecutive steps, namely the forward (demographic) and the backward (coalescent) steps. The forward simulation starts from a founding (or ancestral) deme that sends migrants to its neighboring demes. The founding deme is characterized by an effective population size Nanc, a parameter that does not play a role in the forward process but it will determine the rate of coalescence in the backward phase (see below). Migrations to empty demes represent new colonization events. Each deme has an intrinsic growth rate r (the same value for all demes) and its density is logistically regulated by its carrying capacity (Ray et al., 2003). We note that for this reason the effective population size in each deme varies slightly each generation, but the effect of such oscillations on the genetic variability are negligible. For this reason and for the sake of comparison with previous work and with equilibrium models, we called hereafter the carrying capacity N. After the regulation step within each deme, migrants are sent to the four neighboring demes at rate m. The process is repeated generation by generation for each non-empty deme, resulting in a wave of advance of the whole population. The second phase of the algorithm then starts at the present generation, proceeding backward in time. The actual number of individuals present in a deme is used to compute the probability of a coalescent event, and the migration rates determine the probability of each sampled genes to emigrate, backward in time, to the surrounding demes. The coalescent process stops after all genes have coalesced. The lineages that did not coalesce in the scattering or in the collecting phase will coalesce in the ancestral deme, with a rate determined by Nanc.

The size of the array of stepping stones was fixed to 50 × 50 demes and the origin of the expansion was set to position <1;1> of the square lattice (North-West corner). We tested six carrying capacities (N=20, 50, 100, 200, 500 and 1000), with migration rates varying accordingly to get a fixed Nm. To cover observed population differentiation in real data, we tested four Nm values (Nm=1, 10, 50 and 100). The resulting Fst value would range from ~0.005 to ~0.33 in a finite island model. Expansion times (Texp) were set to 50 000, 15 000, 3000, 1000 and 500 generations ago and the ancestral population size to Nanc =100 1000, and 10 000. These parameter combinations were chosen to represent a post-glacial expansion of species with different generation time and to explore a broad range of plausible biological demographic scenarios. We simulated DNA sequences of 1000 bp with a mutation rate of 0.001 per generation for the whole locus and performed 1000 coalescent simulations for each demographic scenario. We sampled 9 fixed demes in the array, as displayed in Figure 1. From each deme we sampled 20 lineages. All these simulations were repeated under an instantaneous expansion model as in Hamilton et al. (2005), where all the demes are colonized at the same time (hereafter, IE model). Finally, we simulated an equilibrium stepping stone models for the four Nm and the same combinations of N and m as above using fastsimcoal2 v2.5.1 (Excoffier et al., 2013). Simulations of the equilibrium stepping stone were performed with a mutation rate of one order of magnitude lower than the one used in REs in order to obtain comparable levels of variability. We computed several statistics to characterize the genetic diversity at the intra-deme level: the number of haplotypes H (that is, the number of different sequences present in the sample), the number of segregating sites S (that is, the number of polymorphic sites in the data) and the mean pairwise difference π (that is, the mean number of differences between all pairs of sequences in the sample). To detect departure from the Wright–Fisher model (that is, an isolated population of constant effective size) because of the range expansion as well as the presence of metapopulation structure, we computed two summary statistics and then tested their departure from mutation–drift equilibrium: Tajima’s D that is based on the comparison of two estimators of the scaled mutation rate θ (namely θS and θπ) and Fu’ Fs that evaluates the probability of having a number of haplotypes similar or smaller than the observed value given the observed number of the mean pairwise difference. To determine the extent of genetic differentiation between demes we computed the φst (based on molecular distance metrics; Excoffier et al., 1992) and Fst (based on allele frequency). All summary statistics were calculated using Arlequin 3.5 (Excoffier and Lischer, 2010). Intra-population statistics were computed on the central deme of the array as well as on the nine pooled demes. φst and Fst were computed among the central and the bottom left deme (Figure 1), as well as among all demes.

Figure 1
figure 1

The simulated array of 50 × 50 demes. Black dots: sampled demes. Gray dot: founding deme (origin of the range expansion). Gray circles: demes sampled to compute φst and Fst presented in the text.

To test the ability to estimate the four parameters of our RE model: N, m, Nanc and Texp, as well as the Nm product, we developed an ABC framework (Beaumont et al., 2002; Beaumont, 2010; Bertorelle et al., 2010) in four steps. Step 1: we created our reference tables for the later estimation process by performing 100 000 simulations sampling from uniform prior distributions for N (U: 5–2000), m (U: 0.004–1), Texp (U: 500–20 000) and Nanc (U: 50–15 000). The resulting compound prior for Nm is very close to an exponential distribution with rate 0.002 bounded within 0–2000. We sampled 9 demes and 20 lineages per deme (Figure 1) and simulated 1000 independent loci, each of 1000 bp length with a mutation rate of 0.001 per generation for the whole locus as above. Step 2: we computed two sets of summary statistics for the ABC analysis: (1) SET1: the intra-population statistics (H, S, π, D and Fs) in the 9 demes pooled and the φst (for a total of six summary statistics) and (2) SET2: the same five intra-population statistic computed in the central deme only. Each statistic was averaged over the number of loci. Step 3: we generated pseudo observed data sets (pods) by randomly extracting 1000 observations from the reference table, recording the corresponding vector of prior values. Step 4: we analyzed pods using the ABC algorithm, with a weighted local linear regression (Beaumont et al., 2002) performed on the 5% of the closest simulations. For each analyzed pods we retained the mode and the median estimate of the posterior distribution and plotted them against the ‘real’ value recorded in step 3. We also computed the scaled root mean square error (SRMSE) (Walther and Moore, 2005) both for the median and the mode and the 95% coverage. A power analysis was carried out by subsampling from the total number of loci (1, 10 and 100) for both sets of summary statistics and sampling scheme. To determine which statistics is more informative and the importance of pooling demes we define three other sets of summary statistics: SET3 (summary statistics computed on all demes separately plus the φst, for a total of 46 statistics), SET4 (the same as SET1 without φst) and SET5 (the same as SET1 without the neutrality tests). Summary statistics used in each set are listed in Table 1.

Table 1 Summary statistics used in the approximate Bayesian computation (ABC) cross-validation procedure

Results

Unless specified, we present and discuss results in REs relative to: (1) the nine pooled demes concerning the statistics computed at the intra-population level (Figure 1); (2) the central deme concerning the neutrality tests; and (3) the central and the bottom left deme, concerning Fst and φst (Figure 1). The other sampling schemes displayed a similar pattern. We note that the grid of demes was not fully colonized for Nm=1 and Texp 1000, and therefore we did not report results of these simulations.

For fixed Nm we observe a monotonic increase in genetic diversity (as measured by H, S and π) for growing N (Figure 2). Both H and S showed a dependency on N for all the Texp tested as well as for the equilibrium stepping stone model (Figure 2 and Supplementary Figures S1 and S2). Conversely, we observed a different pattern for π that is insensitive to N for low Texp but depends on it for higher expansion time (Texp=50 000) and in the equilibrium stepping stone model (Supplementary Figure S3). The excess of private haplotypes for larger N leads to an increase of H and S that is reflected in the more negative D and Fs (Figure 3 and Supplementary Tables S1–S4). The influence of N on the neutrality tests decreases for high Texp and it is lost when the process arrives at equilibrium (Figures 3 and 4). This result can be explained as followed. For fixed Nm more mutations will occur for higher N. The length of the internal branches of the genealogy is constrained by Texp, implying that this excess of mutations will be concentrated in the terminal branches. These new mutations are therefore at low frequencies having almost no influence on π. Conversely, for large Texp and in the stepping stone the internal branches will be longer so that mutations will have time to occur. In this case, these mutations will be at intermediate frequencies, having an impact on π (and, as a consequence, on the neutrality tests).The percentage of significant outcomes (as calculated by Arlequin 3.5), both positive and negative, is also related to N for all the four Nm tested (Figure 4 and Supplementary Tables S1–S4). Low Nm can produce false signature of bottleneck in equilibrium model at the deme level (Chikhi et al., 2010). We found a similar result in RE model, where for Nm=1 many data sets display a significant signature of reduction of the effective population size (Supplementary Table S3). However, when N grows from 20 to 1000, the number of significant positive Tajima’s D drop considerably (Supplementary Table S3). Similarly, the power to detect a RE depends on the Texp (Ray et al., 2003): here we show that even for large Texp the probability of detecting an expansion increases substantially for larger N (Figure 4), and the probability of erroneously inferring a bottleneck drops (Supplementary Tables S1–S4).

Figure 2
figure 2

Distribution of the number of haplotypes (H) for Texp=3000 and Nanc =100 for four Nm values computed on the pooled demes. N values on the x axis.

Figure 3
figure 3

Distribution of Tajima’s D for Nm=10 in a RE for various values of Texp and the equilibrium stepping stone model in the 9 pooled demes.

Figure 4
figure 4

Proportion of significant negative outcomes of Tajima’s D for the four Nm values as a function of N in the nine pooled demes.

We present the φst and Fst distribution in REs for Nm=10 and Nanc=100 in Figure 5 and compare the results with the equilibrium stepping stone model. We observed a negative correlation between both measures of population differentiation and N when Texp 3000 (Figure 5). This result is consistent with the monotonic increase of diversity for larger N: higher heterozygosity tends to lower the degree of population differentiation, in particular when it is measured from allele frequencies (Hedrick, 2005). However, φst does not depend on N in the equilibrium stepping stone, whereas the behavior of Fst is consistent to that of nonequilibrium models. We found a similar trend for all the Nm values we tested, ranging from 1 to 100 (Supplementary Figures S4–S6). However, the decay of population differentiation for large N it is particularly strong for Nm=1 where the median value of Fst is ~0.8 for N=20, falling to ~0.35 for N=1000. Such difference is weaker for φst (decreasing from ~0.8 to ~0.6), consistent with the suggestion that measures based on molecular difference can better correct for the heterozygosity excess (Hedrick, 2005). For Texp 1000 the pattern is more complex and it is not consistent between the two differentiation measures. First of all, we observed a slight increase in the mean and in the variance of the distribution of φst for higher N (Figures 5, 6, 7 and Supplementary Figure S5 and S6), whereas no dependency on N was found in the Fst. Second, the monotonic increase in φst for larger N is more pronounced when Nanc=10 000. This pattern is more pronounced in IEs than in REs, where we also found similar results for Texp=3000 when Nanc=10 000 (Figure 7 and Supplementary Figure S6). Independently of Nm, no differences related to N were observed for the Fst. The genome-wide distribution of measures of population differentiation is therefore dependent on the interaction between N, Texp and Nanc in RE model. Detecting outlier loci in genomic scan of selection will require a good knowledge of all these parameters when investigating structured species.

Figure 5
figure 5

Distribution of φst (on the left of the dotted line) and Fst (on the right of the dotted line) for Nm=10 in a RE for various values of Texp and the equilibrium stepping stone model, Nanc fixed to 100.

Figure 6
figure 6

Distribution of φst (on the left of the dotted line) and Fst (on the right of the dotted line) for Nm=10 in a RE for two values of Texp.

Figure 7
figure 7

Distribution of φst (on the left of the dotted line) and Fst (on the right of the dotted line) for Nm=10 in an IE for two values of Texp.

We performed a cross-validation ABC procedure to investigate the power to jointly estimate N and m in RE model. We presented the modes of the estimated N in Figure 8 and Supplementary Figure S7 for all the combination of sets of summary statistics (SET1 and SET2) and number of loci (1, 10, 100 and 1000). The black line represents a perfect match: the closer the points are to the line, the closer the estimated mode is to the ‘real’ parameter. As expected, increasing the number of loci reduces the variance of the modes around the line. When we compared SET1 and SET2, we found that pooling more demes increases the performance of the ABC procedure, particularly when a small number of loci are used. A consistent pattern was found for all other parameters (Supplementary Figures S8–S10), with the exception of Nanc, poorly estimated in all cases (Supplementary Figure S11). We reported the SRMSE averaged over the priors for both the mode and the median for all the four parameters of the RE model and the Nm product in Supplementary Table S5. An SRMSE of one means that mode (or median) is on average one time higher or smaller than the real value. SRMSE decreases with the number of loci and it is always smaller in SET1 compared with SET2. We noted that the accuracy in the estimate of Nm is slightly higher than that of N, whereas in m it is considerably worse than both of them. Adding more loci did not improve the estimate of Nanc that was unexpectedly poor. Conversely, few loci were enough to correctly estimate Texp with any sets of summary statistics considered. We further tested three more combinations of summary statistics. First, we evaluated whether pooling more demes increases the performance in the ABC estimation process. To this end, we used the intra-deme summary statistics calculated separately in the nine demes: adding φst we defined SET3, characterized by 46 summary statistics. SRMSE were similar to SET1 (Supplementary Table S5), suggesting that it is important to sample more demes but not necessarily to pool them. We note however that pooling reduces the number of predictors in the regression step of ABC and it could be particularly helpful when studying large data sets (that is, composed by many demes). We then investigated the role played by φst by removing it from SET1 (thus obtaining the SET4). We obtained results very similar to SET2 (Supplementary Table S5): migration rate is poorly estimated and, as expected, Nm shows an SRMSE twice as big as the one found in SET1. Finally, we obtained SET5 by removing D and Fs from SET1. We observed a decrease in the performance of the estimation of Texp, N and m, all of which showed an SRMSE two times bigger than SET1. Conversely, the compound parameter Nm was unaffected.

Figure 8
figure 8

Cross-validation of N. Modes (y axis) are estimated under the ABC framework using the SET1 and plotted against the real values (x axis). The analyses are performed with 1, 10, 100 and 1000 loci (see text for details).

Discussion

Empirical population genetics is now experiencing a revolution. Thanks to the spread of next-generation sequencing technology, an impressive amount of genetic data are now easy to collect in both model and nonmodel organisms. These huge amounts of data carry precious information on the demographic and selective history of populations and species. However, having a lot of data has a potential drawback: the interpretation of genetic variation can be significantly incorrect when an incorrect demographic model is applied. Most (if not all) species are structured, that is, they are composed by many demes exchanging migrants to various extent. The two structured models that have been explored the most in theoretical population genetics are the island and the stepping stone model, both of which are at equilibrium. It is therefore urgent to explore in greater details nonequilibrium structured model and to integrate spatial components. It is for this reason that here we explored REs. Previous studies have shown that in homogeneous environment the main driver of genetic diversity after a RE is the compound Nm product (Ray et al., 2003). Here we focus on the influence of N for fixed Nm values and on the role played by Nanc in shaping the genetic diversity. Finally, we tested how well we can recover the demographic parameters after a RE using an ABC approach and how many demes it is important to sample to obtain accurate estimates. All those issues have not been investigated before and they are likely to be informative for empirical studies.

To better understand the specificity of REs, we compared our results with an equilibrium stepping stone model, fixing the Nm product and exploring various combinations of N and m. We considered both the variability within deme and the population differentiation computed among demes. At the intra-deme level, we found a monotonic increase in genetic diversity in both RE and stepping stone for increasing N. However, we found two important characteristics of REs: (1) an excess of low frequency haplotypes for increasing N and (2) a complex interaction between N, Nanc and Texp in shaping the degree of population differentiation. It is for these reasons that we could estimate N and m using an ABC approach. Below we provide some coalescent arguments to explain our findings.

Signatures of REs

Coalescent in metapopulation can be described in terms of scattering and collecting phases (Wakeley, 1999). In RE models the number of emigrants sent out to each generation is Nm, whereas the within-deme probability of coalescent is determined by N only. Therefore, for a fixed Nm the scattering phase will be longer for higher N and fewer within-deme coalescences will happen. This has three consequences, as for higher N: (1) the number of lineages entering the collecting phase will be higher; (2) there will be more H and more private S as in RE models mutations can also occur during the scattering phase; and (3) the ratio of external to internal branch lengths increases. Indeed, we observed a monotonic increase in both the number of haplotypes and the number of segregating sites for larger N when we fixed Nm (Figure 2 and Supplementary Figure S1). This is also true for the equilibrium stepping stone and it is therefore related to the increase of the scaled mutation rate (Supplementary Figure S2). However, the distribution of mean pairwise differences follows a different pattern: we observed a monotonic increase for larger N only when Texp 50 000 and at equilibrium. Both θπ and θs are estimators of the scaled mutation rate θ and Tajima’s D is based on their difference that is centered around zero under neutrality (Tajima, 1989). For a fixed Nm, if S increases monotonically with N but π remains constant, the difference between the two estimators will become negative, a signature typical of population expansion. For this reason, the power to detect an RE for fixed Nm depends heavily on N (Figure 4). This has practical implications: the signature of a RE will be lost more quickly in species characterized by small-sized demes. This is clearer when we compare the results of Nm=10 between RE and the equilibrium stepping stone, where N does not a play a role in determining the shape of the genealogy (Figures 3 and 4). Moreover, in equilibrium model, low Nm gives rise to false signature of population bottleneck (Chikhi et al., 2010). This is also true in RE (Ray et al., 2003) but the number of positive outcomes of Tajima’s D drops when the deme size increases (Supplementary Table S3). We argue that in empirical studies, particularly when concerning endangered species, N and m must be estimated separately to fully recover the underlying demography.

Population differentiation

The collecting phase dominates the coalescent process in equilibrium model (Wakeley, 1999), whereas its length in RE/IE will depend both on N and Texp. In RE/IE, backward in time lineages will contract toward the founding/ancestral deme: for a fixed Nm there will be less coalescence with higher N and more lineages entering the ancestral phase of the coalescent process. The number of lineages surviving until the ancestral phase is also influenced by Texp. Indeed, when Texp is low, there is less opportunity for the sampled lineages to coalesce in the array of demes. The surviving lineages will coalesce in the ancestral deme according to a Kingman coalescent dependent on Nanc only, accumulating mutations proportionally to Nanc and the mutation rate. Therefore, for fixed Nm larger N will increase the heterozygosity and the number of lineages reaching the founding deme, the two phenomena having opposite effect on the degree of population differentiation. The increased within-deme heterozygosity tends to lower both φst and Fst (Hedrick, 2005; Meirmans, 2006), causing a negative correlation between N and the genetic differentiation between demes (Figure 5 and Supplementary Figure S4). This is clearer for Texp 3000 because the difference in heterozygosity among the tested N is of higher magnitude. We found a similar behavior in the equilibrium stepping stone (Figure 5), consistently with the observed pattern of intra-deme diversity. Conversely, more lineages coming from different demes will arrive in the ancestral phase for larger N and Texp <3000. Here, the occurring mutations enhance the between deme component of molecular variance, increasing φst but not Fst for increasing N (Figures 5, 6, 7 and Supplementary Figures S5 and S6). When Texp 1000 generations, under our parameter combination, the influence of N on the φst is remarkable (Figure 4). The larger the Nanc, the higher will be the average φst and its variance (Figures 6 and 7 and Supplementary Figures S5 and S6). This effect is of greater intensity for IE (Figure 7 and Supplementary Figure S6), where the number of lineages arriving in the ancestral phase is larger because the contraction phase typical of RE, where many coalescent events happen before arriving in the ancestral deme (Excoffier et al., 2009), is absent. These findings have important implications in the analysis of real data. For example, invasive species are typical instances of young REs: the genome-wide distribution of the φst will differ enormously between species characterized by small- vs large-sized demes. Many tests of selection are based on the idea of finding outliers in the distribution of some summary statistics such as the φst. A good knowledge of the effective deme size is therefore needed to generate a correct null distribution to avoid false positive results in genomic scan of selection.

Estimating parameters of a RE

The influence of the carrying capacity on the genetic variability at both the intra and inter-population levels suggested that it should be possible to jointly estimate N and m using summary statistics. To this end, we developed an ABC framework and check by cross-validation the power to estimate the parameters of a RE. We varied the number of simulated loci, the number of demes sampled and the set of summary statistics chosen. To be consistent with all the previous simulations, we computed inter and intra-population summary statistics in both a single deme and a pool of lineages coming from all nine demes. We measured the accuracy of the estimation of all parameters by computing the SRMSE averaged through the prior distributions. For any combination of loci/demes/summary statistics, we obtained a similar accuracy in the estimation of N and Nm (Supplementary Table S5). This shows unambiguously how N affects the distribution of genetic variability, playing a role that can be easily disentangled from m. Several independent loci need to be typed to take into account the stochastic variability of the coalescent process: however, it seems that going from 100 to 1000 loci only slightly improves the accuracy of our estimate, suggesting that at least under our model a relatively small set of loci should provide consistent results. When using summary statistics computed in pooled lineages (SET1) or in the nine demes independently (SET3) the performance of the ABC procedure increases drastically compared with SET2, where statistics from a single deme are used (Figure 8 and Supplementary Figures S7–S11). The number of demes studied is therefore important to correctly characterize the parameter of REs. Summary statistics convey different information on each parameter. As expected, estimating the migration rate m is crucial to add the φst in the ABC algorithm (Supplementary Table S5). Indeed even when we pooled demes, the SRMSE is higher than one for m if we do not include the φst, whereas the drop in accuracy is less drastic for the compound parameter Nm (SET1 and SET4, Supplementary Table S5). Similarly, neutrality test statistics are also useful to disentangle N and m (Supplementary Table S5, SET5), consistent with the finding that N influences the power to detect a RE for fixed Nm. It should be possible to devise an optimal sampling strategy in terms of number of demes and number of loci, but this is beyond the scope of this paper. We only highlight here the importance of sampling more than just one deme to estimate the parameters of a RE and to compute a measure of population differentiation to estimate the migration rate. Despite the large number of loci studied, we could not correctly estimate the Nanc. Contrary to N, which plays a direct role on the intra-population diversity and on the measure of differentiations, Nanc interacts with N and Texp to shape Fst and φst. We argue that for this reason it may be not identifiable under our setting. At the same time, we note that other summary statistics (for example, combing the variance of both Fst and φst) may provide more inferential power on the estimation of this parameter that will be the subject of further investigations.

Conclusions

We show that for fixed Nm, the carrying capacity of demes play an important role during a RE at both the intra and inter-population levels. Species are unlikely to be at equilibrium. We therefore stress the importance of estimating N and m separately as well as investigating the ancestral phase of the coalescent process. Our results highlight that we can actually disentangle N and m provided many demes and a consistent set of summary statistics is used in the estimation process (for example, using an ABC algorithm). This would be extremely useful in conservation studies and also when investigating invasive species. Finally, we note that species undergoing recent REs will retain the signature of the ancestral demography that could be uncovered focusing on the genome-wide distribution of population differentiation.

Data archiving

There were no data to deposit.