Introduction

The build-up of species neutral genetic diversity through time is driven by the balance between recombination, mutation and drift and its spatial distribution structured by dispersal and gene flow. As these processes are tightly linked, understanding of mutational processes can help describe evolutionary processes in play better. This is especially important when studying recently diverged populations or species where the relative contribution of divergence by drift or natural selection, and gene flow, needs to be assessed to predict species evolutionary trajectory. Recent developments of sequence-based microsatellite genotyping (Darby et al. 2016; De Barba et al. 2016; Vartia et al. 2016) have improved our understanding of molecular variation at microsatellite repeats and in surrounding flanking sequences (Barthe et al. 2012; Viruel et al. 2018; Lepais et al. 2020). Several studies demonstrated that jointly accounting for linked microsatellite and flanking sequence substitutions provides complementary information from polymorphism evolving at different rates and by different mechanisms (Hey et al. 2004; Ramakrishnan and Mountain 2004; Payseur and Cutter 2006). Assuming complete linkage due to short physical distance, linked polymorphisms share the same genealogical history. As a result, microhaplotypes combining fast evolving microsatellite repeat variations with the much slower evolving flanking sequence, should provide refined insights about past demographic history with improved estimation of temporal range and resolution (Ramakrishnan and Mountain 2004).

For instance, Hey et al. (2004) adapted the Isolation with Migration model (IMa) to compound markers, linking flanking sequence haplotype and microsatellite (HapSTR) with an illustration on two Cichlid species who recently diverged with gene flow. Using one such compound marker system improved the inference of the demographic parameters that were difficult to estimate using a single type of marker. They highlighted additional avenues for progress, such as releasing constrains on the stepwise mutation model for the microsatellites and the need to use several independent loci in future applications. Conversely, Ramakrishnan and Mountain (2004) defined SNPSTR as one or more Single Nucleotide Polymorphism (SNP) linked to a microsatellite and demonstrated their use to estimate the timing of recent population divergence using both simulated and real data. They found the divergence time estimated from microsatellite variation on the background of the derived SNP allele to be more accurate and precise that the estimation based on the microsatellite variation alone. Their simulation showed that 20 SNPSTR markers provided a much better estimation that 100 microsatellites, and that the inference was robust to complex demographic scenarios involving gene flow or both population expansions and bottlenecks. In addition, divergence time between African and non-African human populations inferred from microsatellite variation alone was strongly underestimated compared with the estimation based on microsatellite variation linked to the derived SNP alleles (Ramakrishnan and Mountain 2004). In this study, the use of a subset of the data only, i.e. the microsatellite alleles linked to the derived SNP alleles aiming to reduce homoplasy (alleles identical by state but not by descent), led to much better inferences. This clearly illustrates the wealth of information that can be gained from contrasted mutation rates and mechanisms. On a theoretical ground, assuming an island model at migration drift equilibrium, genetic differentiation is related to gene flow by Fst ≈ 1/ (1 + 4Ne.(m + µ)) (Crow and Aoki 1984). As a result, genetic differentiation at equilibrium is expected to be lower for fast-evolving markers (such as microsatellites) compared to slowly-evolving markers (such as SNP). In addition, correlation between genetic diversities estimated at linked markers should also inform about non-equilibrium processes (Payseur and Cutter 2006), with a decrease in correlation following a demographic event such as a bottleneck.

Despite the clear demonstrated potential of using contrasted polymorphisms, these compound marker systems have been rarely studied because their genotyping is challenging (Mountain et al. 2002; Šarhanová et al. 2018; Barthe et al. 2012). This technical limitation does not hold true anymore thanks to the development of sequence-based microsatellite genotyping that greatly improves the characterisation of microsatellite and linked flanking sequences (Bradbury et al. 2018; Viruel et al. 2018; Curto et al. 2019; Layton et al. 2020). However, now that data from numerous marker systems can be easily collected at population scale, development of statistical approaches that harness this new source of information is needed for meaningful inference (Payseur and Cutter 2006).

Trees are a model of choice to study broad-scale postglacial colonisation patterns following the end of the last glacial maximum (Hewitt 1999). Their unique life history traits, such as long generation time, and the use of uniparentally inherited chloroplast DNA markers, enable the reconstruction of glacial refugia and postglacial recolonization routes (Petit et al. 2002) or even Tertiary geological events (Magri et al. 2007). Nevertheless, studies conducted at a more local scale in regions known to harbour glacial refugia identified fine-scale population genetic structure within refugia (Gómez and Lunt 2007), highlighting the value of more spatially focused studies to understand complex evolutionary processes such as vicariance and secondary contact and gene flow at play within these relict populations throughout the Pleistocene and the Holocene (Rodríguez-Sánchez et al. 2010; Feliner 2014). While the Mediterranean basin is a major centre of biodiversity for the European biota, studies remain scarce on the North African side of the Mediterranean basin, which harbours biodiversity hotspots and potential glacial refugia (Médail and Diadema 2009; Lepais et al. 2013; Feliner 2014). Understanding evolutionary processes in the region is particularly pressing as anthropogenic pressure is rising and increases threats posed by ongoing climate change (Hampe and Petit 2005).

Quercus canariensis Willd. and Q. faginea Lam. are two oak species that are a good illustration of both the complexity of evolutionary processes that shape present-day biodiversity within species, and the threats faced by rear-edge populations. While the taxonomy of Q. canariensis is well defined, the taxonomic status of Q. faginea has been debated among taxonomists with two sub-species (broteroi and faginea) in Algeria, among the four subspecies described in Q. faginea, differing by morphological and environmental characteristics (Aissi et al. 2021). The origin of such high level of intraspecific variability is unclear and might involve a combination of genetic drift in isolated populations (Moracho et al. 2016) and interspecific gene flow. Localized at the rear-edge of the distribution range of the species in an area of putative refugia and regional hotspots of biodiversity (Médail and Diadema 2009), these populations are under the direct threat of contemporary environmental changes. A better understanding of the origin of this intra-specific variability and their recent demographic trajectory is paramount to evaluate the evolutionary value and conservation priority status of these populations.

The aim of this study is to assess the value of the joint analysis of microsatellite and flanking sequence variations for population genetic inference in the case of populations or species with a complex history of drift and gene flow. As a proof of concept, we studied molecular variation at 36 sequenced microsatellites across seven Q. canariensis and four Q. faginea rear-edge populations to decipher their taxonomic status, past evolutionary history and recent demographic trajectory.

Given long pollen dispersal distance, large effective population size and long generation time typical of wind-pollinated tree species such as oaks, we expect high local genetic diversity and low regional genetic structure at neutral nuclear loci. Alternatively, limited pollen dispersal between small remnant forest stands isolated by habitat fragmentation may have led to genetic drift and reduced effective population size, resulting in low local genetic diversity and high regional genetic differentiation typical of relictual populations (Hampe and Petit 2005). These two putative demographic histories can be further confounded by historical or contemporary interspecific gene flow that will additionally blur species limits and contribute to reshuffle genetic diversity across populations.

We first compared patterns of molecular variation and genetic structure observed from different polymorphisms (variation of microsatellite repeat number, SNP in flanking sequence, allele size, all polymorphisms combined as microhaplotype) using genetic diversity and differentiation indices to assess the correlation or complementarity and the resolution power of different polymorphisms. We expect homoplasy within microsatellite loci to blur species limits, but with a weakest effect when accounting for all polymorphisms.

To identify the most likely combination of historical and contemporary processes that shape the current population genetic structure, we used a simulation-based inference approach that is flexible enough to model specific linked-polymorphisms observed at each locus for alternative past demographic scenarios. We derived specific summary statistics, synthetizing population-level molecular variation, that account for linked microsatellite and flanking sequence polymorphisms. We then applied a Random Forest implementation of Approximate Bayesian Computation (ABC) that retrieves the information content of each summary statistic and of each type of polymorphism to differentiate between concurrent demographic models and to estimate mutational and demographic parameters for the most likely demographic scenario. For historical model selection, we expect summary statistics derived from the combination of microsatellite and flanking sequence substitutions to be less sensitive to homoplasy. This should help to better differentiate shared ancestral polymorphisms from interspecific gene flow. For demographic parameter estimations, we expect flanking sequence SNPs to be more informative about ancient population history while fast evolving microsatellites should be more informative about recent population demography.

Materials and methods

Biological model and individual selection

Q. canariensis is listed as Near Threatened at European level in the IUCN’s Red List (García Murillo and Harvey-Brown 2017) due to population decline and increasing drought within the Iberian Peninsula. Globally, the species is considered as Data Deficient because the species lacks information on population size and status across its range especially in North Africa (Gorener et al. 2017). The species is present in forest stands across Morocco, Algeria and Tunisia (Fig. 1a). In Algeria, Q. canariensis is located in either vast stands near the coastal area or in a few isolated populations in more remote southern mountain locations. Q. canariensis grows preferentially on deep, acidic substrates (calcareous and marno-calcareous, seldom frequent) in a subhumid and humid bioclimate. Q. faginea is listed as Least Concern both at the European scale (Harvey-Brown et al. 2017) and globally (Jerome and Vasquez 2018) due to widespread and stable abundance within its main range in Spain, Portugal and to a lesser extent in Morocco (Fig. 1a). In Algeria however, Q. faginea is nowadays only found in four locations (note that due to the scarcity of georeferenced occurrences Fig. 1a includes historical records that do not precisely represent contemporary location of Q. faginea populations in the country) in the form of small and isolated populations of a few hundreds of trees including scattered individuals in mountainous landscapes (Aissi et al. 2019). Q. faginea seems to be indifferent to the physico-chemical composition of the substrate and grows mainly in isolated stands on limestone substrates in a subhumid or semi-arid bioclimate and is adapted to both dry and cold environments (Aissi et al. 2021).

Fig. 1: Geographical context and population genetic structure of Quercus faginea and Q. canariensis Algerian populations.
figure 1

a distribution range of Q. faginea (green) and Q. canariensis (red) estimated using GBIF occurrence data including historical records (GBIF Secratariat 2021a, 2021b), with the studied rear-edge populations identified by the dotted rectangle; (b) EDENetwork network graph projected on a geographic map with nodes indicating spatial location of the sampled sites (site name colours reflecting morphological species as in (a)), node size is proportional to the number of connection with other nodes and colours reflect the dominant genetic cluster, edges link population with a pairwise genetic differentiation (Fst) below 0.04 with edge width and greenness inversely proportional to genetic differentiation. O: Oran, A: Alger and S: Skikda; (ch) Bayesian genetic clustering Structure results using whole haplotype information (c, e, g) or the number of repeats at microsatellite only (d, f, h), assuming two (c, d) or eight (e, f) genetic clusters with the phylogenetic tree (g, h) illustrating the allele frequency divergence (Structure’s Net nucleotide distance) among the eight genetic clusters.

Geographical coordinates of thirty randomly selected individuals were recorded in seven Q. canariensis and four Q. faginea populations across Algeria (Table 1; Fig. 1b). For each selected individual, a few fresh leaves were harvested and stored dried in silica gel until DNA extraction.

Table 1 Description of the sampled Q. faginea and Q. canariensis Algerian populations.

Microsatellite genotyping by sequencing

Total DNA for each individual was extracted from silica dried leaf material using Invisorb DNA Plant HTS 96 Kit (Invitek) and sequence-based microsatellite genotyping (SSRseq) was used to analyse polymorphism at 36 loci as described elsewhere (Lepais et al. 2020). In short, 60 genomic and EST-derived microsatellites (Kampfer et al. 1998; Durand et al. 2010) were co-amplified in a single multiplexed PCR and sequenced on an Illumina MiSeq sequencer (Supporting Material File 1). After sequence demultiplexing, a bioinformatics pipeline (Lepais et al. 2020) integrating the FDSTools analysis toolkit (Hoogenboom et al. 2016), was used to convert raw sequence into genotypic data (microhaplotypes) integrating all polymorphisms identified within the microsatellite itself (number of repeats, SNP and insertion-deletion (indel)) and in its flanking sequences (SNP and indel). The bioinformatics pipeline also compares genotypes from blind-repeated genotyping of 48 individuals to estimate allelic error rate and compute the overall missing data rate for each locus. These data quality metrics are necessary to optimise the bioinformatics analysis strategy by identifying loci that cannot be reliably genotyped or for which the genotyping accuracy is only acceptable when analysing the polymorphism within the repeat motif itself (i.e. not accounting for the polymorphism occurring in the flanking sequence (Lepais et al. 2020)). For each locus, every allele differing from the other by any polymorphism type was coded under an arbitrary three-digit scheme with a unique number assigned to each microhaplotype per locus.

Multilocus genotypic dataset quality control

Assuming Hardy-Weinberg equilibrium, we tested for heterozygote deficit or excess for each locus in each population using Genetix (Belkhir et al. 2004) to identify outlier loci presenting extreme inbreeding coefficients that could indicate presence of null allele (positive Fis) or paralogs (negative Fis).

As clonal reproduction has been reported for oaks in harsh environmental conditions (Alberto et al. 2010), we identified clones using Colony (Wang 2016) and kept only one genotype from each detected genet.

The studied loci were partly derived from expressed transcripted sequences (Durand et al. 2010), or were selected among anonymous genomic regions to differentiate Q. robur and Q. petraea (Scotti-Saintagne et al. 2004; Lepais et al. 2006), therefore, all may not behave neutrally. Bayescan v2.1 (Foll and Gaggiotti 2008) was used to identify loci departing from neutrality as most of the following analyses assume neutral genetic evolution. Bayescan was run at population level within species and at species level using the microhaplotype dataset.

Population genetic diversity and structure

Gene diversity (Nei’s unbiased heterozygosity), allelic richness after rarefaction to 16 diploid individuals, Fis and Fst were estimated for each sampled location, or pair of sampled locations, with FSTAT (Goudet 1995). Private alleles were computed using HP-RARE (Kalinowski 2005), genetic differentiation accounting for allele distance in number of repeats at microsatellite alleles (Rst) was estimated using SPAGeDi (Hardy and Vekemans 2002) and statistical significance of genetic differentiation between species tested using 10,000 permutations of individuals between species.

To estimate genetic differentiation (Fst) between populations, four different genotypic datasets were considered, accounting either for the whole haplotype, the flanking sequence haplotype, the number of microsatellite repeat or the whole allele size, to highlight potential difference in genetic structure between polymorphisms. Statistical significance for difference of genetic diversity and differentiation within species estimated from different datasets was tested by comparing locus-level statistics using a Mann–Whitney test implemented in the R package coin (Hothorn et al. 2008).

Structure v2.3.4 (Pritchard et al. 2000; Falush et al. 2003) was used to explore genetic structure between and within species. We applied the Admixture model and the Correlated Allele Frequency model with 20 replicated runs for an assumed number of genetic clusters (k) ranging from 1 to 12. To remove runs that did not converge and produced outlier results, the 15 most likely runs for each k value were kept for further inspection. Structure Harvester (Earl and vonHoldt 2012) was used to inspect likelihood trends of the runs and Clumpak (Kopelman et al. 2015) to identify and filter-out sub-optimal runs.

Using the most informative whole haplotype genotypic dataset, contemporary effective population size was estimated for each sampled location using the linkage disequilibrium method implemented in NeEstimator v2.1 (Do et al. 2013) assuming random mating, removing alleles with frequency below 0.02 (Waples and Do 2010) and using the parametric chi-squared method to estimate 95% confidence intervals.

In addition, the geographical context of the population genetic structure was illustrated using a network graph built with EDENetworks (Kivelä et al. 2015) with node location representing the geographical coordinates of sampled site locations and edges linking sites showing pairwise genetic differentiation values (Fst) lower than 0.04 (determined by automatic thresholding), with edge width inversely proportional to Fst estimates.

Population demographic history

While results from the Structure analysis show that each of the 11 studied populations can be unambiguously assigned to one of eight delineated genetic clusters (see “Results”), each population cannot be unambiguously assigned to one of the two species due to genetic continuity of the studied populations. We thus investigated the most-likely demographic history that led to the observed continuous pattern of genetic differentiation. One hypothesis is that the populations diverged separately from the main core populations and that different levels of genetic drift across populations led to a continuous gradient of genetic differentiation that blurs species delineation. A second hypothesis is that gene flow taking place after a divergence period (secondary contact) produced admixed genotypes that, following recombination and drift, have mingled to form a variety of intermediate populations. While interspecific gene flow following secondary contact has been found to be the rule for sympatric European white oak species (Leroy et al. 2017), the studied species are not sympatric because of contrasted ecological requirements. As a result, interspecific gene flow could only happen during secondary contact or through continuous levels of low interspecific gene flow over long distances. Therefore, molecular variation was simulated for a number of realistic demographic models using the coalescent implemented in fastsimcoal2 v2.6.0.3 (Excoffier et al. 2013). We simulated six concurrent demographic models consisting of population divergence without gene flow (model A), population divergence with a single interspecific gene flow event (model B) or with continuous interspecific gene flow (model C) after a period of divergence without gene flow (Fig. 2). Three additional models include long-term continuous interspecific gene flow between core populations of the two species (model Ab, Bb and Cb, respectively, Fig. 2). Gene flow between contemporary populations and unsampled core range of the other species was modelled as directional from the core to the isolated heterospecific population because we were interested in the effect of interspecific gene flow on the studied populations specifically. In addition, we allowed interspecific gene flow only for populations showing putative admixture in Structure (Fig. 1c and g). Effective population size was allowed to increase or decrease between the time at divergence and present. Prior distribution of demographic parameters was setup wide to reflect the lack of knowledge about the history of these populations (Table 4, Supporting File 1) and checked for their capacity to produce realistic genotypic data by comparing observed and simulated genotyping datasets (Cornuet, Ravigné and Estoup 2010; Leroy et al. 2017, Supporting Material File 3).

Fig. 2: Description of simulated models of Q. faginea and Q. canariensis Algerian populations demographic history.
figure 2

Present-day populations diverged from their respective species main distribution area without any additional interspecific gene flow (models A and Ab), or with interspecific gene flow represented either by an instantaneous single event of interspecific gene flow (models B and Bb) or a continuous interspecific gene flow (models C and Cb). Models Ab, Bb and Cb account for long-term interspecific gene flow between species whereas models A, B and C assume that species diverged without interspecific gene flow. Unsampled (ghost and ancestral) populations are indicated in white, sampled populations are coloured according to their main Structure’s genetic cluster. Arrow with continuous line indicates single instantaneous interspecific gene flow, arrow with dashed line indicates continuous interspecific gene flow. N: effective population size, t: time of event, tx: time of secondary contact; R and M: instantaneous and continuous interspecific migration rate, s: sampled population code. Rectangle indicates fixed effective population size, isosceles trapezoid indicates variable effective population size since time of divergence; shapes and sizes are not to scale in terms of time of event or effective population size.

Using fastsimcoal2, we simulated realistic molecular diversity that closely matches the characteristics of each of the studied loci. We focused on two kinds of variability that can be simulated: variation in repeat number at the 36 microsatellites and substitutions in the flanking sequence for the 20 loci for which the flanking sequence could be reliably analysed (called HapSTR loci thereafter) and assuming complete linkage between the microsatellites and their flanking sequences. This procedure accounts for 90% of the source of variations in the dataset (Lepais et al. 2020), leaving out additional polymorphisms (SNP in the repeated motif or indel in the flanking sequence) that are far more complex to model due to uncertainty of the mutational mechanisms and rates.

Each microsatellite was allowed to mutate at rates ranging from 10−5 to 10−2 mutation/generation/haploid genome (Lye et al. 2011) following a Generalized Mutation Model with parameter p between 0.1 and 0.5. Similarly, substitutions in the flanking sequence could mutate at rates ranging between 10−9 and 10−6 substitution/position/generation/haploid genome following neutral model of evolution. Each locus system had its own prior distributions for these parameters to model a wide range of mutational dynamics.

To compare the six alternative demographic models, 12,000 genotypic datasets were simulated for each model with demographic and mutational parameter values sampled from prior distributions. To summarize the distribution of genetic diversity among populations and loci, a total of 530 summary statistics were derived (Supporting Material File 2). We used Arlsumstat (Excoffier and Lischer 2010) to compute 294 summary statistics that take into account the different sources of molecular variation. A first set of 98 statistics were computed with microsatellites and, when available, flanking sequences combined as microhaplotypes “considered as mutationally equidistant from each other” (standard data in Arlequin nomenclature) using the 36 loci. A second set of 135 statistics were computed based on the number of repeats from all 36 loci (microsatellite data type in Arlequin). Finally, for the 20 HapSTR loci, an additional set of 61 statistics were computed based on substitution in the flanking sequences. We then computed 224 additional summary statistics using a custom R script (Supporting Material File 4). First, 136 of these statistics were locus-level variability such as variance in allele size for the microsatellite (V), number of microsatellite alleles, number of flanking haplotypes, nucleotide diversity (theta). Second, as the covariation in diversity at linked polymorphisms might include useful information about mutational and demographic processes (Payseur and Cutter 2006), we computed 88 statistics interrogating the association between the repeated motif number and the flanking haplotype, such as the correlation between theta and V (Payseur and Cutter 2006) or the mean number of microsatellite alleles per flanking sequence haplotype.

We performed model choice by comparing simulation-based summary statistics from the six models to the observed summary statistics from the real dataset within the ABC framework (Beaumont et al. 2002) and using the Random Forest approach implemented in the R package abcrf (Pudlo et al. 2016; Raynal et al. 2019).

We first built a classification random forest model using 1000 trees and a training dataset consisting of simulation-based summary statistics obtained from all models. We estimated the classification prior error rate for each model using an “out-of-bag” procedure to estimate the power of the genetic data to differentiate between the demographic models given the model and prior specifications (Pudlo et al. 2016). Then, we used the summary statistics computed based on the observed genotypic data to predict the demographic model that best fit the data using a regression forest with 1000 trees. In addition, we also repeated the model choice analysis using only the 231 summary statistics derived from the microsatellite motifs from the 36 loci.

We then used the overall most likely scenario to simulate 100,000 additional genetic datasets using parameters and prior distributions described above, to estimate demographic model parameters. We built a regression random forest model implemented in the R package abcrf based on the summary statistics using 1000 trees. The out-of-bag procedure (Raynal et al. 2019) was used to assess prediction error and inference power by comparing true (simulated) and estimated parameter values. We then estimated the posterior median, 0.05 and 0.95 quantiles of demographic and mutational parameters using a random forest regression model based on all summary statistics computed from the observed haplotypic dataset.

Finally, we assessed the value of different or combined polymorphisms for demographic inference by quantifying summary statistics importance for both model choice and parameter estimation (Estoup et al. 2018). To interpret better the significance of variable importance, we introduced 12 additional random summary statistics (“noise variables”) in the datasets (Chapuis et al. 2020). These consisted in four random floating point variables sampled between 0 and 1, four random floating variables sampled between 0 and 10 and four random natural number variable sampled between 1 and 30. As random variables, these noise variables should not contribute to the inference, and thus will provide a benchmark to assess the significance of the variable importance metrics reported for each summary statistics (Chapuis et al. 2020). Thus, summary statistics with variable importance higher than random summary statistics were considered informative.

Results

Genotyping

After filtering for missing data, genotyping error, Hardy-Weinberg equilibrium, neutrality and clonality (see Supporting Material File 2), the final genotypic dataset consisted of 36 high quality sequenced loci genotyped in 318 individuals. This includes 20 linked flanking sequence and microsatellite loci, called HapSTR thereafter, and 16 loci that could not be reliably analysed across the whole haplotype and for which only the microsatellite motif was analysed, called STR thereafter (Supporting Material File 2).

Molecular variation over loci

In addition to variability for the number of microsatellite repeats, SNP and indel within the microsatellite motif and the flanking sequences, was the rule rather than the exception (Supporting Material File 1 Table S2 and File 2). A total of 438 alleles differing in size would have been observed using traditional capillary-electrophoresis, which represent half of the number of alleles when considering all sources of molecular variation identified using sequence data (899 haplotypes in total, Supporting Material File 2). As a result, size homoplasy (alleles identical in length but differing in sequence) amounts to 51% overall (56% for HapSTR and 31% for STR). Across the 36 loci, 32 showed SNP within the microsatellite motif (for a total of 115 SNP), nine showed one indel within the microsatellite motif (for a total of 10 indel) and nine microsatellites consisted of more than one repeated motif (compound microsatellites). SNP among the flanking sequence for the 20 HapSTR were widespread with an average of 7.5 SNP per locus (min: 3, max: 14) and an average of one SNP every 6.7 nucleotides (15% of flanking sequence position showed a SNP). Half of the HapSTR showed an indel with an average of 0.9 indel per loci (min: 0, max: 4).

Molecular variation across populations

Interspecific genetic differentiation (Fst) varies from 0.030 when computed based on flanking sequence haplotype to 0.045 when computed from the whole haplotype (Table 2). Differentiation among Q. canariensis populations (average: 0.026) was lower than among Q. faginea populations (average: 0.043) irrespectively of the polymorphism considered. The comparison of Fst computed from the whole haplotypes among Q. canariensis populations (Fst = 0.026) and among Q. faginea populations (Fst = 0.049) was statistically significant (Mann–Whitney z-score = −3.74, p value = 0.0001, N = 36). The higher differentiation among Q. faginea populations is mostly driven by differentiation at microsatellite variation which translated into higher Fst compared to differentiation among Q. canariensis populations estimates based on allele size, repeat number or whole haplotype (Table 2). By contrast, differentiation at flanking sequence is not statistically different among Q. canariensis and Q. faginea populations (z-score = −0.56, p value = 0.5877, N = 20). Differentiation among populations within the two species at the microsatellite and the flanking sequence is not significantly different (respectively, 0.027 and 0.024 for Q. canariensis, Mann–Whitney z-score = 0.59, p value = 0.5687, N = 19; 0.046 and 0.030 for Q. faginea, Mann–Whitney z-score = 1.15, p value = 0.2584, N = 19, Table 2).

Table 2 Between and within species genetic diversity and differentiation parameters computed using different polymorphism types.

Genetic differentiation based on microsatellite allele distances (Rst) was not statistically significant from genetic differentiation considering only allele identity (Fst) at the intraspecific and interspecific level (all permutation tests not significant).

Genetic diversity did not differ between species (Table 2): variance of allele size at microsatellites amount to 4.31 and 4.65 and nucleotide diversity at flanking sequence to 7.3.10−3 and 6.5.10−3 for Q. canariensis and Q. faginea, respectively. Locus-level comparison showed low and not significant correlation between variance of allele size at microsatellite and nucleotide diversity at linked flanking sequence (Pearson product moment correlation: 0.15 (t = 0.62, df = 16, p value = 0.54) and 0.20 (t = 0.84, df = 17, p value = 0.41) for Q. canariensis and Q. faginea respectively).

At population level, allelic richness, gene diversity and private alleles showed similar variation between populations (Fig. 3), except for populations S02 and S03 that harbour higher and lower diversity, respectively (Fig. 3a, c and e). Flanking sequence haplotypes consistently showed lower allelic richness and genetic diversity than STR and their combination under HapSTR showed substantially more variability (Fig. 3a and c). This observation is explained by the lack of correlation between allelic richness or gene diversity between flanking sequence and linked STR (Fig. 3b and d). The mean number of private alleles per locus is generally low and similar between flanking sequence and STR (0.25 and 0.20 respectively), at the exception of S02, S06 and S09 populations where private flanking sequences were more frequent (Fig. 3e). However, the combination of the two sources of variation in HapSTR showed a much higher mean number of private alleles (0.88). Overall, these observations indicate that flanking sequence and linked STR harbour complementary information.

Fig. 3: Genetic diversity of Q. faginea (S02, S03, S08, S11) and Q. canariensis (S01, S04, S05, S06, S07, S09, S10) Algerian populations.
figure 3

a allelic richness, (b) within-loci correlation between allelic richness (Ar) at individual microsatellite and their flanking sequence, (c) gene diversity, (d) within-loci correlation between gene diversity (Gd) at individual microsatellite and their flanking sequence, (e) private alleles, (f) genetic differentiation (Fst, summarized by averaging all pairwise genetic differentiation estimates that involved the focal population). Symbols represents mean over loci (square: microsatellite variation, triangle: flanking sequence variation, circle and diamond: whole haplotype integrating both kind of polymorphisms) and the bar shows estimate variability (plus and minus two standard deviations over loci). Populations are ordered from west to east along the x-axis, colours refer to Structure results, horizontal dashed lines represent the average value across populations.

Population-specific genetic differentiation is the highest for S03, S08 and S11 populations and shows the highest variability for Q. canariensis S04, S05, S09 and S10 populations (Fig. 3f), which is consistent with Q. canariensis populations displaying low divergence from each other and Q. faginea populations displaying higher distinctiveness in the genetic differentiation network (Fig. 3b). Genetic differentiation has similar values for flanking sequence haplotypes, STR and combined information, at the exception of a few populations where flanking sequence haplotypes showed higher (S08, S07, S09) or lower (S11) genetic differentiation (Fig. 3f).

Structure analysis assuming two genetic clusters (K = 2), broadly delineate the two species using either HapSTR haplotypes or STR alleles (Fig. 1c et 1d respectively): Q. canariensis S04, S05, S09, S10 populations on the one hand and Q. faginea S03 and S08 populations on the other hand clearly fall in their own specific cluster. However, Q. canariensis S01, S06 and S07 populations and Q. faginea S02 and S11 populations showed admixture or unsorted ancient ancestry. While S01 had been assigned as Q. canariensis based on morphology, it was found to be more genetically similar to Q. faginea. Assuming a higher number of genetic clusters resulted in a more detailed view of the intraspecific genetic structure (Supporting Material File 1 Figure S2). At K = 8, using HapSTR haplotypes, results show a clear delineation of seven populations into specific clusters in addition to a genetic cluster of four Q. canariensis populations (Fig. 1e). Population delineation is less clear using STR alleles only, especially for Q. canariensis populations (Fig. 1f, Supporting Material File 1 Figure S3). Differentiation between genetic clusters do not show a bipartite hierarchical structure expected when analysing two species (Fig. 1g et 1h). Instead, five of the clusters showed gradual level of differentiation starting from the Q. canariensis side (right part of the tree in Fig. 1g), while three genetic clusters on the Q. faginea side showed high genetic divergence (left branch of the tree in Fig. 1g). Populations with intermediate ancestry coefficient assuming two genetic clusters (S02, S01, S06, S07 and S11) now constitute independent clusters that are located at intermediate positions between the two species extremities (Fig. 1e). The tree topology is similar when using STR alleles only, but its branches are shorter (Fig. 1h).

High variability in contemporary effective population size was found with estimates ranging from as low as 50 individuals for S03 and S08 population to 350 individuals for S05 (Supporting Material File 1 Figure S1). Excluding population S04 for which the sample size is too low for a point estimate (indicating a higher effective population size with the 95% lower bound confidence interval estimated at 524 individuals), the contemporary effective population size averaged 150 individuals over populations.

Demographic inferences

Model choice

When using information taken from both the microsatellite and the linked flanking sequence (HapSTR), the most likely model to explain the observed data was model Ab (posterior probability: 0.47, Table 3a) which consisted of species divergence with gene flow followed by independent divergence of populations without additional more recent interspecific gene flow (Fig. 2). The out-of-bag procedure estimated that, given the choice of the model Ab and the uncertainty in differentiating the models given the data and the prior distribution settings, there is a 49% probability that the true model is indeed model Ab, and a 18% or 17% probability that the true model is model Cb and Bb respectively (Table 3b).

Table 3 Random forest ABC Model choice (a) by using jointly microsatellite and flanking sequence information (HapSTR) or information derived from microsatellite variations only (SSR only). Power to differentiate alternative demographic models for HapSTR dataset (b) given the model and prior range specifications.

A total of 98 summary statistics had variable importance higher than randomly generated statistics, and thus were considered informative to choose between alternative scenarios (Fig. 4a). These informative summary statistics represented 44% of the total variable importance. Among them, 24% derived from linear discriminant axes, 35% from summary statistics computed from microsatellite variation, 33% from haplotypes and only 8% from the substitutions within the flanking sequences (Fig. 4a).

Fig. 4: Contribution of different types of polymorphism measured as Variable Importance (y-axis).
figure 4

a Model choice and parameter estimation, b sampled and c unsampled population effective population size, d time of event, e interspecific gene flow, HapSTR microsatellite (f) mutation rate, g mutation steps and substitution rate (h), SSR microsatellite mutation rate (i) and mutation steps (j). In each subfigure, the leftmost bar shows the proportion of total Variable Importance resulting from informative summary statistics, while the rightmost bar shows the contribution of the different groups of summary statistics derived from different types of polymorphisms.

It should be noted that when considering microsatellite variation alone, the most likely scenario was model Cb (posterior probability: 0.40, Table 3b) which consisted of species divergence with gene flow followed by independent divergence of populations followed by interspecific gene flow after a period of strict isolation (Fig. 2). However, since HapSTR integrates more molecular information, this strategy provides a higher resolution (posterior probability of 0.48 and 0.40 for HapSTR and STR respectively) and power (overall classification error of 42.7 and 44.4% for HapSTR and STR respectively). Therefore, we selected scenario Ab to estimate its demographic parameters using the HapSTR dataset.

Demographic and mutational parameter estimations

Microsatellite mutation and substitution rates were accurately estimated (all posterior MNSE lower than 0.11; Supporting Material File 2) with an average of 65 and 68 informative summary statistics representing 89 and 88% of the total variable importance for microsatellite mutation rates at HapSTR loci and SSR loci respectively (Fig. 4f and i) and an average of 16 informative parameters for substitution rate that represents 47% of the total variable importance (Fig. 4h). On the contrary, the parameter p controlling the number of mutational steps at microsatellite was difficult to estimate (posterior MNSE between 2.6 and 7.6; Supporting Material File 2) and had only an average of three informative summary statistics per parameters that represented 7.6% of the total variable importance for all loci (Fig. 4g, j).

Estimated microsatellite mutation rates varied from 2.2 × 10−5 to 2.1 × 10−3 mutation per generation per haploid genome (mean: 3.1 × 10−4, 95% CI: 4.5 × 10−5 – 2.4 × 10−3) and flanking sequence substitution rates varied from 4.5 × 10−8 to 4.8 10−7 (mean: 2.5 × 10−7, 95% CI: 2.8 × 10−8 – 9.0 × 10−7, Supporting Material File 2).

Posterior effective population size parameters showed restricted ranges compared to prior, as well as reduced estimation error (posterior NMAE below 0.32, Table 4). Contemporary and ancestral effective population sizes of sampled populations could be estimated, with an average of 39 informative summary statistics (Fig. 4b) representing 64% of the total variable importance. Effective population size of unsampled populations (i.e. Q. canariensis and Q. faginea core range, and their ancestor) could be estimated with an average of 46 informative summary statistics (Fig. 4c) representing 61% of the total variable importance.

Table 4 Demographic parameter prior specification and posterior estimations for the most likely model Ab.

Effective population size estimates showed some variation between populations (Table 4). Effective population size at divergence time were generally higher than recent effective population size, with clear sign of effective population size reduction for S08 (ratio N5a/N5 = 9.3, N5 = 3908 haploid genomes), S11 (ratio of N7a/N7 = 10.3, N7 = 1898 haploid genomes) and S3 (ratio of N2a/N2 = 24.3, N2 = 1667 haploid genomes). Populations that did not experience significant effective population size change had relatively moderate (~10,000 for S09, ~20,000 for S01, S06 and S07) or high (~40,000 for S02) effective population sizes.

Overall, only 36 out of the 518 summary statistics were not informative for the estimation of demographic or mutational parameters. Most of them were associated with flanking sequence diversity or divergence and correlation between diversity at microsatellite and flanking sequence (Supporting Material File 2).

Discussion

The complex demographic history of rear-edge populations

A first aim of this study was to assess how the informative content of microsatellite sequence can help us better understand the role of interspecific gene flow and recent demographic dynamics in structuring genetic diversity in complex population dynamics, using Q. faginea and Q. canariensis at the rear-edge of their distribution as a case study. A genetic clustering analysis assuming two genetic clusters showed highly admixed populations, which can be interpreted as evidence for recent interspecific gene flow. However, these admixed populations differentiated into specific genetic clusters when assuming eight genetic clusters. Contrary to expectations, the genetic distance between the eight genetic clusters does not clearly delineate species boundary, but instead shows a continuous gradient in genetic distance that suggests either ancient hybridization or retention of ancestral polymorphism. Notably, within-species genetic differentiation estimates (Fst) of 0.028 and 0.047 for Q. canariensis and Q. faginea respectively, are intermediate between typical estimates of microsatellite genetic differentiation among European white oaks core range populations (Muir et al. 2004; Alberto et al. 2010; Neophytou et al. 2015) and isolated marginal populations (Buschbom et al. 2011; Moracho et al. 2016). Results herein therefore suggest that interspecific gene flow occurred before population divergence in isolation. Divergence by drift was found to be the driving factor shaping genetic structure of small isolated Q. faginea and Q. canariensis populations. Recent drift in isolation was exacerbated in declining Q. faginea populations and affected to a lesser extent the remaining Q. faginea and some Q. canariensis populations. This has led to the formation of independent genetic clusters in a context of constant effective population size, allowing them to retain some admixed characteristics from ancient hybridization events as shown by intermediate genetic characteristics of their respective genetic clusters. Indeed, populations which experienced stable environments for extended periods of time are expected to retain ancestral characteristics (Petit et al. 2005), and thus in our context, the footprint of ancient interspecific gene flow. As a result of the recent evolution by drift in isolation, each population constitutes an original and significant contribution to the total genetic diversity of the species and provides a good illustration that rear-edge populations effectively act as reservoirs of evolutionary history (Petit et al. 2005; Médail and Diadema 2009; Lepais et al. 2013), as suggested by the description of subspecies in the region (Aissi et al. 2021). Given their small population size, these populations are vulnerable to rapid decline. This is supported by our simulations showing a demographic decline that has not yet led to a reduction in population genetic diversity. However, ABC-based estimates of historical declines in some Q. faginea populations was not always congruent with linkage-disequilibrium based estimates of small contemporary effective population size, thus indicating that all populations might not follow the same recent demographic trends. Nevertheless, most of the studied populations have effective population size confidence intervals below 500 individuals, a critical level for the long term maintenance of evolutionary potential (Jamieson and Allendorf 2012; Hoban et al. 2020), and estimates of effective population sizes as low as 50 individuals in isolated Q. faginea populations suggest short-term risk of detrimental effect of inbreeding depression on population persistence (Jamieson and Allendorf 2012). While these populations are probably more likely to be in immediate threat posed by demographic stochasticity rather than generic risks, long term retention of evolutionary potential is however paramount for adaptation to environmental changes (Allendorf and Luikart 2007) and is clearly challenged by low effective population size of rear-edge Q. faginea populations. Additional studies at the species distribution range level, including core and marginal populations should provide a better understanding of the history and evolutionary significance of intraspecific genetic diversity within these Mediterranean oak species.

The value of compound markers for population history reconstruction

Sequence information around microsatellite loci can uncover molecular variations caused by different mutation dynamics and mechanisms that generate complementary patterns across populations and species (Hey et al. 2004; Barthe et al. 2012). If the dataset combining all polymorphisms under microhaplotypes gave the most clear-cut results to depict the geographical distribution of the genetic clusters, we also observed different patterns of diversity distribution across populations according to the polymorphisms considered. For instance, Fst estimated from microsatellites are higher in Q. faginea compared to Q. canariensis, but Fst estimated from flanking sequence SNP haplotype are not significantly different. Moreover, if genetic differentiation at fast evolving markers is expect to be lower than genetic differentiation at slowly-evolving markers under an island model of migration at the mutation-drift equilibrium, no statistical difference was found. In addition, geographic patterns of microsatellite differentiation are not linked to newly acquired mutations because the estimates of genetic differentiation accounting for phylogenetic distance between microsatellite alleles (Rst) are not significantly higher than the estimates of genetic differentiation computed from allele identity (Fst) (Hardy et al. 2003). All these results, combined with the low correlation between genetic diversity estimates using linked markers, suggests a non-equilibrium situation where other factors than mutation, such as drift or interspecific gene flow, are involved in the increased differentiation at microsatellites for Q. faginea. However, highly polymorphic multi-allelic microsatellites should be more sensitive to the recent effect of genetic drift. Remarkably, estimates of genetic differentiation among Q. faginea populations were similar to estimates of genetic differentiation between Q. faginea and Q. canariensis, suggesting that interspecific gene flow blurs species boundaries with morphologically undetected hybridization and directional introgression introducing heterospecific alleles within Q. faginea populations. A combination of interspecific gene flow and recent demographic events could account for this pattern of molecular variation, highlighting the limits of estimates of genetic diversity and population genetic structure to resolve complex demographic history. Nevertheless, contrasted variability between differently evolving polymorphisms suggests complementary information about past population history. We showed that simulation-based inferences, that allow for realistic modelling of molecular evolution at linked loci, leverage information complementarily for improved historical inferences.

We found contrasted most likely demographic scenarios when considering either microsatellites alone or in conjunction with variation in their flanking sequences. Recent interspecific gene flow was inferred when using microsatellite variation alone while divergence without interspecific gene flow was the most likely scenario when also accounting for flanking sequence variation. Homoplasy (identical allelic state derived from different ancestor alleles) at microsatellite loci explains this discrepancy: with the underestimation of interspecific divergence due to homoplasy (Estoup et al. 2002) interspecific gene flow needs to be invoked to explain the observed pattern of microsatellite variation between species. Indeed, homoplasy at microsatellite loci is expected to overestimate the estimation of hybridization rate and underestimates the estimation of the timing of interspecific gene flow events (Dickey et al. 2013; Henriques et al. 2016).

However, contrary to traditional implementation of ABC, Random Forest ABC is not sensitive to the number and relevance of the summary statistics (Marin et al. 2018) and can assess the weight of each summary statistics in the inference (Estoup et al. 2018; Raynal et al. 2019). Therefore, we took advantage of these new properties to compare the value of microsatellite variations, flanking sequence substitutions, or the combination of both, to inform biological inferences. For model choice, all types of polymorphisms provide some information to differentiate the concurrent demographic models. For population demographic parameters, population-level diversity and pairwise difference between populations computed from haplotype and microsatellites inform about effective population size of sampled populations. However, the level of genetic diversity within the flanking sequence brings some additional information for ancient events such as effective size of ancestral populations. While population-level correlation between genetic diversity at flanking sequence and microsatellite was suggested to be sensitive to past demographic events (Payseur and Cutter 2006), nine out of 32 of those summary statistics were among the 33 uninformative summary statistics out of the total of 518 summary statistics derived in this study. This indicates that not all summary statistics are informative with respect to parameter inference, irrespectively of the type of molecular variability they summarized. Deeper assessment of correlation between diversity statistics (Gaggiotti et al. 2018) at contrasted polymorphisms, or the direct use of raw haplotypes without summary with deep learning approaches (Flagel et al. 2019) seem promising avenues to extract more information from multi-polymorphisms marker sequences.

Last but not least, we found that microsatellite mutation rates at HapSTR and SSR loci and substitution rates at their flanking sequence are informed overall by population level summary statistics, while locus-specific mutation parameters are informed specifically by the corresponding locus-level summary statistics. These results confirm the relevance of inference at the locus level in order to capture a realistic picture of mutational dynamic variability across loci (Payseur and Cutter 2006; Chapuis et al. 2020). Given the renewed interest in the functional role of microsatellite variation (Xie et al. 2019; Press et al. 2019) and the availability of novel technologies to sequence thousands of targeted microsatellite loci in empirical populations (Press et al. 2018), locus-specific inference contrasting flanking sequence and microsatellite mutation rates could help identifying functionally relevant microsatellite variation showing unusually low or high rate of evolution. Integrating different sources of polymorphism will thus further our understanding of the role of molecular variations in shaping species adaptation.