Divergence time estimation using ddRAD data and an isolation-with-migration model applied to water vole populations of Arvicola

Molecular dating methods of population splits are crucial in evolutionary biology, but they present important difficulties due to the complexity of the genealogical relationships of genes and past migrations between populations. Using the double digest restriction-site associated DNA (ddRAD) technique and an isolation-with-migration (IM) model, we studied the evolutionary history of water vole populations of the genus Arvicola, a group of complex evolution with fossorial and semi-aquatic ecotypes. To do this, we first estimated mutation rates of ddRAD loci using a phylogenetic approach. An IM model was then used to estimate split times and other relevant demographic parameters. A set of 300 ddRAD loci that included 85 calibrated loci resulted in good mixing and model convergence. The results showed that the two populations of A. scherman present in the Iberian Peninsula split 34 thousand years ago, during the last glaciation. In addition, the much greater divergence from its sister species, A. amphibius, may help to clarify the controversial taxonomy of the genus. We conclude that this approach, based on ddRAD data and an IM model, is highly useful for analyzing the origin of populations and species.


Scientific Reports
| (2022) 12:4065 | https://doi.org/10.1038/s41598-022-07877-y www.nature.com/scientificreports/ classes (K = 2), which was the best supported model, separated A. scherman and A. amphibius. The clustering outcome for K = 4 showed a coherent subdivision in the four populations considered here (Fig. 2B). Increasing the K value revealed new components of very low frequency. An F-statistics analysis confirmed pronounced and significant levels of population differentiation between the four populations defined (Supplementary Table S3). The genome phylogeny based on the average genetic distances between the 3361 loci grouped individuals according to the population of origin ( Supplementary Fig. S1), in agreement with the PCA and Structure analyses. In addition, the Cantabrian and Pyrenean populations were closely related, with the central European sample being external to this group. A. amphibius appeared as the most external group and highly divergent from A. scherman. At the fine-scale level, specimens were grouped according to their locality, indicating the low overall dispersal of the species and the high resolving power of the ddRAD genomic data. C a n t a b r i a n P y r e n e a n C . E u r o p e a n A . a m p h i b i u s www.nature.com/scientificreports/ Estimation of ddRAD loci mutation rates in a phylogenetic framework. In a first step of the dating process, we estimated the specific mutation rates of the ddRAD loci. To do so, we applied a pipeline to find orthologues from each locus in Muroidea, as depicted in Supplementary Fig. S2. Sequences from this superfamily were selected as it includes both the Muridae family, necessary to set the Mus-Rattus calibration point, and Cricetidae, which contains the Arvicola genus. The starting point in the pipeline was the set of variable ddRAD loci present in all the Arvicola samples. One sequence per locus was used to perform a BLAST search 40 against the house mouse (Mus musculus) genome. To try to ensure a 1:1 orthology, we considered only sequences with a single hit and an E-value of less than 10 -40 , generating 118 orthologues. Using the sequence coordinates from the mouse sequences detected, 114 mammalian orthologues were downloaded from the ENSEMBL database 41 .
Only alignments that included house mouse (M. musculus) and brown rat (Rattus norvegicus) were kept, leaving a total of 86 orthologues. From these, we selected the available sequences from Muroidea (Mus pahari, Mus caroli, Mus spretus, Mus musculus, Microtus ochrogaster, Cricetulus griseus, Rattus norvegicus, and Peromyscus maniculatus). We then added the corresponding A. scherman sequence to each set of rodent orthologues and realigned them. After filtering out one alignment that was invariant, alignments of 85 loci across 9 rodent species remained ( Supplementary Fig. S2).
Using the 85 rodent alignments, we constructed a tree with the BEAST2 program 21 . As a calibration point, we included the mouse-rat split at 10.4-14.0 million years (Myr), based on fossil data 20,42 . The calibrated phylogeny indicated that all the species diverged 18 Myr ago ( Supplementary Fig. S3). The mutation rate was obtained for each locus and it was the same for the set of all species as a strict clock was chosen (see "Methods"). The minimum and maximum rates were 0.3 × 10 -9 and 5.3 × 10 -9 mutations/site/year, respectively. The average rate of all loci was 2.6 × 10 -9 and the geometric mean was 2.32 × 10 -9 mutations/site/year (Supplementary Table S4).

Application of an isolation-with-migration model to the ddRAD loci.
We applied an IM model implemented in IMa3 43 to estimate the divergence times between the populations and species of Arvicola, as well as the population sizes and migration rates, using the ddRAD loci and the mutation rates of the 85 loci previously estimated (Supplementary Table S4). The mean number of polymorphic positions per locus was 2.29 ( Supplementary Fig. S4), a relatively low number, meaning that it was necessary to use a large number of loci for the estimations. IMa3 was tested with different numbers of loci, from less than 100-300, of which 85 were the calibrated loci and the rest were randomly selected from the total pool of variable ddRAD loci. We found that 300 loci were adequate to obtain sufficiently narrow confidence intervals (C.I.) for most parameters while maintaining a good convergence and mixing, as indicated by the similar values for most population size mutation rates and population migration rates obtained from the first and second half of the sampled genealogies (Supplementary Table S5). Probability distributions of divergence times in years (Fig. 3A) as well as of population sizes in demographic units and significant migration rates ( Supplementary Fig. S5) were generally continuous and showed a well-defined peak. However, some distributions were noisier and C.I. were relatively wide for some parameters, specifically, the most ancestral time and some of the migration rates. The divergence time between the two Iberian populations of A. scherman was 34 thousand years (Kyr) ago, with a 95% C.I. of 17-57 Kyr  Table S6). Five significant migration rates involving all present and ancestral populations were found, all with values of an effective number of migrant genes per generation (2 Nm) of ≪ 1 (Fig. 3B).
To test whether the loci with estimated mutation rates were more conserved, as these were selected as having 1:1 orthologues in Muroidea, we used the scalars or relative rates that IMa3 estimates for all loci. The geometric mean of all relative rates was 1.34, being 1.19 and 1.41 for the calibrated and non-calibrated loci, respectively. Thus, the rates were slightly lower for the calibrated loci, as expected, but the distributions mostly overlapped ( Supplementary Fig. S6).

Discussion
Suitability of ddRAD loci for IMa3 analysis. To study the divergence time between the populations of Arvicola, we applied an IM model to the ddRAD data. The length of the markers (145 bp) and the mean number of polymorphic positions (2.29) were small compared to other loci generally used with IM models 44,45 . This resulted in a small number of loci generating parameters with large confidence intervals or not converging properly, as we observed in initial runs. After increasing this number in successive IMa3 runs, we found that a set of 300 ddRAD loci resulted in reasonably good mixing and convergence (Supplementary Table S5), while the distributions were adequate for most of the demographic parameters in the model ( Fig. 3A and Supplementary  Fig. S5). However, some distributions had relatively wide confidence intervals, probably due to a lack of variability in the loci. Increasing the number of loci slowed the analysis speed, making it impractical. Other ddRAD datasets may require different number of loci and run parameters for achieving adequate convergence and precision, so initial runs are necessary to find the best conditions for each case.
To introduce mutation rates for the divergence-time estimation, we calibrated the ddRAD loci within a phylogenetic framework, obtaining a mean mutation rate for 85 loci of 2.6 × 10 -9 mutations/site/year. This value is similar to the average rate found for mammals of 2.2-2.6 × 10 -9 mutations/site/year using the fourfold degenerate sites of genes in different mammalian lineages 46 . However, the germline mutation rates estimated in two different works for the mouse genome using a pedigree approach were 5.4 × 10 -947 and 6.85 × 10 -9 mutations/ site/generation 48 , respectively. Considering a generation time of 1 year or less, the per-year mutation rate of the whole mouse genome would be more than double or triple that which we determined for the Arvicola ddRAD Scientific Reports | (2022) 12:4065 | https://doi.org/10.1038/s41598-022-07877-y www.nature.com/scientificreports/ loci. A possibility to explain this discrepancy is that the mouse lineage could be more accelerated than Arvicola, although our initial analyses showed that rate variation was low in the rodents' phylogeny (see "Methods"). A more likely explanation for this difference between mutation rates of ddRAD and the whole genome could be in the way in which the ddRAD data is obtained and assembled. On the one hand, we only used and calculated rates from the variable loci, which are 86% of all ddRAD loci in our dataset. On the other hand, it has been shown that ddRAD data do not incorporate the most variable regions of the genome 49,50 due to the fact that repetitive regions or polymorphic loci for the restriction enzyme sites are not assembled 51,52 . For these reasons, the estimation of specific mutation rates for ddRAD data may be more convenient than extrapolating germline mutation rates from whole genome data. The need to know the generation time to convert per-generation mutation rates to per-year rates is also a major source of uncertainty of the germline mutation rates to estimate divergence times 53 . However, the estimation of mutation rates in a phylogenetic framework may also be affected by other problems such as the availability and suitability of the calibration points 20 and the delineation of orthology 41,54 . In addition, ddRAD loci for which orthologues were found showed slightly lower relative rates (1.19) compared to the relative rates of all loci (1.34) due to the difficulties in detecting orthologues for the most accelerated loci. Accordingly, the demographic population sizes and split times should be ~ 11% lower than the values estimated here. Moreover, we do not have a reference genome in this group of species, so it was not possible to efficiently filter loci of sex chromosomes or with linkage disequilibrium 55,56 . All these factors in the estimation of mutation rates and their application to the IM model should be taken into account and make it necessary to be cautious with the estimates of split times and other demographic parameters. In this sense, it is convenient that the hypothesis being tested does not refer to a very specific time point but rather to broader time periods.

A . a m p h i b i u s
A . s c h e r m a n C e n t r a l E u r o p e a n A . s c h e r m a n C a n t a b r i a n A . s c h e r m a n P y r e n e a n www.nature.com/scientificreports/ Evolutionary history and taxonomy of Arvicola. Despite the above-mentioned uncertainties in the estimations, the results obtained with an IM model and the ddRAD data were robust enough to help advance our understanding in some aspects of the evolutionary history of a genus of complex evolution like Arvicola. Thus, the ddRAD data allowed us to determine the most likely periods of population and speciation splits from the last 2 million years of Pleistocene glaciations (Fig. 3A,B). The population size was smaller in A. amphibius than in A. scherman as a whole (Supplementary Table S6), as also reflected in the individual heterozygosity, but both were of the same order of magnitude as those observed for widespread rodents 44 . It is also worth noting the significant migration rates detected between most branches in the tree (Fig. 3B). Although gene flow values were relatively small, some of these migrations could have been important for sharing novel diversity between lineages.
Regarding the two Iberian populations of A. scherman, the IM analysis indicated that they diverged 34 Kyr ago (C.I.: 17-57 Kyr) (Fig. 3). These results are consistent with the hypothesis that the split happened during the last glaciation, dated as being between 115 and 20 Kyr ago, in the Late Pleistocene 57,58 . It can then be hypothesized that allopatry could have been initiated within separate refugia during the Last Glacial Maximum or close to this period 58 . Furthermore, the IM model indicated that there was significant gene flow from the Cantabrian to the Pyrenean population, i.e., the two populations exchanged migrants since the initial split, probably as a consequence of range expansion from the original refugia during the Holocene. However, migration rates were small (2 Nm ≪ 1), indicating a low homogenization between the two populations 59 , in line with the strong structure found (Fig. 2) and the morphological differences in both populations, which supports their subspecific status 32,33 . These results are also in agreement with the fundamental role of the refugia-within-refugia hypothesis to explain the generation of biodiversity in the Iberian Peninsula 34 .
As for the other nodes of the population tree, it should be taken into account that we have a much smaller number of samples to properly resolve them. Only one specimen was available for the central European population of A. scherman and, for A. amphibius, we only had samples from the semi-aquatic populations 24 ; we are therefore likely not to have fossorial ecotypes of A. amphibius. Hence, the estimated divergence dates and other model parameters as well as the conclusions should be taken with caution. Using the samples available in this work, the IM model allowed us to infer that the Iberian and central European populations of A. scherman diverged 145 Kyr ago (C.I.: 105-207). If we assume that A. scherman is mainly fossorial, this date marks the minimum age for the origin of this ecotype 37 . It is also interesting that a small amount of gene flow was found from the A. scherman ancestral populations (shown in grey in Fig. 3B) to A. amphibius, which makes it tempting to speculate on the possibility that these migrations help explain the presence of fossorial ecotypes in A. amphibius 25,35 . However, these hypotheses should be tested with a good representation of semi-aquatic and fossorial ecotypes of A. amphibius to further analyze the evolution of the ecological forms of Arvicola.
The genome-wide data obtained here may also shed some light on the controversy about the species status of A. scherman and A. amphibius 24,25,28 , assuming that our samples are representative of both. The two taxa appeared clearly separated in the PCA ( Fig. 2A), the Structure analysis with K = 2 (Fig. 2B), and the genomic tree (Supplementary Fig. S1). In addition, the IM analysis indicated that both species split 381 Kyr ago (C.I.: 276-628), during the Middle Pleistocene (Fig. 3). We also found evidence of low but significant gene flow between both lineages, although the inclusion of specimens of the two species from overlapping areas may result in a greater amount of gene flow. In view of these results, it seems appropriate to reconsider the recent proposal, based on mitochondrial genetic distances and morphological data, that these populations correspond to a single species 24 . However, further research based on this methodology and with a wider sampling of the known populations of Arvicola is advisable to resolve the taxonomic uncertainties.

Conclusions
In this work, we demonstrate the suitability of a strategy based on ddRAD genomic data together with an advanced IM model, which takes coalescence and migration into account, to estimate the split times of recently diverged populations and closely related species, using the genus Arvicola as an example. Ultimately, all the dated nodes of the population tree as well as the estimated population sizes and migration rates provided important insights into different aspects of the evolutionary history and taxonomy of this genus. The ddRAD technique and similar genome reduction approaches are emerging as cost-effective and valid alternatives for generating population genomics data, which can facilitate the future application of this type of sequences along with robust IM dating methods to a wide range of taxa. This is especially important to determine the evolutionary context and the main drivers of population and species divergence in order to better understand the origin of biodiversity. Additionally, with sufficient comparative data, these methods can help to objectively describe different taxa as well as to define evolutionary significant units of conservation importance and, thus, lead to a more precise description of biodiversity.

Methods
Samples. Samples of different localities of A. scherman and A. amphibius were obtained through a combination of our own collections from previous studies 30,31 , loans from museums, and skull bones sampled from barn owl pellets (Supplementary Table S1).

Ethics statement.
No animal was specifically captured for this work. Therefore, this study did not require ethics approval by a specific committee. ddRAD library preparation and analysis. DNA extraction from tissues and skull samples was carried out as described in Balmori-de la Puente et al. 60 . To prepare the libraries, we followed the ddRAD protocol 12 with modifications to process samples independently 14  www.nature.com/scientificreports/ by qPCR as previously described 14 , was digested with EcoRI and MspI restriction enzymes. Fragments between 300 and 400 bp were selected in a precast EX 2% agarose gel using the E-Gel system (Invitrogen). Different P1 Illumina adapters for each sample (each with a different 5-nucleotide barcode), up to a maximum of 24, and one P2 (the same for all samples) were used. When there were more than 24 samples, different PCR indexes were used. A PCR of 20 cycles was performed with primers annealing over the adapters (or 24 cycles in the case of weak PCR products). Two more PCR reactions were performed to homogenize the coverage of loci. When there was no PCR product, the samples were removed before pooling. The three PCR products from each sample were pooled and visualized in a gel. To construct the final library, the samples were pooled with proportions that depended on the product intensity observed in the gel. After initial bioinformatic analyses to estimate the coverage of each sample, some tissues and most of the skull samples were reprocessed in subsequent libraries to increase coverage. The libraries were sequenced using the NextSeq Sequencing System (Illumina) with the 150-cycles Mid Output kit and single-read runs, at the Genomics Core Facility of the Pompeu Fabra University. The library sequences were analyzed using different programs in the Stacks 1.35 package 38 . First, process_ radtags was used to filter the sequences, assign sequences to the different samples according to the sequenced barcode, and truncate them to 145 bp with the recovery option (r). As bone samples of A. scherman contained a large proportion of exogeneous sequences, they were filtered using the tissue samples 14 . For this, a database including the sequences of the tissue samples of this species was constructed with the bowtie-build tool from Bowtie 2.3.0 61 . We then performed a Bowtie search of the bone sequences against the tissue database with parameters "--score-min L,-0.6,-0.6", retaining only bone sequences that gave at least one hit for further analysis. After this step, sample reads of the same specimen from different libraries were merged by concatenating the sequence files. Using ustacks from the Stacks package, the initial minimum coverage (m) was set to 3 and the maximum differences between stacks (M) to 6. A catalog of loci from all the samples was constructed, allowing for a number of differences between loci from different samples (n) of 6 in cstacks. After testing different values, this set of parameters was found to be optimal for loci assembly. Finally, the populations program from the Stacks package was used with a minimum coverage (m) of 6 and a minimum proportion of individuals (r) of 1, i.e., only loci that were present in all individuals were selected. The 145 bp sequences were saved in FASTA format (with the command "--fasta_strict") and the first SNP of each variable locus was saved in PLINK format ("--write_single_snp --plink"). The numbers of SNPs and variable loci in FASTA format were not exactly the same (2877 and 2874, respectively), as they were generated with different statistical models. The heterozygosity rate of each individual was calculated by counting the proportion of variable positions along all loci in the FASTA file.
With the aim of detecting any potential bias in the assembly, we performed an Fst analysis for each locus using the SNPs dataset with BayeScan 2.1 62 . The results revealed no major deviations in the assembled loci, with only 4 possible outliers (with slightly higher Fst values) out of the 2877 SNPs.
Genomic tree and population structure analysis. A genomic tree of the individuals was constructed using all loci, following Igea et al. 8 . To summarize the divergence of the two separate alleles of each locus, a pairwise distance matrix was calculated by estimating genetic distances between all possible combinations of alleles from a pair of individuals using Equation 8.1 in Freedman et al. 63 . Then, the resulting matrix of pairwise distances was used to construct a tree with the Fitch program of the Phylip package 64 . Mid-point rooting was used to represent the tree.
A PCA applied to the SNPs dataset was performed using the program SNPRelate available in R, using the genetic covariance matrix 65 .
STRU CTU RE 2.3.4 39 was applied to the same SNPs. An admixture and uncorrelated allele frequency model with the ancestry prior, recommended when the sampling is unbalanced, was used. The number of populations, K, analyzed ranged from 2 to 6, and the initial Dirichlet parameter for degree of admixture was defined for each K as 1.0 divided by the number of populations. The optimal K value was estimated using the method of Evanno 66 . The number of iterations was set to 500,000 with a 10% of burn-in. Ten independent replicas for each K were performed. Replicas with different patterns were found for K = 4, 8 of the 10 being the same. CLUMPP 1.1.2 67 was used to summarize the results.
Pairwise F st distances between the different populations were estimated using the SNPs with the Weir and Cockerman (1984) method in the genet.dist function of the hierfstat R package 68 . Using boot.ppfst from the same package, 95% confidence intervals were calculated with 100,000 replications bootstrapping over loci. Any intervals that did not overlap zero were inferred to be significant.
Estimation of specific mutation rates of ddRAD loci from rodent sequences. A pipeline of several bioinformatic steps was designed to detect orthologues of Arvicola ddRAD sequences and estimate their mutation rates, as depicted in Fig. S2. The set of 2874 ddRAD sequences used as seed in this pipeline belonged mostly to a single specimen of A. scherman (IBE-C4977; Table S1), except for a few sequences that were missing from this individual and taken from another (IBE-C4969; Table S1). Any other specimen of this or the other species would have produced the same results given the small differences between Arvicola sequences compared to the large differences with other rodent species. First, one sequence per locus was used to perform a BLAST search 40 against the mouse genome using an E-value of 1e −10 (80,352 hits). Only loci with single hits and reported E-values lower than 1e −40 were considered to ensure as much as possible 1:1 orthology (118 orthologues). Chromosome number and sequence coordinates of each hit were annotated. Mammalian orthologues of each mouse sequence were then downloaded from the EPO suite of the ENSEMBL database (114 alignments). This database includes complete genomes of vertebrate species, with one genome used as reference per species, and multiple genome alignments of these species from which orthologous regions can be obtained 41  The number of species per alignment ranged from 4 to 9; 93% of alignments had at least 6 species and 41% had all 9 Muroidea species. After adding one A. scherman sequence to each locus (the same used as seed; Fig. S2), we realigned each set of rodent orthologues using MAFFT 7.130 69 with the localpair, maxiterate and adjustdirectionaccurately parameters. Gap positions from the final alignments were removed using Gblocks 70 . After this step, invariant alignments were removed (remaining 85 alignments). If any sequence in this set was not orthologous or was too divergent, it could be detected in a phylogenetic tree as an anomalous branch length. We therefore reconstructed a maximum-likelihood phylogenetic tree from each alignment using RAxML version 8.0.19 71 .
The trees were visually inspected to ensure that no trees with large branch lengths were present in the final set. A tree was constructed from the 85 rodent alignments with BEAST version 2.5.2 21 , using the calibrating point of the mouse-rat split at 10.4-14.0 Myr 20 . Unlinked site models (HKY + I) across loci, unlinked clocks (strict clock) and linked trees (Yule) were selected. A strict clock was chosen because the analysis converged better than with a relaxed clock in initial runs, as expected when the mutation rate variation between lineages is low 72 . The calibrated node was modeled using a lognormal prior distribution with minimum and maximum constraints in real space; the offset and the soft maximum were set to 10.4 and 14.0 Myr, respectively, to coincide with the 95th percentile of the probability density distribution with a standard deviation of 1. A total of 75 million generations were run, sampling each 1000 generations. The Tracer v1.7.1 program 73 was used to check convergence and retrieve the mutation rate of each locus. The TreeAnnotator program from the BEAST package was used to calculate the consensus tree with median heights and 10% burn-in.
Isolation-with-migration analyses. To estimate divergence times in an IM analysis we used IMa3 43 .
Four populations were considered: the Cantabrian and Pyrenean populations of A. scherman from the Iberian Peninsula, a sample from the central European distribution of this species, and the A. amphibius samples. In this configuration, the population topology was, according to the genomic tree: (((A. scherman Cantabrian, A. scherman Pyrenean), A. scherman central European), A. amphibius). After different tests, the final analysis was carried out using 300 randomly selected loci that included the 85 calibrated loci. The mutation rates estimated previously using BEAST2 were included in the corresponding loci of the IMa3 input file after scaling them per alignment, as required by the program. To calculate the population size in demographic units, the generation time was set to 1 year, a value that can be considered adequate for a short-lived species like Arvicola sp. However, it should be noted that the generation time does not affect the estimation of divergence times 5 . The infinite sites model was used for all loci. Seven loci that did not pass the four gametes test were trimmed, with the longest fragment being selected 74 . The priors in the IMa3 model were adjusted to short loci and recently split populations and species, and they were selected according to their convergence in initial runs: maximum population size mutation rate 4Nµ (q) = 1.5; maximum split time tµ = 0.5; and maximum migration rate m/µ = 2. Note that all parameters are scaled by the mutation rate µ. Runs were performed with 420 chains (a large number due to the considerable number of loci) on 28 processors (the final run took 265 h). Burn-in was set to ~ 500,000 steps and 15,000 genealogies, sampled every 100 steps, were saved. To ensure proper mixing and convergence, plots of parameter trends and marginal posterior probability distributions of the parameters were checked, and estimates of the first and second halves of the sampled genealogies were compared. Parameter estimates reported were the histogram bins with the highest value and confidence intervals were the 95% highest posterior density intervals (HPD). The IMfig program 45 was used to prepare the figure with the schematic representation of the generated isolation-with-migration model. Significance of migration rates was based on the log-likelihood-ratio test of the null hypothesis of zero migration 3 .