Rolling down that mountain: microgeographical adaptive divergence during a fast population expansion along a steep environmental gradient in European beech

Forest tree populations harbour high genetic diversity thanks to large effective population sizes and strong gene flow, allowing them to diversify through adaptation to local environmental pressures within dispersal distance. Many tree populations also experienced historical demographic fluctuations, including spatial population contraction or expansions at various temporal scales, which may constrain their ability to adapt to environmental variations. Our aim is to investigate how recent contraction and expansion events interfere with local adaptation, by studying patterns of adaptive divergence between closely related stands undergoing environmentally contrasted conditions, and having or not recently expanded. To investigate genome-wide signatures of local adaptation while accounting for demography, we analysed divergence in a European beech population by testing pairwise differentiation among four tree stands at ~35k Single Nucleotide Polymorphisms from ~9k genomic regions. We applied three divergence outlier search methods resting on different assumptions and targeting either single SNPs or contiguous genomic regions, while accounting for the effect of population size variations on genetic divergence. We found 27 signals of selective signatures in 19 target regions. Putatively adaptive divergence involved all stand pairs. We retrieved signals both when comparing old-growth stands and recently colonised areas and when comparing stands within the old-growth area. Therefore, adaptive divergence processes have taken place both over short time spans, under strong environmental contrasts, and over short ecological gradients, in populations that have been stable in the long term. This suggests that standing genetic variation supports local, microgeographic divergence processes, which can maintain genetic diversity at the landscape level.

Using a reference table of the individual belonging to each stand, we produced two new files containing each the individuals of each pair of stands as well a vcf file pooling the variants of all the individuals Script for mapping each nucleotide on the scaffold Here we build a reference table from the BLAST+ output and use it to map our whole dataset to the reference genome.BLAST+ was run with the aforementioned command line: blastn -db Fsylvatica_DE_NuclearGenome -query /home/pc/Documenti/FagusLab/Fagus_pc_Turin/Fs_contigsK81wrap.fa-out Fagus_0.txt-max_hsps 6 -dust no -outfmt "0 qseqid sseqid qstart qend sstart send gapopen gaps pident nident mismatch evalue length" blastn -db Fsylvatica_DE_NuclearGenome -query /home/pc/Documenti/FagusLab/Fagus_pc_Turin/Fs_contigsK81wrap.fa-out Fagus_0.txt-max_hsps 6 -dust no -outfmt "0 qseqid sseqid qstart qend sstart send gapopen gaps pident nident mismatch evalue length"

Randomness of distribution of target sequences
To check whether the sequenced genomic regions were randomly distributed across the genome, their positions were compared to the expectation obtained following a Poisson distribution.Under a Poisson process, the ratio of variance to mean is χ 2 -distributed with one degree of freedom; when the ratio of variance to mean equal to 1.0 , the studied process that causes object to be scattered in space or time follows perfectly a Poisson distribution; lower values suggest that it scatters the object more randomly than a Poisson process, higher values suggests that it clump them together instead.This property can be used to test whether the observed distribution follows a Poisson process.To test whether the sequenced genomic regions were randomly distributed over the genome, we counted the number of sequenced regions in 1,000,000-bp genomic windows, computed the mean and variance of counts, computed their ratio and applied the above χ 2 test.

Detailed methods for the identification of the demographic model
We estimated demographic parameters for the four stands by applying FastSimCoal2 v2.7.0.5 (Excoffier and Foll, 2011) to folded joint pairwise site frequency spectra (SFS) for each pair of stands, from now on called 2D-fSFS.

2D-fSFS computation
To compute the observed 2D-fSFS both at the genomic region and the scaffold level we fed the vcfSieve() function (Scotti et al., 2023) a VCF file containing the position of the SNPs (on the genomic regions and on the scaffolds, respectively), a fasta file of the genomic reference assembly or the scaffold sequences, and a table mapping each individual to its stand.After having counted the alternative and reference allele for each population, we imputed the missing data with the custom function imputing(), which takes the counts of each allele at a locus in a population and the maximum number of valid genotypes at any locus in a population to return a vector of the counts for each allele of the imputed SNPs.Next, the global minor allele at each SNP was identified with the custom function minAlleleCounts(), which returns a vector of (global) minor allele counts for each population.In the final step, the 2D-fSFS were produced by combining the vectors of (global) minor allele counts for each pair of populations (the numbers of invariant sites in the sequences, i.e. the total length of the target regions minus the number of SNPs, were added to the [0] and the [0,0] cells columns).The R script used to produce 2D-fSFS is the n°17 available at the DataVerse repository.FastSimCoal2, version 2.7.5 (Excoffier and Foll, 2011) was used twice, to (1) estimate the demographic parameters under a given demographic scenario using the observed 2D-fSFSs, and (2) generate a simulated 2D-fSFSs based on the most-likely demographic scenarios, after a treatment of the output of (1) that is described below (see Figure 1 for the workflow).Box 1 reports the master (TPL) file coding for the demographic events in FastSimCoal (see also Figure 2 in the main document for a graphical description of the events).

Estimation of demographic model parameters
We used the six 2D-fSFSs as input to the maximum-likelihood parameter estimator of Fastsimcoal2 v2.7.5 (with the following parameters: number of independent loci 1, chromosome 0, number of linkage blocks per chromosome 1, data type FREQ, num loci 1, recombination rate 0, mut rate 10-8; -n 100000 -M -c 4).We repeated the estimation step 100 times to obtain a distribution of values for each parameter, to account for uncertainties in parameter estimation (Figure 1, Block 1).See Table 1 for the 100 maximum likelihood sets of parameters and Figure SM2-2 for their distribution.

Analysis of the covariation structure of parameter values
Prior to the sampling of sets of parameter values for the simulation of demographyinformed 2D-fSFS, we examined (a) the similarity of parameter sets obtained with the 100 estimates and (b) the covariation among parameters (Figure 1, Block 2).
To check whether the hundred parameter sets belonged to the same group (i.e., the most likely scenarios from all estimations were a random draw from a single "type" of scenario) or formed multiple groups (i.e., more than one "type" of scenario was obtained), we applied a principal component analysis, in which each estimation was an object, and each parameter a variable describing the objects, with the function dudi.pca() of Adegenet 2.1.10(Jombart, 2008).We retained the two first principal components, explaining 27.69 and 14.35 % of the variance, respectively.According to these axes, we identified two clusters of parameter sets, containing 64 and 30 parameter sets each (six parameter sets did not fall in a cluster; Figure 2).This indicated that the maximum likelihood estimator identified two peaks of likelihood, corresponding to the two groups of parameter sets.
We then proceeded to inspect the covariation among parameter values within each cluster of parameter sets, for all pairs of parameters (21*20 / 2 = 210 pairs of parameters) using the script n°11.We inspected the P-value of the slope of the correlation (after Bonferroni correction for multiple tests) and identified, in each of the two clusters of parameter sets identified by PCA, the groups of correlated parameters shown in Figure 3.
In each group of correlated parameters we identified a "core" parameter, based on the strength of correlation with all others; the "core" parameters are underlined in Figure 3.

Calculation of priors for the final simulations
In the next step, we used FastSimCoal again one hundred times, to produce one hundred joint 2D-fSFS.We drew starting sets of parameters for each simulation from the above distributions of parameter values; however, because (a) parameter sets were grouped in two clusters, and (b) at least some of the parameters were correlated within each cluster, we could not draw each parameter independently from the others, which would have lead to implausible combinations of parameter values.To obtain input sets of parameter values that took into account parameter covariation structure, we proceeded as follows (Figure 1, Block 2): (a) In each cluster of parameter sets, we draw values for all uncorrelated parameters and for the "core" parameter in each group of correlated parameters from a uniform distribution, having as boundaries the minimum and maximum of the distribution of each parameter (Table 1)     Table 1

. Sets of parameter estimates
Note that the signs of the GROWTHRATE parameters have been flipped to make them more understandable.They must be filpped again to write the TPL input file of FastSimCoal because the software package works in a coalescent framework (ie.backward in time) : hence it requires changes in population size to be written with the opposite sign.A FastSimCoal input file has demographic expansiion written with a negative sign as if they were contraction, and demographic contraction written with a positive sign as if they were expansions.
For all other parameters, values were obtained using the linear model describing their correlation with the "core" parameters: where α is the slope of the model, A 1 the "core" parameter, β is the intercept of the model; N(0, σ) is the error term, drawn from a normal distribution ~N(0, σ), where σ is the model's residual standard deviation.Each value that was drawn was checked for consistency with the structure of the demographic model (in particular for the ranking of event times: TIMEA < TIMEB < TIMEC etc.), as well as for staying within the boundaries of estimated parameter values.Draws violating such constraints were discarded, and a new value was drawn, until reaching the required numbers of parameter sets (see script n°12).
Respectively, 68 and 32 parameter sets were drawn from the distribution and the covariation structure of each cluster of parameter sets, to match the size of each cluster (respectively made of 64 and 30 parameter sets).This way, we accounted for the relative weight of each peak of maximum likelihood, for the correlation structure of parameters, and for error in parameter estimation.
The one hundred parameter sets thus computed were used in the final phase.

Computing the final simulated joint 2D-fSFS
To compute the final joint 2D-fSFS from the newly produced one hundred parameter sets (Figure 1, Block 3), we first generated 100 PAR files from a TPL written to simulate the whole beech genome (Box 2) and the parameter sets (Table 1).We then used FastSimCoal anew, with the following arguments: -i input (n° of the PAR file).par-n1 -G -g.By doing so we obtained SNP tables for six populations and 75,000 short sequences of length 1000 bp ("simulated target region data"), mimicking our empirical target regions.The GEN output files produced by FastSimCoal were then converted to VCF format using the script n°19 and then merged into a single VCF, which was treated exactly as the empirical VCF for all subsequent analyses.. //migration matrix 0.000 M01 M02 M03 M12 0.000 M12 M13 M02 M12 0.000 M23 M03 M13 M23 0.000 //historical event: time, source, sink, migrants, new size, growth rate, migr.matrix NOTE: order of event 1 and 2 to be estimated 8 historical event TIME1 0 1 1 1 GROWTHRATE0 0  TIME1 0 0 0 0 GROWTHRATE0 0  TIME2 2 1 1 1 GROWTHRATE0 0  TIME2 2 2 0 0 GROWTHRATE0 0  TIME3 3 1 1

Single-locus approach for outlier detection based on theoretical model of population differentiation
A locus is considered as an outlier whenever Bamova 1.02 (Gompert & Buerkle, 2011;Scotti et al., 2023) is a genome-based method for Φ-statistics outliers that compare the single-locus posterior distribution of the statistic to the genome-level posterior distribution.The analyses were carried over the six pairing of the four populations within an island framework.This "symmetrical" sampling scheme will allow us to reason on the outcomes by relying on the biogeographic and ecological differences within each pair.Details on the number of individuals and loci of each population pair are reported in Supplementary Table 2.
The method relies on the identification of outliers using Bayes Factors as a statistic to assess whether a given locus conforms to the null hypothesis of undergoing only the background observed genome-level divergence (Gompert & Buerkle, 2011;Scotti et al., 2023).The neutral expectations (generally speaking based on an island model divergence process) are therefore drawn from the observed data.
The pipelines used to obtain the respective input files and the detailed methods for outlier search are reported in Supplementary Methods 3.

"Crude" outlier search tests
Overall, 778 SNPs passed the significance tests of Bamova and were deemed as outliers: 535 pairwise tests and 243 tests of hierarchical schemes.When pooling hierarchical and pairwise population analyses, the number of individual outlier loci was 562: 387 were retrieved under a single grouping and 175 by more than one, respectively.The 562 outlier SNPs were retrieved in 461 target regions.Among these regions, 389 harboured a single outlier SNP and 72 more than one.
The Bamova analyses retrieved 65 SNPs whose frequencies diverged with strong evidence from the genomic background between N3 and N4 populations (Φ ST Northern slope), and 120 between S1 and S5 (Φ ST Southern slope).When comparing population from the putative refugial area, the SNPs retrieved with a frequency strongly diverging from the background were 88 between N3 and S5, and 109 SNPs between N4 and S5 (Φ ST ).When the comparison was performed for population pairs belonging to different slopes and growing at different altitudes, Bamova retrieved 79 (N3-S1) and 74 (N4-S1) SNPs.When the full dataset is analysed within a hierarchical structure, we found 18 genomic outliers the number of genomic outliers diverging between the two slopes (Φ CT slope replicates, "North vs. South"), another and 225 were divergent at different altitudes within the same slope (Φ SC replicates "altitude").

Figure
Figure SM2-2.Distribution of estimated parameter values and principal component analysis of estimated parameter sets

Figure SM2- 3 .
Figure SM2-3.Groups of correlated parameters in each of the two clusters of parameter sets TIME2 -TIME3 -TIMEC

Box 2 :
FastSimCoal master file (TPL) code to produce the PAR files //Number of population samples (demes) 4 samples to simulate //Population effective sizes (number of genes) of migration matrices : 0 implies no migration between demes 1