Introduction

Differences in genetic structure within and between populations in tree species are mainly because of the life form and breeding system. The availability of highly variable molecular markers has facilitated the analysis of fine-scale genetic structure in natural tree populations. The fine-scale structure has been found in some forest tree species (Sork et al., 1993; Berg and Hamrick, 1995; Streiff et al., 1998; Dutech et al., 2002; Hardy et al., 2006). These tree species are characterized by either limited seed dispersal or restricted pollen and seed dispersal.

Pines are wind-pollinated and the seeds generally have wings that facilitate wind dispersal (Ledig, 1998). Together with a predominant random mating system (Koski, 1970), these features contribute to little or no genetic structure being found in undisturbed pine forests: both in large (Gullberg et al., 1985; Karhu et al., 1996; Dvornyk et al., 2002; García-Gil et al., 2003) and in fine geographic scales (Knowles, 1991; Xie and Knowles, 1991; Parker et al., 2001; Uchiyama et al., 2006; Marquardt et al., 2007). On the other hand, fragmentation and bottlenecks may cause a genetic structure because of self-fertilization and mating among genetically related individuals (Vogl et al., 2002; Robledo-Arnuncio et al., 2004; Boys et al., 2005). When mating occurs between genetically related individuals, it increases inbreeding. Inbred individuals may have lower fitness because of the expression of recessive deleterious alleles (Charlesworth and Charlesworth, 1987). Inbreeding also results in a decreased level of genetic diversity, which is of major concern in forest tree breeding and conservation programs.

Forest management practices have also been shown to increase genetic structure compared with natural forests, especially if the breeding practices imply drastic reduction of the effective population size (Young and Merriam, 1994; Finkeldey and Ziehe, 2004). Population size reduction could potentially increase the rate of self-fertilization because of the reduction in number of local compatible mates. Moreover, even under random mating, a smaller number of parent trees will increase the probability of seed cohorts with full-sib relationships (Surles et al., 1990; Muona and Harju, 1989; Robledo-Arnuncio et al., 2004).

Recently, several Bayesian clustering methods for inference of population genetic structure have been developed. These methods are generally referred to as assignment methods and use allele frequency data of molecular markers to ascertain the population membership of individuals by assuming either fixed or variable numbers of population clusters (Manel et al., 2005). In the original methods developed by Pritchard et al. (2000), Dawson and Belkhir (2001) and Corander et al. (2003), spatial information was not explicitly included in the modeling. However, some recent Bayesian assignment methods incorporate information from the geographical coordinates of individuals, by using prior distributions for the spatial distribution of individuals in a cluster (Wasser et al., 2004; Guillot et al., 2005; François et al., 2006; Corander et al., 2008). Simulation studies have shown that incorporation of geographical information into assignment methods can result in better statistical performance (Chen et al., 2007). It is well-recognized that inbreeding perturbs the Hardy–Weinberg equilibrium and can lead to spurious aggregates of population substructure (for example, Guinand et al., 2006). With the exception of the methods developed by François et al. (2006) and Gao et al. (2007), the assignment methods are based on the assumption of Hardy–Weinberg equilibrium within clusters, and may therefore yield biased estimates of the number of clusters in the presence of inbreeding.

In this study, we jointly estimated the fine-scale genetic structure and inbreeding level in a managed tree population of Scots pine using a recently developed Bayesian hidden Markov model. We analyzed 96 geographically mapped individual seed trees of Swedish Scots pine using 14 microsatellite loci. The analysis was carried out using the program GENECLUST (François et al., 2006), which provides the facility to jointly incorporate both spatial information from a geographical neighborhood structure through a Potts–Dirichlet model and account for variable degrees of inbreeding within the clusters. To evaluate whether inbreeding and spatial interaction should be included in the best-fitting statistical model for our data, we used the deviance information criterion (DIC), a weighted measure of fit that accounts for an effective number of free parameters in a model (Spiegelhalter et al., 2002; Celeux et al., 2006). We evaluated DIC statistics for several models with and without inbreeding, and with increasing levels of spatial connectivity.

Materials and methods

Scots pine material

Scots pine is a major conifer species across the northern boreal zone in Europe and Asia. Its distribution is the widest among the pine species, from southern Spain (38° N) to north Finland (68 °N), and from western Scotland (6 °W) to Okhotsk Sea in eastern Siberia (135 °E) (Mirov, 1967). Within its distribution Scots pine grows at elevations from sea level to 2400 m and in many different environments in terms of temperature, soil quality and humidity. Scots pine is a keystone species, on which many other plants, insects, birds and animals species depend (Persson, 1980). Like the majority of the pine species, Scots pine has a diploid genome with a chromosome number of 2n=24 (Saylor, 1972). Scots pine is wind-pollinated and has wings on the seeds that facilitate wind dispersal, over distances that can be characterized by an exponential distribution with a tail that descends to a value close to zero within a few tens of meters. However, some long-distance animal-mediated seed dispersal cannot be ruled out (Lanner, 1998).

The trees for this experiment are situated in a population 25 km north-east of Umeå, Sweden and originate from wind-pollinated seed trees that were established in 1965. The number of seed trees was around 50 per hectare. Seedlings were allowed to establish until 1979, after which the seed trees were cut down. The population was thinned in 1989, resulting in a collection of trees with homogenous height and age.

We sampled needles, marked and estimated geographic positions with a satellite-based GPS system of 96 trees according to a square lattice. We sampled 25 hectares out of a total managed area of 65.9 hectares. The aim was to sample trees as close as possible to 50 m apart (that is, a lattice with 50 × 50 m cells). However, the lattice deviated slightly from this ideal because the seedlings had established naturally (see the Voronoi tessellation in Figure 1). Needles were sampled within 1 day in November 2004 and stored in a −80 °C freezer.

Figure 1
figure 1

Neighborhood structure of the 96 sampled Scots pine trees obtained from a Voronoi tessellation. Trees sharing a border are considered as neighbors.

DNA extraction and microsatellite amplification

DNA was extracted from needles with the DNeasy Plant Mini Kit (Qiagen, Solna, Sweden, Cat number: 69104). Twelve nuclear microsatellite primers developed for Pinus taeda (Elsik et al., 2000; Auckland et al., 2002; Liewlaksaneeyanawin et al., 2004; Chagn et al., 2004) and Pinus sylvestris (Soranzo et al., 1998) were selected to genotype all the individuals. The amplified microsatellite primers are PtTX2146, PtTX3107, PtTX3116, PtTX4001, PtTX4011, LOP1, LOP3, SPAC 12:5, SPAG 7:14, SPAC 11:8, SsrPt_ctg64 and Ssr_ctg4487b. Primer SsrPt_ctg64 amplified three different polymorphic microsatellite loci, namely ctg64a, ctg64b and ctg64c. The primers incorporated fluorescent dyes (D2, D3 and D4). The PCR volume was 25 μl and consisted of 50 ng of genomic DNA template, 0.2 mM of each primer, 0.2 mM of each dNTP, 2.5 μl of 10XTaq buffer (500 mM KCl, 100 mM Tris-HCl, 1% Triton X-100, Promega, Nacka, Sweden, Cat number: A3511), 2 mM of MgCl2 (Promega) and two units of Taq polymerase (Fermentas, Helsingborg, Sweden, Cat number: EP0405). Amplifications were carried out using a Peltier Thermal Cycler PTC-225. The amplification protocol for SPAC 11:8, SPAC 12:5 and SsrPt_ctg64 primers was 5 min at 94 °C; followed by 35 cycles of 1 min at 94 °C, 1 min at 55 °C, 1 min at 72 °C; and finally one cycle for 10 min at 72 °C. The amplification conditions for PtTX4001, PtTX3107, LOP3 and SsrPt_ctg4487b primers were 5 min at 94 °C; followed by touch-down from 55 °C down to 45 °C and 25 cycles of 1 min at 94 °C, 1 min at 45 °C, 1 min at 72 °C; and finally one cycle for 10 min at 72 °C. Primers PtTX3116, PtTX4011, PtTX2146, SPAG 7:14 and LOP1 were amplified under the same touch-down protocol described before, except for the gradient temperature that started at 60 °C down to 50 °C. PCR amplifications were resolved in a Beckman Coultier CEQ-8000 using an internal size standard (400 bp size standard) and multiplexing the runs for a maximum of three different SSR loci, and allele scoring was done by using the CEQ system software.

Statistical analysis

GENECLUST (François et al., 2006) is based on the concept of Hidden Markov Random Field (HMRF), which models the spatial dependencies in cluster membership. Hidden Markov models (HMMs) assume that the data are a noisy realization of an underlying process with Markovian dependence. In other words, a HMM is a one-dimensional Markov chain observed in noise (Cappé et al., 2005). HMRFs are generalizations of HMMs to the two-dimensional plane and are therefore suitable for analysis of spatially structured observations. Markov random fields provide a statistically well-founded basis for modeling spatial autocorrelation, which is of major interest to many biological applications (Sokal and Oden, 1978). Markov random field models are motivated by the concept of conditional independence; that is, the dependence of a random variable associated with a particular site on the random variables at all the other sites can be specified by the values of random variables in the neighboring sites only (Ripley, 1981; Cressie, 1993). In population genetics, HMRFs can account for the fact that individuals from spatially continuous populations are more likely to share cluster membership with their close neighbors than with distant individuals. GENECLUST can detect geographical discontinuities in allele frequencies and estimate individual population memberships as an unobservable parameter. To account for the dependencies among cluster labels, GENECLUST uses the Potts model, parameter Ψ of which specifies the importance of spatial interactions. The value of Ψ is generally non-negative. Zero values of Ψ indicate no special dependency; hence the statistical model used by STRUCTURE (Pritchard et al., 2000) is recovered.

The first step in GENECLUST is to calculate a neighborhood structure from the geographical coordinates with Dirichlet tiling (also known as Voronoi tessellation). Two sampled individuals are neighbors if their Dirichlet cells share a border. The neighborhood structure for the sampled Scots pine trees is shown in Figure 1. Default priors were used for all parameters, that is, Dirichlet distributions Dir(α, …, α) with α=1 on allele frequencies fk, β(4, 40) prior on each fk and fixed values of the spatial interaction parameter Ψ. GENECLUST also provides an estimate for the actual number of cluster in the data, K. For well-chosen values of Ψ, the hidden Markov model acts as a regularizer, and tends to empty spurious clusters when Kmax exceeds K. In order to avoid problems associated with specification of a single Kmax (Evanno et al., 2005), we used the two values, Kmax=2, 3.

We fixed Ψ to different values (0, 0.2, 0.4, 0.6) and compared the DIC (Spiegelhalter et al., 2002) of models with and without inbreeding, and for two values of Kmax. The basic principle of DIC is that models with smaller values are preferred to models with larger values. For a model with parameter θ and for some genetic data, y, the DIC can be computed by adding a penalty term, pD, to the averaged deviance, D(θ)=−2 Eθ [log p(θy)y]. The penalty term is meant to represent an effective dimension for θ, which is estimated from the data. The penalty term is usually computed as pD=D(θ)+2 log p(θesty), where θest represents an estimate of θ. Provided that the deviance, −2 log p(θy), is available in closed form, D(θ) can easily be approximated from an MCMC run by taking the sample mean of the simulated values. With flat priors or when the likelihood overwhelms the priors, DIC behaves similarly as the Akaike Information Criterion (Akaike, 1974). In general, DIC contains a useful estimate of the effective number of parameters even when many of them are defined as latent variables, as is the case in many hierarchical models. We implemented the DIC for GENECLUST models, where the parameter θ comprised the set of allele frequencies, fk, the set of individual cluster labels, z, and the set of inbreeding coefficients, ϕ, when inbreeding was included in the model. We used posterior average estimates for the allele frequencies and for the inbreeding coefficients as computed by GENECLUST, and estimated the cluster configuration from the cluster membership coefficients, after re-assigning each individual to their most likely cluster.

Paternal genotypes were reconstructed using the Bayesian program Parentage 1.0 (Emery et al., 2001). We used two chains and the Metropolis-coupled MCMC option. Burn-in was set to 100 000 iterations, and sampling based on the next 100 000 (with a thinning of 10). Prior for the allele frequencies was the standard Dirichlet distribution and prior for the number of fathers and mothers was Unif(1,96) and Model 1. Each male is equally likely to be the father of any offspring.

Results

The number of alleles and their frequencies are available in the online supplement. The number of alleles per SSR locus ranges from 2 (ctg64a) to 47 (SPAC12:5). As expected, the SSR loci that originated from cDNA libraries (for example, ctg64) showed lower allelic richness than genomic SSR loci. As reported earlier in the literature, the SPAC12:5 locus turned out to be highly polymorphic (Soranzo et al., 1998).

For each value of Kmax=2–3, for levels of the spatial interaction parameter Ψ ranging from 0 to 0.6, and considering models with and without inbreeding, we computed an average of the DIC over the best 20 values obtained after 100 replicates of GENECLUST runs. Using a total of 5000 sweeps for the MCMC program and discarding the 2500 first sweeps as a burn-in period, we carried out a total number of 1200 runs and compared 12 models.

Runs with Ψ greater than zero generally converged to a single cluster. Runs with Ψ=0 generally ended with a large majority of individuals assigned to a single cluster, and with a small minority, not exceeding five individuals, sometimes assigned to a second cluster. Table 1 reports DIC values corresponding to each model. Averaged DICs ranged from 9149 to 9270, with s.d.'s ranging from 7 to 30. The smallest values of the DIC were reached for models with inbreeding and for a spatial interaction parameter around 0.2–0.4. For these models, the effective number of parameters was estimated around 505. The highest values were reached for models without inbreeding and with the spatial parameter Ψ set to 0. All models without inbreeding performed worse than those including inbreeding. Posterior estimates of the inbreeding coefficient were computed for models with Ψ=0.4 by pooling the 20 runs with the lowest DICs (DIC<9140). The posterior mean of the inbreeding coefficient was equal to 0.248 (median=0.249), with a 95% credibility interval ranging from 0.217 to 0.283. The posterior s.d. was estimated to be 0.018. Figure 2a shows a histogram for 10 000 simulated values from the posterior distribution. Using the same procedure, we computed posterior estimates from models with Ψ=0. The posterior mean of the inbreeding coefficient was equal to 0.249 (median=0.249), with a 95% credibility interval ranging from 0.215 to 0.284. The posterior distribution was not different from the one obtained by using higher values of Ψ (Figure 2b). These results may indicate that, in the case of absence of population structure, the estimation of the inbreeding coefficient is robust to the presence of a small amount of spatial autocorrelation in the data. To conclude, the values of the DIC indicated that a model with a single estimated cluster, relatively high levels of inbreeding and a moderate amount of spatial dependencies within the unique population best explains the data.

Table 1 Average values of the deviance information criterion (DIC) for spatial genetic models with Kmax=2–3 clusters, with levels of the spatial interaction parameter ψ ranging from 0 to 0.6, and with absence (n) or presence (y) of inbreeding
Figure 2
figure 2

Posterior density of the inbreeding coefficient from the 20 models with lowest values of DIC, computed from GENECLUST using Kmax=2 clusters (10 000 simulations). (a) Model with spatial parameter Ψ=0.4. (b) Model with spatial parameter Ψ=0.

Analyses of paternity were carried out in order to reconstruct the parental genotypes. The analyses supported 19 fathers and 15 mothers as the progenitors of the 96 sampled trees. The analyses identified a very small number of full- and half-sibs. By setting a probability limit at 0.9, we found the following full-sib pairs: 27–37 (P=0.973), 23–78 (P=0.999), 13–90 (P=0.906), 54–87 (P=0.916); and half-sib pairs: 27–37 (P=0.984), 47–49 (P=0.933), 13–90 (P=0.956), 24–78 (P=1.000) and 54–87 (P=0.955). Note that 27–37 can be both half-sibs and full-sibs with very high probability. Hence, we can conclude that seed trees do not belong to a single full-sib family, which is indicated by the inbreeding level. Instead, they must share some form of relationship before they are selected as seed trees.

Discussion

The fine-scale genetic structure of 96 geographically mapped Scots pine trees from a stand in northern Sweden was analyzed using 14 SSR loci. Our sample size and number of loci have earlier been shown to be sufficient for the study of spatial genetic structure in tree populations (Cavers et al., 2005). Assignment analysis carried out with the program GENECLUST (François et al., 2006) and model comparison based on the DIC show that a model with a single estimated cluster, with high levels of inbreeding and with a moderate amount of spatial dependencies within the unique cluster (Ψ=0.2–0.4), best explains the data. Although the DIC has been used earlier to decide which runs of a Bayesian clustering program should be kept after a multiple-run analysis (François et al., 2008), its systematic use for deciding which model best fits the data is new in this context. The four versions of DIC implemented in this study were motivated by the fact that these measures could be directly and easily computed from the output of the program.

Different approaches have been used before to evaluate spatial clustering of genotypes within stands (or populations). The simplest method is to assess the degree of clustering by plotting the trees on a map (Knowles, 1991). Another method is based on dividing the stand into subplots and estimating the among-subplots differentiation by means of gene diversity (Gst or Fst) (Streiff et al., 1998). The most commonly used procedure estimates the similarity between pairs of genotypes or subplots (based on allele frequencies) within a specified distance and evaluates whether the pairs are more similar than expected by chance under random spatial arrangement (Epperson, 1992; Parker et al., 2001; Cavers et al., 2005). However, these methods are not very useful when dealing with inbred populations.

Only few simulation studies have evaluated the effect of inbreeding on results from assignment and cluster analysis. Guinand et al. (2006) carried out a simulation study that investigated how different levels of inbreeding (F=0, 0.05 and 0.15) influenced the accuracy of assignment analysis. They concluded that inbreeding had no effect on the accuracy of the assignments. However, it should be noted that Guinand et al. (2006) used a version of STRUCTURE that does not allow for proper modeling of inbreeding. Gao et al. (2007) presented a method (implemented in a program called InStruct) that extends the algorithm in STRUCTURE by eliminating the assumption of the Hardy–Weinberg equilibrium within clusters. Based on extensive simulations with various levels of selfing, they showed that their approach could avoid spurious signals of population substructure that could lead to biased assignments. However, the differences in assignment bias compared with STRUCTURE were mostly relatively small, and they did not evaluate how estimation of K was influenced by inbreeding. In addition, the program InStruct does not allow inference based on spatially explicit priors, as does GENECLUST, and is therefore less appropriate for analysis of our data.

Our results indicate a single estimated cluster and a relatively high overall inbreeding coefficient of 0.250, which correspond to co-ancestry of one full-sib family or a mixture of half-sibs and full-sibs established from already related seed trees. Based on the parentage analysis, the results of which support a total of 19 fathers and 15 mothers and only nine pairs of trees that were either full- or half-sibs, we can conclude that the trees do not form a single full-sib family. The high overall inbreeding coefficient contrasts with the efficient mechanisms for purging inbreeds described in Scots pine (Muona et al., 1987; Kärkkäinen and Savolainen, 1993). On the other hand, this apparent contradiction could be explained if some of the trees are the result of mating between already related parent trees, as supported by the parentage analysis.

In Scots pine, increased selfing is generally not a problem in natural stands because of heavy selection against inbreeds at the seed and seedling stages (Muona et al., 1987; Kärkkäinen and Savolainen, 1993). However, in low-density stands (partially harvested forests), the remaining inbreed seedlings may be eliminated less efficiently due to lower competition. Natural regeneration from a few trees can potentially affect the population structure and mating system because of the reduced initial reproductive population size. Studies carried out in managed pine populations support changes in the level of inbreeding after natural regeneration from partially removed forests, but the degree and direction of the disturbance vary among reports. Although some studies indicate an increase in the inbreeding level (Rudin et al., 1977; Farris and Mitton, 1984), others support absence (Yazdani et al., 1989), or even decreased inbreeding (Marquardt et al., 2007). Yazdani et al. (1989) concluded that seed trees had a low genetic contribution to regeneration compared with seeds from felled trees and surrounding trees. The discrepancies among studies may be because of factors such as percentage of tree removal (final forest density), level of gene inflow from the surrounding forest and level of genetic relatedness between the trees left after harvesting.

Spatial analyses have shown some level of clustering of genotypes within populations of other conifers (Knowles et al., 1992; Cavers et al., 2005). However, cluster sizes are quite small (5–50 m across), suggesting that they are primarily a result of limited seed dispersal, and that the clusters are made up of close relatives; Knowles et al. (1992) compared spatial genetic clustering in two stands of Larix laricina. The stand had naturally regenerated after clear-cutting, presumably by a few remnant individuals scattered within the stand, and showed significant spatial clustering of genotypes. No clustering, however, was observed in a nearby old-field stand.

It is possible to use molecular markers in combination with powerful Bayesian statistical methods for joint estimation of spatial genetic structure and inbreeding in tree populations for estimation of genetic parameters that potentially could be used for monitoring forest-management practices, but results from comparative experiments with managed and non-managed stands are needed before we can draw a final conclusion. Results from this kind of study would be of special relevance in forestry, wherein the observation of long-term effects of forest management on genetic structures is retarded by the long rotation cycles.