Plant species develop strong genetic structure, that is nonrandom distribution of genotypes (Vekemans and Hardy, 2004), at a variety of spatial scales due to their sedentary nature (Silvertown, 2001). In certain circumstances, for example, colonisation of new habitat, spatial genetic structure (SGS) may develop very quickly (<10 generations) and be highly persistent (Epperson, 1990) although patterns may be dynamic, changing with population age as phenomena such as dispersal independent selection, self-thinning and succession begin to act (Hamrick et al, 1992, 1993; Epperson, 1993; Epperson and Alvarez-Buylla, 1997; Chung et al, 1998; Jensen et al, 2003).

The strength and spatial magnitude of population structuring may influence and be influenced by a variety of factors, including historical processes (vicariance, dispersal) and selection (Epperson and Li, 1996, 1997). At a population scale, interspecific differences in the partitioning of variation are due largely to life form and breeding system, and several syntheses (Hamrick et al, 1992; Degen et al, 2001a; Vekemans and Hardy, 2004; Ward et al, 2005) have identified generalisable trends. For example, selfing species generally maintain strong genetic structure, while among outcrossing species, animal-mediated pollen and gravity-mediated seed dispersal mechanisms create stronger patterns. However, in general, at the population level, despite the potential influence of highly localised factors such as spatial variation in the distribution of species and selection for microhabitat variation (Levin and Kerster, 1974; Epperson, 1993; Doligez et al, 1998; Degen et al, 2001a), SGS is predominantly a consequence of limited seed and pollen dispersal (Epperson and Li, 1997; Doligez et al, 1998; Degen et al, 2001a; Epperson, 2004; Vekemans and Hardy, 2004).

Conservation of forest genetic resources and the development of forest management plans that account for intraspecific genetic diversity are of significant contemporary interest, as part of global efforts to preserve biodiversity and ensure environmental sustainability (Lowe et al, 2005; UN, 2000; Kanashiro et al, 2002). Natural forests that come under management for production, sustainable or otherwise, are likely to experience considerable disruption of SGS (Young and Merriam, 1994; Degen et al, 2001a; Lowe et al, 2003). It should be a key aim for management plans to tailor extraction such that this disturbance is minimised and that remnant genetic structure is sufficient to promote regeneration and maintenance of genetic diversity (Lowe et al, 2005). To advance these efforts, several recent studies have taken advantage of new, highly variable genetic markers to conduct detailed analysis of tree populations and explore the link between limitations to seed and pollen dispersal and patterns of spatial genetic variation observed on the ground (Doligez and Joly, 1997; Geburek et al, 1998; Strieff et al, 1998; Degen et al, 2001a; Cottrell et al, 2003; Latouche-Halle et al, 2003).

Most commonly, analysis of SGS is approached using spatial autocorrelation methods (Sokal and Oden, 1978), comparing patterns of genetic variation with geographical distribution. In contrast to population genetic estimators (FST and related statistics), which require averaging across populations or hierarchical levels, spatial autocorrelation uses data from all pairs of individual locations across the sample surface and therefore accesses much more of the available information at the population scale (Epperson and Li, 1997). In addition, spatial autocorrelation makes no assumptions about the spatial scale of structuring in populations (Epperson, 1989; Heywood, 1991; Chung et al, 1998).

Multilocus measures using genetic distances have been shown to be very sensitive in detecting SGS (Smouse and Peakall, 1999). However, the statistical power of the technique depends on actual population structure, size of sample, and aspects of the scale, orientation and distribution of locations across the population surface (Kremer et al, 2005; Epperson and Li, 1996). In other words, the pattern and magnitude of sampling relative to the population are critical. In addition, for population genetic questions, the selection of molecular marker is also of great importance. In this analysis we aim to determine, for a dominant (amplified fragment length polymorphism, AFLP) and a codominant (microsatellite) marker, an optimal sampling strategy, that is numbers of markers and individuals to be sampled, for reliable estimation of SGS. We use a simulated population based on actual field data, to determine, for a variety of sampling strategies and for dominant and codominant markers, limits for meaningful estimation of SGS and use these limits to explore and criticise some recent data sets.


The model Eco-Gene (Degen et al, 1996; Degen and Roubik, 2004) was used to generate two artificial data sets (dominant and codominant) from field data. Using diameter distribution and density data for the neotropical tree species, Symphonia globulifera, at a permanent sample plot at Paracou, French Guiana (Figure 1), a population of 1900 trees in a 1200 m × 1200 m area (144 ha) was simulated. Initial codominant (microsatellite) and dominant (AFLP) data sets were created by distributing genotypes across this population. Each tree was given an artificial genotype of (a) 100 microsatellite loci and (b) 100 AFLP loci. Microsatellite genotypes were generated based on actual allele frequencies of three tropical tree species (Symphonia globulifera, Dicorynia guianensis and Sextonia rubra; Degen et al, 2001a). For the AFLP data set we created 100 loci with two alleles (1 and 2). The frequency of allele 1 was evenly distributed over the 100 loci from 5 to 95% (5% intervals). Initially, for both data sets, genotypes were in Hardy–Weinberg proportions and there was no SGS.

Figure 1
figure 1

Diameter distribution of Symphonia globulifera in 144 ha of forest, from the experimental trial Paracou, French Guiana.

Eco-Gene was then used to simulate the SGS that would develop in this population after 1000 years given limited pollen and seed dispersal (for details of model functions see Pollen and seed dispersal curves were based on data for relatively abundant tropical tree species, as measured at an experimental plot at Paracou, French Guiana (Figure 2). The SGS of this population at the end of the simulations was used as the ‘real’ SGS for comparison in subsequent analyses. Using the same input data sets, four repetitions of the 1000 year run were carried out, allowing between-repetition comparison of the pattern of SGS established at the end of the simulation. For each repetition, mean pairwise genetic distance was calculated for trees distributed in 10 distance classes of 50 m each (50–500 m, Figure 3a, b). A distance-based approach was selected for the analysis of SGS as it can be applied to both dominant and codominant multilocus data, with the qualities/limitations of the marker type taken into account through selection of appropriate distance measures. For the microsatellite data sets, genetic distance was estimated using Gregorius' distance, DG (Gregorius, 1978):

where i and j represent two populations, n is the number of alleles or haplotypes, pik is the relative frequency of the kth allele or haplotype.

Figure 2
figure 2

Simulated limited pollen and seed dispersal.

Figure 3
figure 3

Distogram of the spatial genetic structure at (a, top) 100 microsatellite loci for four repetitions after 1000 simulated years; (b, bottom) 100 AFLP loci for four repetitions after 1000 simulated years.

For AFLP data sets, allele 1 was assumed to be dominant over allele 2, hence the genotypes 11 and 12 were transformed to 1 and the genotype 22 to 0, creating a binary matrix of 1 and 0. Genetic distance was then estimated using Tanimoto's distance, Dij (Degen et al, 2001b):

where vij represents the number of loci scored as 1 in both individuals i and j, yi and yj are the numbers of loci that score 1 in only individual i or j, respectively.

Using the simulated population to determine ‘real’ SGS, a series of sampling strategies (ie variations of the numbers of individuals and loci used) were then tested for both codominant and dominant data sets. For the tests, the program SGS v1.0c (Degen et al, 2001b; was used to analyse spatial autocorrelation in the data sets. Again, mean pairwise genetic distances for microsatellite and AFLP data sets were computed using Gregorius' and Tanimoto's distances, respectively. For each marker type, random samples of 50, 100, 150 and 200 individuals were drawn from the simulated population. At each sample size, a series of data sets were generated with increasing numbers of loci and used to estimate SGS (1, 5, 10, 20, 50, 100 loci). In each case, the estimated SGS, as determined from the sampled data set, was correlated with the ‘real’ SGS as determined for the full simulated population (Figure 4a,b). Each sampling strategy (number of individuals, number of loci) was repeated 100 times and a mean correlation coefficient calculated.

Figure 4
figure 4

Mean correlation between ‘real’ distogram and the distogram drawn from series of subsamples for (a, top) microsatellites (number of sampled loci=1, 5, 10, 20, 50, 100; number of sampled individuals=50, 100, 150, 200), and (b, bottom) AFLPs (number of sampled loci=1, 5, 10, 20, 50, 100; number of sampled individuals=50, 100, 150, 200).

For both codominant and dominant data, the simulated results were used to make recommendations on the minimum sample size and number of loci necessary for meaningful determination of SGS. For a number of data sets drawn from published and new studies (Table 1), the relationship between the number of individuals sampled and number of loci used in the simulated data were explored using a resampling approach. SGS was analysed in subsamples of loci or individuals and distograms for each subsample were correlated with that for the full data set. Each subsampling was repeated 100 times and mean correlation reported. While such resampling of data sets inevitably introduces error, the trends revealed are informative and permit an evaluation of the effort used during collection of the data set. Variation of correlation according to the numbers of loci used was examined using two data sets for Mahogany (Swietenia macrophylla): an AFLP data set of 215 markers, 46 individuals, N=46 (Lowe et al, 2003) and a microsatellite data set of eight loci, N=93 (Lemes et al, 2003). The Mahogany AFLP data set contained a high proportion of low frequency or monomorphic markers, such that only 44 of the 215 loci were polymorphic (frequency of >0.05). The full data set was used as published (Lowe et al, 2003) to examine variation in locus numbers but, for analysis of SGS, the data set was reduced to include only polymorphic loci. In addition, variation in correlation according to the numbers of individuals sampled was examined for four microsatellite data sets: S. macrophylla (8 loci, N=93; Lemes et al, 2003), Sextonia rubra (4 loci, N=184; Hardy unpublished), Dicorynia guianensis (6 loci, N=154; Degen et al, 2001a) and Symphonia globulifera (3 loci, N=148; Degen et al, 2004) and a single AFLP data set: Eugenia uniflora (109 loci, N=278; Salgueiro, unpublished).

Table 1 Real data sets used for resampling, including seed and pollen dispersal characteristics, data set size and sampling area and density as determined from supplied coordinates

For each real microsatellite and AFLP data set, the pattern of SGS was determined using the program SGS v1.0c, in all cases using 1000 permutations of the data set to obtain 95% confidence intervals. These were assessed in the light of the results from the simulations and resamplings.


Repeated simulations of the development of SGS over 1000 years in the model population produced highly consistent patterns (Figure 3a, b). In addition, the scale of SGS observed, that is, the distances at which significant spatial autocorrelation was detected, was similar to experimentally determined values observed for other tropical tree species (Degen et al, 2001a).

The sampling strategies evaluated indicate some clear patterns (Figure 4a, b; Table 2). Firstly, in our simulations, AFLP data required much greater sampling effort compared to microsatellite data. For any given sample size many more AFLP loci and greater numbers of individuals were required to achieve the same degree of correlation as for microsatellites. The pattern is clearly evident in comparison of the trends observed for variation in the estimates from the 100 repetitions carried out (Figure 5). Estimates derived from microsatellite data achieve consistency much more rapidly than those from AFLP data sets. The contrast is a consequence of the lower information content and lower allele numbers per locus of dominant markers as compared to codominant markers (Lynch and Milligan, 1994).

Table 2 Mean correlation for microsatellites and AFLPs between the real distogram and the distogram drawn from different samples
Figure 5
figure 5

Variation in standard deviation of mean correlation between ‘real’ SGS and SGS estimated from series of subsamples (see Figure 4a, b). Each subsampling was repeated 100 times. Top: values for microsatellite subsample data sets. Bottom: values for AFLP subsample data sets.

For both marker types, it is a logical expectation that the more markers and individuals sampled, the better the correlation with ‘real’ SGS. However, for both data types, target sampling ranges are evident, with lower limits below which meaningful estimates of real SGS cannot be made but with upper limits above which the information gain per unit sampling effort declines rapidly.

If a mean correlation of at least 0.9 between real and sampled distogram is taken as a minimum target then, for microsatellites, this can be achieved with a sample of 100 individuals and 10 loci, although close to 0.9 is achievable with five loci. With 200 individuals sampled, five loci can provide 0.95 correlation and little is gained from increasing either locus or individual numbers. For AFLPs, >0.9 correlation can be achieved with a sample of 150 individuals and 100 loci. With 200 individuals sampled, 100 loci provide >0.95 correlation and, again, little is gained for greater effort. It should be noted, however, that this means 100 polymorphic loci (no fixed loci were included in this analysis). With a sample of only 100 individuals, more than 100 polymorphic AFLP loci would be required to approach correlation of 0.9.

In this analysis we have assessed correlation by calculating, in each distance class, the mean Pearson correlation coefficient between the genetic distance in the real population and the values in the sampled population. This is a conservative approach in that the shape of whole distogram is considered, so the sampling ranges we identify can be considered stringent. In other words, sampling within the ranges identified is likely to allow efficient and accurate estimation of ‘real’ SGS. There is a limitation on the extent to which our results can be considered general, in that correlation between samples and the real population depends, to some extent, on the selection of distance classes. As greater numbers of distance classes are used, the number of data pairs per distance class decreases, introducing more stochastic variation and reducing correlation. Our analysis of the simulated data consistently used 10 distance classes so, within this framework, the patterns of correlation should be robust. As a natural outcome of the balance between physical sampling effort and ensuring that the numbers of data pairs per distance class is sufficient, selection of around 10 distance classes is commonplace and as such our results should be broadly applicable.

The pattern of correlation observed in the simulations was mirrored in the resampling studies of real data sets (Figures 6 and 7). For both AFLP and microsatellite markers in S. macrophylla, >0.9 correlation is achieved with fewer markers than used in the published analyses (Figure 6; Lemes et al, 2003; Lowe et al, 2003). In the case of the AFLP study, >0.9 correlation with the final data set is achieved with 75 markers. As noted above, the AFLP data set used here was as published (Lowe et al, 2003) and contained a high proportion of low frequency/monomorphic loci (44 polymorphic loci). So, the correlation rapidly approaches 1 as the number of sampled loci approaches the number of polymorphic loci present. Comparing this data set with the simulations, all of the polymorphic loci would have to be included to make any estimation of SGS, and this would still have low correlation with the ‘real’ SGS. For microsatellites, estimation with six loci matched that made with the eight loci used in the published analysis (Lemes et al, 2003). For these data sets, if the scale of real SGS is of the same order as that observed for the simulated population, then even the full AFLP sample will only achieve a correlation of around 0.5 with the ‘real’ SGS, while for the microsatellite data, six loci and N=93 would achieve nearly 0.9 correlation.

Figure 6
figure 6

Resampling of real data for variation over number of loci. Mean correlation for SSR and AFLP markers between SGS from full data set and that derived from resampled data sets (no. of sampled loci: microsatellites – 1, 2, 3, 4, 5, 6, 7; AFLPs – 25, 50, 75, 100, 125, 150, 175). Note that full AFLP data set is used (no of polymorphic loci=44).

Figure 7
figure 7

Resampling of real data for variation over number of individuals. Mean correlation for four SSR data sets and one AFLP data set, between SGS from full data set and that derived from resampled data sets (no of sampled individuals: 50, 75, 100, 125, 150).

For both the AFLP and microsatellite data sets that were resampled for numbers of individuals (Figure 7), high levels of correlation with the ‘real’ data set were attained quickly when the number of loci was high. For microsatellites, the rate at which correlation was attained was not strictly dependent on the number of loci (S. globulifera, with three loci, approached full correlation faster than S. rubra, four loci), perhaps suggesting that qualities of the individual locus may become important (eg level of polymorphism), although we did not explore the relationship between allelic richness and SGS calculation efficiency.

Patterns of SGS for each species are shown in Figure 8. The mean pairwise genetic distance in each distance class is plotted together with the level at which genetic structure is random and upper and lower 95% confidence intervals generated by 1000 permutations of the data sets. Significant spatial structure is observed where the observed mean genetic distance is above or below the confidence interval, that is, where observed genetic distance is significantly greater or less than that expected from a random distribution respectively.

Figure 8
figure 8

Distograms for real data sets used for resampling. All data sets were analysed using the program SGS. Solid central line indicates value at which there is no spatial autocorrelation. Solid line with filled circles indicates observed level of genetic distance. Dotted lines indicate 95% significance levels as determined using 1000 permutations of the actual data set: hollow squares – upper 95% confidence interval, filled squares – lower 95% confidence interval.


Our simulations indicate that, where moderate SGS exists, there are clear target sampling ranges, of both numbers of individuals and loci, within which the effectiveness of a molecular marker type for estimating SGS is maximised. The sampling effort required (for both individuals and of loci) is much greater for AFLP markers than for microsatellite markers. Using microsatellites and for species with SGS of the same magnitude as that simulated, once five loci are available, it is much more effective to focus on increasing individual sample numbers than increasing numbers of loci. With five loci and 100 individuals, a correlation of close to 0.9 is achievable. Using dominant markers, the number of both loci and individuals required is much higher; at least 100 loci and 150 individuals. These recommendations are somewhat lower than previous predictions, for example (Geburek and Tripp-Knowles, 1994) recommend sampling of 300–400 trees, but it is important to bear in mind that the sampling scheme required will depend strongly on the particular characteristics of the species studied. The key question is how closely does the SGS estimated by the analysed sample reflect the actual SGS in the population? To successfully estimate real SGS, scale and distribution of sampled individuals and the number and type of molecular loci must be carefully considered.

Sampling of individuals

Theoretical expectations are that where the spatial scale of sampling is similar to the spatial scale of the pattern of SGS, the ability to make inferences based on autocorrelation statistics is limited (Slatkin and Arter, 1991). When the spatial scale of sampling is smaller than that of the SGS, autocorrelation can be powerful (Sokal and Oden, 1978; Epperson, 1990, 1993; Sokal et al, 1997). However, the number and distribution of individuals sampled from a population must be carefully considered, with respect to local species distribution, spatial density and expectations of SGS based on life form and breeding system (Hamrick et al, 1992). For example, species with effective long-distance dispersal mechanisms (eg wind-dispersed pollen or animal-dispersed seed) should be expected to show only weak SGS (Vekemans and Hardy, 2004). In this case it is likely to be more efficient to put effort into sampling individuals than increasing the number of markers. In general, if the SGS of a study species is expected to be weak, then sample sizes should be increased above our recommendations.

The orientation of sampling with respect to the distribution of individuals on the ground is also important. Our simulations used a random sampling of individuals across the whole population. In reality for tropical tree species, it is likely that sampling will be biased towards clustered individuals, due to the difficulty of locating low-density target species in species-rich forest. To counter bias due to distribution, a mixed sampling strategy is probably best, balancing high-density local sampling with wider scale coverage, for example, using transects in multiple dimensions (Vekemans and Hardy, 2004). At the same time, however, the sampling strategy must ensure that sufficient numbers of pairwise comparisons are produced in each distance class to achieve statistical significance (a minimum of 30 pairs per class is recommended; Degen et al, 2001b). A further consideration for sample distribution is the age structure of the population. It is notoriously difficult to successfully estimate age in tree populations, particularly tropical tree populations, but diameter at breast height (DBH) measurements are often used (but see case of Eugenia uniflora below). Where possible these should be taken and incorporated into the data set. In natural stands, SGS is likely to be influenced by age: as populations age, self-thinning and succession will lead to changes in SGS, most likely increasing the spatial extent of patterns (eg Jensen et al, 2003). Therefore, it will be important to account for age when estimating SGS, particularly where comparative analysis is to be attempted.

Marker properties

Different molecular markers yield significantly different amounts of information and all require a critical minimum effort to provide a statistically meaningful picture of true SGS. Our simulations have clearly demonstrated the consequences of the lower information content of dominant markers relative to codominant markers. In addition, the markers themselves need to be critically evaluated, in particular the assumption of marker neutrality. Criticism has been made of the use of traditional autocorrelation statistics (eg Moran's I) to address population genetic questions (Slatkin and Arter, 1991), due to the risk that different loci experience different evolutionary forces, rendering averaged statistics meaningless (Hardy and Vekemans, 1999). However, if it is reasonable to assume linkage disequilibrium, and selective neutrality, then averaging over loci should not introduce bias (Epperson, 2004), although this should be explicitly tested (Kremer et al, 2005).

A further consideration is the allele fequency distribution. We based initial allele frequencies in our simulations on those of a series of neotropical trees (codominant data; Degen et al, 2001a) and on an even distribution of marker presence across 100 loci (dominant data). In both cases, it is possible (even likely) that allele frequency distributions for other species will differ and efficiency of SGS estimation may be affected.

Real data sets

Of the real data sets reanalysed here, most would give good estimates of SGS (ie >0.9 correlation), if the ‘real’ SGS for these species is of the same order as that simulated. For the Swietenia macrophylla AFLP data set, the number of individuals sampled is low (N=46), thus, even with 215 loci, only a weak estimate of the ‘real’ SGS is achievable (likely to achieve around 0.6 correlation with ‘real’ SGS). Furthermore, there are only 44 polymorphic loci present in this data set. So while the inclusion of all loci in the original publication is justified (analysis of genetic diversity), their inclusion for analysis of SGS is not and the extent of SGS estimated here (Figure 8) is much stronger than that found previously (Lowe et al, 2003). In any case, for the purposes of estimating real SGS, the difference between the full and edited data sets is minimal as the number of individuals sampled is so low. Also, the inclusion of fixed loci in this case reduced the magnitude of the SGS detected, but the spatial pattern found was similar to that detected using only polymorphic loci. So, for the S. macrophylla AFLP data set, a greater sampling of both individuals and loci is necessary.

In the case of the Symphonia globulifera microsatellite data set, the number of loci sampled is low (three loci). However, due to the high information content of the marker a good estimate of the ‘real’ SGS is possible (>0.8 correlation with ‘real’ SGS). For this data set, the number of individuals sampled should be sufficient to successfully estimate SGS, and, if further effort were to be made, it would be most efficient to concentrate on obtaining additional loci (two more would give >0.9 correlation).

In terms of effort required to successfully estimate SGS, the microsatellite data sets for S. macrophylla (8 loci, N=93), Dicorynia guianensis (6 loci, N=154) and the AFLP data set for Eugenia uniflora (109 loci, N=278) all achieve a good balance (sufficient but not excessive). In the latter case, fewer individuals could theoretically have been sampled to achieve a successful estimate of SGS (with 100 loci, between 150 and 200 individuals should be sufficient). However, this case illustrates the necessity for considering individual species characteristics. Eugenia uniflora has a shrub form and identifying independent individuals can be difficult (Salgueiro, pers. comm.). As a result, many apparently independent samples may in fact be duplicates of single widespread individuals. Therefore, in this case, the extra sampling is justified in order to ensure a large enough sampling of independent trees. The form of E. uniflora also highlights the difficulty of using DBH to estimate age structure in populations: for this species DBH gives no real indication of individual age due to the shrub structure and high frequency of regrowing stems. For the species Sextonia rubra and Symphonia globulifera, the population sample is sufficient and additional loci would be the most efficient focus for further sampling effort.

Summary and extensions

The simulations and analysis presented provide indications of the numbers of individuals and loci for dominant and codominant markers, necessary for successful estimation of SGS in tree populations. These recommendations are qualified by the requirement that any study must consider the characteristics of its target species (mating system, seed and pollen dispersal mechanisms) and plan sampling and marker selection appropriately. In addition, SGS is a dynamic quality that changes over time with population aging, due to selection, density independent thinning and successional processes (eg Jensen et al, 2003). Such considerations are particularly pertinent for comparative analyses that seek common biological factors responsible for patterns of genetic structure; studies that are being actively pursued to identify key considerations for forest management (Lowe et al, 2005; Ward et al, 2005). For these efforts, it will be critical to ensure that sampling schemes for different species provide statistically meaningful outputs. Using the recommendations detailed here as a guide, sampling for each species can be designed such that estimates of SGS can be confidently expected to mirror real patterns.