Introduction

STRUCTURE (Pritchard et al., 2000) is the most widely used clustering software applied to detect population genetic structure, with more than 1000 citations for its first version (Pritchard et al., 2000) and more than 170 citations for its recent enhanced version (Falush et al., 2003) (source: ISI Web of Science database). STRUCTURE generates clusters based on both transient Hardy–Weinberg disequilibrium (HWD) and linkage disequilibrium (LD) caused by admixture between populations. The program works by clustering individuals in groups, where both linkage and HWD are minimized, and therefore, the presence of LD in the data improves clustering results (Falush et al., 2003). On the other hand, ‘strong’ LD or departure from Hardy–Weinberg equilibrium could lead to an overestimation of the number of clusters detected (Falush et al., 2003).

STRUCTURE deals with two kinds of LD: the first is mixture LD, which occurs across loci even if they are unlinked due to the correlation of allele frequencies ‘because individuals with a large component of ancestry in population k have an excess of alleles that are common in k’ (Falush et al., 2003). The second is admixture LD, which is ‘the correlation that arises between linked markers in recently admixed populations’ (Pritchard and Wen, 2004). This LD occurs because markers are on the same ‘chunk’ of chromosome that derives from an ancestral population. The ‘admixture model’ implemented in the latest version of STRUCTURE (STRUCTURE 2.1; Falush et al., 2003) combines admixture LD with map distances between markers to improve clustering results.

Falush et al. (2003) defined a third kind of LD: the background LD measured between syntenous loci separated by few cM. Background LD is generated by genetic drift, is expected to be strong at short distance, and can generate spurious clustering (Falush et al., 2003). However, the authors have not implemented a way to take that LD into account during clustering yet, and they advise users to avoid too closely linked markers that could be in background LD (Falush et al., 2003). Pritchard and Wen (2004) suggested that the distance between markers should not be below 1 cM for humans, although they did not provide empirical evidence in support of this advice. For many species, however, the lack of a genomic map leads to an inability to separate background LD from other types of LD. Furthermore, events other than admixture, such as population bottlenecks (Lynch and Walsh, 1998) or important demographic changes, could also generate strong LD and increase the occurrence of background LD (Jorde, 2000; Peltonen, 2000; Puffenberger et al., 1994). In this situation, users should not rely on the genomic map alone to decide which loci to use or not. A measure of the ‘strength’ of linkage between loci, such as the correlation rLD (Hill and Robertson, 1968) may be therefore more informative than the genomic map to decide which pairs of loci could be used.

Although STRUCTURE is the most widely used program to identify clusters, only few studies have tested its sensitivity to ecological or genetical constraints. For example, Berry et al. (2004) and Rosenberg et al. (2001) tested the effect of the number of loci on clustering results, whereas Evanno et al. (2005) and Waples and Gaggiotti (2006) tested the effects of variation in dispersal rates among populations on the reliability of the number of clusters detected. Furthermore, no study has tested the effect of the strength of LD on clustering results. In this paper, we use knowledge of both the complete history of a mouflon (Ovis aries) population and the genomic information about the species to study the potential bias caused by strong LD on clustering with STRUCTURE 2.1. The Kerguelen mouflon population was founded in 1957 by two individuals originating from the Vincennes Zoo (Paris). In 1958, the mouflon started to reproduce, and the population reached the size of 100 individuals at the beginning of the 1970s. The population then grew exponentially, reaching about 700 individuals in 1977, corresponding to a density of about 100 individuals per hectare. Since then, the population has been characterized by cyclical dynamics, fluctuating between 250 and 650 individuals, with winter crashes occurring every 3–5 years (Chapuis et al., 1994).

Given the high densities reached during peaks, the low genetic diversity observed (maximum of 4 alleles per loci Kaeuffer et al., 2007), the small size of the island and the young age of the population, we do not expect to detect any genetic structure. Furthermore, the strong founder effect in that population has potentially generated strong background LD. This situation favours the possibility of empirically testing for the effect of background linkage on clustering. Finally, we could calculate the distance between syntenous loci based on a genomic map of Ovis aries (Maddox et al., 2001) and link this distance to the correlation between loci.

In 1993, we sampled 106 individuals that died during the winter crash and for which we knew their exact geographical position. These individuals were genotyped at 22 microsatellite loci: 18 from five linkage groups and seven that were unlinked. We used STRUCTURE 2.1 to estimate the population genetic structure. Using different combinations of loci with variable rLD, we obtained different clustering results and compared them with geographical positions. We then assessed the relationship between the map distance between loci and the strength of LD (that is rLD, a correlation coefficient estimated between pair of loci, Hill and Robertson, 1968), and tested the effects of distance and rLD on the clustering results. We hypothesize that a high rLD value between loci could indicate a potential for the pair of loci to generate spurious clustering with STRUCTURE.

Materials and methods

Population and study site

The population is located on Haute Island, a small island (6.5 km2) of the Kerguelen archipelago. Kerguelen is a very remote Subantarctic archipelago located in the Southern Indian Ocean (49°20′S, 70°20′E). The climate is Subantarctic, with high precipitation, strong winds and average temperature ranging from 1°C during the winter to 8°C in summer. Rocky landscapes dominate Haute Island, with sparse vegetation cover (about 40%) composed of few endemic species (that is Azorella selago and Agrostis magellanica) and introduced forage species (that is Poa sp. and Festuca sp.) (Chapuis et al., 1994). The central part of the island is composed by rocky mountains, reaching 321 m high. Mouflons are restricted to shores and low-altitude prairies protected from the dominant winds. In 1993, tissues samples were collected from 106 lambs carcasses from both sexes found on the ground. Positions were recorded using a 125 m2 grid map of Haute Island.

Genotyping

The samples were kept in 95% ethanol. DNA was extracted using the QIAamp tissue extraction mini kit (Qiagen Inc., Mississauga, Ontario, Canada). PCR amplification was performed at the following 22 ungulate-derived microsatellite loci: ARO28, HEL10, MCM64, MCM152, BM3413, HUJ177, MAF64, MCM527, TGLA13, Ilsts059, TGLA176, RT1, AGLA226, Il2ra, MCM218, NRAMP, OarCP49, TEXAN4, DRBps, INRA26, oMHC1 and TGLA387 (see for details Maddox et al., 2001). Reaction conditions (see http://www.thearkdb.org/) were optimized using temperature and MgCl2 gradient PCR. PCR products were analysed with an automatic sequencer (ABI 3730; Applied Biosystems, Foster City, CA, USA) and read using GENEMAPPER 3.5 software (Applied Biosystems).

We characterized variation at each locus, and also tested for departures Hardy–Weinberg equilibrium, using Genepop (Raymond and Rousset, 1995) available at http://www.wbiomed.curtin.edu.au/genepop/genepop_op1.html. Distances between loci were estimated using the SM3 BestPositions, sex averaged on the Arkdb database (http://www.thearkdb.org/) and Maddox et al. (2001).

LD

We estimated LD between the 22 loci using the correlation coefficient rLD (Hill and Robertson, 1968). The LD correlation coefficient between all pairs of loci was computed using Linkdos software (Garniergere and Dillmann, 1992) (http://www.wbiomed.curtin.edu.au/genepop/linkdos.html). Then, we used Fisher's exact test available in Genepop (Raymond and Rousset, 1995) (http://wbiomed.curtin.edu.au/genepop/genepop_op2.html) to test if genotypes at one locus are independent from genotypes at other loci. Founder events (Lynch and Walsh, 1998) and population dynamics (Slatkin, 1994) can affect LD. We examined the nature of the relationship between rLD and the distance in cM between two loci in the mouflon population.

Population structure

We used STRUCTURE 2.1 (Pritchard et al., 2000; Falush et al., 2003) to infer Haute Island population structure. To infer the number of groups, a fully Bayesian process described by Pritchard et al. (2000) was run with different values of the number of clusters (K). STRUCTURE would attribute a probability Pr(XK) given the data (X), and the log Pr(XK) is used to determine the more likely number of clusters (Pritchard et al., 2000). However, Pr(XK) is computationally difficult to estimate, and Pritchard et al. (2000, p 948, 958) have proposed an ad hoc way to approximate the probability of K given the genotyped data. Pritchard et al. (2000, p 950) warned users that the estimated probabilities should be considered as ‘a guide to which models are consistent rather than accurate estimates of the posterior probabilities’ of K.

STRUCTURE software gives also the assignment probabilities of each individual for each cluster. We used these probabilities to infer the membership of each individual at their most probable groups.

We referred to the sheep genomic map (Maddox et al., 2001) for distances between loci for use in the linkage model. The population was young (less than 50 years since the introduction), and we expected to find no population structure or a weak structure. Allele frequencies should probably be correlated between groups, and thus we used the ‘correlated allele frequencies’ option in the linkage model. The K value that provided the maximum likelihood over the runs was retained as the most probable number of clusters (Pritchard and Wen, 2004). We first ran a series of models with K ranging from 1 to 10, using all loci. We fixed the burn-in period to 500 000 and the running length to 1 000 000 to give consistent results over runs. To verify the consistency of the results, we performed 10 independent runs for each K. Running this first series took us more than 100 h, and we limited the following models to three runs and to a K ranging from 1 to 5. We ran these analyses using three different combinations of selected loci: (1) syntenous loci, (2) non-syntenous loci and (3) syntenous loci separated by large distance (>3 cM). We finally ran STRUCTURE using all the different possible pairs of syntenous loci and estimated the number of clusters for each of these pairs.

We analysed the relationship between the distance separating two syntenous loci and their rLD. Given the non-linear link between the two variables, we fitted a non-linear regression model of the form y=b0+b1*e(−x/th), using the selfStart procedure in the package MASS in R (Venables and Ripley, 2002), where y is the rLD and x is the distance between two loci. We used a generalized linear model (logit link function and binomial distribution) to analyse the effect of rLD on the probability of detecting more than one cluster with STRUCTURE. The number of alleles can affect clustering results (Rosenberg et al., 2001), and was thus included in the model. Two groups of pairs of syntenous loci (that is with an rLD <0.3 and with an rLD>0.5, respectively) provided completely separated number of estimated clusters (that is 1 or >1 cluster, respectively). To account for the bias caused by the complete separation between these two groups, we ran a bias-robust logistic regression that uses a maximum penalized likelihood estimation (Firth, 1993; Heinze and Schemper, 2002). We used the package logistf in R (Ploner et al., 2005).

Results

Heterozygosity

The mean number of alleles per locus was 2.5 and ranged from 2 to 4. The average heterozygosity observed is 0.47±0.012 s.e. None of the 22 loci showed any departure from Hardy–Weinberg equilibrium (exact test P=0.05).

LD and the link with distance on the chromosome

Thirty-six of the 231 pairs of loci showed significant LD (P<0.05). The average rLD over the whole set of loci was 0.112 (range: 0.0005–0.878). Of these 36 pairs, eight were from syntenous loci and showed an rLD averaging 0.64, (Table 1). The 28 other pairs in LD were on different chromosomes and were characterized by an average rLD of 0.09. Only 10 pairs of loci, including the eight pairs from syntenous loci were still in significant LD after correcting for multiple testing (Benjamini and Hochberg, 1995). The average rLD of the two non-syntenous pairs in significant LD after correcting for multiple testing was 0.25. Ten other syntenous loci showed a non-significant LD (Table 1).

Table 1 Linkage groups, distance between loci (in cM) and correlations (rLD) between syntenous loci and their associate P-value (in bold significant P-values)

The correlation coefficient (rLD) between syntenous loci decreased significantly with the distance between them and reached a minimum of 0.08 for distances greater than 30 cM (Figure 2; non-linear regression; best-fitted model parameters: b0±s.e.=0.08±0.03 (t=2.444; P=0.027), b1±s.e.=0.80±0.09 (t=9.438; P<0.0001) and th±s.e.=5.26±1.83 (t=2.873; P=0.012).

Figure 2
figure 2

rLD as a function of the distance between two syntenous microsatellite loci (in cM) within a pair.

Population structure

Different combinations of loci provided different clustering results (Table 2). Using the whole set of loci, the best fitted model provided two clusters (K=2; Table 2). The estimated membership of each individual in each cluster did not correspond to any obvious geographical structure on the island (Figure 1), suggesting that the number of clusters was overestimated here. Analyses run with combination of syntenous loci alone also provided an estimate of two clusters (Table 2). In contrast, both analyses that used only non-syntenous loci or syntenous loci separated by more that 3 cM provided only one cluster.

Table 2 Number of clusters inferred by STRUCTURE for different combinations of loci. In bold: most probable number of clusters (highest log Pr(X/K))
Figure 1
figure 1

Approximate position and cluster membership of all lambs sampled on Haute Island determined with STRUCTURE software using the whole set of loci (for map readability, individuals sampled from same location were slightly shifted). Circles and squares represent membership of each individual.

The strength of the linkage correlation (that is rLD) between two loci significantly increased the probability of overestimating the number of clusters, in an analysis performed with one pair of syntenous loci at a time (Figure 2). The number of alleles in a pair of syntenous loci and the interaction between rLD and the number of alleles did not affect the probability of overestimating the number of clusters (Figure 2). Eleven out of 18 pairs of loci that estimated one cluster were characterized by a rLD <0.12 (mean±s.d.=0.09±0.05). Four analyses with pairs of loci with a rLD ranging from 0.57 to 0.70 (mean±s.d.=0.63±0.06) provided an estimate of two clusters, and three analyses with pairs of loci with a rLD higher than 0.75 (mean±s.d.=0.83±0.06) estimated three clusters. Finally, analyses using the two pairs non-syntenous loci with a significant LD (rLD=0.25) gave an estimate of one cluster.

Discussion

We detected two subpopulations in the Kerguelen mouflon population using our full dataset and the program STRUCTURE. On one hand, this result could be explained by the fact that groups of related individuals occupied restricted portions of the island. For example, observations on 18 marked individuals during their first year of life (average number of observation per lamb=13.6 and range=10–32) showed that lambs occupied small home ranges (average±s.e.=39.0±7.0 Ha) relative to the size of the island (that is 650 Ha). Furthermore, male and female mouflon in Europe are known to be philopatric (Dubois et al., 1993, 1995; Petit et al., 1997; see also Martins et al., 2002) and to reproduce in their natal area (Dubois et al., 1995). Coltman et al. (2003) have also found weak genetic structure on an insular population of sheep. On the other hand, several lines of evidence suggest this explanation is unlikely. First, the two clusters showed a completely overlapping spatial distribution (Figure 1). Such an overlap of clusters could be explained by frequent dispersal or long movements of related individuals just before the winter crash. This explanation is, however, not consistent with behavioural observations made on the population. All the lambs that died during the winter crash were found within the boundaries of their home range. Furthermore, the population was characterized by a low genetic diversity (maximum of 4 alleles Kaeuffer et al., 2007) and a high density relative to other mouflon populations (Boussès and Réale, 1996), which should increase the flow of individuals between areas. Finally, marked males moved easily from one area of the island to another during the rut (Réale et al., unpublished), suggesting an unrestricted gene flow within the population. Thus, it is unlikely for the Kerguelen mouflon population to exhibit genetic structure. As mentioned by Pritchard et al. (2000), the two clusters found in our analyses do not necessarily have a biological meaning and could be caused by LD between the markers used (Falush et al., 2003). To test for the potential bias in STRUCTURE, we used the program Geneland (Guillot et al., 2005a) implanted in the R program (R Development Core Team, 2005). This method consists of a Bayesian model implemented in a Markov Chain Monte Carlo that takes into account individual geographical positions, but that does not consider effect of LD on the genetic correlation between populations. Following Guillot et al. (2005a, 2005b) recommendations, we first run a model to determine the most probable number of populations. Distribution of posterior probabilities showed a mode at K=2 populations, although the frequency of K=1 was also very high. However, after fixing K=2 and running the model again, all the individuals were assigned to the same population. This result confirms that there is no population genetic structure in the Kerguelen mouflon population (for more discussion on Geneland vs STRUCTURE models see Coulon et al., 2006).

In support to the hypothesis that LD can generate spurious number of clusters, we showed that a high rLD between loci strongly affects the probability of detecting more than one cluster in the Kerguelen population (Figure 3). This high rLD (>0.56) was caused by pairs of syntenous loci with distance lower than 3 cM (Table 1 and Figure 2). Our results also indicate that beyond this distance rLD declined dramatically and did not bias the estimation of clusters (Figures 2 and 3). Therefore, a distance higher than 3.0 cM generates one cluster. The impact of such a high rLD between two closely linked loci support the hypothesis that background LD only is responsible for the overestimation of clusters (Falush et al., 2003). This is also supported by the fact that the two pairs of unlinked loci that showed a rLD of 0.25 provided only one cluster.

Figure 3
figure 3

Effect of rLD between two syntenic loci on the overestimation of the number of populations, using the software STRUCTURE and data from the Haute Island population. The model was run using a bias-robust logistic regression (for more information see text) including rLD, number of alleles per locus and their interaction. Only rLD had a significant effect (χ12=18.5852, P<0.0001), and the interaction and the number of alleles per locus were not significant (χ12=0.0004, P=0.984 and χ12=0.0015, P=0.9689, respectively). Number of pairs of syntenous loci=18.

Rosenberg et al. (2001) found that increasing the number of loci could improve their clustering results (but see Lecis et al., 2006). In our study, analyses using two loci differing in rLD generated different numbers of clusters. This suggests that the strength of the rLD rather than the low number of loci used is responsible for the biased clustering. Moreover, when we used the 22 pairs of loci or only portions of our data set our results indicate that the presence of pairs of loci in strong LD within a sample can generate spurious results (Table 2; comparison of the whole 22 loci vs the non-syntenous loci or loci with large distances), suggesting that STRUCTURE could be sensitive to even a rare pair of loci in strong LD. Our study could not detect an effect of the number of alleles in a pair of loci on the probability of detecting clusters. It should be noted, however, that the allelic diversity in the mouflon population is low and that a higher allelic diversity may have stronger effect.

Our results also indicate that other LD does not bias the clustering results. LD between unlinked loci following a founder event (Lynch and Walsh, 1998) could be confounded with ‘mixture’ or ‘admixture’ LD (Falush et al., 2003). This could be the case on Haute Island. The population was founded by only two individuals, and some pairs of non-syntenous loci showed stronger LD values than pairs of syntenous loci separated by a large distance. LD obtained from non-syntenous loci or from syntenous loci separated by a large distance, however, did not seem strong enough to generate spurious clustering results (see Table 2).

On the basis of a human study, Pritchard and Wen (2004) advised users against using loci separated by less than 1 cM. However, in our study, all the loci that were separated by less than 3 cM were characterized by a rLD higher than 0.55 and generated a clustering bias. Furthermore, the selection of loci based on map distance may not be the most efficient approach. First, the strength of LD is not always correlated to physical distance between loci (Jorde et al., 1994; Jorde, 1995, 2000). Second genomic maps are still lacking for many species, and the distance between loci is available only for a few species. Finally, the distance per se is not responsible for the clustering, but it affects the strength of the linkage that in turn biases the estimation of the number of clusters. LD not only depends on the distance between two loci, but also can be increased by founder events or be decreased by population dynamics (Slatkin, 1994) or number of allele at a given loci (Ott and Rabinowitz, 1997). For all these reasons, we recommend researchers to rely on the rLD to take their decision of using or not using a given pair of loci before running STRUCTURE.

We hope our results will convince researchers that a good knowledge of the LD between loci is an important step before starting analyses in STRUCTURE. As already mentioned by Manel et al. (2005) and Waples and Gaggiotti (2006), and despite recommendations made by Pritchard et al. (2000) and Falush et al. (2003), some authors did not seem to consider LD as a potential issue in the study of population genetic structure. For example, some authors do not report any measure of LD between loci used in STRUCTURE (Kusumo et al., 2006). Others used some pairs of closely linked loci (that is about 3 cM in Verardi et al., 2006; >0.5 cM in Lecis et al., 2006). We do not intend to say that these studies are biased. Results from Lecis et al. (2006), for instance, seem consistent to their expectations and the uses of very closely linked loci did not seem to affect their results, potentially because a short distance between two loci may not automatically translate into a strong LD (Peterson et al., 1995).

To conclude, our empirical study demonstrates how a strong background LD can lead STRUCTURE to overestimate the number of clusters on a population genetic structure analysis. Therefore, rather than only simply testing for the presence of LD, studies using STRUCTURE should first estimate the rLD between all the pairs of loci before running clustering analyses. This would permit to exclude pairs of loci that could potentially bias the clustering results. For example, in the presence of rLD higher than 0.5 in a sample, one should run two analyses with and without loci with strong rLD and compare the number of clusters detected by STRUCTURE. A different number of clusters may suggest a bias caused by background LD.