Introduction

Recently, growing attention is being devoted to taking landscape information into account in genetic studies (Manel et al., 2003). Among landscape features, space is most likely to influence the genetic structuring of a set of individuals or populations (Manel et al., 2004; Coulon et al., 2006). This structuring can exhibit different patterns, such as isolation by distance (Wright, 1943), clines (Haldane, 1948), metapopulations (Hanski and Simberloff, 1997; Kerth and Petit, 2005) and barriers to gene flow (Slatkin, 1985). Moreover, as technological advances have made obtaining spatial information easier, there is strong interest in including this information in the analysis of genetic data.

Exploiting the geographic dimension of genetic data is not new. Spatial information can be used a posteriori for graphical display purposes (for example, Bertranpetit and Cavalli-Sforza, 1991; Manel et al., 2004) or to measure spatial autocorrelation (for example, Sokal and Wartenberg, 1983; Sokal et al., 1986; Bertorelle and Barbujani, 1995; Smouse and Peakall, 1999). Such methods are useful descriptive tools to visualize, quantify and test spatial structure, but are not properly designed to investigate spatial patterns. For instance, ordinary ordination methods (Bertranpetit and Cavalli-Sforza, 1991) may reveal spatial patterns wherever they are obvious, but they are not constrained to do so. To investigate spatial structures other than the most evident, a method should be spatially explicit, that is, it should directly take spatial information into account as a component of the adjusted model or of the optimized criterion, thereby focusing on the part of the variability, which is spatially structured.

Such methods have been developed using different approaches. Within the analysis of molecular variance (AMOVA) framework (Excoffier et al., 1992), the spatial analysis of molecular variance (SAMOVA; Dupanloup et al., 2002) has proven useful for phylogeographic studies (Pramual et al., 2005; Tolley et al., 2006) to assess the spatial structure of a known number of populations. Within the Bayesian clustering framework, GENELAND (Guillot et al., 2005) and, more recently, a hierarchical Markov random field (HMRF) model (François et al., 2006) were proposed as improvements of STRUCTURE (Pritchard et al., 2000; Falush et al., 2003) by integrating geographic information to infer the number of populations and detect the genetic discontinuities among these populations (Coulon et al., 2006). Combining wombling and Bayesian assignment, Manel et al. (2007) proposed a method to detect genetic boundaries among multilocus genotypes. However, these approaches rely on a genetic model and require populations to meet Hardy–Weinberg equilibrium expectations (although the HMRF model allows inbreeding) and for linkage equilibrium to exist between loci (see Kaeuffer et al., 2007). This might be a problem as such expectations are unrealistic in many cases and robustness of these methods have not been evaluated yet. Another, maybe more concerning, issue with these methods resides in the clustering approach itself: assigning individuals to groups is a likely inappropriate strategy when individuals are genetically structured as a cline. A last approach would be to use a Mantel correlogram (Legendre and Legendre, 1998, pp 736–738) to assess the variation of spatial autocorrelation in allelic frequencies across scales. However, this method is not wholly satisfying as it would only allow to detect spatial structuring, but would not permit to visualize the corresponding spatial patterns.

An appealing alternative for exploring genetic data is offered by reduced space ordination methods because their utilization is not contingent on a particular genetic model. Hardy–Weinberg equilibrium or linkage equilibrium are thus no longer required. Basically, these methods aim at summarizing strongly multivariate data into a few uncorrelated components, forming the so-called ‘reduced space’. For this summary to be meaningful, the components are chosen so as to reflect most of the variability in data, as defined by an optimized criterion (for example, variance among observations). Such methods can be applied on allelic frequency data to obtain a summary of the genetic variability among individuals or populations. A great illustration of such practice was offered by Menozzi et al. (1978), who used a principal component analysis (PCA; Pearson, 1901) to investigate the spatial patterns of the genetic variability, obtaining the well-known synthetic maps of human gene frequencies. More recently, PCA proved useful to correct for population stratification (Price et al., 2006) and to infer and test the number of subpopulations represented in a set of genotypes (Patterson et al., 2006). However, PCA can be criticized when applied to reveal spatial patterns. Indeed, this method finds synthetic variables on which the variance among genotypes is maximized, but does not take spatial information into account. PCA seeks genetic variability, not spatial structures; it is not a likely optimal method for revealing cryptic spatial patterns, that is, spatial patterns that are not associated to the highest genetic variation.

In this paper, we propose a new method, the spatial principal component analysis (sPCA), as a tool to investigate cryptic spatial patterns of genetic variability using georeferenced multilocus genotypes. Our method relies on a modification of PCA so that not only the variance between the studied entities (individuals or populations), but also their spatial autocorrelation is taken into account. The main results of the analysis are maps of entities scores allowing a visual assessment of the spatial genetic structures. The obtained scores reveal two types of patterns, which we define as global and local structures (sensu Thioulouse et al., 1995). Although both types express a fair amount of genetic variability, global structures display positive spatial autocorrelation whereas local ones display negative spatial autocorrelation. In other words, a global pattern would differentiate between two spatial groups or find a cline (or any intermediate state), whereas local scores would retrieve stronger genetic differences among neighbors than among random pairs of entities. As the studied entities can be genotypes or groups of genotypes (later referred to as ‘populations’, in a broad sense), global and local structures encompass a wide range of biological situations. For instance, global patterns of genotypes could indicate population patches in an island model, as well as cline wherever isolation by distance occurs. Local structures could arise when individuals from the same genetic pool are selected to avoid each other (repulsion) or to be attracted by individuals from other genetic pools. Similarly at the population level, global and local patterns may result from stratification (Price et al., 2006) or from adaptations to environmental variables that are inherently spatially structured (‘spatial dependence’; sensu Wagner and Fortin, 2005).

First, we explain how the spatial information is modeled explicitly through a connection network (Legendre and Legendre, 1998, pp 752–756) to be used in further computations. Then, we detail the meaning of Moran's index of spatial autocorrelation (I; Moran, 1948, 1950) which is incorporated into the sPCA criterion, and show how it can identify global and local patterns in allelic frequency data. We then demonstrate how sPCA yields independent components optimizing the product of the variance and Moran's I. As an aid to choose the sPCA scores to be interpreted, we developed two multivariate tests against the absence of global and local patterns. Our approach is illustrated and compared to PCA using simulated and real datasets. We conclude by discussing the prospective applicability of this method for the analysis of genetic data. The developed methodology is implemented in the adegenet package (Jombart, 2008) of the free software R (Ihaka and Gentleman, 1996; R Development Core Team, 2008).

Methods

Modeling spatial information

The first step of a spatially explicit method is to define how spatial information is introduced in the method. In sPCA, the detection of spatial structures uses the well-known Moran's I (Moran, 1948, 1950), which relies on the comparison of the value of a quantitative variable (for example, allelic frequency) observed at one site (that is, individual or population) to the values observed at neighboring sites. This approach thus requires ‘neighboring sites’ to be defined. This is usually achieved by building a connection network (also called neighboring graph) which uses an objective criterion to define which entities are neighbors, and which are not. To simplify the definition of spatial structures provided in this paper, the term ‘neighbors’ is here restrained to immediate neighbors, that is, two vertices of the same edge of the connection network.

Several algorithms, whose review is beyond the scope of the present paper, can be used to build a connection network (Legendre and Legendre, 1998, pp 752–756). Although other spatially explicit methods, such as SAMOVA (Dupanloup et al., 2002) or GENELAND (Guillot et al., 2005), impose a specific connection network (the Delaunay triangulation; Upton and Fingleton, 1985), sPCA can use any graph. This plasticity makes sPCA adaptable to various spatial distributions. For instance, Delaunay triangulation or Gabriel graph (Gabriel and Sokal, 1969) would be adapted to evenly distributed entities, whereas distance-based neighborhood would be more appropriate to aggregated distributions. Moreover, connection networks can be refined manually to include empirical knowledge of the spatial connectivity among entities. Once the connection network is defined, the spatial information is stored in a binary connection matrix M, which is symmetrical and its lines and columns correspond to the same biological entities (as in a distance matrix). The values of M are 1 if the two considered entities are connected, and 0 otherwise. This matrix is used in the computation of Moran's I and therefore in sPCA.

Measuring spatial autocorrelation of an allelic frequency

Let us consider one allelic frequency measured on n individuals or populations. Once the binary connection matrix M is obtained, the spatial autocorrelation of this frequency can be quantified using Moran's I. The general form of this index can be written using matrix notation (Cliff and Ord, 1981, p 119), where x is the vector of n centered allelic frequencies and W is the sum of all the terms of M:

The meaning of this index depends only on its first component; the effect of the second component n/xTx (which is the inverse of the variance of x) is to scale the variable so that I only reflects its spatial structure, not its variability. In this paper we use a version of I in which M is standardized so that the rows sum to one (Cliff and Ord, 1973, p 13). Denoting by L the resulting matrix, (1) becomes:

The expected value when the frequency observed at a site is independent of its neighbors (the null value, denoted I0) equals −1/(n−1) under a nonparametric model of the n! possible permutations of the data (Cliff and Ord, 1973, pp 29 and 32). Note that if I is to be interpreted quantitatively, its range of variation, which depends on the connection network, should be taken into account (De Jong et al., 1984).

This index has a straightforward interpretation. Let i and j indicate a row and a column of L (i=1, n; j=1, n). The row i contains positive values if i and j are neighbors, and 0 otherwise. As the terms of row i sum to one, these values are weights. Hence, the lag vector Lx computes, for a given entity, the mean frequency of its neighbors (Anselin, 1996). It follows that xTLx is the scalar product of the allelic frequencies and their lag vector (xTLx=〈xLx〉): the frequency observed for any entity is multiplied by the mean frequency of its neighbors, and the obtained values are added over all entities.

Two types of spatial structuring can be observed in individuals or populations, whenever allelic frequency observed among neighbors are more similar or more dissimilar than expected in a random spatial distribution. These cases are illustrated using 20 fictitious populations (Figure 1). Patches of similar allelic frequencies (Figure 1a) lead to a highly positive I because the allelic frequency observed in a population is positively correlated to the allelic frequency of its neighbors (Figure 1c). Conversely, different neighbor to neighbor allelic frequencies (Figure 1b) lead to a highly negative I, as the value taken by any population is negatively correlated to those taken by its neighbors (Figure 1d). These two patterns are global and local structures as defined by Thioulouse et al. (1995). In the sPCA context, we define global and local patterns as entities being more genetically similar (respectively dissimilar) to their immediate neighbors than expected in a random spatial distribution.

Figure 1
figure 1

Illustration of global and local patterns of an allelic frequency for 20 fictitious populations overlying their sampling area. Each square represents the frequency of a population. Edges correspond to the connection network (Gabriel's graph). (a) Example of global structure, corresponding to I(x)>I0. (b) Example of local structure, corresponding to I(x)<I0. (c) Moran's scatterplot showing that in the global structure (a), the allelic frequency x of a population is positively correlated with the mean frequency of its neighbors, Lx. The line corresponds to the linear regression of Lx on x. (d) Conversely, the Moran's scatterplot associated with the local structure (b) shows that frequency x of a population is negatively correlated with the mean value of its neighbors, Lx.

As we have shown, Moran's I can be used to numerically detect such patterns using the frequencies of a single allele. Now we tackle the following question: how to reveal these patterns using a complete set of alleles?

Spatial principal component analysis

Two different objectives arise when analyzing a set of georeferenced allelic frequencies. On the one hand, we would like to summarize the genetic variability among the biological entities (individuals or populations) into a few informative components. On the other hand, we would also like to reveal existing spatial patterns.

A convenient solution to the first problem is to use a centered PCA (Pearson, 1901; Menozzi et al., 1978). This method analyzes a table X of p centered allelic frequencies (displayed in columns) measured on n biological entities (rows). The allelic frequencies define Euclidian distances between the n entities in Rp, the p-dimensional space of real numbers. Finding the line of closest fit through the n points (Pearson, 1901) is the same as finding an axis in Rp on which the projections of the n entities are as widely scattered as possible, that is, where the Euclidian distances between the entities are best preserved. To fulfill this property, PCA seeks a scaled vector u (u2=1) containing p loadings (one per allele) so that the entities scores onto this axis (φ=Xu) have a maximum variance. This can be reformulated as the maximization of:

where Xu1/n2=var(φ). The solution is given by the first eigenvector of , which yields scores whose variance is maximized and equates to the highest eigenvalue. The second objective can be tackled by testing the spatial autocorrelation of the PCA scores, to assess whether they display significant spatial structure (Wartenberg, 1985b). However, these scores are appropriate only to summarize genetic variability and are in no way designed to reveal spatial patterns. Thus, there is a need for a methodology summarizing the genetic diversity and revealing spatial structures at the same time.

sPCA encompasses these two objectives. This new method finds a few independent synthetic variables that no longer optimize the variance of the entities' scores (as in PCA), but the product of their variance and of Moran's I. sPCA is closely related to Wartenberg's multivariate spatial correlation (MSC; Wartenberg, 1985a), but MSC constrains all alleles to have the same variance. This has the undesirable effect of masking the variability of the most informative alleles. Our method is also linked to that of Thioulouse et al. (1995), which also focuses on global and local structures. However, their method differs from sPCA in two ways. Firstly, it introduces nonuniform row weights giving more importance to the entities with many neighbors, whereas sPCA gives equal weights to all entities. Secondly, Thioulouse et al. (1995) used a globally standardized connection matrix instead of the row-standardized matrix L, and thus lost the meaning of the lag vector Lx.

sPCA seeks scaled axes v (v2=1) in Rp so that entity scores ψ=Xv are both scattered and spatially autocorrelated. Similarly to the centered PCA (3), this relies on identifying the extreme values of a function (denoted C for ‘criterion’):

We show that the solution is given by the eigenvectors of the symmetric matrix associated with the highest and lowest eigenvalues (Supplementary Appendix A). As with PCA, other eigenvectors associated with less extreme eigenvalues display weaker structuring under the orthogonality constraint.

Although PCA and sPCA rely on a common approach, two major differences between these analyses must be underlined. Firstly, sPCA does not decompose the total variance into decreasing additive components. Instead, the product of the variance var(ψ) and the spatial autocorrelation I(ψ) is separated into positive, null and negative components. Indeed, if the variance is always positive, the spatial autocorrelation can be positive as well as negative. Hence, while PCA focuses on the scores associated to the highest eigenvalues, sPCA encompasses two types of informative scores, both reflecting an aspect of the spatial patterning of the genetic variability. On the one hand, scores with a strong variance and a highly positive spatial autocorrelation (that is, global structures) correspond to highly positive eigenvalues. On the other hand, scores with a strong variance and a highly negative spatial autocorrelation (that is, local structures) correspond to highly negative eigenvalues. Note that these negative eigenvalues are thus useful tools to detect local patterns, and should not be ignored, as it was done in MSC (Wartenberg, 1985a).

Secondly, it makes no sense to compare a sPCA eigenvalue to the sum of all eigenvalues (as done in PCA) because this sum itself has no meaning: it can be low if there is no structure at all, as well as when there are strong global and local structures. Therefore, the percentage of total criterion associated to a given eigenvalue cannot be used as a rule to choose the structures to retain. However, as in other multidimensional methods, an abrupt decrease of the eigenvalues is likely to indicate the boundary between strong and weak structures (Legendre and Legendre, 1998). The interesting patterns are displayed graphically, and their spatial autocorrelation is measured using Moran's I. Note that it is meaningless to test the I of the sPCA scores, as is done in PCA (Wartenberg, 1985b) because the sPCA scores are already optimized regarding spatial autocorrelation.

Multivariate tests to detect global and local structuring

Sometimes, the sPCA eigenvalues may not clearly indicate if global and/or local structures should be interpreted. A first aid would be to assess if there are significant global and local patterns in the data. We developed two statistical tests (a global and a local test) to answer to these questions.

These tests rely on the spectral decomposition of the row-standardized connection matrix L into Moran's eigenvector maps (MEMs; Griffith, 1996; Dray et al., 2006). These vectors are uncorrelated variables modeling different spatial structures; they were initially used in geography for spatial filtering purposes, that is, to remove spatial autocorrelation from the residuals of a statistical model (Griffith, 2000). In ecology, MEMs are used as explanatory variables in linear modeling approaches to model complex spatial patterns (Dray et al., 2006; Griffith and Peres-Neto, 2006). Each of these spatial predictors is associated to a Moran's I and can therefore be characterized either as a global or a local pattern. We denote E+ the matrix whose columns are the global MEMs of L, and E the matrix storing local MEMs. As there are always (n−1) MEMs for n locations, these vectors can fully decompose a centered allelic frequency x into global and local spatial structures using simple linear regression. Note that this decomposition is not subject to multicollinearity troubles because MEMs are orthogonal: each MEM explains a different part of the variance of x, which is measured by the corresponding coefficient of determination (R2). This can be applied to the p centered allelic frequencies of matrix X, yielding p x(n−1) coefficients of determination (one for each allele/MEM combination) which are stored separately for E+ and E (see detailed computations in Supplementary Appendix B). The R2 of alleles with vectors of E+ are used in the global test, whereas R2 computed with MEMs of E are used in the local test.

The basic idea behind our testing procedures is that if a global (respectively local) pattern exists among individuals (or populations), a large number of alleles is expected to be fairly correlated to at least one vector of E+ (respectively E). To detect this, the mean R2 across alleles is computed for each MEM. Denoting these means by tj (j=1, q), a vector t containing all tj is then obtained (t=[t1 …. tj …. tq]T). To detect an eventual MEM with which all alleles would be significantly correlated, the test statistic used in both procedures is the maximum of t values, denoted max(t). The null hypothesis (H0) is that allelic frequencies of the individuals (or populations) are distributed at random on the connection network. Alternative hypotheses are that allelic frequencies of the studied entities display at least one global (respectively local) spatial structure. The distribution of max(t) under H0 is obtained by a Monte Carlo procedure involving a large (say at least 999) number of permutations. For each permutation, the rows of X are randomized and max(t) is computed. In both tests, the P-value is defined as the relative frequency of permuted statistics equal to or higher than the initial value of max(t). We verified that the type I errors of both tests were correct using simulated datasets (see Supplementary Appendix B).

Illustrations

Simulated data: simple structures

This illustration compares the results of PCA and sPCA on three simulated datasets containing simple spatial structures: two global patterns (patches, cline) and one local structure (repulsion). For discontinuous patterns (patches and repulsion, Figures 2a, b, e and f), three populations of 500 diploid individuals were simulated using EASYPOP version 2.0.1 (Balloux, 2001), using a hierarchical island model to have different levels of genetic differentiation. Migration rate between populations 1 and 2 was set to 0.005 and to 0.002 between population 3 and the other two. Genotypes consisting in 20 microsatellite-like loci were obtained after 1000 generations using a KAM mutation model (that is, loci mutate to any new allelic state with the same probability) with a mutation rate of 0.0001 and 50 possible allelic states. For the continuous pattern (cline, Figures 2c and d), 4 populations of 500 diploid genotypes were simulated under an isolation-by-distance process. Spatial coordinates of the four populations were set in a two-dimensional space to (0, 0), (0, 2), (2, 0) and (2, 2). Dispersal distances were drawn from a negative exponential distribution with a mean of 1. EASYPOP computes the migration rates as exp(−r*dij), where dij is the distance between populations i and j, and r is the inverse of the dispersal distance. The other input parameters of this simulation were the same as in the simulation of discontinuous patterns. All analyzed data were obtained by randomly sampling genotypes from the created populations. Spatial coordinates were defined using the R software to create the various spatial structures.

Figure 2
figure 2

Analyses of simple global and local structures among 80 genotypes from three different populations by principal component analysis (PCA) and spatial PCA (sPCA). Each square represents the score of a genotype and is positioned by its spatial coordinates. The eigenvalues corresponding to the displayed scores are filled in black on the screeplots. Numbers indicate the population to which genotypes belong. (a, b) Two patches with random noise. (c, d) A cline with random noise. (e, f) Repulsion with random noise. (a, c, e) First PCA scores. (b, d) First global scores of sPCA. (f) First local scores of sPCA.

Three datasets of 80 georeferenced genotypes were created: (1) two patches of 35 individuals from populations 1 and 2 with 10 individuals from population 3 randomly distributed; (2) 40 individuals from populations 1 and 2 forming a cline with 40 individuals from population 3 and 4 randomly distributed; (3) 30 individuals from population 3 distributed in repulsion among a total of 50 individuals of populations 1 and 2.

Data were analyzed using the R software, especially the ade4 package for multivariate analysis (Chessel et al., 2004; Dray et al., 2007), spdep for spatial methods (Bivand, 2007) and adegenet for genetic data handling, sPCA and global/local tests (Jombart, 2008). The same procedure was applied to each dataset: first, data were analyzed by PCA, using Moran's I test to detect spatial structuring in the PCA scores; second, data were analyzed by sPCA using global and local tests (with 9999 permutations) as an aid to select the structures to be interpreted. All connection networks were defined using the Delaunay triangulation (Upton and Fingleton, 1985), a common graph that underlies several other methods (Dupanloup et al., 2002; Guillot et al., 2005; François et al., 2006). The patches of the first dataset were retrieved by the first PCA scores (Figure 2a), which were significantly autocorrelated (I=0.228, P=0.0005). However, these patches appeared more clearly on the first global scores of sPCA (Figure 2b). The global test confirmed the existence of global pattern (max(t)=0.0166, P=0.0011), whereas the local test did not detect any local structure (max(t)=0.0140, NS). In the second dataset, PCA overlooked the cline, showing a weak spatial pattern on the third principal component (Figure 2c; I=0.128, P=0.022), whereas sPCA completely retrieved it (Figure 2d). Note that the first global structure is indeed a cline—and not patches—because genotypes situated in the middle of the distribution have less extreme scores (smaller squares). The global test confirmed the presence of global structure (max(t)=0.0161, P=0.0038), whereas the local test detected no local pattern (max(t)=0.0117, NS). In the third dataset, the local pattern (repulsion among genotypes from population 3) was not identified by PCA (Figure 2e; I=−0.054, NS). On the contrary, the first local score of sPCA revealed this pattern: large black squares (population 3) are rarely found as neighbors and tend to be surrounded by white ones (genotypes from other populations) more often than at random. The global test did not detect any global pattern (max(t)=0.0132, NS), whereas the local test was significant (max(t)=0.0174, P=0.0008).

Simulated data: complex structures in individuals

This illustration compares the results of PCA and sPCA using a simulated dataset in which different structures are mixed. Four populations of 500 diploid individuals were simulated using EASYPOP, following a hierarchical island model. Migration rate between populations 1 and 2 was set to 0.005, and to 0.002 for other populations. Otherwise, all parameters were those used in the previous illustration. A random sample of 80 genotypes was obtained, with unequal sample sizes (from population 1 to 4, sizes were 30, 30, 10, 10). Spatial coordinates were defined so that: (1) the 60 individuals from populations 1 and 2 were structured as a cline; (2) the 10 individuals from population 3 were distributed randomly; (3) the 10 individuals from population 4 were structured in repulsion.

This dataset was analyzed as previously, first by PCA and then by sPCA. The Delaunay triangulation was employed to model the spatial connectivity among genotypes. Two axes were retained for PCA (Figures 3a and c). No clear spatial pattern was revealed by PCA. The cline between populations 1 and 2 seemed split between the first (Figure 3a) and the second scores (Figure 3c), whereas the local structure induced by individuals from population 4 does not appear clearly on either axis. The Moran's I tests did not detect significant autocorrelation in either scores (I=0.081, NS; I=−0.014, NS). On the contrary, sPCA revealed both structures. The first global and local scores were retained (Figures 3b and d). The global scores clearly differentiated populations 1 and 2, even if it was not clear whether this global structure consisted in two patches or in a cline (Figure 3b). The global test confirmed that a global structure existed (max(t)=0.0200, P=0.0005). The first local score clearly emphasized to genetic differences of individuals from population 4 (large white squares) from their immediate neighbors (other populations). The local test was consistently significant (max(t)=0.0270, P=0.0001).

Figure 3
figure 3

Analyses of complex global and local structures among 80 genotypes from four different populations by principal component analysis (PCA) and spatial PCA (sPCA). Each square represents the score of a genotype and is positioned by its spatial coordinates. The eigenvalues corresponding to the displayed scores are filled in black on the screeplots. Numbers indicate the population to which genotypes belong. (a) First PCA scores. (b) First global scores of sPCA. (c) Second PCA scores. (d) First local scores of sPCA.

Simulated data: complex structures in populations

This illustration had the same objective as the previous one, but involved populations rather than individuals. Three populations of 500 diploid individuals were simulated using EASYPOP, following an island model with a migration rate of 0.01. Other parameters of simulations were the same as in previous illustrations. Subpopulations (16) were created by taking random samples of 30 individuals from a given population; 10 subpopulations were thus obtained from population 1 and 2, and 6 were drawn from population 3. Spatial coordinates of subpopulations were defined so that: (1) the 20 subpopulations from populations 1 and 2 were structured in two patches; (2) the 6 subpopulations from population 3 were distributed following a local pattern.

After transforming the data into allelic frequencies for each subpopulation, a PCA and an sPCA were performed. The Delaunay triangulation was used to model the spatial connectivity among subpopulations. The PCA eigenvalues showed that two strongly structured axes were to be retained (Figures 4a and c). This was likely due to the fact that the variability among subpopulations was essentially an interpopulation variability: only two axes are required to differentiate three populations. The first PCA scores displayed a significant spatial structure (Figure 4a; I=0.265, P=0.0096), but it was merely as a by-product: it simply differentiated the population 1 from the two others. Similarly, the second PCA scores differentiated the population 3 from the others (Figure 4c), but these scores were not spatially structured (I=0.031, NS). The sPCA eigenvalues clearly showed that one global and one local axes were to be retained (Figures 4b and d). The first global scores (Figure 4b) found the two patches of subpopulations from populations 1 and 2, giving rather low values to the scores of population 3 (small squares). The global test detected the existence of spatial pattern (max(t)=0.131, P=0.0065). The local scores highlighted the differences between subpopulations from population 3 with the neighboring subpopulations (Figure 4d). The local test was also significant (max(t)=0.133, P=0.0053).

Figure 4
figure 4

Analyses of complex global and local structures among 16 subpopulations from three populations by principal component analysis (PCA) and spatial PCA (sPCA). Each square represents the score of a subpopulation and is positioned by its spatial coordinates. The eigenvalues corresponding to the displayed scores are filled in black on the screeplots. Numbers indicate the population to which subpopulations belong. (a) First PCA scores. (b) First global scores of sPCA. (c) Second PCA scores. (d) First local scores of sPCA.

Scandinavian brown bear data

The Scandinavian brown bear dataset illustrated other methods such as the wombling approach of Manel et al. (2007). These data contain the georeferenced genotypes of 964 brown bears sampled over 200 000 km2 and typed for 18 microsatellite markers. A more complete description of the data will be found in Waits et al. (2000). Former studies stressed the need for identifying management units (MUs) among Scandinavian brown bear for conservation purposes (Waits et al., 2000). Using density indicators, Swenson et al. (1998) suggested four different MUs. There seems to be a general agreement that the southernmost group is strongly differentiated from all the others because of different lineages (Manel et al., 2004). Nonetheless, the number of MUs to be considered is still discussed: using microsatellites, Waits et al. (2000) confirmed the four groups suggested previously (Swenson et al., 1998), whereas more recent studies found only three MUs (Manel et al., 2007), considering northern individuals as from one MU instead of two. Expressed in terms of sPCA, different MUs would appear as global structures, each global score potentially differentiating between two MUs. Thus, these data seem appropriate to illustrate how sPCA can identify several spatial groups.

First, a centered PCA was performed on the allelic frequencies of the individuals. The first three scores were significantly positively autocorrelated, but only the first two had biologically meaningful I values (I=0.647, P=0.001; I=0.125, P=0.001; I0≈0). The first PCA scores differentiated the southern MU from all the others (Figure 5a). The second PCA scores (Figure 5b) were more difficult to interpret, but seemed to correspond in part to the middle MU identified in previous studies (Swenson et al., 1998; Manel et al., 2004). The third scores displayed a very small spatial autocorrelation, and the associated map was not interpretable (result not shown).

Figure 5
figure 5

Analyses of Scandinavian brown bears data. (a) First and (b) second scores of the centered principal component analysis (PCA), displaying significant spatial structure (I=0.647, P=0.001; I=0.125, P=0.001). (c) Screeplot of spatial PCA (sPCA). The first eigenvalue (see real scale screeplot, top-right box) was truncated to better appreciate the others. Retained structures are filled in black. (df) First, second and third global scores of sPCA. Autocorrelation statistics were respectively: I=0.686, I=0.147, and I=0.122 (I0≈0).

Second, we proceeded to sPCA. We used a distance-based connection network because the spatial distribution was fairly aggregated; such a graph ensures that genotypes inside aggregates have more neighbors than outliers. The threshold distance between any two neighbors was chosen as the minimum distance so that no individual was excluded from the graph. We call the resulting graph a minimum distance neighboring graph. The first sPCA eigenvalue was strikingly large compared to the others, but with no doubt the first three eigenvalues and corresponding scores were to be retained (Figure 5c). The first scores revealed the same pattern as in PCA (Figure 5d) and separated individuals from the southern MU from all the others, like in previous studies. This pattern was associated to a strong spatial autocorrelation (I=0.686). The second sPCA scores (I=0.147) clearly differentiated individuals from the ‘middle’ subpopulation (Waits et al., 2000) from the others (Figure 5e). By combining the first two global scores we thus recovered the three subpopulations found in previous studies (for example, Manel et al., 2007). But more interestingly, our analysis retrieved an additional weaker structure: undoubtly the third global scores (I=0.122) showed an east–west differentiation among northern individuals (Figure 5f). Contrary to the first pattern (Figure 5d), this structure does not show sharp boundaries between patches, but rather progressive changes from one patch to another, suggesting an isolation-by-distance process or progressive introgression. This may be the reason why a method based on boundary detection (Manel et al., 2007) overlooked this structure. The global test confirmed the existence of at least one global pattern (max(t)=0.0533, P=0.0001) without detecting local structuring (max(t)=0.0043, NS).

Discussion

We propose a spatially explicit multivariate method, sPCA, as a new tool to explore georeferenced multilocus genotypes and, therefore, to try to understand how geographical and environmental features structure genetic information. Although ordinary centered PCA yields scores that summarize the genetic variability among considered entities (individuals or populations), sPCA adds the constraint that the provided scores should be spatially autocorrelated and, thus, focuses on the spatial pattern of genetic variability. Two types of patterns are discriminated: global and local structures, corresponding respectively to large positive and large negative eigenvalues. Maps of sPCA scores are used to visually assess these patterns. As an aid to the interpretation of sPCA results, two Monte Carlo tests are proposed to detect the existence of global and local patterns. Simulated data illustrated that sPCA can retrieve simple structures (patches, clines, repulsion) as well as more complex patterns among genotypes or populations, and performs better in this task than PCA. The global and local tests efficiently detected the existing patterns, with a reliable type I error, and can therefore be used to assess which kind of pattern should be interpreted. sPCA also retrieved already known patterns in Scandinavian brown bear dataset, as well as more cryptic structures, which were overlooked by another method (Manel et al., 2007), but were biologically expected (Swenson et al., 1998).

Several points relative to the method should be discussed. Firstly, the spatial information is integrated using a connection network. This widely used approach allows taking virtually any type of spatial information into account. Contrary to other spatially explicit methods (Dupanloup et al., 2002; Guillot et al., 2005), we do not impose a specific connection network; one would have to choose from existing algorithms, and refine it according to what is known about the ecological connectivity in the system. It is important to keep in mind that sPCA is not intended to study the spatial connectivity among the considered entities; it aims at finding spatial structuring given that connectivity.

Secondly, sPCA is proposed mainly as an exploratory tool. For this purpose, our approach seems relevant as it is a reduced space ordination method; no assumptions are made about the data model. It is thus free, for instance, from modeling constraints like Hardy–Weinberg equilibrium assumptions, which are often violated when considering markers involved in selection processes. This is in contrast to, for instance, STRUCTURE (Pritchard et al., 2000; Falush et al., 2003), which assumes both Hardy–Weinberg equilibrium and linkage equilibrium. Nonetheless, further investigations should be devoted to link sPCA to existing population genetics models. Indeed, the ability of spatial autocorrelation based methods (of which sPCA is one) for inferring genetic processes has been a controversial topic (Sokal and Wartenberg, 1983; Sokal et al., 1989; Slatkin and Arter, 1991a, 1991b), but useful studies have shown that Moran's I can be linked to population genetics models (Hardy and Vekemans, 1999). Similarly, a recent study demonstrated that the number of significant eigenvalues of PCA can be directly related to the number of subpopulations in a set of genotypes (Patterson et al., 2006). Such development with sPCA would surely enhance the interpretation of the provided results.

Thirdly, the efficiency of sPCA in different population genetics scenario remains to be investigated further, as it was done with spatial autocorrelation. For instance, we did not tackle the relative power of the analysis to reveal patterns due to directional selection (Epperson, 1990) or isolation by distance (Barbujani, 1987; Epperson, 1995). The influence of other parameters, such as the connection network or the level of genetic differentiation, should also be evaluated. These topics as well as comparisons of sPCA to other methods will be investigated using simulations in a next paper.

To conclude, we have shown that sPCA can be used and is useful at the scale of individuals as well as at a population scale. This suggests that our method could be an appropriate tool in different domains. As sPCA can be performed on data from individuals with no a priori knowledge of the studied system, our method should become a useful tool in landscape genetics studies (Manel et al., 2003), to link the revealed genetic patterns to landscape features and to explain genetic discontinuities in terms of environmental, behavioral or physiological barriers. Indeed, the sPCA scores can be correlated to other variables or included as dependent or independent variables in models, as long as their spatial autocorrelation is taken into account (Anselin, 2002). Moreover, sPCA can assess the genetic structuring of a set of fragmented populations, which seems especially relevant in conservation biology where this is common. It is particularly important to identify the most isolated populations, when introducing new individuals to maintain genetic diversity or to predict the spatial spread and maintenance of an introduced disease to control pest species. In these cases, sPCA may help to develop appropriate management and surveillance strategies for a disease. Therefore, the proposed method can be seen as a versatile tool for investigating the genetic structuring of set of individuals or populations, within different contexts.