Introduction

Evolutionary and conservation geneticists frequently rely on neutral molecular data to describe population structure. In the past 30 years, a parade of molecular markers have been used, from blood proteins to microsatellites. This progression has been motivated, at least in part, by a search for loci with more variation. Loci with many alleles, such as microsatellite loci, have unprecedented ability to detect and describe genetic differences between populations (eg Hedrick, 1999; Kalinowski, 2002a).

However, loci with scores of alleles have forced population geneticists to re-evaluate how genotypic data are analyzed and interpreted. The most fundamental result of this re-evaluation has been increased awareness that statistically significant genetic differences are not always biologically or evolutionarily significant (eg Waples, 1998; Hedrick, 1999). One question that has received little attention in the literature is how high levels of polymorphism affect study design. This may be because most geneticists using highly polymorphic microsatellite loci have already concluded that large sample are needed to estimate genetic distances at loci with many alleles. All the literature that I am aware of supports this belief. For example, Nei (1978) analyzed the sampling variances of his genetic distances, and concluded that ‘more individuals should be examined when heterozygosity is high than when it is low.’ Baverstock and Moritz (1996) concluded that ‘it is clear that large sample sizes are needed’ to describe population structure at hypervariable loci. Most recently, Ruzzante (1998) used computer simulations to show that polymorphic loci have high sampling variances when sample size is small. Each of these authors, however, examined the relationship between samples sizes and sampling variance.

This focus on sampling variance has been misleading for two reasons. First, genetic distances are derived measures of genetic differentiation. They do not necessarily have high sampling variances when estimates of allele frequencies are imprecise. For example, high mutation rates usually decrease the sampling variance of FST. Second, sampling variances are not always an appropriate measure of precision to compare study design strategies. Consider the standard genetic distance, DS, between two populations that have been isolated for t generations. The sampling variance of DS will be higher at loci with high mutation rates than at loci with low mutation rates. However, the parametric genetic distance will also be higher

(where μ is the infinite alleles mutation rate) (Nei, 1972). This must be taken into consideration when comparing variances, and the coefficient of variation is a useful statistic to do this (see the Appendix for a mathematical discussion of why the coefficient of variation is a useful statistic for examining study design). The purpose of this paper is to explore the relationship between sample size, polymorphism, and the coefficient of variation of genetic distances.

Methods

I examined the relationship between sample size, mutation rate, and the coefficient of variation of three popular genetic distances: the standard genetic distance of Nei, DS, the chord distance of Nei (1983), DA, and the Weir and Cockerham estimator of FST(θ), (1984) (see Excoffier (2001) for a discussion of the relationship between θ and FST). I chose these three genetic distances because they are commonly used and because they have substantially different evolutionary properties (see Kalinowski (2002b) for a review). I used computer simulation (see below) to see how increasing the sample size per population decreases the coefficients of variation of estimates of these genetic distances for loci with different levels of polymorphism. Because the expected amount of polymorphism at a locus is proportional to mutation rate (μ) and effective population size (Ne), I explored how both μ and Ne affect estimation of genetic distances.

I examined the sampling properties of DS, DA, and FST in two simple evolutionary models: an isolation model of population divergence and an equilibrium model of migration. In the ‘isolation’ model, a randomly mating population of Ne individuals is instantly divided into two populations that each has the same effective size as the ancestral population. The two populations formed by this fragmentation event remain completely isolated for t generations (at which point sampling occurs). I included three population sizes (Ne=500, 5000, and 50 000) and three divergence times (t=50, 500, and 5000) in my simulations. In the equilibrium ‘migration’ model, two populations of equal and constant effective size (Ne) exchange migrants at a rate of m. I included three population sizes (Ne=500, 5000, and 50 000) and three migration rates (m=0.01, 0.001, 0.0001) in my simulations.

In both evolutionary models, I simulated data for loci with four different mutation rates (10−6, 10−5, 10−4, 10−3). All mutations were unique (infinite alleles mutation). These mutation rates (and the effective population sizes listed above) produced gene diversities within populations that ranged from approximately 0.002 to greater than 0.995. Sample size was varied from two individuals per population to 256 individuals per population. The number of loci in simulated data sets was varied from 2 to 256.

The amount of population differentiation in these models can be measured by FST. Most formulations of FST are a function of mutation rates. However, formulae based solely on demographic variables can be obtained by taking the limit of the mutation rate as it goes to zero (Slatkin, 1991). Weir and Cockerham's estimator of FST is then equal to

for the isolation model, and

for the migration model (Slatkin, 1991; Weir and Cockerham, 1984; Excoffier, 2001).

Genotypic data were simulated using the coalescent approach (eg, Hudson, 1990; Felsenstein, 2003) with a computer program that I wrote for this purpose. The method of Ford (1998) was used for the isolation model.

The coefficient of variation of DS, DA, and FST was estimated by calculating the standard deviation and average value of 1000 simulated estimates of DS, DA, and FST. Calculations were performed in Microsoft Access.

Estimates of DS are not defined when two samples have no alleles in common, and estimates of FST are not defined when all loci are fixed for the same allele. Therefore, such samples were removed from the analysis. The frequency of these excluded samples was recorded. If over 10% of the samples were excluded, then no results are reported for the combination of effective population size, mutation rate, sample size, that produced the undefined estimates.

Results and discussion

All three genetic distances (DS, DA, and FST) displayed remarkably similar statistical properties, so I present representative results (Figures 1 and 2). Four trends were observed: three well known, one novel. First, large samples had a lower coefficient of variation than small samples. Second, increasing the sample size (per population) produced diminishing returns: at some point, sampling more individuals had little effect upon the coefficient of variation of the genetic distance. Third, loci with high mutation rates produced lower coefficients of variation than loci with low mutation rates.

Figure 1
figure 1

The influence of mutation rate (μ) and sample size (per population) on the coefficient of variation (CV) of estimates of DS (Nei, 1978) in an isolation model of population divergence. In this model, a population of Ne individuals is instantly and permanently split into two completely isolated populations. Sampling occurs t generations after population fragmentation. All samples have 16 loci. The parametric value of FST for the populations is shown in each graph. Results for DA and FST are qualitatively indistinguishable (not shown).

Figure 2
figure 2

Influence of mutation rate (μ) and sample size (per population) on the coefficient of variation (CV) of estimates of FST in an equilibrium migration model. In this model, two populations of Ne individuals exchange migrants at a rate of m. All samples have 16 loci. The parametric value of FST for the populations is shown in each graph. Results for DS and DA are qualitatively indistinguishable.

What was interesting, was that the rate at which increasing sample size decreased the coefficient of variation was determined solely by the amount of differentiation between the populations – and not the mutation rate or the amount of variation at the loci. This was true for both the isolation and the migration models. More individuals should be sampled when the amount of differentiation is small than when it is large.

FST proved to be a convenient measure of population differentiation for study design. Figures 1 and 2 suggest that 20 individuals is a reasonable maximum sample size when the parametric value of FST is 0.05 and 100 individuals is a reasonable maximum sample size when the parametric value of FST is equal to 0.01. One implication of these results is that there is more benefit to collecting large samples from large populations than from small populations. This is because, all else being equal, FST between large populations is smaller than FST between small populations.

These results extend the seminal work of Nei and collaborators (see Nei, 1987 for a review). Nei and collaborators showed that sample sizes can be small when divergence times are large, but did not examine how effective population size or mutation rate affects the sampling properties of genetic distances. Foulley and Hill (1999) recently showed that only a few individuals need to be sampled to estimate the Sanghvi genetic distance when divergence times are large, but did not relate this genetic distance to effective population size or mutation rate.

Increasing the number of individuals in a study is not the only way to decrease the coefficient of variation of estimates of genetic distance. Increasing the number of loci will also improve the precision of estimates of genetic distance (see Nei, 1987 for a review). In fact, when population differentiation is substantial (eg FST > 0.2), increasing the number of loci is the only method for improving estimates of genetic distances. However, if enough loci are available, reliable estimates of genetic distances can be obtained from very few individuals. Figures 1 and 2 depict results for 16 loci. The shape of the curves in these figures, however, was independent of the number of loci examined. If fewer than 16 loci are sampled, the lines in the figures are shifted upwards (higher coefficient of variation). If more than 16 loci are sampled, the lines in the figures are shifted downwards (lower coefficient of variation).