Bootstrap variance of diversity and differentiation estimators in a subdivided population

Petit, R J; Pons, O

doi:10.1046/j.1365-2540.1998.00282.x

Download PDF

Original Article
Published: 01 January 1998

Bootstrap variance of diversity and differentiation estimators in a subdivided population

R J Petit¹ &
O Pons²

Heredity volume 80, pages 56–61 (1998)Cite this article

1037 Accesses
Metrics details

Abstract

We have recently proposed new estimators of the parameters of genetic diversity and differentiation and of their variances for a haploid locus in a population subdivided into a large number of subpopulations, with a two-stage sampling of populations and individuals. Here they are compared with bootstrap estimators. Several resampling methods are evaluated: sampling of populations only, individuals within populations only, or both. Theoretical results and a numerical example show that the most appropriate bootstrap variance estimators are obtained by resampling the populations alone and not both populations and individuals. However, some bias is apparent in the bootstrap methods, and the direct estimators proposed previously should therefore be preferred.

A method for genome-wide genealogy estimation for thousands of samples

Article 02 September 2019

Structure is more robust than other clustering methods in simulated mixed-ploidy populations

Article Open access 08 July 2019

Factor analysis of ancient population genomic samples

Article Open access 16 September 2020

Introduction

Resampling techniques are becoming widely used to assess confidence in phylogenetic reconstructions as well as in population genetics (Crowley, 1992). Indeed, direct analytical derivations of the appropriate variances can be extremely complex, and resampling techniques then provide rapid assessments of the precision of the studied statistics.

In particular, a variety of resampling methods have been used to detect genetic differentiation among populations (Crowley, 1992). To test whether there is a significant genetic structure, permutation procedures can be used where individuals are shuffled at random among populations, while keeping the sample sizes the same as in the original analysis (Palumbi & Wilson, 1990; Excoffier et al., 1992; Hudson et al., 1992). These methods do not replace the need to evaluate the precision of the measures of differentiation, such as F_ST (Wright, 1951; Weir & Cockerham, 1984) or G_ST (Nei, 1973; Pons & Petit, 1995). For multilocus isozyme data, a confidence interval can be obtained by jackknifing or bootstrapping over loci, as suggested by Weir (1990). But Van Dongen (1995) recently concluded that in general, resampling over individuals should be preferred to resampling over loci, because the allele frequencies for the different loci are usually estimated from the same individuals and are therefore not independent. Resampling over individuals also has the advantage that the precision of the differentiation at each locus can be estimated, which allows for the possibility of detecting aberrant loci when several are available (McDonald, 1994) or of studying single-locus data, such as data based on mitochondrial or chloroplast DNA.

However, for estimators of diversity whose precision is affected by both sources of variation (sampling of individuals and of populations), it is unclear what sampling to use for the bootstrap. Several types of resampling have been used so far in the literature: for instance, Petit et al. (1993) used the two-stage bootstrap which mimics the original sampling, whereas Prout & Barker (1993) used a bootstrap with only the populations as units of resampling.

To decide between these alternative types of sampling, a satisfactory solution would be to compare the simulated bootstrap estimates with the analytically derived direct estimates. This seems important because, as outlined by Crowley (1992, p. 431), ‘bootstrapping may have been swept into the mainstream of ecological and particularly evolutionary research somewhat ahead of a full, balanced evaluation of its capabilities, and shortcomings’. This analytical treatment, which considers both the sampling of individuals and the sampling of populations, is now available for a haploid and a diploid locus as well as for ordered alleles (Pons & Chaouche, 1995; Pons & Petit, 1995, 1996). The usual but sometimes implicit assumption is made that the observed populations are independent, which is both a genetic and a sampling assumption (for further discussion on this topic, see Nei, 1986; Pons & Petit, 1996). The same assumption is required in the uniform resampling methods, such as the bootstrap discussed here. We also assume that the number of sampled populations is large but nevertheless much smaller than the total number of existing populations.

We will use these results to illustrate that, in complex situations, it is necessary to ascertain whether bootstrap simulations yield the required estimates or not. Although two-stage bootstrap methods have been studied in the case of a finite number of finite populations drawn with replacement (Rao & Wu, 1988; Sitter, 1992), no results are known for a large number of populations and data sampled without replacement (in the original data sampling scheme). Here, three situations will be considered: sampling of individuals only, of populations only, and of both. We will present the exact or approximate bootstrap estimators and will compare them to the direct estimators obtained analytically in Pons & Petit (1995) and which will be referred to simply as the ‘direct estimators’.

These results will be illustrated using data on isozyme polymorphism in sessile oak (Zanetto & Kremer, 1995). The comparison of the bootstrap estimators with the corresponding bootstrap simulated values will provide an evaluation of the approximations we used in the analytical derivations of the variances.

Bootstrap estimation

We consider a total population subdivided into a large number of independent populations and in which I alleles are segregating at a haploid locus. For this situation, diversity and differentiation parameters were defined in Pons & Petit (1995). In particular, the diversity h_k of the kth population is given in their eqn 1, the average within-population diversity h_S in eqn 2, the total diversity h_T in eqn 3 and the differentiation parameter G_ST in eqn 4. A two-stage random sampling is used to estimate these parameters: n independent populations are drawn with the same probability from the general population, then n_k individuals are drawn independently and uniformly from the kth population. Within the kth population, the proportion x_ki of individuals having the ith allele is observed, corresponding to an unknown frequency p_ki. Estimators of the parameters are defined in Pons & Petit (1995) as ĥ_k, ĥ_S, ĥ_T and Ĝ_ST by their eqns 5, 6, 9 and 10.

Here, we study three different bootstrap sampling procedures for estimating the variance of ĥ_S, ĥ_T and Ĝ_ST. We also study the within- and between-population components of these variances, which will be useful for understanding what is estimated under each resampling scheme. The first bootstrap method is a resampling of the individuals in the observed populations: in the kth population, n_k individuals are drawn uniformly and with replacement from the initial sample of the kth population. In the second resampling procedure, only the populations are drawn: a bootstrap sample is obtained by drawing uniformly n populations with replacement from the observed set of populations, then by taking the initial observed values for each sampled population. The third bootstrap procedure corresponds to a twostage bootstrap resampling: n populations are sampled uniformly and with replacement from the observed set of populations and, whenever the kth population is selected, n_k individuals are drawn uniformly and with replacement from its initial sample.

For the vth bootstrap resampling procedure (v=1, 2 or 3) and for a parameter θ, which will here be h_S, h_T or G_ST, the bootstrap estimator *_(v) of θ is the mean, under the bootstrap resampling distribution and conditionally on the observed variables, of the variable θ *_(v) defined in the same way as the direct estimator but for the vth bootstrap variable (Efron & Tibshirani, 1993). The bootstrap estimator *_(v)() of the variance of is the variance of θ *_(v), under the vth bootstrap resampling distribution and conditionally on the observed variables. Because the resampling distributions are multinomials with parameters depending on the observed variables, the bootstrap estimators *_(v) and *_(v)() are functions of the proportions of individuals having each allele in the different populations, when bootstrapping within the populations, and of the sample sizes n_k and n. Here, we give the expressions of these bootstrap estimators without proofs, which may be obtained from the second author (Pons, 1997). We will need the following biased estimators of h_S and h_T, where x_•i=n⁻¹ Σ_kx_ki:

According to each multinomial bootstrap distribution, we get for h_S the three bootstrap estimators:

For h_T, we have

The three bootstrap estimators of G_ST are therefore biased. If the number n of sampled populations is large, as is recommended to reduce the total variance of the estimates (Pons & Petit, 1995), ĥ_S≈¸h_S and ĥ_T≈¸h_T, and the three procedures give similar results.

Before considering the bootstrap estimation of the variances, we define E_k and Var_k as the expectation and the variance under the multinomial distribution of parameters n_k and p_ki in the kth population. For h_S, the three bootstrap variances are

In eqn (4), _k(ĥ_k) is the estimated variance of ĥ_k, within the kth population. It has the same form as Var_k (ĥ_k) given by Pons & Petit (1995) but with x_ki instead of p_ki. This is then a biased estimator of the within-population variance of ĥ_S. Comparing eqn (5) to eqn 12 in Pons & Petit (1995), it appears that the second bootstrap variance is an estimator of the total variance of ĥ_S instead of an estimator of the between-population variance as could have been expected when populations alone are drawn. In eqn (6), Ê_k (ĥ²_k) is an estimator of E_k (ĥ²_k)=Var_k(ĥ_k)+h²_k obtained by replacing p_ki with x_ki, i.e. Ê_k (ĥ²_k)=_k(ĥ_k)+¸h²_k. This bootstrap variance is therefore a biased estimator of the sum of the within-population and total variances of ĥ_S.

Closed forms of the bootstrap variance of ĥ_T are more complicated and we use the same approximations as in Pons & Petit (1995) for large n. The three bootstrap variances of ĥ_T are then approximated, up to the order n⁻², as

and

For the first bootstrap variance of ĥ_T, the right-hand side of eqn (7) is an estimator of Var_intra(ĥ_T), the within-population variance of ĥ_T given by eqn 14 in Pons & Petit (1995), but n_k−1 in Var_intra(ĥ_T) is now replaced by n_k. Thus, it is biased and it may differ substantially from the direct estimator for small population sample sizes. If the sample sizes of the bootstrap populations are modified and set to n_k−1 for the kth population, the bootstrap estimator of h_T is still ĥ_T and its bootstrap variance estimates the within-population variance of ĥ_T. By a comparison of our direct estimators of Var(ĥ_T), we can see that the right-hand side of eqn (8) is an estimator of the total variance of ĥ_T. This is therefore also the case for the second bootstrap variance estimator. Finally, by the third procedure the estimated bootstrap variance is approximately the sum of the total and within-population bootstrap variances of ĥ_T.

We get similar results for the bootstrap estimators of the covariance between ĥ_S and ĥ_T, *_(v)(ĥ_S, ĥ_T). Thus, the bootstrap procedures provide estimators of the variance matrix of (ĥ_S, ĥ_T) when sampling the populations alone, and of its within-population variance when sampling only the individuals. Such results also hold for Ĝ_ST, and in particular the bootstrap estimator of its variance is approximately

If n is large, this expression is close to the direct estimator of Var(Ĝ_ST) defined by Pons & Petit (1995). However, for small n, the bias of this bootstrap estimator will become apparent and the direct estimation is preferable.

Numerical example

The data set originates from a large study of gene diversity of sessile oak (Quercus petraea (Matt.) Liebl.) in Europe using several isozyme markers (Zanetto & Kremer, 1995). A total of 81 populations were sampled over most of the European range of this species. We selected a single locus (acid phosphatase, EC 3.1.3.2). Sessile oak is a diploid species and a total of five alleles and 12 genotypes were detected at this locus in the survey. For the purposes of illustrating the approach described in this paper, all the analyses are made at the genotypic level, where each genotype is equivalent to an allele in a haploid locus. Alternatively, Hardy–Weinberg equilibrium could have been assumed, to consider the data as haploid. This has no consequence in regard to the question studied here.

The mean number of individuals per population is 114.6, for a total of 9281 individuals analysed. Alleles 2 and 4 are largely predominant, and genotypes 22, 24 and 44 make up 98 per cent of all genotypes found. In Table 1, simulated bootstrap estimates of the three parameters are compared with the direct estimates (Pons & Petit, 1995). In accordance with the theoretical results presented previously, bootstrapping over individuals only provides an unbiased estimate of h_S but a biased one for h_T, and bootstrapping over populations only provides a biased estimate of h_S and an unbiased one for h_T, whereas the two-step bootstrap provides biased estimates for both h_S and h_T. All bootstrap estimates for G_ST are therefore biased.

Table 1 Direct estimates and bootstrap simulated estimates of the parameters (using 1000 bootstrap samples). The direct estimates are ĥ_S, ĥ_T and Ĝ_ST (Pons & Petit, 1995) and the biased estimates are ĥ_S, ĥ_T given by eqn (1) and Ģ_ST=1−ĥ_S/ĥ_T

Full size table

In Table 2, we computed total, inter- and intrapopulation variances for the estimates of h_S, h_T and G_ST, following the method of Pons & Petit (1995). The estimators of h_S and h_T have similar variances but the estimate of G_ST (which directly derives from the other two parameters) is less precise. These estimates are then compared to those obtained by a bootstrap procedure, either empirically (1000 bootstrap simulations) or using (eqns 4 5, 6, 7, 8 and 9). The three types of bootstrap were considered, as discussed above. Overall, the bootstrap estimates are in excellent agreement with the results obtained in the simulations (Table 2). Hence, the approximations (7–9) are acceptable in this example. The same approximations led to the expression of Var_intra(ĥ_T), Var_inter(ĥ_T), Cov_intra(ĥ_S, ĥ_T) and Cov_inter(ĥ_S, ĥ_T) in Pons & Petit (1995). By analogy, these expressions must also be sufficiently precise. The comparison of the variances obtained using either bootstrap (through simulations or estimations) or direct estimates clearly shows that a two-step bootstrap yields the sum of the total and intrapopulation variances instead of the total variance, in agreement with the theory developed in the previous section. Moreover, the bootstrap over populations gives estimates of the total variances (and not of the interpopulation variances). Finally, the bootstrap over individuals does estimate the intrapopulation variance, though it appears to give an inflated estimate in the case of G_ST.

Table 2 Direct and bootstrap variance estimates×10⁴ of gene diversity estimates for the complete data set (using 1000 bootstrap samples)

Full size table

Because the population sample sizes are large, the bootstrap over individuals is not greatly modified by the sampling of n_k−1 individuals instead of n_k, as proposed above (results unchanged for Var(ĥ_T) for the bootstrap over individuals: 0.058×10⁻⁴ by both methods). Another example was studied using a subset of the complete data set where 20 individuals were selected at random and without replacement in each of the 81 populations. Direct and bootstrap variance estimates were then computed as before. The results indicate that, with this sample size, the procedure of bootstrapping over individuals no longer provides a good estimate of the intrapopulation variance of G_ST (Table 3). Hence, we recommend the use of the direct analytical estimates in these situations.

Table 3 Direct and bootstrap variance estimates×10⁴ of gene diversity estimates for a subset of 20 randomly sampled individuals in each of the 81 populations (using 1000 bootstrap samples)

Full size table

Conclusion

Resampling procedures, as emphasized by Crowley (1992), are often used without being validated, especially in the field of population genetics. Although seemingly appealing, intuitive resampling procedures that mimic the sampling of individuals and populations may turn out to be misleading. Here we have shown that the bootstrap variance estimators of ĥ_S, ĥ_T and Ĝ_ST are obtained by resampling over populations only, instead of over both populations and individuals. Nevertheless, we recommend rather the use of the direct estimators (Pons & Petit, 1995) that we have indirectly validated here. These estimators do not require the computing time necessary for resampling procedures and they are unbiased.

References

Crowley, P. H. (1992). Resampling methods for computation-intensive data analysis in ecology and evolution. Ann Rev Ecol Syst. 23: 405–447.
Article Google Scholar
Efron, B. and Tibshirani, R. J. (1993). An Introduction to the Bootstrap Chapman and Hall, New York.
Excoffier, L., Smouse, P. E. and Quattro, J. M. (1992). Analysis of molecular variance inferred from metric distances among haplotypes: application to human mitochondrial DNA restriction data. Genetics. 131: 479–491.
CAS PubMed PubMed Central Google Scholar
Hudson, R. R., Boos, D. D. and Kaplan, N. L. (1992). A statistical test for detecting geographic subdivision. Mol Biol Evol. 9: 138–151.
CAS PubMed Google Scholar
Mcdonald, J. H. (1994). Detecting natural selection by comparing geographic variation in protein and DNA polymorphisms. In: Golding, B. (ed.) Non-neutral Evolution. Theories and Molecular Data pp. 88–100. Chapman and Hall, New York.
Nei, M. (1973). Analysis of gene diversity in subdivided populations. Proc Natl Acad Sci USA. 70: 3321–3323.
Article CAS PubMed PubMed Central Google Scholar
Nei, M. (1986). Definition and estimation of fixation indices. Evolution. 40: 643–645.
Article PubMed Google Scholar
Palumbi, S. R. and Wilson, A. C. (1990). Mitochondrial DNA diversity in the sea urchins Strongylocentrotus purpuratus and S. droebachiensis. Evolution. 44: 403–415.
Article PubMed Google Scholar
Petit, R. J., Kremer, A. and Wagner, D. B. (1993). Geographic structure of chloroplast DNA polymorphisms in European oaks. Theor Appl Genet. 87: 122–128.
Article CAS PubMed Google Scholar
Pons, O. (1997). Bootstrap variance of diversity and differentiation estimators in a subdivided population: technical proofs. Unpublished manuscript available from the author.
Pons, O. and Chaouche, K. (1995). Estimation, variance and optimal sampling of gene diversity. II. Diploid locus. Theor Appl Genet. 91: 122–130.
Article CAS PubMed Google Scholar
Pons, O. and Petit, R. J. (1995). Estimation, variance and optimal sampling of gene diversity. I. Haploid locus. Theor Appl Genet. 90: 462–470.
Article CAS PubMed Google Scholar
Pons, O. and Petit, R. J. (1996). Measuring and testing genetic differentiation with ordered and unordered alleles. Genetics. 144: 1237–1245.
CAS PubMed PubMed Central Google Scholar
Prout, T. and Barker, J. S. F. (1993). F-statistics in Drosophila buzzatii: selection, population size and inbreeding. Genetics. 134: 369–375.
CAS PubMed PubMed Central Google Scholar
Rao, J. N. K. and Wu, C. F. J. (1988). Resampling inference with complex survey data. J Am Stat Ass. 83: 231–241.
Article Google Scholar
Sitter, R. R. (1992). A resampling procedure for complex survey data. J Am Stat Ass. 87: 755–765.
Article Google Scholar
Vandongen, S. (1995). How should we bootstrap allozyme data?. Heredity. 74: 445–447.
Article Google Scholar
Weir, B. S. (1990). Genetic Data Analysis Sinauer Associates, Sunderland, MA.
Weir, B. S. and Cockerham, C. C. (1984). Estimating F-statistics for the analysis of population structure. Evolution. 38: 1358–1370.
CAS PubMed Google Scholar
Wright, S. (1951). The genetical structure of populations. Ann Eugen. 15: 323–354.
Article CAS PubMed Google Scholar
Zanetto, A. and Kremer, A. (1995). Geographical structure of gene diversity in Quercus petraea (Matt.) Liebl. I. Monolocus patterns of variation. Heredity. 75: 506–517.
Article Google Scholar

Download references

Acknowledgements

The authors thank an anonymous referee for his careful reading and helpful comments, which improved the presentation of the paper, Anne Zanetto for providing material from her study on oak species and Thierry Labbé for skilful computing assistance and for writing the bootstrap simulation programs.

Author information

Authors and Affiliations

Institut National de la Recherche Agronomique, Laboratoire de Génétique et Amélioration des Arbres Forestiers, B.P.45, F-33611 Gazinet, cedex, France
R J Petit
Institut National de la Recherche Agronomique, Laboratoire de Biométrie, F-78352 Jouy-en-Josas, cedex, France
O Pons

Authors

R J Petit
View author publications
You can also search for this author in PubMed Google Scholar
O Pons
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to R J Petit.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Petit, R., Pons, O. Bootstrap variance of diversity and differentiation estimators in a subdivided population. Heredity 80, 56–61 (1998). https://doi.org/10.1046/j.1365-2540.1998.00282.x

Download citation

Received: 20 December 1996
Published: 01 January 1998
Issue Date: 01 January 1998
DOI: https://doi.org/10.1046/j.1365-2540.1998.00282.x

Bootstrap variance of diversity and differentiation estimators in a subdivided population

Abstract

Similar content being viewed by others

A method for genome-wide genealogy estimation for thousands of samples

Structure is more robust than other clustering methods in simulated mixed-ploidy populations

Factor analysis of ancient population genomic samples

Introduction

Bootstrap estimation

Numerical example

Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Search

Quick links

Abstract

Similar content being viewed by others

A method for genome-wide genealogy estimation for thousands of samples

Structure is more robust than other clustering methods in simulated mixed-ploidy populations

Factor analysis of ancient population genomic samples

Introduction

Bootstrap estimation

Numerical example

Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links