Key Points
-
Wright's F-statistics, and especially FST, provide important insights into the evolutionary processes that influence the structure of genetic variation within and among populations, and they are among the most widely used descriptive statistics in population and evolutionary genetics.
-
FST is a property of the distribution of allele frequencies among populations. It reflects the joint effects of drift, migration, mutation and selection on the distribution of genetic variation among populations.
-
FST has a central role in population and evolutionary genetics and has wide applications in fields from disease association mapping to forensic science.
-
FST can be used to describe the distribution of genetic variation among any set of samples, but it is most usefully applied when the samples represent discrete units rather than arbitrary divisions along a continuous distribution.
-
Statistics related to FST can be useful for haplotype or microsatellite data if an appropriate measure of evolutionary distance among alleles is available.
-
Comparison of an estimate of FST from marker data with an estimate of QST from continuously varying trait data can be used to detect selection, but the estimate of FST may depend on the choice of marker and the estimate of QST may differ from neutral expectations if there is a non-additive component of genetic variance.
-
Although the simple relationship between FST and migration rates in Wright's island model makes it tempting to infer migration rates from FST, caution is needed if such an approach is to be used.
-
If estimates of FST from many loci are available, it may be possible to identify certain loci as 'outliers' that may have been subject to different patterns of selection or to different demographic processes.
-
Case–control studies for association-mapping studies must account for the possibility that population substructure accounts for an observed association between a marker and a disease. The genomic control method uses background estimates of FST to control for such substructure.
-
In forensic applications, the probabilities of obtaining a match are sometimes calculated for subpopulations that lack specific allele frequency data. A θ correction, in which θ is FST, is used to calculate the probability of a match using allele frequency information from a broader population that the subpopulation is part of.
-
The massive amount of data that is being generated by population genomics projects can be understood fundamentally as allelic variation at individual loci. We therefore expect F-statistics to be at least as useful in understanding these data sets as they have been in population and evolutionary genetics for most of the last century.
Abstract
Wright's F-statistics, and especially FST, provide important insights into the evolutionary processes that influence the structure of genetic variation within and among populations, and they are among the most widely used descriptive statistics in population and evolutionary genetics. Estimates of FST can identify regions of the genome that have been the target of selection, and comparisons of FST from different parts of the genome can provide insights into the demographic history of populations. For these reasons and others, FST has a central role in population and evolutionary genetics and has wide applications in fields that range from disease association mapping to forensic science. This Review clarifies how FST is defined, how it should be estimated, how it is related to similar statistics and how estimates of FST should be interpreted.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$189.00 per year
only $15.75 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Rosenberg, N. A. et al. Genetic structure of human populations. Science 298, 2381–2385 (2002).
Li, J. Z. et al. Worldwide human relationships inferred from genome-wide patterns of variation. Science 319, 1100–1104 (2008).
Wright, S. The genetical structure of populations. Ann. Eugen. 15, 323–354 (1951). This paper develops the explicit framework for the analysis and interpretation of F -statistics in an evolutionary context.
Malécot, G. Les Mathématiques de l'Hérédié (Masson, Paris, 1948). This book develops a framework — equivalent to Wright's F -statistics — for the analysis of genetic diversity in hierarchically structured populations.
Wright, S. Evolution in Mendelian populations. Genetics 16, 97–159 (1931). A landmark paper in population genetics in which the effect of population size, mutation and migration on the abundance and distribution of genetic variation in populations is first quantitatively described.
Akey, J. M., Zhang, G., Khang, K., Jin, L. & Shriver, M. D. Interrogating a high-density SNP map for signatures of natural selection. Genome Res. 12, 1805–1814 (2002).
Weir, B. S., Cardon, L. R., Anderson, A. D., Nielsen, D. M. & Hill, W. G. Measures of human population structure show heterogeneity among genomic regions. Genome Res. 15, 1468–1476 (2005).
Guo, F., Dey, D. K. & Holsinger, K. E. A Bayesian hierarchical model for analysis of SNP diversity in multilocus, multipopulation models. J. Am. Stat. Assoc. 164, 142–154 (2009).
Keinan, A., Mullikin, J. C., Patterson, N. & Reich, D. Accelerated genetic drift on chromosome X during the human dispersal out of Africa. Nature Genet. 41, 66–70 (2009).
Cockerham, C. C. Variance of gene frequencies. Evolution 23, 72–84 (1969). This paper develops the first approach for the analysis of F -statistics that recognizes the effect of genetic sampling on estimates of F -statistics from population data.
Wahlund, S. Zusammensetzung von Population und Korrelationserscheinung vom Standpunkt der Vererbungslehre aus betrachtet. Hereditas 11, 65–106 (1928).
Sokal, R. R., Oden, N. L. & Thomson, B. A. A simulation study of microevolutionary inferences by spatial autocorrelation analysis. Biol. J. Linn. Soc. 60, 73–93 (1997).
Sokal, R. R. & Oden, N. L. Spatial autocorrelation analysis as an inferential tool in population genetics. Am. Nat. 138, 518–521 (1991).
Epperson, B. K. Geographical Genetics (Princeton Univ. Press, 2003).
Weir, B. S. & Cockerham, C. C. Mixed self- and random-mating at two loci. Genet. Res. 21, 247–262 (1973).
Wright, S. Evolution and the Genetics of Populations Vol. 4 (Univ. Chicago Press, 1978).
Weir, B. S. Genetic Data Analysis II: Methods for Discrete Population Genetic Data (Sinauer Associates, Sunderland, USA, 1996).
Rousset, F. Inbreeding and relatedness coefficients: what do they measure? Heredity 88, 371–380 (2002).
Casella, G. & Berger, R. L. Statistical Inference (Duxbury, Pacific Grove, 2002).
Weir, B. S. & Cockerham, C. C. Estimating F-statistics for the analysis of population structure. Evolution 38, 1358–1370 (1984). This paper develops the ANOVA framework to apply Cockerham's approach to F -statistics and provides method-of-moments estimates for F -statistics.
Excoffier, L. in Handbook of Statistical Genetics (eds Balding, D. J., Bishop, M. & Cannings, V.) 271–307 (John Wiley & Sons, Chichester, 2001).
Cockerham, C. C. Analyses of gene frequencies. Genetics 74, 679–700 (1973).
Berger, J. O. Statistical Decision Theory and Bayesian Analysis (Springer, New York, 1985).
Robert, C. P. The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation (Springer, New York, 2001).
Lee, P. M. Bayesian Statistics: An Introduction (Edward Arnold, London, 1989).
Gelfand, A. E. & Smith, A. F. M. Sampling-based approaches to calculating marginal densities. J. Am. Stat. Assoc. 85, 398–409 (1990).
Weir, B. S. & Hill, W. G. Estimating F-statistics. Annu. Rev. Genet. 36, 721–750 (2002).
Wehrhahn, C. Proceedings of the ecological genetics workshop. Genome 31, 1098–1099 (1989).
Samanta, S., Li, Y. J. & Weir, B. S. Drawing inferences about the coancestry coefficient. Theor. Popul. Biol. 75, 312–319 (2009).
Gaggiotti, O. E. et al. Patterns of colonization in a metapopulation of grey seals. Nature 13, 424–427 (2002).
Levsen, N. D., Crawford, D. J., Archibald, J. K., Santos-Geurra, A. & Mort, M. E. Nei's to Bayes': comparing computational methods and genetic markers to estimate patterns of genetic variation in Tolpis (Asteraceae). Am. J. Bot. 95, 1466–1474 (2008).
Nei, M. & Chesser, R. K. Estimation of fixation indices and gene diversities. Ann. Hum. Genet. 47, 253–259 (1983).
Nei, M. Analysis of gene diversity in subdivided populations. Proc. Natl Acad. Sci. USA 70, 3321–3323 (1973). This article introduces G ST as a measure of genetic differentiation among populations.
Excoffier, L., Smouse, P. E. & Quattro, J. M. Analysis of molecular variance inferred from metric distances among DNA haplotypes: application to human mitochondrial DNA restriction data. Genetics 131, 479–491 (1992). This paper introduces Φ ST and AMOVA for the analysis of haplotype data.
Slatkin, M. A measure of population subdivision based on microsatellite allele frequencies. Genetics 139, 457–462 (1995). This article introduces R ST for the analysis of microsatellite data.
Rousset, F. Equilibrium values of measures of population subdivision for stepwise mutation processes. Genetics 142, 1357–1362 (1996).
Slatkin, M. Inbreeding coefficients and coalescence times. Genet. Res. 58, 167–175 (1991).
Holsinger, K. E. & Mason-Gamer, R. J. Hierarchical analysis of nucleotide diversity in geographically structured populations. Genetics 142, 629–639 (1996).
Balloux, F. & Lugon-Molin, N. The estimation of population differentiation with microsatellite markers. Mol. Ecol. 11, 155–165 (2002).
Balloux, F., Brunner, F. & Goudet, J. Microsatellites can be misleading: an empirical and simulation study. Evolution 54, 1414–1422 (2000).
Gaggiotti, O. E., Lange, O., Rassman, K. & Gliddon, C. A comparison of two indirect methods for estimating average levels of gene flow using microsatellite data. Mol. Ecol. 8, 1513–1520 (1999).
Spitze, K. Population structure in Daphnia obtusa: quantitative genetic and allozymic variation. Genetics 135, 467–374 (1993). This paper introduces Q ST for the analysis of continuously varying trait data.
Lande, R. Neutral theory of quantitative genetic variance in an island model with local extinction and colonization. Evolution 46, 381–389 (1992).
McKay, J. K. & Latta, R. G. Adaptive population divergence: markers, QTL and traits. Trends Ecol. Evol. 17, 285–291 (2002).
O'Hara, R. B. & Merila, J. Bias and precision in QST estimates: problems and some solutions. Genetics 171, 1331–1339 (2005).
Goudet, J. & Martin, G. Under neutrality, QST ≤ FST when there is dominance in an island model. Genetics 176, 1371–1374 (2007).
Notohara, M. The coalescent and the genealogical process in geographically structured population. J. Math. Biol. 29, 59–75 (1990).
Charlesworth, B. Fundamental concepts in genetics: effective population size and patterns of molecular evolution and variation. Nature Rev. Genet. 10, 195–205 (2009).
McCauley, D. E. & Whitlock, M. C. Indirect measures of gene flow and migration: FST ≠ 1/(4Nm+1). Heredity 82, 117–125 (1999).
Wright, S. Isolation by distance. Genetics 28, 114–138 (1943).
Rousset, F. Genetic differentiation and estimation of gene flow from F-statistics under isolation by distance. Genetics 145, 1219–1228 (1997).
Felsenstein, J. How can we infer geography and history from gene frequencies? J. Theor. Biol. 96, 9–20 (1982).
Cann, H. M. et al. A human genome diversity cell line panel. Science 296, 261–262 (2002).
Beerli, P. Comparison of Bayesian and maximum-likelihood estimation of population genetic parameters. Bioinformatics 22, 341–345 (2006).
Kuhner, M. K. Coalescent genealogy samplers: windows into population history. Trends Ecol. Evol. 24, 86–93 (2009).
Kuhner, M. K. LAMARC 2.0: maximum likelihood and Bayesian estimation of population parameters. Bioinformatics 22, 768–770 (2006).
Fu, R., Gelfand, A. & Holsinger, K. E. Exact moment calculations for genetic models with migration, mutation, and drift. Theor. Popul. Biol. 63, 231–243 (2003).
Beaumont, M. A. & Balding, D. J. Identifying adaptive genetic divergence among populations from genome scans. Mol. Ecol. 13, 969–980 (2004).
Vitalis, R., Dawson, K. & Boursot, P. Interpretation of variation across marker loci as evidence of selection. Genetics 158, 1811–1823 (2001).
Beaumont, M. A. & Nichols, R. A. Evaluating loci for use in the genetic analysis of population structure. Proc. R. Soc. Lond. B 263, 1619–1626 (1996).
Foll, M. & Gaggiotti, O. A genome-scan method to identify selected loci appropriate for both dominant and codominant markers: a Bayesian perspective. Genetics 180, 977–993 (2008).
Zhang, Y. et al. Positional cloning of the mouse obese gene and its human homologue. Nature 372, 425–432 (1994).
Mammès, O. et al. Association of the G2548A polymorphism in the 5′ region of the LEP gene with overweight. Ann. Hum. Genet. 64, 391–394 (2000).
Balding, D. J. & Donnelly, P. How convincing is DNA evidence? Nature 368, 285–286 (1994).
Balding, D. J. & Nichols, R. A. DNA match probability calculation: how to allow for population stratification, relatedness, database selection, and single bands. Forensic Sci. Int. 64, 125–140 (1994).
Council, N. R. The Evaluation of Forensic DNA Evidence (National Academy Press, Washington DC, 1996).
Devlin, B., Roeder, K. & Wasserman, L. Genomic control, a new approach to genetic-based association studies. Theor. Popul. Biol. 60, 155–166 (2001).
Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).
Pritchard, J. K. & Donnelly, P. Case–control studies of association in structured or admixed populations. Theor. Popul. Biol. 60, 227–237 (2001).
Pritchard, J. K. & Rosenberg, N. A. Use of unlinked genetic markers to detect population stratification in association studies. Am. J. Hum. Genet. 65, 220–228 (1999).
Kingman, J. F. C. On the genealogy of large populations. J. Appl. Prob. 19A, 27–43 (1982).
Kingman, J. F. C. The coalescent. Stoch. Proc. Appl. 13, 235–248 (1982).
Kuhner, M. K. & Smith, L. P. Comparing likelihood and Bayesian coalescent estimation of population parameters. Genetics 175, 155–165 (2007).
Wang, J. A coalescent-based estimator of admixture from DNA sequences. Genetics 173, 1679–1692 (2006).
Innan, H., Zhang, K., Marjoram, P., Tavare, S. & Rosenberg, N. A. Statistical tests of the coalescent model based on the haplotype frequency distribution and the number of segregating sites. Genetics 169, 1763–1777 (2005).
Wall, J. D. & Hudson, R. R. Coalescent simulations and statistical tests of neutrality. Mol. Biol. Evol. 18, 1134–1135 (2001).
Nordborg, M. Structured coalescent processes on different time scales. Genetics 146, 1501–1514 (1997).
Donnelly, P. & Tavaré, S. Coalescents and genealogical structure under neutrality. Annu. Rev. Genet. 29, 401–421 (1995).
Griffiths, R. C. & Tavare, S. Simulating probability distributions in the coalescent. Theor. Popul. Biol. 46, 131–159 (1994).
Fearnhead, P. & Donnelly, P. Estimating recombination rates from population genetic data. Genetics 159, 1299–1318 (2001).
Kuhner, M. K., Beerli, P., Yamato, J. & Felsenstein, J. Usefulness of single nucleotide polymorphism data for estimating population parameters. Genetics 156, 439–447 (2000).
Kuhner, M. K., Yamato, J. & Felsenstein, J. Maximum likelihood estimation of recombination rates from population data. Genetics 156, 1393–1401 (2000).
Kuhner, M. K. & Felsenstein, J. Sampling among haplotype resolutions in a coalescent-based genealogy sampler. Genet. Epidemiol. 19 (Suppl. 1), 15–21 (2000).
Kuhner, M. K., Yamato, J. & Felsenstein, J. Maximum likelihood estimation of population growth rates based on the coalescent. Genetics 149, 429–434 (1998).
Beerli, P. & Felsenstein, J. Maximum-likelihood estimation of migration rates and effective population numbers in two populations using a coalescent approach. Genetics 152, 763–773 (1999).
Drummond, A. J., Nicholls, G. K., Rodrigo, A. G. & Solomon, W. Estimating mutation parameters, population history and genealogy simultaneously from temporally spaced sequence data. Genetics 161, 1307–1320 (2002).
Wright, S. An analysis of local variability of flower color in Linanthus parryae. Genetics 28, 139–156 (1943).
Malécot, G. The Mathematics of Heredity (W. H. Freeman, San Francisco, 1969).
Hamrick, J. L. & Godt, M. J. W. Effects of life history traits on genetic diversity in plant species. Philos. Trans. R. Soc. Lond. B 351, 1291–1298 (1996).
Hamrick, J. L. in Isozymes in Plant Biology (eds Soltis, D. E. & Soltis, P. S.) 87–105 (Dioscorides, Portland, 1989).
Loveless, M. D. & Hamrick, J. L. Ecological determinants of genetic structure in plant populations. Annu. Rev. Ecol. Syst. 15, 65–95 (1984).
Hamrick, J. L., Linhart, Y. B. & Mitton, J. B. Relationships between life history characteristics and electrophoretically detectable genetic variation in plants. Annu. Rev. Ecol. Syst. 10, 173–200 (1979).
Gottlieb, L. D. in Progress in Phytochemistry Vol. 7 (eds Reinhold, L., Harborne, J. B. & Swain, T.) 1–46 (Pergamon, Oxford, 1981).
Brown, A. H. D. Enzyme polymorphism in plant populations. Theor. Popul. Biol. 15, 1–42 (1979).
International HapMap Consortium et al. A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851–861 (2007).
International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).
He, M. et al. Geographical affinities of the HapMap samples. PLoS ONE 4, e4684 (2009).
Balding, D. J. Likelihood-based inference for genetic correlation coefficients. Theor. Popul. Biol. 63, 221–230 (2003).
Foll, M. & Gaggiotti, O. Identifying the environmental factors that determine the genetic structure of populations. Genetics 174, 875–891 (2006).
Begun, D. J. et al. Population genomics: whole-genome analysis of polymorphism and divergence in Drosophila simulans. PLoS Biol. 5, e310 (2007).
Luikart, G., England, P. R., Tallmon, D., Jordan, S. & Taberlet, P. The power and promise of population genomics: from genotyping to genome typing. Nature Rev. Genet. 4, 981–994 (2003).
Goudet, J., Raymond, M., de Meeus, T. & Rousset, F. Testing differentiation in diploid populations. Genetics 144, 1933–1940 (1996).
Workman, P. L. & Niswander, J. D. Population studies on southwest Indian tribes. II. Local genetic differentiation in the Papago. Am. J. Hum. Genet. 22, 24–49 (1970).
Holsinger, K. E. in Hierarchical Modeling for the Environmental Sciences (eds Clark, J. S. & Gelfand, A. E.) 25–37 (Oxford Univ. Press, 2006).
Holsinger, K. E. Analysis of genetic diversity in hierarchically structured populations: a Bayesian perspective. Hereditas 130, 245–255 (1999).
Weir, B. S. The rarity of DNA profiles. Ann. Appl. Stat. 1, 358–370 (2007).
Ritland, K. R. Joint maximum-likelihood estimation of genetic and mating system structure using open-pollinated progenies. Biometrics 42, 25–43 (1986).
Thompson, S. L. & Ritland, K. A novel mating system analysis for modes of self-oriented mating applied to diploid and polyploid arctic Easter daisies (Townsendia hookeri). Heredity 97, 119–126 (2006).
Acknowledgements
We thank R. Prunier and K. Theiss for their helpful comments on earlier versions of this Review. The work in the laboratories of the authors was supported in part by grants from the US National Institutes of Health (1 R01 GM 068449-01A1 to K.E.H; 1 R01 GM 075091 to B.S.W).
Author information
Authors and Affiliations
Corresponding authors
Related links
Related links
FURTHER INFORMATION
ABC4F (approximate Bayesian computation for F-statistics)
Arlequin (an integrated software application for population genetics data analysis)
BayeScan (BAYEsian genome SCAN for outliers)
Bayesian population genetic data analysis
GenAlEx (integrated software for analysis of genetic data with an interface to Excel)
GESTE (GEnetic STructure inference based on genetic and Environmental data)
Hickory (software for the analysis of geographic structure in genetic data)
Hierfstat (Weir & Cockerham F-statistics for any number of levels in a hierarchy)
Nature Reviews Genetics series on Fundamental Concepts in Genetics
The genetic structure of populations
Glossary
- Genetic drift
-
The random fluctuations in allele frequencies over time that are due to chance alone.
- Short tandem repeat loci
-
Loci consisting of short sequences (2–6 nucleotides) that are repeated multiple times. Alleles at short tandem repeat loci differ from one another in their number of repeats.
- Variance
-
A measure of the amount of variation around a mean value.
- Diversifying selection
-
Selection in which different alleles are favoured in different populations. It is often a consequence of local adaptation (in which genotypes from different populations have higher fitness in their home environments owing to historical natural selection).
- Hardy–Weinberg proportions
-
When the frequency of each diploid genotype at a locus equals that expected from the random union of alleles. That is, the genotypes AA, Aa and aa will be at frequencies p2, 2pq and q2, respectively.
- Heterozygote advantage
-
A pattern of natural selection in which heterozygotes are more likely to survive than homozygotes.
- Likelihood
-
A mathematical function that describes the relationship between the unknown parameters of a statistical distribution — for example, the mean and variance of the allele frequency distribution among populations or the allele frequency in a particular population — and the data. It is directly proportional to the probability of the data given the unknown parameters.
- Prior distribution
-
A statistical distribution used in Bayesian analysis to describe the probability that parameters take on a particular value before examining any data. It expresses the level of uncertainty about those parameters before the data have been analysed.
- Posterior distribution
-
A statistical distribution used in Bayesian analysis to describe the probability that parameters take a particular value after the data have been analysed. It reflects both the likelihood of the data given particular parameters and the prior probability that parameters take particular values.
- Markov chain Monte Carlo methods
-
Methods that implement a computational technique that is widely used for approximating complex integrals and other functions. In this context, these methods are used to approximate the posterior distribution of a Bayesian model.
- Multinomial distribution
-
A statistical distribution that describes the probability of obtaining a sample with a specified number of objects in each of several categories. The probability is determined by the total sample size and the probability of drawing an object from each category. The binomial distribution is a special case of the multinomial distribution in which there are two categories.
- Additive genetic variance
-
The part of the total genetic variation that is due to the main (or additive) effects of alleles on a phenotype. The additive variance determines the degree of resemblance between relatives and therefore the response to selection.
- Stabilizing selection
-
Selection in which either the same allele or the same genotype is favoured in different populations.
- Effective population size
-
Formulated by Wright in 1931, the effective population size reflects the size of an idealized population that would experience drift in the same way as the actual (census) population. The effective population size can be lower than the census population size owing to various factors, including a history of population bottlenecks and reduced recombination.
- Coalescent-based approaches
-
Approaches that use statistical properties of the genealogical relationship among alleles under particular demographic and mutational models to make inferences about the effective size of populations and about rates of mutation and migration.
- Conditional autoregressive scheme
-
A statistical approach developed for analysis of data in which a random effect is associated with the spatial location of each observation. The magnitude of the random effect is determined by a weighted average of the random effects of nearby positions. In most applications, the weights of the averages are inversely related to the spatial distance between two sample points.
Rights and permissions
About this article
Cite this article
Holsinger, K., Weir, B. Genetics in geographically structured populations: defining, estimating and interpreting FST. Nat Rev Genet 10, 639–650 (2009). https://doi.org/10.1038/nrg2611
Issue Date:
DOI: https://doi.org/10.1038/nrg2611
This article is cited by
-
PSReliP: an integrated pipeline for analysis and visualization of population structure and relatedness based on genome-wide genetic variant data
BMC Bioinformatics (2023)
-
The role of transposon inverted repeats in balancing drought tolerance and yield-related traits in maize
Nature Biotechnology (2023)
-
Hybrid autoencoder with orthogonal latent space for robust population structure inference
Scientific Reports (2023)
-
Genetic diversity in creole pigs in south central Peru
Tropical Animal Health and Production (2023)
-
Chrysolaena obovata (Less.) Dematt., a species native of Brazilian Cerrado: genetic diversity and structure of natural populations and potential for inulin production
Genetic Resources and Crop Evolution (2023)