Computer programs for population genetics data analysis: a survival guide

Excoffier, Laurent; Heckel, Gerald

doi:10.1038/nrg1904

Review Article
Published: 22 August 2006

Computer programs for population genetics data analysis: a survival guide

Laurent Excoffier¹ &
Gerald Heckel¹

Nature Reviews Genetics volume 7, pages 745–758 (2006)Cite this article

10k Accesses
265 Citations
7 Altmetric
Metrics details

Key Points

Computer programs are now essential for the analysis of large population genetics data sets that are increasingly being generated.
We review 24 such programs here, and list their main features, limitations and some of their underlying assumptions.
Several user-friendly programs provide methods to compute standard genetic-diversity indices from various types of marker, as well as to test genetic structure, linkage disequilibrium and selective neutrality within populations.
Several programs that use multilocus genotype information have recently focused on individuals rather than on populations, and provide ways to delineate populations, detect new immigrants and their population of origin, and estimate individual admixture coefficients.
New coalescent-based programs provide powerful methods to estimate demographic parameters; however, these parameters have been developed under specific evolutionary models, and the accuracy of the results also depends on the convergence of the programs.
Most programs are based on many well-documented assumptions that need to be integrated by their users for a sound interpretation of the results.
Proper genetic data analyses should start with generalist packages to uncover the basic properties of the data, and be followed by the use of specialized methodologies to address more specific questions.
An important limitation of the wider use of sophisticated programs is the lack of a generic population genetics format, which would allow data to be easily exchanged between programs.

Abstract

The analysis of genetic diversity within species is vital for understanding evolutionary processes at the population level and at the genomic level. A large quantity of data can now be produced at an unprecedented rate, requiring the use of dedicated computer programs to extract all embedded information. Several statistical packages have been recently developed, which offer a panel of standard and more sophisticated analyses. We describe here the functionalities, special features and assumptions of more than 20 such programs, indicate how they can interoperate, and discuss new directions that could lead to improved software and analyses.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Flow chart of possible data exchange between different population genetics programs.**

Phylogenomics and the rise of the angiosperms

Article Open access 24 April 2024

Network of large pedigrees reveals social practices of Avar communities

Article Open access 24 April 2024

Genome-wide association studies

Article 26 August 2021

References

Schaffner, S. F. et al. Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 15, 1576–1583 (2005).
Article CAS PubMed PubMed Central Google Scholar
Akey, J. M. et al. Population history and natural selection shape patterns of genetic variation in 132 genes. PLoS Biol. 2, e286 (2004).
Article PubMed PubMed Central Google Scholar
Williamson, S. H. et al. Simultaneous inference of selection and population growth from patterns of variation in the human genome. Proc. Natl Acad. Sci. USA 102, 7882–7887 (2005). One of the first and more elaborate attempts to correct for the effect of past demography when inferring patterns of selection at the sequence level.
Article CAS PubMed PubMed Central Google Scholar
Fernandez, J., Villanueva, B., Pong-Wong, R. & Toro, M. A. Efficiency of the use of pedigree and molecular marker information in conservation programs. Genetics 170, 1313–1321 (2005).
Article CAS PubMed PubMed Central Google Scholar
Labate, J. A. Software for population genetics analyses of molecular marker data. Crop Sci. 40, 1521–1528 (2000).
Article Google Scholar
Stephens, M. & Scheet, P. Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am. J. Hum. Genet. 76, 449–462 (2005).
Article CAS PubMed PubMed Central Google Scholar
Felsenstein, J. Inferring Phylogenies (Sinauer Associates, Sunderland, 2003).
Google Scholar
Knowles, L. L. The burgeoning field of statistical phylogeography. J. Evol. Biol. 17, 1–10 (2004).
Article CAS PubMed Google Scholar
Morrison, D. A. Networks in phylogenetic analysis: new tools for population biology. Int. J. Parasitol. 35, 567–582 (2005).
Article PubMed Google Scholar
Jones, A. G. & Ardren, W. R. Methods of parentage analysis in natural populations. Mol. Ecol. 12, 2511–2523 (2003).
Article CAS PubMed Google Scholar
Dudbridge, F. A survey of current software for linkage analysis. Hum. Genomics 1, 63–65 (2003).
Article CAS PubMed PubMed Central Google Scholar
Beaumont, M. A. & Rannala, B. The Bayesian revolution in genetics. Nature Rev. Genet. 5, 251–261 (2004). A necessary read presenting a broad overview of the use of Bayesian approaches in population genetics.
Article CAS PubMed Google Scholar
Beerli, P. Comparison of Bayesian and maximum-likelihood inference of population genetic parameters. Bioinformatics 22, 341–345 (2006).
Article CAS PubMed Google Scholar
Rosenberg, N. A. et al. Genetic structure of human populations. Science 298, 2381–2385 (2002).
Article CAS PubMed Google Scholar
Valdes, A. M., Slatkin, M. & Freimer, N. B. Allele frequencies at microsatellite loci: the stepwise mutation model revisited. Genetics 133, 737–749 (1993).
CAS PubMed PubMed Central Google Scholar
Slatkin, M. A measure of population subdivision based on microsatellite allele frequencies. Genetics 139, 457–462 (1995).
CAS PubMed PubMed Central Google Scholar
Goldstein, D. B., Ruiz Linares, A., Cavalli-Sforza, L. L. & Feldman, M. W. An evaluation of genetic distances for use with microsatellite loci. Genetics 139, 463–471 (1995).
CAS PubMed PubMed Central Google Scholar
Balloux, F., Brunner, H., Lugon-Moulin, N., Hausser, J. & Goudet, J. Microsatellites can be misleading: an empirical and simulation study. Evolution Int. J. Org. Evolution 54, 1414–1422 (2000).
Article CAS Google Scholar
Raymond, M. & Rousset, F. An exact test for population differentiation. Evolution 49, 1280–1283 (1995).
Article PubMed Google Scholar
Lewontin, R. C. The interaction of selection and linkage. II. Optimum models. Genetics 50, 757–782 (1964).
CAS PubMed PubMed Central Google Scholar
Tajima, F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123, 585–595 (1989).
CAS PubMed PubMed Central Google Scholar
Evanno, G., Regnaut, S. & Goudet, J. Detecting the number of clusters of individuals using the software Structure: a simulation study. Mol. Ecol. 14, 2611–2620 (2005).
Article CAS PubMed Google Scholar
Nielsen, R. & Wakeley, J. Distinguishing migration from isolation: a Markov chain Monte Carlo approach. Genetics 158, 885–896 (2001).
CAS PubMed PubMed Central Google Scholar
Nielsen, R. & Slatkin, M. Likelihood analysis of ongoing gene flow and historical association. Evolution Int. J. Org. Evolution 54, 44–50 (2000).
Article CAS Google Scholar
Felsenstein, J. How can we infer geography and history from gene frequencies? J. Theor. Biol. 96, 9–20 (1982).
Article CAS PubMed Google Scholar
Wakeley, J. Distinguishing migration from isolation using the variance of pairwise differences. Theor. Popul. Biol. 49, 369–386 (1996).
Article CAS PubMed Google Scholar
Mountain, J. L. et al. SNPSTRs: empirically derived, rapidly typed, autosomal haplotypes for inference of population history and mutational processes. Genome Res. 12, 1766–1772 (2002).
Article CAS PubMed PubMed Central Google Scholar
Brooks, S. & Giudici, P. in Bayesian Statistics (eds Bernardo, J., Berger, J., Dawid, A. P. & Smith, A. F. M.) 733–742 (Oxford Univ. Press, Oxford, 1999).
Google Scholar
Gaggiotti, O. E. et al. Patterns of colonization in a metapopulation of grey seals. Nature 416, 424–427 (2002). One of the first applications of RJ-MCMC in population and conservation genetics. The method allowed the authors to integrate non-genetic data such as demographic or environmental information directly in the inference process.
Article CAS PubMed Google Scholar
Gaggiotti, O. E., Brooks, S. P., Amos, W. & Harwood, J. Combining demographic, environmental and genetic data to test hypotheses about colonization events in metapopulations. Mol. Ecol. 13, 811–825 (2004).
Article CAS PubMed Google Scholar
Beaumont, M. A. & Nichols, R. A. Evaluating loci for use in the genetic analysis of population structure. Proc. R. Soc. Lond. B 263, 1619–1626 (1996). The first description of how genome scans that were performed in several populations can be used to detect loci under selection.
Article Google Scholar
Beaumont, M. A. & Balding, D. J. Identifying adaptive genetic divergence among populations from genome scans. Mol. Ecol. 13, 969–980 (2004).
Article CAS PubMed Google Scholar
Slatkin, M. Seeing ghosts: the effect of unsampled populations on migration rates estimated for sampled populations. Mol. Ecol. 14, 67–73 (2005).
Article PubMed Google Scholar
Beerli, P. Effect of unsampled populations on the estimation of population sizes and migration rates between sampled populations. Mol. Ecol. 13, 827–836 (2004).
Article PubMed Google Scholar
Nielsen, R. Population genetic analysis of ascertained SNP data. Hum. Genomics 1, 218–224 (2004). A lucid description of the effect of SNP ascertainment bias on parameter inference and ways to correct it.
Article CAS PubMed PubMed Central Google Scholar
Nielsen, R. et al. Genomic scans for selective sweeps using SNP data. Genome Res. 15, 1566–1575 (2005).
Article CAS PubMed PubMed Central Google Scholar
Hey, J. & Nielsen, R. Multilocus methods for estimating population sizes, migration rates and divergence time, with applications to the divergence of Drosophila pseudoobscura and D. persimilis. Genetics 167, 747–760 (2004). A comprehensive presentation of the elaborate methodology underlying the IM program, which can simultaneously estimate gene flow and divergence time between two populations.
Article CAS PubMed PubMed Central Google Scholar
Beaumont, M. A., Zhang, W. & Balding, D. J. Approximate Bayesian computation in population genetics. Genetics 162, 2025–2035 (2002). A fundamental paper presenting the principles of ABC computations. It shows how genetic simulations can be used to accurately estimate parameters of arbitrarily complex demographic models, for which the likelihood is impossible or too costly to compute.
PubMed PubMed Central Google Scholar
Marjoram, P., Molitor, J., Plagnol, V. & Tavare, S. Markov chain Monte Carlo without likelihoods. Proc. Natl Acad. Sci. USA 100, 15324–15328 (2003).
Article CAS PubMed PubMed Central Google Scholar
Excoffier, L., Estoup, A. & Cornuet, J.-M. Bayesian analysis of an admixture model with mutations and arbitrarily linked markers. Genetics 169, 1727–1738 (2005).
Article CAS PubMed PubMed Central Google Scholar
Zhao, J. H. & Tan, Q. Integrated analysis of genetic data with R. Hum. Genomics 2, 258–265 (2006).
Article CAS PubMed PubMed Central Google Scholar
Goudet, J. Hierfstat, a package for R to compute and test hierarchical F-statistics. Mol. Ecol. Notes 5, 184–186 (2005).
Article Google Scholar
Price, E. W. & Carbone, I. SNAP: workbench management tool for evolutionary population genetic analysis. Bioinformatics 21, 402–404 (2005).
Article CAS PubMed Google Scholar
Altshuler, D. et al. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).
Article Google Scholar
Gilks, W. R., Richardson, S. & Spiegelhalter, D. J. Markov Chain Monte Carlo in Practice (Chapman and Hall/CRC, London, 1996).
Google Scholar
Guillot, G., Mortier, F. & Estoup, A. Geneland: a computer package for landscape genetics. Mol. Ecol. Notes 5, 712–715 (2005).
Article CAS Google Scholar
Guillot, G., Estoup, A., Mortier, F. & Cosson, J. F. A spatial statistical model for landscape genetics. Genetics 170, 1261–1280 (2005). An extension of the Structure approach that explicitly uses spatial information to infer the genetic structure of populations and to detect recent immigrants.
Article CAS PubMed PubMed Central Google Scholar
Cegelski, C. C., Waits, L. P. & Anderson, N. J. Assessing population structure and gene flow in Montana wolverines (Gulo gulo) using assignment-based approaches. Mol. Ecol. 12, 2907–2918 (2003).
Article CAS PubMed Google Scholar
Excoffier, L., Laval, G. & Schneider, S. Arlequin ver. 3. 0: an integrated software package for population genetics data analysis. Evolutionary Bioinformatics Online 1, 47–50 (2005).
CAS Google Scholar
Rozas, J., Sanchez-DelBarrio, J. C., Messeguer, X. & Rozas, R. DnaSP, DNA polymorphism analyses by the coalescent and other methods. Bioinformatics 19, 2496–2497 (2003).
Article CAS PubMed Google Scholar
Goudet, J. FSTAT (version 1.2): a computer program to calculate F-statistics. J. Hered. 86, 485–486 (1995).
Article Google Scholar
Raymond, M. & Rousset, F. Genepop (version 1.2): population genetics software for exact tests and ecumenicism. J. Hered. 86, 248–249 (1995).
Article Google Scholar
Kumar, S., Tamura, K. & Nei, M. MEGA3: integrated software for molecular evolutionary genetics analysis and sequence alignment. Brief. Bioinform. 5, 150–163 (2004).
Article CAS PubMed Google Scholar
Dieringer, D. & Schlötterer, C. Microsatellite Analyser (MSA): a platform independent analysis tool for large microsatellite data sets. Mol. Ecol. Notes 3, 167–169 (2003).
Article CAS Google Scholar
Hardy, O. J. & Vekemans, X. spagedi: a versatile computer program to analyse spatial genetic structure at the individual or population levels. Mol. Ecol. Notes 2, 618–620 (2002).
Article Google Scholar
Wilson, G. A. & Rannala, B. Bayesian inference of recent migration rates using multilocus genotypes. Genetics 163, 1177–1191 (2003).
PubMed PubMed Central Google Scholar
Corander, J., Waldmann, P., Marttinen, P. & Sillanpaa, M. J. BAPS 2: enhanced possibilities for the analysis of genetic population structure. Bioinformatics 20, 2363–2369 (2004).
Article CAS PubMed Google Scholar
Piry, S. et al. GeneClass2: A software for genetic assignment and first- generation migrant detection. J. Hered. 95, 536–539 (2004).
Article CAS PubMed Google Scholar
Anderson, E. C. & Thompson, E. A. A model-based method for identifying species hybrids using multilocus genetic data. Genetics 160, 1217–1229 (2002). A powerful Bayesian method that uses multilocus genotype information to identify the different types of hybrid individual present in a population.
CAS PubMed PubMed Central Google Scholar
Pritchard, J. K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000). A highly influential and innovative paper that uses multilocus genotype information to assign individuals to populations, and to identify recent immigrants and admixed individuals.
CAS PubMed PubMed Central Google Scholar
Falush, D., Stephens, M. & Pritchard, J. K. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164, 1567–1587 (2003).
CAS PubMed PubMed Central Google Scholar
Wilson, I. J., Weale, M. E. & Balding, D. J. Inferences from DNA data: population histories, evolutionary processes and forensic match probabilities. J. R. Stat. Soc. A 166, 155–188 (2003).
Article Google Scholar
Foll, M. & Gaggiotti, O. E. Colonise: a computer program to study colonization processes in metapopulations. Mol. Ecol. Notes 5, 705–707 (2005).
Article CAS Google Scholar
Beaumont, M. A. Detecting population expansion and decline using microsatellites. Genetics 153, 2013–2029 (1999).
CAS PubMed PubMed Central Google Scholar
Glaubitz, J. C. Convert: a user-friendly program to reformat diploid genotypic data for commonly used population genetic software packages. Mol. Ecol. Notes 4, 309–310 (2004).
Article CAS Google Scholar

Download references

Acknowledgements

We are grateful to P. Beerli for providing an illustration from Migrate's manual. We also thank him, as well as O. Gaggiotti, J. Goudet and A. Estoup, for suggestions and comments on an early version of this manuscript. We are indebted to three reviewers for their comments. We apologize to the authors of programs which, owing to space constraints, we have not been able to cover here. The work in L.E.'s laboratory is partially supported by a grant from the Swiss National Science Foundation.

Author information

Authors and Affiliations

Computational and Molecular Population Genetics (CMPG) Laboratory, Zoological Institute, University of Berne, Baltzerstrasse 6, Berne, 3012, Switzerland
Laurent Excoffier & Gerald Heckel

Authors

Laurent Excoffier
View author publications
You can also search for this author in PubMed Google Scholar
Gerald Heckel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Laurent Excoffier.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Glossary

Linkage disequilibrium: The non-random association of alleles at different loci.
Gametic phase: In a diploid individual, it represents the original allelic combinations that an individual received from its parents. It is therefore a particular association of alleles at different loci on the same chromosome, which is often unknown.
Selective neutrality: Null model of evolution that assumes that all the alleles observed at given locus are functionally equivalent.
Bayesian: Inference framework, based on the work of Thomas Bayes (1702–1761), in which the posterior probability of a parameter depends explicitly on its prior probability, reflecting some previous belief about this parameter.
Short tandem repeat (or microsatellite): A class of repetitive DNA that is made up of repeats that are 2–5 nucleotides in length. The number of these repeats is usually extremely variable in a population.
Homoplasic mutations: Mutations that lead to identical character states (identity-in-state) despite having occurred by different evolutionary processes.
Coalescent (theory): A theory that describes the structure of the genealogy of a sample of genes from present time to their most recent common ancestor. For neutral genes, this genealogy is extremely variable but only depends on the past demography (deme sizes and immigration rates) of the population.
Maximum-likelihood estimation: Inference technique in which the estimated parameters of a model are those that maximize the probability of the data under that model.
Hardy–Weinberg equilibrium: (HWE). Fit between the observed frequencies of the different genotype categories and the frequencies that are expected under random mating in an ideal population. Departure from HWE can also be due to selection, migration or hidden population subdivision.
F-statistics: Statistics that measure the correlation between genes drawn at different levels of a (hierarchically) subdivided population. This correlation is influenced by several evolutionary forces, such as mutation and migration, but it was originally designed to measure how far populations had gone in the process of fixation owing to genetic drift.
Hierarchical analyses of genetic variance: Analysis in which genetic diversity is hierarchically organized, with subunits nested in larger units (for example, genes in diploid individuals drawn from demes belonging to a subdivided population).
Mantel test: Test designed to measure the association between the elements of two matrices, by taking into account the autocorrelation that exists between the elements of each matrix. It is often used to test for a significant association between genetic and geographical distances.
Mismatch distribution: The distribution of the number of differences (mismatches) between pairs of DNA sequences in a sample. The exact shape of this distribution is affected by the past demography of a population.
Infinite-sites model: A mutation model according to which each new mutation occurs at a site that has not mutated before. This model was originally developed for protein- and DNA-sequence evolution, and is obviously related to the infinite allele model.
Infinite-allele mutation model: A mutation model according to which each new mutation produces an allele that has not previously existed.
Summary statistics: In the current genetic context, these are descriptive statistics summarizing the pattern of genetic diversity, such as the level of heterozygosity or the number of alleles per locus.
D: A measure of linkage disequilibrium defined as the difference between the frequency of a two-locus haplotype and the product of the frequencies of its constituent alleles (D_ij = p_ij p_ip_j).
D′: A standardized version of D that is obtained by dividing D by its maximum possible value given the allele frequencies (D′ = D/D_max).
Tajima's D: Statistic used in a selective neutrality test to decide whether the mean number of differences between pairs of DNA sequences is compatible with the observed number of segregating sites in a sample.
Likelihood (of a model): The probability of the data under a given model defined by a particular set of parameter values.
Joint posterior distribution: When a model is defined by more than one parameter, it is the posterior distribution of all possible combinations of parameter values.
Effective population size: The size of a virtual, randomly mating, stationary and isolated population that would have the same amount and type of polymorphisms as the population under study.
Finite-island model: A conceptual model for gene flow under which a finite number of demes exchange migrants with each other. The spatial location of the populations is not specified, and the constituent demes are usually assumed to have the same size and to exchange migrants at the same rate.
F _ST: A measure of the level of population genetic differentiation, which usually reflects the proportion of total genetic variability that is due to the net differences between populations (see F-statistics).
Balancing selection: A form of natural selection that maintains polymorphism within populations.
AFLP: Amplified fragment length polymorphism. A method for the selective PCR amplification of anonymous, dominant DNA polymorphisms using restriction enzymes and DNA linkers.
Ascertainment bias: Systematic bias introduced by the criteria used to select individuals and/or genetic markers to be analysed (for example, choosing SNPs with heterozygosity that is higher than a given threshold).
Selective sweep: Drastic reduction of the genetic diversity along a chromosomal segment as a consequence of the fixation of an advantageous mutation by selection in that region.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Excoffier, L., Heckel, G. Computer programs for population genetics data analysis: a survival guide. Nat Rev Genet 7, 745–758 (2006). https://doi.org/10.1038/nrg1904

Download citation

Published: 22 August 2006
Issue Date: 01 October 2006
DOI: https://doi.org/10.1038/nrg1904

This article is cited by

The relative fitness of the de novo variants in general Lithuanian population vs. in individuals with intellectual disability
- Laura Pranckėnienė
- Vaidutis Kučinskas
European Journal of Human Genetics (2022)
The influence of native populations’ genetic history on the reconstruction of invasion routes: the case of a highly invasive aquatic species
- Thomas Brazier
- Emira Cherif
- Rodolphe E. Gozlan
Biological Invasions (2022)
Evaluation and comparison of population genetics software in Rabari Tribe of Gujarat population
- Aditi Mishra
- Archana Kumari
- Ulhas Gondhali
Egyptian Journal of Forensic Sciences (2021)
ANOVA for estimating Nei’s diversity and related parameters in a fixed set of populations with an application in genetic resources conservation
- André Gallais
- François Lefèvre
Euphytica (2021)
From keystone species to conservation: conservation genetics of wax palm Ceroxylon quindiuense in the largest wild populations of Colombia and selected neighboring ex situ plant collections
- Katherine Chacón-Vargas
- Víctor Hugo García-Merchán
- María José Sanín
Biodiversity and Conservation (2020)

Computer programs for population genetics data analysis: a survival guide

Key Points

Abstract

Access options

Similar content being viewed by others

Phylogenomics and the rise of the angiosperms

Network of large pedigrees reveals social practices of Avar communities

Genome-wide association studies

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Related links

FURTHER INFORMATION

Glossary

Rights and permissions

About this article

Cite this article

This article is cited by

The relative fitness of the de novo variants in general Lithuanian population vs. in individuals with intellectual disability

The influence of native populations’ genetic history on the reconstruction of invasion routes: the case of a highly invasive aquatic species

Evaluation and comparison of population genetics software in Rabari Tribe of Gujarat population

ANOVA for estimating Nei’s diversity and related parameters in a fixed set of populations with an application in genetic resources conservation

From keystone species to conservation: conservation genetics of wax palm Ceroxylon quindiuense in the largest wild populations of Colombia and selected neighboring ex situ plant collections

Search

Quick links

Key Points

Abstract

Access options

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Related links

Related links

FURTHER INFORMATION

Glossary

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links