Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Computer programs for population genetics data analysis: a survival guide

Key Points

  • Computer programs are now essential for the analysis of large population genetics data sets that are increasingly being generated.

  • We review 24 such programs here, and list their main features, limitations and some of their underlying assumptions.

  • Several user-friendly programs provide methods to compute standard genetic-diversity indices from various types of marker, as well as to test genetic structure, linkage disequilibrium and selective neutrality within populations.

  • Several programs that use multilocus genotype information have recently focused on individuals rather than on populations, and provide ways to delineate populations, detect new immigrants and their population of origin, and estimate individual admixture coefficients.

  • New coalescent-based programs provide powerful methods to estimate demographic parameters; however, these parameters have been developed under specific evolutionary models, and the accuracy of the results also depends on the convergence of the programs.

  • Most programs are based on many well-documented assumptions that need to be integrated by their users for a sound interpretation of the results.

  • Proper genetic data analyses should start with generalist packages to uncover the basic properties of the data, and be followed by the use of specialized methodologies to address more specific questions.

  • An important limitation of the wider use of sophisticated programs is the lack of a generic population genetics format, which would allow data to be easily exchanged between programs.

Abstract

The analysis of genetic diversity within species is vital for understanding evolutionary processes at the population level and at the genomic level. A large quantity of data can now be produced at an unprecedented rate, requiring the use of dedicated computer programs to extract all embedded information. Several statistical packages have been recently developed, which offer a panel of standard and more sophisticated analyses. We describe here the functionalities, special features and assumptions of more than 20 such programs, indicate how they can interoperate, and discuss new directions that could lead to improved software and analyses.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: Flow chart of possible data exchange between different population genetics programs.

References

  1. 1

    Schaffner, S. F. et al. Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 15, 1576–1583 (2005).

    CAS  Article  Google Scholar 

  2. 2

    Akey, J. M. et al. Population history and natural selection shape patterns of genetic variation in 132 genes. PLoS Biol. 2, e286 (2004).

    Article  Google Scholar 

  3. 3

    Williamson, S. H. et al. Simultaneous inference of selection and population growth from patterns of variation in the human genome. Proc. Natl Acad. Sci. USA 102, 7882–7887 (2005). One of the first and more elaborate attempts to correct for the effect of past demography when inferring patterns of selection at the sequence level.

    CAS  Article  Google Scholar 

  4. 4

    Fernandez, J., Villanueva, B., Pong-Wong, R. & Toro, M. A. Efficiency of the use of pedigree and molecular marker information in conservation programs. Genetics 170, 1313–1321 (2005).

    CAS  Article  Google Scholar 

  5. 5

    Labate, J. A. Software for population genetics analyses of molecular marker data. Crop Sci. 40, 1521–1528 (2000).

    Article  Google Scholar 

  6. 6

    Stephens, M. & Scheet, P. Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am. J. Hum. Genet. 76, 449–462 (2005).

    CAS  Article  Google Scholar 

  7. 7

    Felsenstein, J. Inferring Phylogenies (Sinauer Associates, Sunderland, 2003).

    Google Scholar 

  8. 8

    Knowles, L. L. The burgeoning field of statistical phylogeography. J. Evol. Biol. 17, 1–10 (2004).

    CAS  Article  Google Scholar 

  9. 9

    Morrison, D. A. Networks in phylogenetic analysis: new tools for population biology. Int. J. Parasitol. 35, 567–582 (2005).

    Article  Google Scholar 

  10. 10

    Jones, A. G. & Ardren, W. R. Methods of parentage analysis in natural populations. Mol. Ecol. 12, 2511–2523 (2003).

    CAS  Article  Google Scholar 

  11. 11

    Dudbridge, F. A survey of current software for linkage analysis. Hum. Genomics 1, 63–65 (2003).

    CAS  Article  Google Scholar 

  12. 12

    Beaumont, M. A. & Rannala, B. The Bayesian revolution in genetics. Nature Rev. Genet. 5, 251–261 (2004). A necessary read presenting a broad overview of the use of Bayesian approaches in population genetics.

    CAS  Article  Google Scholar 

  13. 13

    Beerli, P. Comparison of Bayesian and maximum-likelihood inference of population genetic parameters. Bioinformatics 22, 341–345 (2006).

    CAS  Article  Google Scholar 

  14. 14

    Rosenberg, N. A. et al. Genetic structure of human populations. Science 298, 2381–2385 (2002).

    CAS  Article  Google Scholar 

  15. 15

    Valdes, A. M., Slatkin, M. & Freimer, N. B. Allele frequencies at microsatellite loci: the stepwise mutation model revisited. Genetics 133, 737–749 (1993).

    CAS  PubMed  PubMed Central  Google Scholar 

  16. 16

    Slatkin, M. A measure of population subdivision based on microsatellite allele frequencies. Genetics 139, 457–462 (1995).

    CAS  PubMed  PubMed Central  Google Scholar 

  17. 17

    Goldstein, D. B., Ruiz Linares, A., Cavalli-Sforza, L. L. & Feldman, M. W. An evaluation of genetic distances for use with microsatellite loci. Genetics 139, 463–471 (1995).

    CAS  PubMed  PubMed Central  Google Scholar 

  18. 18

    Balloux, F., Brunner, H., Lugon-Moulin, N., Hausser, J. & Goudet, J. Microsatellites can be misleading: an empirical and simulation study. Evolution Int. J. Org. Evolution 54, 1414–1422 (2000).

    CAS  Article  Google Scholar 

  19. 19

    Raymond, M. & Rousset, F. An exact test for population differentiation. Evolution 49, 1280–1283 (1995).

    Article  Google Scholar 

  20. 20

    Lewontin, R. C. The interaction of selection and linkage. II. Optimum models. Genetics 50, 757–782 (1964).

    CAS  PubMed  PubMed Central  Google Scholar 

  21. 21

    Tajima, F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123, 585–595 (1989).

    CAS  PubMed  PubMed Central  Google Scholar 

  22. 22

    Evanno, G., Regnaut, S. & Goudet, J. Detecting the number of clusters of individuals using the software Structure: a simulation study. Mol. Ecol. 14, 2611–2620 (2005).

    CAS  Article  Google Scholar 

  23. 23

    Nielsen, R. & Wakeley, J. Distinguishing migration from isolation: a Markov chain Monte Carlo approach. Genetics 158, 885–896 (2001).

    CAS  PubMed  PubMed Central  Google Scholar 

  24. 24

    Nielsen, R. & Slatkin, M. Likelihood analysis of ongoing gene flow and historical association. Evolution Int. J. Org. Evolution 54, 44–50 (2000).

    CAS  Article  Google Scholar 

  25. 25

    Felsenstein, J. How can we infer geography and history from gene frequencies? J. Theor. Biol. 96, 9–20 (1982).

    CAS  Article  Google Scholar 

  26. 26

    Wakeley, J. Distinguishing migration from isolation using the variance of pairwise differences. Theor. Popul. Biol. 49, 369–386 (1996).

    CAS  Article  Google Scholar 

  27. 27

    Mountain, J. L. et al. SNPSTRs: empirically derived, rapidly typed, autosomal haplotypes for inference of population history and mutational processes. Genome Res. 12, 1766–1772 (2002).

    CAS  Article  Google Scholar 

  28. 28

    Brooks, S. & Giudici, P. in Bayesian Statistics (eds Bernardo, J., Berger, J., Dawid, A. P. & Smith, A. F. M.) 733–742 (Oxford Univ. Press, Oxford, 1999).

    Google Scholar 

  29. 29

    Gaggiotti, O. E. et al. Patterns of colonization in a metapopulation of grey seals. Nature 416, 424–427 (2002). One of the first applications of RJ-MCMC in population and conservation genetics. The method allowed the authors to integrate non-genetic data such as demographic or environmental information directly in the inference process.

    CAS  Article  Google Scholar 

  30. 30

    Gaggiotti, O. E., Brooks, S. P., Amos, W. & Harwood, J. Combining demographic, environmental and genetic data to test hypotheses about colonization events in metapopulations. Mol. Ecol. 13, 811–825 (2004).

    CAS  Article  Google Scholar 

  31. 31

    Beaumont, M. A. & Nichols, R. A. Evaluating loci for use in the genetic analysis of population structure. Proc. R. Soc. Lond. B 263, 1619–1626 (1996). The first description of how genome scans that were performed in several populations can be used to detect loci under selection.

    Article  Google Scholar 

  32. 32

    Beaumont, M. A. & Balding, D. J. Identifying adaptive genetic divergence among populations from genome scans. Mol. Ecol. 13, 969–980 (2004).

    CAS  Article  Google Scholar 

  33. 33

    Slatkin, M. Seeing ghosts: the effect of unsampled populations on migration rates estimated for sampled populations. Mol. Ecol. 14, 67–73 (2005).

    Article  Google Scholar 

  34. 34

    Beerli, P. Effect of unsampled populations on the estimation of population sizes and migration rates between sampled populations. Mol. Ecol. 13, 827–836 (2004).

    Article  Google Scholar 

  35. 35

    Nielsen, R. Population genetic analysis of ascertained SNP data. Hum. Genomics 1, 218–224 (2004). A lucid description of the effect of SNP ascertainment bias on parameter inference and ways to correct it.

    CAS  Article  Google Scholar 

  36. 36

    Nielsen, R. et al. Genomic scans for selective sweeps using SNP data. Genome Res. 15, 1566–1575 (2005).

    CAS  Article  Google Scholar 

  37. 37

    Hey, J. & Nielsen, R. Multilocus methods for estimating population sizes, migration rates and divergence time, with applications to the divergence of Drosophila pseudoobscura and D. persimilis. Genetics 167, 747–760 (2004). A comprehensive presentation of the elaborate methodology underlying the IM program, which can simultaneously estimate gene flow and divergence time between two populations.

    CAS  Article  Google Scholar 

  38. 38

    Beaumont, M. A., Zhang, W. & Balding, D. J. Approximate Bayesian computation in population genetics. Genetics 162, 2025–2035 (2002). A fundamental paper presenting the principles of ABC computations. It shows how genetic simulations can be used to accurately estimate parameters of arbitrarily complex demographic models, for which the likelihood is impossible or too costly to compute.

    PubMed  PubMed Central  Google Scholar 

  39. 39

    Marjoram, P., Molitor, J., Plagnol, V. & Tavare, S. Markov chain Monte Carlo without likelihoods. Proc. Natl Acad. Sci. USA 100, 15324–15328 (2003).

    CAS  Article  Google Scholar 

  40. 40

    Excoffier, L., Estoup, A. & Cornuet, J.-M. Bayesian analysis of an admixture model with mutations and arbitrarily linked markers. Genetics 169, 1727–1738 (2005).

    CAS  Article  Google Scholar 

  41. 41

    Zhao, J. H. & Tan, Q. Integrated analysis of genetic data with R. Hum. Genomics 2, 258–265 (2006).

    CAS  Article  Google Scholar 

  42. 42

    Goudet, J. Hierfstat, a package for R to compute and test hierarchical F-statistics. Mol. Ecol. Notes 5, 184–186 (2005).

    Article  Google Scholar 

  43. 43

    Price, E. W. & Carbone, I. SNAP: workbench management tool for evolutionary population genetic analysis. Bioinformatics 21, 402–404 (2005).

    CAS  Article  Google Scholar 

  44. 44

    Altshuler, D. et al. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).

    Article  Google Scholar 

  45. 45

    Gilks, W. R., Richardson, S. & Spiegelhalter, D. J. Markov Chain Monte Carlo in Practice (Chapman and Hall/CRC, London, 1996).

    Google Scholar 

  46. 46

    Guillot, G., Mortier, F. & Estoup, A. Geneland: a computer package for landscape genetics. Mol. Ecol. Notes 5, 712–715 (2005).

    CAS  Article  Google Scholar 

  47. 47

    Guillot, G., Estoup, A., Mortier, F. & Cosson, J. F. A spatial statistical model for landscape genetics. Genetics 170, 1261–1280 (2005). An extension of the Structure approach that explicitly uses spatial information to infer the genetic structure of populations and to detect recent immigrants.

    CAS  Article  Google Scholar 

  48. 48

    Cegelski, C. C., Waits, L. P. & Anderson, N. J. Assessing population structure and gene flow in Montana wolverines (Gulo gulo) using assignment-based approaches. Mol. Ecol. 12, 2907–2918 (2003).

    CAS  Article  Google Scholar 

  49. 49

    Excoffier, L., Laval, G. & Schneider, S. Arlequin ver. 3. 0: an integrated software package for population genetics data analysis. Evolutionary Bioinformatics Online 1, 47–50 (2005).

    CAS  Google Scholar 

  50. 50

    Rozas, J., Sanchez-DelBarrio, J. C., Messeguer, X. & Rozas, R. DnaSP, DNA polymorphism analyses by the coalescent and other methods. Bioinformatics 19, 2496–2497 (2003).

    CAS  Article  Google Scholar 

  51. 51

    Goudet, J. FSTAT (version 1.2): a computer program to calculate F-statistics. J. Hered. 86, 485–486 (1995).

    Article  Google Scholar 

  52. 52

    Raymond, M. & Rousset, F. Genepop (version 1.2): population genetics software for exact tests and ecumenicism. J. Hered. 86, 248–249 (1995).

    Article  Google Scholar 

  53. 53

    Kumar, S., Tamura, K. & Nei, M. MEGA3: integrated software for molecular evolutionary genetics analysis and sequence alignment. Brief. Bioinform. 5, 150–163 (2004).

    CAS  Article  Google Scholar 

  54. 54

    Dieringer, D. & Schlötterer, C. Microsatellite Analyser (MSA): a platform independent analysis tool for large microsatellite data sets. Mol. Ecol. Notes 3, 167–169 (2003).

    CAS  Article  Google Scholar 

  55. 55

    Hardy, O. J. & Vekemans, X. spagedi: a versatile computer program to analyse spatial genetic structure at the individual or population levels. Mol. Ecol. Notes 2, 618–620 (2002).

    Article  Google Scholar 

  56. 56

    Wilson, G. A. & Rannala, B. Bayesian inference of recent migration rates using multilocus genotypes. Genetics 163, 1177–1191 (2003).

    PubMed  PubMed Central  Google Scholar 

  57. 57

    Corander, J., Waldmann, P., Marttinen, P. & Sillanpaa, M. J. BAPS 2: enhanced possibilities for the analysis of genetic population structure. Bioinformatics 20, 2363–2369 (2004).

    CAS  Article  Google Scholar 

  58. 58

    Piry, S. et al. GeneClass2: A software for genetic assignment and first- generation migrant detection. J. Hered. 95, 536–539 (2004).

    CAS  Article  Google Scholar 

  59. 59

    Anderson, E. C. & Thompson, E. A. A model-based method for identifying species hybrids using multilocus genetic data. Genetics 160, 1217–1229 (2002). A powerful Bayesian method that uses multilocus genotype information to identify the different types of hybrid individual present in a population.

    CAS  PubMed  PubMed Central  Google Scholar 

  60. 60

    Pritchard, J. K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000). A highly influential and innovative paper that uses multilocus genotype information to assign individuals to populations, and to identify recent immigrants and admixed individuals.

    CAS  PubMed  PubMed Central  Google Scholar 

  61. 61

    Falush, D., Stephens, M. & Pritchard, J. K. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164, 1567–1587 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  62. 62

    Wilson, I. J., Weale, M. E. & Balding, D. J. Inferences from DNA data: population histories, evolutionary processes and forensic match probabilities. J. R. Stat. Soc. A 166, 155–188 (2003).

    Article  Google Scholar 

  63. 63

    Foll, M. & Gaggiotti, O. E. Colonise: a computer program to study colonization processes in metapopulations. Mol. Ecol. Notes 5, 705–707 (2005).

    CAS  Article  Google Scholar 

  64. 64

    Beaumont, M. A. Detecting population expansion and decline using microsatellites. Genetics 153, 2013–2029 (1999).

    CAS  PubMed  PubMed Central  Google Scholar 

  65. 65

    Glaubitz, J. C. Convert: a user-friendly program to reformat diploid genotypic data for commonly used population genetic software packages. Mol. Ecol. Notes 4, 309–310 (2004).

    CAS  Article  Google Scholar 

Download references

Acknowledgements

We are grateful to P. Beerli for providing an illustration from Migrate's manual. We also thank him, as well as O. Gaggiotti, J. Goudet and A. Estoup, for suggestions and comments on an early version of this manuscript. We are indebted to three reviewers for their comments. We apologize to the authors of programs which, owing to space constraints, we have not been able to cover here. The work in L.E.'s laboratory is partially supported by a grant from the Swiss National Science Foundation.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Laurent Excoffier.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Related links

Related links

FURTHER INFORMATION

Berne's CMPG (Computational and Molecular Population Genetics) programs

Bob Griffith's GeneTree program

Bruce Rannala's programs

Computational and Molecular Population Genetics Laboratory homepage

Gil McVean's programs

Giorgio Bertorelle's programs

Ian Wilson's programs

Jerôme Goudet's programs

Jinliang Wang's programs

Jody Hey's programs

Jonathan Pritchard's programs

Kent Holsinger's programs

Louis Bernatchez's laboratory links and programs

Mark Beaumont's programs

Matthew Stephens's programs

Montgomery Slakin's programs

Montpellier's CBGP (Centre de Biologie et de Gestion des Populations) programs

Montpellier's Genome Populations Interactions Adaptation laboratory's programs

Noah Rosenberg's programs

Oxford Evolutionary Biology Group's programs

Oxford's Mathematical Genetics programs

Peter Andolfatto's programs

Rasmus Nielsen's programs

Richard Hudson's programs

Ziheng Yang's programs

Glossary

Linkage disequilibrium

The non-random association of alleles at different loci.

Gametic phase

In a diploid individual, it represents the original allelic combinations that an individual received from its parents. It is therefore a particular association of alleles at different loci on the same chromosome, which is often unknown.

Selective neutrality

Null model of evolution that assumes that all the alleles observed at given locus are functionally equivalent.

Bayesian

Inference framework, based on the work of Thomas Bayes (1702–1761), in which the posterior probability of a parameter depends explicitly on its prior probability, reflecting some previous belief about this parameter.

Short tandem repeat (or microsatellite)

A class of repetitive DNA that is made up of repeats that are 2–5 nucleotides in length. The number of these repeats is usually extremely variable in a population.

Homoplasic mutations

Mutations that lead to identical character states (identity-in-state) despite having occurred by different evolutionary processes.

Coalescent (theory)

A theory that describes the structure of the genealogy of a sample of genes from present time to their most recent common ancestor. For neutral genes, this genealogy is extremely variable but only depends on the past demography (deme sizes and immigration rates) of the population.

Maximum-likelihood estimation

Inference technique in which the estimated parameters of a model are those that maximize the probability of the data under that model.

Hardy–Weinberg equilibrium

(HWE). Fit between the observed frequencies of the different genotype categories and the frequencies that are expected under random mating in an ideal population. Departure from HWE can also be due to selection, migration or hidden population subdivision.

F-statistics

Statistics that measure the correlation between genes drawn at different levels of a (hierarchically) subdivided population. This correlation is influenced by several evolutionary forces, such as mutation and migration, but it was originally designed to measure how far populations had gone in the process of fixation owing to genetic drift.

Hierarchical analyses of genetic variance

Analysis in which genetic diversity is hierarchically organized, with subunits nested in larger units (for example, genes in diploid individuals drawn from demes belonging to a subdivided population).

Mantel test

Test designed to measure the association between the elements of two matrices, by taking into account the autocorrelation that exists between the elements of each matrix. It is often used to test for a significant association between genetic and geographical distances.

Mismatch distribution

The distribution of the number of differences (mismatches) between pairs of DNA sequences in a sample. The exact shape of this distribution is affected by the past demography of a population.

Infinite-sites model

A mutation model according to which each new mutation occurs at a site that has not mutated before. This model was originally developed for protein- and DNA-sequence evolution, and is obviously related to the infinite allele model.

Infinite-allele mutation model

A mutation model according to which each new mutation produces an allele that has not previously existed.

Summary statistics

In the current genetic context, these are descriptive statistics summarizing the pattern of genetic diversity, such as the level of heterozygosity or the number of alleles per locus.

D

A measure of linkage disequilibrium defined as the difference between the frequency of a two-locus haplotype and the product of the frequencies of its constituent alleles (Dij = pij pipj).

D′

A standardized version of D that is obtained by dividing D by its maximum possible value given the allele frequencies (D′ = D/Dmax).

Tajima's D

Statistic used in a selective neutrality test to decide whether the mean number of differences between pairs of DNA sequences is compatible with the observed number of segregating sites in a sample.

Likelihood (of a model)

The probability of the data under a given model defined by a particular set of parameter values.

Joint posterior distribution

When a model is defined by more than one parameter, it is the posterior distribution of all possible combinations of parameter values.

Effective population size

The size of a virtual, randomly mating, stationary and isolated population that would have the same amount and type of polymorphisms as the population under study.

Finite-island model

A conceptual model for gene flow under which a finite number of demes exchange migrants with each other. The spatial location of the populations is not specified, and the constituent demes are usually assumed to have the same size and to exchange migrants at the same rate.

F ST

A measure of the level of population genetic differentiation, which usually reflects the proportion of total genetic variability that is due to the net differences between populations (see F-statistics).

Balancing selection

A form of natural selection that maintains polymorphism within populations.

AFLP

Amplified fragment length polymorphism. A method for the selective PCR amplification of anonymous, dominant DNA polymorphisms using restriction enzymes and DNA linkers.

Ascertainment bias

Systematic bias introduced by the criteria used to select individuals and/or genetic markers to be analysed (for example, choosing SNPs with heterozygosity that is higher than a given threshold).

Selective sweep

Drastic reduction of the genetic diversity along a chromosomal segment as a consequence of the fixation of an advantageous mutation by selection in that region.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Excoffier, L., Heckel, G. Computer programs for population genetics data analysis: a survival guide. Nat Rev Genet 7, 745–758 (2006). https://doi.org/10.1038/nrg1904

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing