Computer programs for population genetics data analysis: a survival guide

Article metrics

Abstract

The analysis of genetic diversity within species is vital for understanding evolutionary processes at the population level and at the genomic level. A large quantity of data can now be produced at an unprecedented rate, requiring the use of dedicated computer programs to extract all embedded information. Several statistical packages have been recently developed, which offer a panel of standard and more sophisticated analyses. We describe here the functionalities, special features and assumptions of more than 20 such programs, indicate how they can interoperate, and discuss new directions that could lead to improved software and analyses.

Key Points

  • Computer programs are now essential for the analysis of large population genetics data sets that are increasingly being generated.

  • We review 24 such programs here, and list their main features, limitations and some of their underlying assumptions.

  • Several user-friendly programs provide methods to compute standard genetic-diversity indices from various types of marker, as well as to test genetic structure, linkage disequilibrium and selective neutrality within populations.

  • Several programs that use multilocus genotype information have recently focused on individuals rather than on populations, and provide ways to delineate populations, detect new immigrants and their population of origin, and estimate individual admixture coefficients.

  • New coalescent-based programs provide powerful methods to estimate demographic parameters; however, these parameters have been developed under specific evolutionary models, and the accuracy of the results also depends on the convergence of the programs.

  • Most programs are based on many well-documented assumptions that need to be integrated by their users for a sound interpretation of the results.

  • Proper genetic data analyses should start with generalist packages to uncover the basic properties of the data, and be followed by the use of specialized methodologies to address more specific questions.

  • An important limitation of the wider use of sophisticated programs is the lack of a generic population genetics format, which would allow data to be easily exchanged between programs.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: Flow chart of possible data exchange between different population genetics programs.

References

  1. 1

    Schaffner, S. F. et al. Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 15, 1576–1583 (2005).

  2. 2

    Akey, J. M. et al. Population history and natural selection shape patterns of genetic variation in 132 genes. PLoS Biol. 2, e286 (2004).

  3. 3

    Williamson, S. H. et al. Simultaneous inference of selection and population growth from patterns of variation in the human genome. Proc. Natl Acad. Sci. USA 102, 7882–7887 (2005). One of the first and more elaborate attempts to correct for the effect of past demography when inferring patterns of selection at the sequence level.

  4. 4

    Fernandez, J., Villanueva, B., Pong-Wong, R. & Toro, M. A. Efficiency of the use of pedigree and molecular marker information in conservation programs. Genetics 170, 1313–1321 (2005).

  5. 5

    Labate, J. A. Software for population genetics analyses of molecular marker data. Crop Sci. 40, 1521–1528 (2000).

  6. 6

    Stephens, M. & Scheet, P. Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am. J. Hum. Genet. 76, 449–462 (2005).

  7. 7

    Felsenstein, J. Inferring Phylogenies (Sinauer Associates, Sunderland, 2003).

  8. 8

    Knowles, L. L. The burgeoning field of statistical phylogeography. J. Evol. Biol. 17, 1–10 (2004).

  9. 9

    Morrison, D. A. Networks in phylogenetic analysis: new tools for population biology. Int. J. Parasitol. 35, 567–582 (2005).

  10. 10

    Jones, A. G. & Ardren, W. R. Methods of parentage analysis in natural populations. Mol. Ecol. 12, 2511–2523 (2003).

  11. 11

    Dudbridge, F. A survey of current software for linkage analysis. Hum. Genomics 1, 63–65 (2003).

  12. 12

    Beaumont, M. A. & Rannala, B. The Bayesian revolution in genetics. Nature Rev. Genet. 5, 251–261 (2004). A necessary read presenting a broad overview of the use of Bayesian approaches in population genetics.

  13. 13

    Beerli, P. Comparison of Bayesian and maximum-likelihood inference of population genetic parameters. Bioinformatics 22, 341–345 (2006).

  14. 14

    Rosenberg, N. A. et al. Genetic structure of human populations. Science 298, 2381–2385 (2002).

  15. 15

    Valdes, A. M., Slatkin, M. & Freimer, N. B. Allele frequencies at microsatellite loci: the stepwise mutation model revisited. Genetics 133, 737–749 (1993).

  16. 16

    Slatkin, M. A measure of population subdivision based on microsatellite allele frequencies. Genetics 139, 457–462 (1995).

  17. 17

    Goldstein, D. B., Ruiz Linares, A., Cavalli-Sforza, L. L. & Feldman, M. W. An evaluation of genetic distances for use with microsatellite loci. Genetics 139, 463–471 (1995).

  18. 18

    Balloux, F., Brunner, H., Lugon-Moulin, N., Hausser, J. & Goudet, J. Microsatellites can be misleading: an empirical and simulation study. Evolution Int. J. Org. Evolution 54, 1414–1422 (2000).

  19. 19

    Raymond, M. & Rousset, F. An exact test for population differentiation. Evolution 49, 1280–1283 (1995).

  20. 20

    Lewontin, R. C. The interaction of selection and linkage. II. Optimum models. Genetics 50, 757–782 (1964).

  21. 21

    Tajima, F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123, 585–595 (1989).

  22. 22

    Evanno, G., Regnaut, S. & Goudet, J. Detecting the number of clusters of individuals using the software Structure: a simulation study. Mol. Ecol. 14, 2611–2620 (2005).

  23. 23

    Nielsen, R. & Wakeley, J. Distinguishing migration from isolation: a Markov chain Monte Carlo approach. Genetics 158, 885–896 (2001).

  24. 24

    Nielsen, R. & Slatkin, M. Likelihood analysis of ongoing gene flow and historical association. Evolution Int. J. Org. Evolution 54, 44–50 (2000).

  25. 25

    Felsenstein, J. How can we infer geography and history from gene frequencies? J. Theor. Biol. 96, 9–20 (1982).

  26. 26

    Wakeley, J. Distinguishing migration from isolation using the variance of pairwise differences. Theor. Popul. Biol. 49, 369–386 (1996).

  27. 27

    Mountain, J. L. et al. SNPSTRs: empirically derived, rapidly typed, autosomal haplotypes for inference of population history and mutational processes. Genome Res. 12, 1766–1772 (2002).

  28. 28

    Brooks, S. & Giudici, P. in Bayesian Statistics (eds Bernardo, J., Berger, J., Dawid, A. P. & Smith, A. F. M.) 733–742 (Oxford Univ. Press, Oxford, 1999).

  29. 29

    Gaggiotti, O. E. et al. Patterns of colonization in a metapopulation of grey seals. Nature 416, 424–427 (2002). One of the first applications of RJ-MCMC in population and conservation genetics. The method allowed the authors to integrate non-genetic data such as demographic or environmental information directly in the inference process.

  30. 30

    Gaggiotti, O. E., Brooks, S. P., Amos, W. & Harwood, J. Combining demographic, environmental and genetic data to test hypotheses about colonization events in metapopulations. Mol. Ecol. 13, 811–825 (2004).

  31. 31

    Beaumont, M. A. & Nichols, R. A. Evaluating loci for use in the genetic analysis of population structure. Proc. R. Soc. Lond. B 263, 1619–1626 (1996). The first description of how genome scans that were performed in several populations can be used to detect loci under selection.

  32. 32

    Beaumont, M. A. & Balding, D. J. Identifying adaptive genetic divergence among populations from genome scans. Mol. Ecol. 13, 969–980 (2004).

  33. 33

    Slatkin, M. Seeing ghosts: the effect of unsampled populations on migration rates estimated for sampled populations. Mol. Ecol. 14, 67–73 (2005).

  34. 34

    Beerli, P. Effect of unsampled populations on the estimation of population sizes and migration rates between sampled populations. Mol. Ecol. 13, 827–836 (2004).

  35. 35

    Nielsen, R. Population genetic analysis of ascertained SNP data. Hum. Genomics 1, 218–224 (2004). A lucid description of the effect of SNP ascertainment bias on parameter inference and ways to correct it.

  36. 36

    Nielsen, R. et al. Genomic scans for selective sweeps using SNP data. Genome Res. 15, 1566–1575 (2005).

  37. 37

    Hey, J. & Nielsen, R. Multilocus methods for estimating population sizes, migration rates and divergence time, with applications to the divergence of Drosophila pseudoobscura and D. persimilis. Genetics 167, 747–760 (2004). A comprehensive presentation of the elaborate methodology underlying the IM program, which can simultaneously estimate gene flow and divergence time between two populations.

  38. 38

    Beaumont, M. A., Zhang, W. & Balding, D. J. Approximate Bayesian computation in population genetics. Genetics 162, 2025–2035 (2002). A fundamental paper presenting the principles of ABC computations. It shows how genetic simulations can be used to accurately estimate parameters of arbitrarily complex demographic models, for which the likelihood is impossible or too costly to compute.

  39. 39

    Marjoram, P., Molitor, J., Plagnol, V. & Tavare, S. Markov chain Monte Carlo without likelihoods. Proc. Natl Acad. Sci. USA 100, 15324–15328 (2003).

  40. 40

    Excoffier, L., Estoup, A. & Cornuet, J.-M. Bayesian analysis of an admixture model with mutations and arbitrarily linked markers. Genetics 169, 1727–1738 (2005).

  41. 41

    Zhao, J. H. & Tan, Q. Integrated analysis of genetic data with R. Hum. Genomics 2, 258–265 (2006).

  42. 42

    Goudet, J. Hierfstat, a package for R to compute and test hierarchical F-statistics. Mol. Ecol. Notes 5, 184–186 (2005).

  43. 43

    Price, E. W. & Carbone, I. SNAP: workbench management tool for evolutionary population genetic analysis. Bioinformatics 21, 402–404 (2005).

  44. 44

    Altshuler, D. et al. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).

  45. 45

    Gilks, W. R., Richardson, S. & Spiegelhalter, D. J. Markov Chain Monte Carlo in Practice (Chapman and Hall/CRC, London, 1996).

  46. 46

    Guillot, G., Mortier, F. & Estoup, A. Geneland: a computer package for landscape genetics. Mol. Ecol. Notes 5, 712–715 (2005).

  47. 47

    Guillot, G., Estoup, A., Mortier, F. & Cosson, J. F. A spatial statistical model for landscape genetics. Genetics 170, 1261–1280 (2005). An extension of the Structure approach that explicitly uses spatial information to infer the genetic structure of populations and to detect recent immigrants.

  48. 48

    Cegelski, C. C., Waits, L. P. & Anderson, N. J. Assessing population structure and gene flow in Montana wolverines (Gulo gulo) using assignment-based approaches. Mol. Ecol. 12, 2907–2918 (2003).

  49. 49

    Excoffier, L., Laval, G. & Schneider, S. Arlequin ver. 3. 0: an integrated software package for population genetics data analysis. Evolutionary Bioinformatics Online 1, 47–50 (2005).

  50. 50

    Rozas, J., Sanchez-DelBarrio, J. C., Messeguer, X. & Rozas, R. DnaSP, DNA polymorphism analyses by the coalescent and other methods. Bioinformatics 19, 2496–2497 (2003).

  51. 51

    Goudet, J. FSTAT (version 1.2): a computer program to calculate F-statistics. J. Hered. 86, 485–486 (1995).

  52. 52

    Raymond, M. & Rousset, F. Genepop (version 1.2): population genetics software for exact tests and ecumenicism. J. Hered. 86, 248–249 (1995).

  53. 53

    Kumar, S., Tamura, K. & Nei, M. MEGA3: integrated software for molecular evolutionary genetics analysis and sequence alignment. Brief. Bioinform. 5, 150–163 (2004).

  54. 54

    Dieringer, D. & Schlötterer, C. Microsatellite Analyser (MSA): a platform independent analysis tool for large microsatellite data sets. Mol. Ecol. Notes 3, 167–169 (2003).

  55. 55

    Hardy, O. J. & Vekemans, X. spagedi: a versatile computer program to analyse spatial genetic structure at the individual or population levels. Mol. Ecol. Notes 2, 618–620 (2002).

  56. 56

    Wilson, G. A. & Rannala, B. Bayesian inference of recent migration rates using multilocus genotypes. Genetics 163, 1177–1191 (2003).

  57. 57

    Corander, J., Waldmann, P., Marttinen, P. & Sillanpaa, M. J. BAPS 2: enhanced possibilities for the analysis of genetic population structure. Bioinformatics 20, 2363–2369 (2004).

  58. 58

    Piry, S. et al. GeneClass2: A software for genetic assignment and first- generation migrant detection. J. Hered. 95, 536–539 (2004).

  59. 59

    Anderson, E. C. & Thompson, E. A. A model-based method for identifying species hybrids using multilocus genetic data. Genetics 160, 1217–1229 (2002). A powerful Bayesian method that uses multilocus genotype information to identify the different types of hybrid individual present in a population.

  60. 60

    Pritchard, J. K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000). A highly influential and innovative paper that uses multilocus genotype information to assign individuals to populations, and to identify recent immigrants and admixed individuals.

  61. 61

    Falush, D., Stephens, M. & Pritchard, J. K. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164, 1567–1587 (2003).

  62. 62

    Wilson, I. J., Weale, M. E. & Balding, D. J. Inferences from DNA data: population histories, evolutionary processes and forensic match probabilities. J. R. Stat. Soc. A 166, 155–188 (2003).

  63. 63

    Foll, M. & Gaggiotti, O. E. Colonise: a computer program to study colonization processes in metapopulations. Mol. Ecol. Notes 5, 705–707 (2005).

  64. 64

    Beaumont, M. A. Detecting population expansion and decline using microsatellites. Genetics 153, 2013–2029 (1999).

  65. 65

    Glaubitz, J. C. Convert: a user-friendly program to reformat diploid genotypic data for commonly used population genetic software packages. Mol. Ecol. Notes 4, 309–310 (2004).

Download references

Acknowledgements

We are grateful to P. Beerli for providing an illustration from Migrate's manual. We also thank him, as well as O. Gaggiotti, J. Goudet and A. Estoup, for suggestions and comments on an early version of this manuscript. We are indebted to three reviewers for their comments. We apologize to the authors of programs which, owing to space constraints, we have not been able to cover here. The work in L.E.'s laboratory is partially supported by a grant from the Swiss National Science Foundation.

Author information

Correspondence to Laurent Excoffier.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Related links

Related links

FURTHER INFORMATION

Berne's CMPG (Computational and Molecular Population Genetics) programs

Bob Griffith's GeneTree program

Bruce Rannala's programs

Computational and Molecular Population Genetics Laboratory homepage

Gil McVean's programs

Giorgio Bertorelle's programs

Ian Wilson's programs

Jerôme Goudet's programs

Jinliang Wang's programs

Jody Hey's programs

Jonathan Pritchard's programs

Kent Holsinger's programs

Louis Bernatchez's laboratory links and programs

Mark Beaumont's programs

Matthew Stephens's programs

Montgomery Slakin's programs

Montpellier's CBGP (Centre de Biologie et de Gestion des Populations) programs

Montpellier's Genome Populations Interactions Adaptation laboratory's programs

Noah Rosenberg's programs

Oxford Evolutionary Biology Group's programs

Oxford's Mathematical Genetics programs

Peter Andolfatto's programs

Rasmus Nielsen's programs

Richard Hudson's programs

Ziheng Yang's programs

Glossary

Linkage disequilibrium

The non-random association of alleles at different loci.

Gametic phase

In a diploid individual, it represents the original allelic combinations that an individual received from its parents. It is therefore a particular association of alleles at different loci on the same chromosome, which is often unknown.

Selective neutrality

Null model of evolution that assumes that all the alleles observed at given locus are functionally equivalent.

Bayesian

Inference framework, based on the work of Thomas Bayes (1702–1761), in which the posterior probability of a parameter depends explicitly on its prior probability, reflecting some previous belief about this parameter.

Short tandem repeat (or microsatellite)

A class of repetitive DNA that is made up of repeats that are 2–5 nucleotides in length. The number of these repeats is usually extremely variable in a population.

Homoplasic mutations

Mutations that lead to identical character states (identity-in-state) despite having occurred by different evolutionary processes.

Coalescent (theory)

A theory that describes the structure of the genealogy of a sample of genes from present time to their most recent common ancestor. For neutral genes, this genealogy is extremely variable but only depends on the past demography (deme sizes and immigration rates) of the population.

Maximum-likelihood estimation

Inference technique in which the estimated parameters of a model are those that maximize the probability of the data under that model.

Hardy–Weinberg equilibrium

(HWE). Fit between the observed frequencies of the different genotype categories and the frequencies that are expected under random mating in an ideal population. Departure from HWE can also be due to selection, migration or hidden population subdivision.

F-statistics

Statistics that measure the correlation between genes drawn at different levels of a (hierarchically) subdivided population. This correlation is influenced by several evolutionary forces, such as mutation and migration, but it was originally designed to measure how far populations had gone in the process of fixation owing to genetic drift.

Hierarchical analyses of genetic variance

Analysis in which genetic diversity is hierarchically organized, with subunits nested in larger units (for example, genes in diploid individuals drawn from demes belonging to a subdivided population).

Mantel test

Test designed to measure the association between the elements of two matrices, by taking into account the autocorrelation that exists between the elements of each matrix. It is often used to test for a significant association between genetic and geographical distances.

Mismatch distribution

The distribution of the number of differences (mismatches) between pairs of DNA sequences in a sample. The exact shape of this distribution is affected by the past demography of a population.

Infinite-sites model

A mutation model according to which each new mutation occurs at a site that has not mutated before. This model was originally developed for protein- and DNA-sequence evolution, and is obviously related to the infinite allele model.

Infinite-allele mutation model

A mutation model according to which each new mutation produces an allele that has not previously existed.

Summary statistics

In the current genetic context, these are descriptive statistics summarizing the pattern of genetic diversity, such as the level of heterozygosity or the number of alleles per locus.

D

A measure of linkage disequilibrium defined as the difference between the frequency of a two-locus haplotype and the product of the frequencies of its constituent alleles (Dij = pij pipj).

D′

A standardized version of D that is obtained by dividing D by its maximum possible value given the allele frequencies (D′ = D/Dmax).

Tajima's D

Statistic used in a selective neutrality test to decide whether the mean number of differences between pairs of DNA sequences is compatible with the observed number of segregating sites in a sample.

Likelihood (of a model)

The probability of the data under a given model defined by a particular set of parameter values.

Joint posterior distribution

When a model is defined by more than one parameter, it is the posterior distribution of all possible combinations of parameter values.

Effective population size

The size of a virtual, randomly mating, stationary and isolated population that would have the same amount and type of polymorphisms as the population under study.

Finite-island model

A conceptual model for gene flow under which a finite number of demes exchange migrants with each other. The spatial location of the populations is not specified, and the constituent demes are usually assumed to have the same size and to exchange migrants at the same rate.

F ST

A measure of the level of population genetic differentiation, which usually reflects the proportion of total genetic variability that is due to the net differences between populations (see F-statistics).

Balancing selection

A form of natural selection that maintains polymorphism within populations.

AFLP

Amplified fragment length polymorphism. A method for the selective PCR amplification of anonymous, dominant DNA polymorphisms using restriction enzymes and DNA linkers.

Ascertainment bias

Systematic bias introduced by the criteria used to select individuals and/or genetic markers to be analysed (for example, choosing SNPs with heterozygosity that is higher than a given threshold).

Selective sweep

Drastic reduction of the genetic diversity along a chromosomal segment as a consequence of the fixation of an advantageous mutation by selection in that region.

Rights and permissions

Reprints and Permissions

About this article

Further reading