How many alleles per locus should be used to estimate genetic distances?

Kalinowski, S T

doi:10.1038/sj.hdy.6800009

Download PDF

Research Article
Published: 24 January 2002

How many alleles per locus should be used to estimate genetic distances?

S T Kalinowski¹

Heredity volume 88, pages 62–65 (2002)Cite this article

6321 Accesses
141 Citations
Metrics details

Abstract

As more microsatellite loci become available for use in genetic surveys of population structure, population geneticists are able to select loci to use in population structure surveys. This study used computer simulations to investigate how the number of alleles at loci affects the precision of estimates of four common genetic distances. This showed that equivalent results could be achieved by examining either a few loci with many alleles or many loci with a few alleles. More specifically, the total number of independent alleles appears to be a good indicator of how precise estimates of genetic distance will be.

Hybrid speciation driven by multilocus introgression of ecological traits

Article Open access 17 April 2024

Neil Rosser, Fernando Seixas, … Kanchon K. Dasmahapatra

Complexity of avian evolution revealed by family-level genomes

Article 01 April 2024

Josefin Stiller, Shaohong Feng, … Guojie Zhang

Diversity-dependent speciation and extinction in hominins

Article Open access 17 April 2024

Laura A. van Holstein & Robert A. Foley

Introduction

Analysis of selectively neutral molecular markers has become a common method for inferring the evolutionary history of populations and species. Technological advances have enabled population and conservation geneticists to describe increasingly complex and subtle genetic relationships. However, molecular data remains expensive and many population level studies are limited by the amount of data that can be collected. Efficient study design remains important in order to maximize the ability of genetic data to describe genetic relationships between populations. Increasingly often, researchers have the luxury of selecting a set of loci to use in a study from a larger number of loci that have been previously characterized. This is particularly true for microsatellite loci. One of the most common criteria used to select loci is the number of alleles present. I shall call this the allele number of a locus. Loci with more alleles are generally thought to produce more precise estimates of genetic distances than loci with few alleles, especially for closely related populations. However, loci with a large number of alleles can be difficult to score and take up more space on electrophoretic gels. This space issue is usually not a problem when only one locus is run per gel, but it becomes a critical consideration when multiple loci are run on each gel. Unfortunately, there have been few guidelines for balancing these two contradictory concerns. In this investigation, I examine the relationship between allele number and the coefficient of variation of four popular genetic distances: the D_A distance (Nei et al, 1983), the chord distance, D_C (Cavalli-Sforza and Edwards, 1967), the standard genetic distance of Nei, D_S (Nei, 1972, 1978), and the Weir and Cockerham estimator of F_ST, θ (Weir and Cockerham, 1984).

Each of these genetic distances has unique evolutionary and statistical properties (see Nei, 1987 for a review). Given a few simple assumptions, such as random mating and constant population size, the genetic distance between two halves of a population instantly split into two completely isolated new populations will initially increase linearly with time. For example, D_S between two such population fragments will be equal to

where t is the number of generations the populations have been isolated, and μ is the mutation rate of the loci examined. In this case, F_ST increases approximately linearly with time, provided t is small

where N_e is the effective size of each population. The expectations of D_A and D_C have not been expressed as functions of isolation time or other evolutionary variables, however they will initially increase linearly with time. All of these genetic distances are equal to zero when populations have the same allele frequencies. The maximum value of D_A, D_C, and F_ST is equal to 1.0. D_S will have a value of infinity when two populations do not share any alleles. The rate at which D_A, D_C, and D_S increase with time is proportional to mutation rate. Therefore, each of these three genetic distances is expected to be higher at loci with many alleles than loci with fewer alleles.

Methods

An analytical evaluation of the coefficient of variation for most genetic distances would be formidably complex, so I have relied on computer simulation to examine a simple model of population bifurcation (a similar simulation program available for general use is described by Excoffier et al, 2000). In this model, a randomly mating population having an effective population size of N_e diploid individuals is instantaneously split into two completely isolated populations, each also having N_e individuals. The populations are assumed to remain completely isolated until samples are collected t generations after fragmentation. Gene trees for each locus were simulated using standard coalescent techniques (see Hudson, 1990 for review). The mutation rate for each locus was obtained by selecting a number, u, from the interval [−7, −2] and using 10^u as the mutation rate for that locus. Two mutational models were used: infinite alleles and single step mutation.

The infinite alleles model of mutation assumes that each mutation is unique. The single step model of mutation assumes that each allele can be represented as a sum, and that each mutation either adds or subtracts one from that sum. Mutation increasing the number of repeat units was assumed to be as likely as mutation decreasing the number of repeat units. All mutations were assumed to change the number of repeat units by a single step and no bounds were placed on the number of repeat units possible at simulated loci. Large numbers of loci were simulated and loci with 2, 3, 4, 8, 16, and 33 alleles were selected for analysis. Loci not having one of these numbers of alleles were dropped from analysis. All samples consisted of 100 diploid individuals from each population. The number of loci in the samples was varied from 2 to 32. Three effective population sizes (N_e = 500, 5000, and 50000) and three divergence times (t = 50, 500 and 5000) were examined. All combinations of these four parameters was examined (number of loci, number of alleles, N_e, t). Four commonly used genetic distances were estimated from the data: the D_A distance (Nei et al, 1983), the chord distance, D_C (Cavalli-Sforza and Edwards, 1967), the standard genetic distance of Nei, D_S (Nei, 1978), and the Weir and Cockerham estimator of F_ST, θ (Weir and Cockerham, 1984).

The coefficients of variation for each of these genetic distances were estimated from the data by dividing the standard deviation of the estimates by the average estimate. Contour plots showing the coefficient of variation for data sets with different numbers of loci and varying numbers of alleles per locus were created with SigmaPlot 2000.

Results and Discussion

Simulated data showed that highly polymorphic loci provided better estimates of genetic distances than less polymorphic loci (Figure 1). This trend was evident for the entire range of parameters examined (N_e = 500, 5000, 5000, t = 50, 500, 5000, number of alleles = 2, 4, 8, 16, 33). When the amount of population divergence was small, t/N_e ⩽ 0.1, the total number of independent alleles examined appeared to be a good indicator of the coefficient of variation of estimates of genetic distances (the number of independent alleles at a locus is equal to the total number of alleles at the locus minus one, and the total number of independent alleles is the sum of the number of independent alleles at each locus). For example, 16 loci each having two independent alleles had approximately the same coefficient of variation for each of the genetic distances as two loci having 16 independent alleles each. The ratio t/N_e appeared to determine the relationship between number of alleles, number of loci, and the coefficient of variation. For example, the coefficient of variation of genetic distances for populations of 500 individuals separated for 50 generations was identical to the coefficient of variation of estimates of genetic distances between populations of 5000 individuals separated for 500 generations.

These results are in good agreement with an analysis of the Sanghvi (1953) genetic distance made by Foulley and Hill (1999). The Sanghvi distance is not used often for describing population structure, but it has tractable mathematical properties and has been shown to estimate phylogenies effectively (Takezaki and Nei, 1996). Foulley and Hill showed analytically that the coefficient of variation of the Sanghvi distance is approximately proportional to the sum of the number of independent alleles at each locus in the sample.

When divergence time was substantial, ie t/N_e was 1.0 or greater, the relationship between the coefficient of variation of genetic distances and the number of independent alleles observed at small to moderate divergence times broke down, especially for highly polymorphic loci. This is not especially problematic, for the utility of these genetic distances to quantify genetic differences between highly differentiated populations is limited. Keep in mind that both genetic drift and mutation lead to differentiation. Both D_A and D_C approach their maximum value of 1.0 when mutation rate and divergence time are high. This results in these statistics having a low coefficient of variation when divergence time and polymorphism are high (Figure 1), but decreases their ability to describe the length of population separation. D_S loses its utility when polymorphism is high, divergence time is long, and few loci are scored. In this case, samples from each population often share no alleles and D_S is undefined. For example, about half of the loci having 33 alleles had no alleles in common in populations of 5000 individuals after 5000 generations. Lastly, θ has two undesirable properties when divergence time and polymorphism is high. It asymptotically approaches a maximum value, and this value is inversely proportional to the amount of polymorphism present in the populations (eg, Hedrick, 1999).

Of the four distances examined, D_A and D_C exhibited the strongest equivalence of alleles within and between loci (Figure 1). D_S and θ did not fit this trend as closely. For both of these distance measures, adding more loci decreased the coefficient of variation faster than increasing the number of alleles per locus. For example, eight loci with two independent alleles produced better estimates of D_S than two loci with eight independent alleles.

Mutation mechanism did not appear to have a strong affect upon the coefficient of variation when divergence time was short. When the length of population isolation was short, t/N_e ⩽ 0.01, the coefficient of variation of estimates of genetic distances appeared to be a function of the total number of independent alleles examined for both types of mutation mechanisms. When the length of isolation was longer, t/N_e = 0.1, this relationship was approximately true, but broke down to some extent at loci with stepwise mutation and high mutation rates. When the length of population isolate was quite long, t/N_e ⩾ 1.0, the contrast between mutation mechanisms was greatest (Figure 1). However, for both mutation mechanisms, the equivalence of alleles observed at low divergence times broke down at loci with high mutation rates (ie, those with greater than eight alleles) but not at loci with lower mutation rates.

Increasing allele number was associated with decreased coefficients of variation for each of the four genetic distances studied. The standard error of these statistics, however, behaved differently. The standard error of D_S, D_A, and D_C increased as allele number increased (Figure 2). The coefficient of variation of these statistics decreased only because the average value of these genetic distances increased faster than the standard error (Figure 2). The expected value of θ is relatively insensitive to how much variation is present at loci and its standard error decreased as allele number increased.

The equivalent utility of alleles within and across loci for estimating genetic distances described here is significant because it demonstrates that study design for estimating genetic distances is flexible as long as the amount of divergence is not great. There is no requirement to examine either highly polymorphic loci or large numbers of loci. The only requirement is that a sufficient number of alleles be examined.

References

Cavalli-Sforza, LL, Edwards, AWF (1967). Phylogenetic analysis: models and estimation procedures. Evolution, 21: 550–570.
Article CAS PubMed Google Scholar
Excoffier, L, Novembre, J, Schneider, S (2000). SIMCOAL: a general coalescent program for the simulation of molecular data in interconnected populations with arbitrary demography. J Hered, 91: 506–508.
Article CAS PubMed Google Scholar
Foulley, J, Hill, WG (1999). On the precision of estimation of genetic distance. Genet Selection Evol, 31: 457–464.
Article Google Scholar
Hedrick, PW (1999). Perspective: highly variable loci and their interpretation in evolution and conservation. Evolution, 53: 313–318.
Article PubMed Google Scholar
Hudson, RR (1990). Gene genealogies and the coalescent process. Oxford Surveys Evolution Biol, 7: 1–44.
Google Scholar
Nei, M (1972). Genetic distance between populations. Am Naturalist, 106: 283–292.
Article Google Scholar
Nei, M (1978). Estimation of average heterozygosity and genetic distance from a small number of individuals. Genetics, 89: 583–590.
CAS PubMed PubMed Central Google Scholar
Nei, M (1987). Molecular Evolutionary Genetics. Columbia University Press: New York.
Google Scholar
Nei, M, Tajima, F, Tateno, Y (1983). Accuracy of estimated phylogenetic trees from molecular data. J Molec Evol, 19: 153–170.
Article CAS PubMed Google Scholar
Sanghvi, LD (1953). Comparison of genetical and morphological methods for a study of biological differences. Am J Phys Anthropol, 11: 385–404.
Article CAS PubMed Google Scholar
Takezaki, N, Nei, M (1996). Genetic distances and reconstruction of phylogenetic trees from microsatellite DNA. Genetics, 144: 389–399.
CAS PubMed PubMed Central Google Scholar
Weir, BS, Cockerham, CC (1984). Estimating F-Statistics for the analysis of population structure. Evolution, 38: 1358–1370.
CAS PubMed Google Scholar

Download references

Acknowledgements

I thank Mike Ford, Phil Hedrick, Paul Moran, Robin Waples, and an anonymous reviewer for comments that improved this manuscript.

Author information

Authors and Affiliations

Conservation Biology Division, Northwest Fisheries Science Center, National Marine Fisheries Service, N.O.A.A., Seattle, 98112, WA, USA
S T Kalinowski

Authors

S T Kalinowski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S T Kalinowski.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kalinowski, S. How many alleles per locus should be used to estimate genetic distances?. Heredity 88, 62–65 (2002). https://doi.org/10.1038/sj.hdy.6800009

Download citation

Received: 03 April 2001
Accepted: 06 August 2001
Published: 24 January 2002
Issue Date: 01 January 2002
DOI: https://doi.org/10.1038/sj.hdy.6800009

Keywords

This article is cited by

Genetic diversity and connectivity of moose (Alces americanus americanus) in eastern North America
- Elias Rosenblatt
- Katherina Gieder
- Stephanie McKay
Conservation Genetics (2023)
A novel SNP assay reveals increased genetic variability and abundance following translocations to a remnant Allegheny woodrat population
- Megan Muller-Girard
- Gretchen Fowles
- Jacqueline M. Doyle
BMC Ecology and Evolution (2022)
Rapid SNP genotyping, sex identification, and hybrid-detection in threatened bull trout
- Stephen J. Amish
- Shana Bernall
- Gordon Luikart
Conservation Genetics Resources (2022)
An empirical comparison of population genetic analyses using microsatellite and SNP data for a species of conservation concern
- Shawna J. Zimmerman
- Cameron L. Aldridge
- Sara J. Oyler-McCance
BMC Genomics (2020)
Comparative analysis of genomic- and EST-SSRs in European plum (Prunus domestica L.): implications for the diversity analysis of polyploids
- Rosanna Manco
- Pasquale Chiaiese
- Giandomenico Corrado
3 Biotech (2020)

How many alleles per locus should be used to estimate genetic distances?

Abstract

Similar content being viewed by others

Hybrid speciation driven by multilocus introgression of ecological traits

Complexity of avian evolution revealed by family-level genomes

Diversity-dependent speciation and extinction in hominins

Introduction

Methods

Results and Discussion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

This article is cited by

Genetic diversity and connectivity of moose (Alces americanus americanus) in eastern North America

A novel SNP assay reveals increased genetic variability and abundance following translocations to a remnant Allegheny woodrat population

Rapid SNP genotyping, sex identification, and hybrid-detection in threatened bull trout

An empirical comparison of population genetic analyses using microsatellite and SNP data for a species of conservation concern

Comparative analysis of genomic- and EST-SSRs in European plum (Prunus domestica L.): implications for the diversity analysis of polyploids

Search

Quick links

Abstract

Similar content being viewed by others

Introduction

Methods

Results and Discussion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

Search

Quick links