Introduction

Hybridization between genetically distinct taxa is a complex ecological process that has important implications for a diverse array of population genetic questions. For example, medical geneticists study admixed human populations to identify and map genes causing diseases (for example, Mao et al., 2007; Cheng et al., 2010; Winkler et al., 2010). Ecological geneticists study admixed populations to understand how outbreeding affects survival and reproduction (for example, Hogg et al., 2006; Johnson et al., 2010; Vander Wal et al., 2012). Conservation geneticists study admixed populations to manage the spread of introgressive hybridization (for example, Rhymer and Simberloff, 1996; Hedrick, 2009; Muhlfeld et al., 2009). Agricultural geneticists and environmental activists monitor wild populations to detect genes from genetically modified organisms (for example, Watrud et al., 2004; Piñeyro-Nelson et al., 2009; Zapiola and Mallory-Smith, 2012). And finally, evolutionary biologists study hybrid zones to better understand how selection, gene flow and mate choice shape the genetic structure of natural populations (for example, Barton and Hewitt, 1985).

Hybridization often begins when non-native individuals enter a population and mate with individuals from a native taxon. At this point, the population may have a ‘bimodal’ distribution of genotypes (Harrison and Bogdanowicz, 1997). Such a population might consist of mostly genetically ‘pure’ individuals of both taxa, and, potentially, a few hybrid individuals. For some taxa, this will be as far as hybridization proceeds (for example, Steeves et al., 2010). However, if hybrids are fertile and pre-zygotic isolating mechanisms are weak, hybridization may continue until all individuals in the population are hybrids (Rhymer et al., 1994). At this point, ecologists sometimes call the population a ‘hybrid swarm’ (for example, Allendorf et al., 2001). If interbreeding continues, the distribution of non-native genes among individuals in the population will become ‘unimodal’ and will eventually approach the point in which all individuals have the same amount of non-native genes—and the mixing can be viewed as complete.

Genetic data are often used to quantify the amount of non-native genes in admixed populations and the degree to which these genes have become mixed in the population. The analysis of such data is relatively straightforward when there are fixed genetic differences between the taxa (for example, Rhymer et al., 1994), and sophisticated methods are available to study hybridization when fixed genetic differences between taxa are not present (for example, Pritchard et al., 2000; Anderson and Thompson, 2002; Hey, 2010). Therefore, it is usually relatively straightforward to use genetic data to estimate the ancestry of each individual in a hybrid population.

One challenge to interpreting such data is the potentially complex distribution of different levels of hybridity among individuals in the population. As discussed above, the amount of non-native genes present in individuals sometimes varies widely among the individuals in a population. Quantifying this variability is relevant to many analyses, but there is no widely accepted way to do this. A common practice is to report the total proportion, P, of non-native genes in a population and present a graph showing the distribution of non-native genes among all individuals (for example, Pertoldi et al., 2010). This approach conveys a lot of information about the distribution of non-native genes among individuals in a population, but makes it difficult to compare the degree of genetic mixing among different populations.

Vernesi et al. (2003) developed a parameter, which they called the ‘true hybridization index’ (THI), that solves this problem by quantifying how well the genes of multiple taxa are mixed in a population. THI has a range of 0–1, with 0 indicating that no mixing has occurred (all individuals are genetically pure), and 1 indicating that the population is thoroughly mixed (all individuals have the same amount of ancestry from each contributing taxon).

The purpose of this present investigation is to extend Vernesi et al. (2003) work in several ways. First, we show how their genetic mixing parameter (which we call the ‘degree of genetic mixing’) can be derived in a simple, biologically meaningful way. This new definition allows us to show how the amount of genetic mixing in a population is related to the amount of gametic disequilibrium present in a population, and to show how the parameter increases in a randomly mating population. We also present a nearly unbiased formula for estimating this parameter when diagnostic loci are available and discuss how best to estimate the parameter when diagnostic loci are not available.

Methods and Results

A genetic mixing parameter, md

We seek to derive a parameter (We use ‘parameter’ to refer to a numeric characteristic of an entire population and ‘statistic’ to refer to a quantity calculated from a sample (Everitt and Skrondal, 2010)) that quantifies the degree of genetic mixing among taxa in a population and has a range of 0 to 1. We will derive such a parameter here and compare it to the parameter THI of Vernesi et al. (2003) in the Discussion.

A parameter quantifying the degree of genetic mixing among taxa can be derived as follows. As above, let P represent the overall proportion of non-native genes in a population. Furthermore, let Pi represent the proportion of non-native genes in the genome of the ith individual. Finally, let N represent the total number of individuals in the population. With this notation, . The variance of Pi, Var(Pi), serves as a useful measure of how well mixed native and non-native genes are in the population. If the two taxa have interbred for a long time, all individuals in the population will have similar values of Pi, and Var(Pi) will be low. Var(Pi) will take a minimum value of zero when all individuals in the population have exactly the same amount of non-native ancestry. On the other hand, if there has not been extensive interbreeding between the taxa present in the population, Var(Pi) will be high. Var(Pi) will take a maximum value when all individuals in the population are genetically pure members of either taxa (Pi for every individual is either 0 or 1). The variance of Pi for this case is P(1−P), the variance of a Bernoulli random variable. Given this, we propose a measure, which we will call md, of how well mixed non-native genes are in a population

This parameter is equivalent to THI as defined by Vernesi et al. (2003) (see below), but we call it the ‘degree of genetic mixing’ in a population because we do not believe it is appropriate to call any parameter describing the amount of hybridization in a population the ‘true hybrid index’. There are lots of reasonable ways of quantifying hybridization in a population, and, therefore, no single ‘true’ index. This parameter has a minimum value of 0, which occurs when a population consists of individuals from two species that have not yet interbred, that is, Var(Pi)=P(1−P). This parameter has a maximum value of 1.0, which occurs when all the individuals in the population have the same amount of non-native ancestry. Note that md is undefined if P equals 0 or 1. This is appropriate, as it does not make sense to quantify how genetically well mixed a population is when there are genes from only one taxon in the population.

Our parameter can also be defined for hybrid populations having more than two taxa. Let Pj represent the proportion of the genes in the population that belong to the jth taxa and let Pij represent the proportion of genes in the ith individual that are from the jth taxa. With this notation, md is equal to

Genetic mixing and gametic disequilibrium

It is well known that recently hybridized populations have high amounts of gametic disequilibrium (even at unlinked loci), and that this disequilibrium declines with time. This suggests that there may be a relationship between md and the amount of gametic disequilibrium in a population. This, indeed, is the case. Let D represent the parametric amount of gametic disequilibrium present at a pair of loci in a hybrid population, and let D̄ represent the average value of D across all the pairs of loci in the population. Barton and Gale (1993); Equation 2b have shown that for populations in Hardy–Weinberg equilibrium

If we divide both sides of Equation 3 by P(1−P), we obtain

where D′ is a popular standardized measure of gametic disequilibrium that has a maximum value of 1.0 (Lewontin, 1964; Hedrick, 2011). Combining Equations 1 and 4 shows the relationship between md and D′

This is a very useful result for two reasons. First, it relates the degree of mixing in a population to the amount of gametic disequilibrium present in the population, a quantity frequently estimated in genetic samples. Second, it allows us to make quantitative statements about how quickly genes in a population will mix when there is random mating.

Genetic mixing in randomly mating populations

Interpreting empirical estimates of md would be easier if we knew how quickly it could increase in simple evolutionary scenarios, for example, if there was random mating in a population. The relationship between md and D′ makes it easy to make some simple statements about the behavior of md in cases like this. For example, it is well known that D and D′ for unlinked loci decrease by a factor of 0.5 in a randomly mating population (for example, Hedrick, 2011). This fact tells that if we use unlinked loci to estimate the ancestry of individuals, 1−md will decrease by a factor of 0.5 every generation of random mating. Therefore, if a population begins with genetically pure individuals from two taxa (md=0), and the individuals in the population mate randomly for t generations, md at generation t, md(t), will equal

For the first three generations of random mating, md will equal 1/2, 3/4 and 7/8 (Figure 1). After five generations of random mating, md will be ~0.97, and mixing will be nearly complete. This relationship assumes that the population begins with genetically pure individuals from two taxa, but is independent of the proportion of non-native individuals that enter the population.

Figure 1
figure 1

The distribution of hybridity coefficients (Pi) among individuals in a randomly mating populations over five generations (t=0–5). Individuals with Pi equal to 0 or 1 are genetically pure members of alternative taxa.

Estimation

We will discuss estimation of md in two contexts: cases in which diagnostic loci are available and cases in which diagnostic loci are not available.

Estimation using diagnostic loci

Estimating md using diagnostic loci is facilitated by noting the term Var(Pi)/P(1−P) in Equation 1 is mathematically equivalent to Wright’s FST (Wright, 1951), or more specifically, FST for one locus with two alleles. The only difference between md and FST is that Pi in Equation 1 refers to the frequency of alleles in an individual whereas Pi in Wright’s (1951) definition of FST refers to the frequency of alleles in a population. The similarity is not coincidental. FST quantifies how allele frequencies vary among populations; our parameter, md, quantifies how allele frequencies vary among individuals. The mathematical equivalence between md and FST allows us to use the well-developed literature on FST to estimate md (see below).

If diagnostic loci are available to unambiguously discriminate between native and non-native alleles, standard methods for estimating FST can be used to estimate md. For example, Weir and Cockerham’s θ̂ (1984) can be used to produce an estimate of md, m̂d

Weir and co-workers have presented a few alternative methods for estimating θ̂ (Weir and Cockerham, 1984; Weir and Hill, 2002; Weir, 2010). The method most appropriate for the application here is the estimator designed to compare gene pools in randomly mating populations (Weir and Cockerham, 1984, unlabeled equation at the top of page 1363; Weir and Hill, 2002, Equation 5). Using our notation, this estimator is calculated as

where Ns is the number of individuals sampled from a hybrid population, ni is the number of amplified alleles in individual i, P̂i is the estimated proportion of non-native alleles in the genome of individual i, and n̄ is the average number of alleles amplified in each genotyped individual

These equations, as noted above, are only applicable for loci having diagnostic alleles.

θ̂ can also be calculated using software designed for estimating FST (for example; GENEPOP; Rousset, 2008). Doing this requires reorganizing the data so that each individual in the sample is represented as a population and the diagnostic alleles from all loci are pooled into a single locus. Appendix shows how this can be done to create a GENEPOP file (Rousset, 2008).

Weir and Cockerham’s (1984) θ̂ is essentially unbiased when used to estimate FST. However, θ̂ may not be as unbiased when used to estimate md. Weir and Cockerham’s formula for θ̂ (Equation 8 in our paper) assumes alleles sampled from an individual are random and independent draws from a gene pool (which, in our application, is an individual’s genome). With this assumption, the estimated proportion of non-native ancestry in the ith individual, P̂i, will be binomially distributed. This will not be true when there is gametic disequilibrium in a population (for example, in the early generations of hybridization). When there is gametic disequilibrium in a population, the alleles present in diploid genotypes are not always going to be independent. This is easily seen by considering an F1 individual (that is, an individual having one parent from each hybridizing taxon). In an F1 individual, every locus will be heterozygous and Pi will equal 0.5. Because each locus is heterozygous, P̂i will also equal 0.5—no matter how few or many loci are genotyped. In other words, when an F1 individual is genotyped, there will be no sampling error in P̂i. Equation 8 assumes binomial sampling error, so terms in Equation 8 intended to eliminate sampling bias will not work as intended. This could result in a negative estimate of Var(Pi)/P(1−P), which would produce an estimate of md that is >1.0. This is equivalent to an estimate of FST being <zero.

We used computer simulations to estimate how much bias there was in the estimates of md calculated from Equation 7 and 8 for realistic amounts of data. We simulated populations of 50, 200 and 2000 randomly mating individuals that were founded with 20% non-native individuals. We modeled these populations after cutthroat trout (Oncorhynchus clarkii)—a species which frequently hybridizes with non-native rainbow trout (O. mykiss)—and assumed the ratio of the genetic effective population size, Ne, to census size, N, was 0.23 (Finger et al., 2011). We achieved this ratio of Ne to N by varying the reproductive success according to the method of Anderson (2001). For each individual, 52 chromosomal arms were simulated with 10 equally spaced, species–specific diagnostic di-allelic loci. Recombination was allowed to occur once per chromosomal arm per generation (Danzmann et al., 2005). In each generation we calculated the true degree of mixing, md, for each population. Then we drew 1000 simulated samples of 10, 20, 50 or 100 individuals with genotypes at 8, 16, 48 and 96 loci. The bias in estimates of md was calculated by comparing the average estimate of md with the parametric value for the populations, that is, bias=average(m̂d)−md.

The simulations were performed using a program written by us in R (R Development Core Team, 2008).

The computer simulations showed that there was only a modest amount of bias in the first few generations of mating, and this bias was becoming negligible by the fifth generation (Table 1). In all cases, estimates of md tended to be slightly higher than the parametric value (that is, the bias was positive). As expected, the amount of bias was greatest in small samples, and for populations in the early stages of hybridization. The smallest samples we examined had 10 individuals genotyped at eight diagnostic loci. Even with these small samples, the bias observed in the second generation (0.0330), was only 4.4% of the parametric value (0.75). When the number of diagnostic loci was increased to 16, the bias was reduced to just over 2%. The bias present in the estimates of md for populations that had five generations of random mating (md=0.97) was much lower. Even for the smallest samples, it was only 0.6% of the parametric value.

Table 1 Statistical bias of m̂d calculated from Equation 6 and 9 estimated from computer simulations for samples of varying sizes (NLoci, NIndividuals) from a population of 200 individuals after t generations of random mating (see text for details)

Estimation using non-diagnostic loci

When diagnostic loci are not available, md can be estimated by estimating the ancestry of each individual in a sample, and then inserting these estimates into Equations 1 or 2, (depending on the number of taxa involved). There are a variety of methods for estimating the ancestry (Pi) of hybrid individuals using loci that do not have diagnostic alleles (for example, Pritchard et al., 2000; Anderson and Thompson, 2002). The computer programs STRUCTURE and NEWHYBRIDS have been popular for this type of analysis. The accuracy of estimates of md obtained from Pi′s calculated by these programs will depend on how well the Pi′s are estimated. md is calculated from the variance of Pi, so if there is a bias in this variance, there will be a bias in estimates of md. Little is known about the error structure of STRUCTURE or similar analytic approaches, but it is likely that these errors will be affected by the amount of genetic differentiation among the taxa being studied, the number of loci genotyped and the amount of genetic variation at those loci.

We performed a series of computer simulations to explore how estimation error in Pi′s affected estimates of md. A systematic investigation of the error structure present in STRUCTURE or NEWHYBRIDS is beyond the scope of this investigation, so we examined how three plausible, generic models of estimation error affected estimates of md. In all simulations, we assumed there were two hybridizing taxa. As above, we simulated populations of 50, 200 or 2000 individuals and kept track of the parametric values of Pi for each individual in the population. Estimates of Pi were obtained in the simulation by assuming one of the three statistical models of estimation error. The first model assumed that the ancestry of one of the taxa was always underestimated by a constant amount. The second model assumed that estimates of the ancestry of two taxa were biased towards 0.5 by a constant proportion. The third model assumed that estimation error was normally distributed. We tested different magnitudes of estimation error for each model and ran 10 000 simulations for each case to quantify the bias in estimates of md.

The simulations showed that the bias in estimates of md was a function of how well the ancestry of each individual, Pi, was estimated (results not shown). Good estimates of Pi produced good estimates of md. There did not appear to be any thresholds affecting how estimation error in Pi was propagated to estimates of md.

Discussion

We have defined and derived a parameter, md, that quantifies the amount of genetic mixing of native and non-native genes in a hybrid population. The parameter that we developed is related to both FST and the average amount of pairwise gametic disequilibrium in a population. The value of this parameter will rapidly approach 1.0 in randomly mating populations; after five generations of random mating, md will be ~0.97. Computer simulations showed that when diagnostic loci are available, one of Weir and Cockerham’s (1984) estimators of FST did a very nice job of estimating md. If diagnostic loci are not available, md can be estimated by inserting the estimated ancestry of each individual into Equations 1 or 2,.

The genetic mixing parameter described here may be thought of as an alternative version of the ‘true hybridization index’ (THI) parameter of Vernesi et al. (2003). Both parameters have a range of 0 to 1.0 and both parameters quantify as to how well genes are mixed in a population. In fact, md and THI are mathematically equivalent. The main difference between md and THI is how the parameters are defined. md is defined as a variance of the amount of non-native genes across individuals in a population. As noted above, this variance will be zero when the population is genetically well mixed and all individuals have the same amount of non-native ancestry. THI is defined as the average variance of genetic ancestry within individuals. This is explained as follows. The first step in calculating THI is calculating the variance of ancestry, Vi, within each individual in the sample. If d taxa are hybridizing, and we use our notation, the variance of non-native genes in the ith individual is . Notice that this variance measures how close the native and non-native ancestries of an individual are to 1/d. This quantity, somewhat surprisingly, is related to the amount of mixing in a population. If all the individuals in a population are genetically pure, the proportion of native and non-natives genes in the ith individual, Pij, will be 0 or 1,and the values of Vi for every individual will be relatively high. On the other hand, if all individuals in a population have the same amount of non-native genes, the values of Pij will be closer to 1/d and the values of Vi will be smaller. This is the principle THI uses to quantify the amount of mixing in a population. More specifically, THI is calculated by taking the average of Vi across individuals and then standardized to account for the minimum and maximum values of this average (given the amount of non-native genes in the population). This calculation produces a quantity with the same value as md. md has the advantage of a simple, easy-to-interpret definition. This is important because it allowed us to relate md to the amount of gametic disequilibrium in a population, identify how md changed in randomly mating populations, and estimate md using formulae for FST.

In randomly mating populations, md will rapidly approach 1.0 at a rate specified by Equation 6. There are, however, many plausible reasons as to why natural populations will behave differently. For example, if mating between the two taxa is not random, genetic mixing may proceed more slowly. Alternatively, if hybrid individuals have a lower evolutionary fitness, mixing will be slower. Mixing will also appear to be slower if non-native individuals continuously enter a population. And, finally, the rate of mixing will be affected by the genotypes of the first individuals to enter a population. If these individuals have mixed ancestry, md will be higher than if the first immigrants were genetically pure non-natives.

A few comments regarding the definition of a ‘hybrid swarm’ may be useful, as several definitions are present in the literature. Rhymer and Simberloff (1996) defined a hybrid swarm as a population containing individuals with various degrees of non-native ancestry. Allendorf et al. (2001) used a slightly different definition; they specified that all individuals in a hybrid swarm must be hybrids. Finally, Allendorf and Leary (1988) provided the most strict definition; they defined a hybrid swarm as a population in which all individuals in the population have the same amount of non-native ancestry (that is, md=1). All of these definitions may be useful, but in some circumstances it may be more useful to quantify how well mixed non-native genes are in a population rather than to classify a population as being a hybrid swarm or not. This is especially likely to be true as the number of loci used to study hybridization increases. Genomic data are expected to have very high power to detect slight amounts of variation in ancestry. Therefore, when genomic data are available, it may not be useful to define a hybrid swarm as a population in which all individuals have the same amount of non-native ancestry.

We conclude this paper with a discussion of what it means for genes in a hybrid population to be ‘mixed.’ We have assumed throughout this paper that native and non-native genes are well mixed when all the individuals in the population have the same amount of non-native ancestry. This criterion should be useful for studying populations that are in the early stages of hybridization or have stable distributions of hybrid genotypes. Other mixing criteria and parameters may be more informative for examining populations that have been hybridizing for a long time. In particular, if populations have been hybridizing for a long time, it may be more useful to quantify how thoroughly recombination has broken up and shuffled the genomes of native and non-native taxa into small chromosomal segments (for example, vonHoldt et al., 2011 and references within). When genomic data are available, such an approach can be used to date the timing of admixture (Tang et al., 2006). If such data are not available, and a population is in the early stages of admixture, the mixing degree parameter presented in this paper should be informative.

Data archiving

There were no data to deposit.