Introduction

Understanding contemporary patterns of dispersal are of crucial importance in evolution, ecology and conservation biology. Since the advent of diverse polymorphic markers (for example, microsatellites), assignment methods (Paetkau et al., 1995; Rannala and Mountain, 1997; Cornuet et al., 1999; Paetkau et al., 2003) have allowed for rapid estimation of dispersal that would otherwise be difficult and time-consuming to obtain through direct observations (Berry et al., 2004). Population assignment is one such tool that aims to identify the source population for specific individuals or assign them to multiple populations in the case of recent admixture. These methods have become important in forensics for identifying the provenance of material of unknown origin, determining the frequency of hybridization between species and the degree of connectivity among recently fragmented populations (for example, Cain et al., 2000; Manel et al., 2003; Paetkau et al., 2003; Berry et al., 2004). Although several methods for population assignment exist for diploid organisms, there are currently limited options for polyploids (but see Meirmans and Van Tienderen, 2004; Falush et al., 2007). Polyploidy is a widespread phenomenon of major ecological and evolutionary importance in plants and animals (Otto and Whitton, 2000; Mable, 2004; Soltis et al., 2004; Wood et al., 2009). However, few population genetic studies of dispersal in polyploids have been conducted owing, in part, to a lack of methods that appropriately account for the complexities of polyploid data.

Although population assignment for diploids is relatively straightforward, several unique features of polyploids continue to provide significant challenges for implementing these techniques in natural populations. These partly depend on the presumed origin of whole-genome duplication (Ramsey and Schemske, 1998). Polyploids are commonly categorized broadly as either allopolyploid (derived from interspecific hybridization) or autopolyploid (derived from chromosomal doubling of the same genome). In allopolyploids, bivalents are mostly formed between pairs of homologous chromosomes (for example, A1/A2, B1/B2), resulting in disomic inheritance similar to that of diploids (Ronfort et al., 1998). In contrast, segregation patterns in autopolyploids are considerably more complex because chromosomes either pair at random or form multivalents during meiosis. Polysomic inheritance in autopolyploids can result in two alternative segregation patterns. First, random chromosome segregation (RCeS), where gametes arise from any random assortment of homologous chromosomes but sister chromatids always end up in different gametes. Alternatively, maximum equational segregation and random chromatid segregation (RCdS) may occur where sister chromatids behave independently and distribute into the same gamete, a process that can result in double reduction (Bever and Felber, 1992). For example, consider an autotetraploid individual with four distinct alleles (abcd) at a locus. There are six possible gametes where sister chromatids distribute to different gametes (ab, ac, ad, bc, bd and cd) and four derived from double reduction (aa, bb, cc and dd). It is therefore important to consider these complexities given that they can influence segregation ratios, as well as expected gametic and genotype frequencies, at the population level.

A further challenge for polyploids is genotype ambiguity such that, for codominant markers (for example, microsatellites), allele dosage (copy number) cannot be reliably determined (Obbard et al., 2006). Molecular markers are only able to detect which alleles are present but not how many of each there are. For example, in the case of a hexaploid individual, the presence of the two alleles (a, b) at a locus could reflect five possible genotypes (aaaaab, aaaabb, aaabbb, aabbbb, abbbbb). Therefore, many genotypes are indistinguishable and require the use of phenotypes (that is, the unique alleles present, Table 1). There is a long history of theory developed for understanding the population genetics of autopolyploids that incorporate some of the complexities of polysomic inheritance, double reduction and genotype ambiguity (for example, Haldane, 1930; Mather, 1935; Geiringer, 1949; Moody et al., 1993; Ronfort et al., 1998; Wu et al., 2001; Luo et al., 2006; Stift et al., 2008; Meirmans and Tienderen, 2013). However, approaches that explicitly incorporate polysomic inheritance and double reduction to examine contemporary patterns of gene dispersal are currently unavailable for autopolyploids.

Table 1 Genotype classes, phenotypes and general formulas for the number of possible genotypes given k codominant alleles

For diploids, population assignment is commonly achieved through frequency-based likelihood or full Bayesian approaches. Frequency-based methods such as GeneClass (for example, Cornuet et al., 1999) use a sample of reference genotypes that provides information on the allele frequencies from each of the known (fixed) candidate populations. Individuals of unknown origin are then assigned probabilistically to their most likely population of origin. In contrast, the Bayesian method implemented in Structure (Falush et al., 2007) uses an iterative algorithm (Markov Chain Monte Carlo) that randomly assigns individuals into a number of groups (predefined clusters) and converges when the assumption of Hardy–Weinberg and linkage equilibrium is fulfilled. Thus Structure simultaneously identifies the set of populations, their allele frequencies and the population membership coefficient of each individual, and these are updated until the best fit for the data is found. Currently, the only assignment approaches for polyploid data include Genodive (Meirmans and Van Tienderen, 2004) and Structure (Falush et al., 2007), although both programs do not account for double reduction. Only Structure allows for phenotype markers; however, it remains unclear how accurate the method is for performing population assignment compared with a method that allows for polysomic inheritance with double reduction for autopolyploids. In addition, information on the maternal relationship for individual offspring (if known) is not utilized in existing assignment methods. However, in many cases, for example, seed collected from individual plants, including information on the genotype of the known maternal parent could increase the power of population assignment as only the population origin of the paternal parent requires evaluation.

Here we develop novel methods of population assignment for autotetraploid and autohexaploid species that explicitly account for polysomic inheritance with double reduction and ambiguous genotypes (implemented in the software AutoPoly). The main goal is to use allele frequency information from predefined reference genotypes sampled from a set of candidate populations and then assign a set of genotyped individuals of unknown origin (that is, offspring) to their most likely: (i) joint maternal and paternal source population, when both maternal and paternal origins are unknown (for example, seed dispersal), or (ii) paternal population of origin (for example, pollen dispersal), given the maternal parent is known (genotype and population of origin). For each of these approaches, we present methods for genotype (allele dosage known) and phenotype markers (allele dosage unknown). To assess the accuracy of these assignment methods in relation to Structure, we conducted a power analysis using simulated microsatellite (SSR) data and examined the effects of the number of loci, degree of population differentiation (FST), genotype ambiguity, maternal information, error rates and double reduction. From this, we address the following questions: (i) what is the difference in the accuracy of population assignment between genotype and phenotype data? (ii) does the inclusion of maternal information improve population assignment? (iii) how accurate is AutoPoly for providing point estimates of migration rates? and (iv) how does the accuracy of AutoPoly compare to Structure? Lastly, we test these methods using an empirical data set for the autohexaploid plant Eremophila glabra.

Methods

Likelihood model for polyploid population assignment

In our model, individuals are autopolyploid with either four (Y4) or six (Y6) sets of chromosomes (that is, 2n=4x=tetraploid; 2n=6x=hexaploid), but all populations must have the same ploidy level for any given analysis. Random mating is assumed within each reference population (both in terms of zygote and gamete dispersal) and loci are assumed to be unlinked and in linkage equilibrium. We allow segregation patterns at a given locus to follow expectations for polysomic inheritance with multivalent formation under random chromatid segregation (RCdS). To allow for any double reduction rate (DRR), we use general formulas for DRR anywhere within the theoretical bounds (for RCdS, tetraploids, 0<α<(1)/(7); hexaploids, 0<β<(3)/(11)) (Mather, 1936; Geiringer, 1949). Here we assume the maximum double reduction follows that expected for RCdS rather than maximum equational segregation (tetraploids, 0<α<(1)/(6); hexaploids, 0<β<(3)/(10)). We assumed RCdS as this was more tractable for calculating general formulas for segregation ratios and the specific requirements for maximum equational segregation (that is, only one crossover event between locus and centromere) is rather restrictive. Moreover, for most empirical data sets, DRR at a given locus remains unknown (but see Stift et al., 2008) but probably lies somewhere between the theoretical minimum and maximum. By always using general formulas for α and β, we circumvent the problem of other methods that do not allow for multivalent chromosome formation and double reduction (that is, Structure and Genodive) or assume that double reduction is fixed at either the theoretical minimum or maximum (for example, Buteler et al., 1997).

We consider a set of I discrete populations that exchange zygotes (for example, seed) or gametes (for example, pollen). In each population, a representative sample of n individuals are either genotyped (allele dosage known) or phenotyped (allele dosage unknown). We let Gijm and Pijm denote the genotype and phenotype at the jth locus (j=1, 2, …, J) for the mth individual (m=1, 2, …, M) located in the ith population (i=1, 2, …, I). For example, an individual genotype (Gijm) lists the alleles detected where the total alleles recorded must equal the ploidy level (for example, tetraploid, aabc; hexaploid, aabcde), whereas a phenotype (Pijm) lists only the unique alleles present (for example, tetraploid, abc; hexaploid, abcde). We let G={Gijm} and P={Pijm} represent the matrix of genotypes or phenotypes of individuals in the sampled population. The model only allows for resolved genotypes or phenotypes for a given analysis (that is, cannot include both phenotypes and genotypes). Although the majority of empirical data sets will consist of phenotypes, we describe both approaches because beginning with unambiguous genotypes is an easier starting point.

Our assignment method builds on techniques designed for diploids for individual based population assignment using multilocus genotypes (Rannala and Mountain, 1997; Cornuet et al., 1999) and consists of five main steps that calculate: (1) allele frequencies in each candidate (reference) population, (2) expected gamete frequencies at random mating equilibrium (RME) in each population, (3) expected genotype frequencies at RME, (4) assignment probabilities, and (5) simulations to determine the confidence intervals (CIs) for assignment. For each step, we describe the methods for autotetraploids followed by autohexaploids and, when required, derivations for both genotype and phenotype data. For the assignment probabilities (step 4) we describe separately the methods for: Model I=joint maternal and paternal population assignment with a single population origin (for example, seed dispersal), Model II=joint maternal and paternal population assignment with an admixed population origin, and Model III=paternal population assignment given the known maternal genotype of each offspring (for example, pollen dispersal).

Allele frequencies

The first step in population assignment requires that allele frequencies are estimated in each of the reference populations to be evaluated as potential source populations. For genotype data, the frequency of each allele in a given population can be directly counted from information in the genotype matrix at each locus (Gij), as in diploids. In contrast, for phenotype data (Pij), only the distinct alleles that are carried by an individual are known. We use two alternative approaches: (i) Expectation-Maximization (EM)-based estimation, and (ii) marginal (weighted) allele frequency. The EM method follows the approach outlined by De Silva et al. (2005) and implemented in Polysat (Clark and Jasieniuk, 2011) and we run this approach assuming no selfing (for example, self-incompatible plants). One limitation of this method is that it assumes only RCeS occurs, meaning that double reduction under RCdS is not incorporated. To avoid the problem of using unknown priors or restricted assumptions on the nature of polysomic inheritance, we also use an alternative estimate based on the marginal allele frequency. This approach is equivalent to summing the allele counts over the set of possible genotypes for each given phenotype, which can be approximated by determining the number of individuals in each phenotypic class and weighting these proportionally to the number of alternative alleles. Here we let the vector pij={p1ij,…,pkij}, where pkij is the frequency of the kth allele at the jth locus in the ith population. Given the vector of phenotypes Pij and Y4 (tetraploid), we find the frequency of the kth allele as:

where we denote Ni as the total number of individuals in the ith population and nk4, nk3, nk2 and nk1 are the number of quadriallele, triallele, biallele and monoallele individuals carrying the kth allele, respectively. In the case of quadriallele (first term nominator; nk4), the allele counts are unambiguous as the genotype is known. For triallele phenotypes (the second term on the nominator; (4/3)nk3), there can be four total copies of kth allele across three alternative genotypes. Similarly, for biallele phenotypes (third term; 2nk2), summing across possible genotypes there are a total of six copies for three genotypes. Lastly, for monoallele phenotypes (last term nominator; 4nk1), these are unambiguous as genotype is known. Following this same procedure, for Y6 (hexaploid):

where nk6, nk5, nk4, nk3, nk2 and nk1 are the number of hexallele, pentallele, quadriallele, triallele, biallele and monoallele individuals carrying the kth allele, respectively. Compared with the EM-based estimation, the marginal allele frequency method may result in a bias towards more uniform allele frequencies, particularly when the population sample is small.

Population gametic probabilities

The next step requires the expected frequency of each gamete in each reference population under the assumption of random mating (RME), given the allele frequencies are known and a given DRR. For autopolyploids with RCdS, we must first calculate the expected frequencies of all possible gametes at equilibrium from the allele frequencies. In contrast to diploids, autopolyploids do not reach equilibrium after one generation of random mating but approach this asymptotically (Haldane, 1930; Geiringer, 1949; Bever and Felber, 1992). However, this can be approximated with general limit formulas for RME under segregation patterns intermediate between RCeS and RCdS (Geiringer, 1949).

We denote the vector gij={g1ij,…,gmij}, where gmij is the expected frequency of the mth gamete at the jth locus in the ith population at RME. Tetraploids can transmit two allele copies and thus two classes of gametes are possible, either a monoallele or a biallele which occur with the frequencies xkk and 2ykk′, respectively (where allele k≠k′). For example, in a tetraploid population with k=2 unique alleles, there are three possible gametes. For clarity, we use notation x11 in place of xkk, this gives x11, 2y12, and x22. To explicitly allow for polysomic inheritance and any DRR, we use the general limit formulas derived by Geiringer (1949) to calculate the equilibrium gamete frequencies for any given probability of double reduction (0<α<0.1428) (also see Wricke and Weber, 1986) as:

where pkij and qk′ij is the frequency of alleles k1 and k2 at the jth locus in the ith population. When α=0, the general limit formulas reduce to the binomial expansion of (p+q,…,ki)Y/2, where x11=p2, 2y12=2pq and x22=q2.

Hexaploid gametes transmit three alleles and can be classified into three classes in a hexaploid depending on the number of unique alleles they carry. These include monoallele gametes (xkkk), biallele (3ykkk' and 3ykk'k') and triallele gametes (6zkk'k''). For example, a hexaploid population with a total of k=3 alleles, there are 10 possible gametes (x111, x222, x333, 3y112, 3y122, 3y223, 3y233, 3y113, 3y133 and 6z123). We used the general limit formulas (Equation 28; Geiringer, 1949) to calculate the equilibrium gamete frequencies for any DRR (0<β<0.2727). The probability of the ith population producing monoallele, biallele and triallele gametes in a hexaploid population is:

Where pkij, qk′ij, rk′′ij represent the observed frequency of allele k1, k2 and k3, respectively. When β=0, the general formulas reduce to the trinomial expansion of (p+q+r,…,ki)Y/2 (for example, x111=p3, 3y112=3p2q, 6z123=6pqr).

Genotype and phenotype probabilities

For tetraploids, when the mother is unknown, the probability of observing the genotype Gijm for individual m at locus j, given it is from population i, is dependent on assuming the individual is solely from population i and the vector of expected gametic frequencies from population i (gij) and polyploidy=Y4. This equates to the genotype probabilities at RME, Pr(Gijm|i,gij,Y4). Henceforth, to distinguish genotypes we denote alleles with letters and their allele copy number with subscripts. The probability of observing each genotype class follows Geiringer (1949) as

Unlike for tetraploids, as far as we are aware, there are no general formulas to calculate the expected genotype frequencies for hexaploids that take into account double reduction. Therefore, we derived the expected genotype frequencies for hexaploids at RME equilibrium that simply follows the random union of gametes within each population (Appendix A1). We follow the same notation for tetraploids. The probabilities of observing each genotype class given the individual from population i are:

For hexaploids, this gives 11 distinct genotype classes (Table 1). Although the number of possible genotypes increases rapidly with the number of alleles (for example, when k=6, gives 462 genotypes), the expected genotype frequencies at RME for each can be calculated on the basis of their respective genotype class.

For phenotypes, the expected frequency of full homozygotes and heterozygotes (for example, monoallele and hexallele for a hexaploid, respectively) are equal to genotype frequencies at RME. To evaluate genotype probabilities for partial heterozygotes, we must account for the lack of allele copy number. To address this problem, we take the sum of the probabilities of obtaining each of the possible genotypes. For example, for a phenotype with three unique alleles detected, abc, the set of possible genotypes is Gijm∈{Pijm}={aabc, abbc, abcc}. Therefore, the probability of obtaining each of the alternative genotypes is proportional to their frequencies at RME, which depends on the equilibrium gametic frequencies in the ith population (gij and ploidy (Yx), given the population from which the individual was sampled,

Assignment probabilities

Model I. Population assignment (single candidate, mother unknown)

Here we assume individual m is from two unknown parents that belong to a single candidate population. This probability comes directly from the probability of observing the genotype at each of the candidate populations. Assuming the alleles at the J loci are independent (no linkage), we calculate the probability of observing the multilocus genotype Gim, or phenotype Pim, at a given candidate population i as the product of the probabilities at each locus:

when an allele is absent from population i, this results in a zero probability of gametes and genotypes that carry that allele. This reduces the probability of the multilocus genotype/phenotype to zero, although the particular allele may be rare in the population or missing among the reference individuals (Cornuet et al., 1999). To account for this problem, we follow Rannala and Mountain (1997) and let the frequency of the absent allele be proportional to the inverse of the number of gene copies at the locus, adjusted by the number of observed alleles as, pkji*=(1/Kj)/NijY, where Kj is the total number of alleles detected across all populations for the jth locus, Ni are the total number of individuals sampled in the ith candidate population and Y is the ploidy level (for example, Y=6 for hexaploid). Given this allele frequency, we re-calculate allele frequencies proportionally so that the sum of allele frequencies in each population sums to one and then re-calculate the probability of all possible gametes and phenotypes/genotypes.

Model II. Population assignment (admixed individuals)

We now consider the situation in which we assume one parent is a resident of population i and assume the other parent belongs to a different candidate population i′. We denote this first-generation admixed genotype as G[i,i']jm, where one gamete is from the resident population i (that is, g[i]j{..}) and the other gamete is from population i' (that is, g[i,i']j{..}). For tetraploids, the probability of observing individual m which is a mixed (F1) genotype G[i,i']jm at locus j depends on the gametic frequencies in each of the two populations (gij, gi'j) and can be written as . We replace the term for each specific genotypic class as follows:

Similarly, the probability of mixed genotypes for a hexaploid individual is and are described in Appendix A2.

Phenotype data are treated as outlined for the case of non-admixed individuals (Equation 7). Similarly, the probability of observing the multilocus genotype G[i,i']m, or phenotype P[i,i']m, assuming it is an F1 between population i and i', follows that of Equation 8.

Model III. Population assignment (paternal origin given mother known)

Now we consider a situation where individual offspring are sampled and the identity of the female (mother) and her population of origin and genotype/phenotype are known, but the location and identity of the male (father) is unknown. The location of the unknown father may be in any of the candidate populations, including that of the known female. The intent is to determine for a given offspring (o), the most likely source of the male gamete (xm), given that the female parent (f) is known. This can be expressed in a similar framework used for paternity analysis (for example, Meagher, 1986). We evaluate the probability of obtaining the offspring genotype (Goj) given the following relationship: f is a parent of o and the male parent, mi, is located in the ith population. We make the assumption that the female and male parents are not F1 or recent immigrants to the population in which they were sampled. The probability depends on the gamete frequencies in the ith population (gij), ploidy (Yx), and the DRR (αj or βj) at the jth locus and can be written as:

where xf and xm are the female and males gametes, respectively, P(xf|Gfj) is the gamete segregation probability from a given female genotype and P(xm|gij) is the probability of the male gamete given the expected gamete frequencies in a candidate population (Equations 3 and 4). For polyploids, there can be many alternative gametes that two parents could have contributed towards the offspring, hence we sum over all the possible gametes segregating from the known female parent. Although gametic segregation ratios have been described previously for autohexaploids for fixed RCeS or RCdS, most loci probably exhibit intermediate DRR between the two extremes. Therefore, for hexaploids we derived generalized segregation probabilities for any value of β (Appendix A3).

Likelihood of population assignment

For each individual, we evaluate the likelihood of population assignment to each of the candidate populations and, being a first generation, between all pair-wise candidate populations. Following Cornuet et al. (1999), for Model I we take the logarithms of the genotype probabilities in each candidate population (i=1, 2,…, I). For Model II, we take the logarithms of each hth population pair (h=1, 2,…, H), where the number of admixed population genotype/phenotype possibilities is H=I(I−1). In the case of two candidate populations i and i′, for example, the log-likelihoods of the observing individual m, assuming it is solely from population i, or assumed m is an admixed genotype originating from two populations [i, i′] would be:

Each individual is assigned to the population or admixed population pair in which the likelihood of observing the individual’s genotype/phenotype is the highest. Using this method, a candidate population or mixed population pair is always assigned from among the set of reference populations. In order to discriminate between the most likely candidates, we use the statistic, ln Δ, as the difference in the log-likelihood of the most likely candidate (ln1) and the second most likely (ln2):

Confidence intervals

We used simulations to assess the accuracy of assignment procedures and identify the critical values of ln Δ. The aim here is to provide a measure of confidence that an individual belongs to the assigned candidate population or jointly to two populations in the case of admixed genotypes/phenotypes. We generated new sets of multilocus genotypes/phenotypes for each population by drawing gametes according to their expected frequencies in the reference samples. Similarly, for population assignment when the mother is unknown, we generated new sets of mixed multilocus genotypes/phenotypes for each of the hth population pairs (h=1, 2,…, H). Next we compare ln Δ between groups of simulated individuals that were assigned correctly and incorrectly. Critical values for population assignment were approximated from the distribution of ln Δ values of the simulated data (typically n=10 000; see Supplementary Information S1 and Supplementary Figure S1 for more details).

Genotyping errors

To examine the effects of genotype/phenotype errors on critical values, we modified the simulated individuals according to two sources of error, e1 and e2, where e1 is the probability of allelic dropout (removing an allele from a phenotype or genotype) and e2 is the probability of an allele being mis-scored. For the latter, e2, an allele is replaced with an alternative proportional to the allele frequencies in the sampled population. These simulations did not explicitly model null alleles; however, increasing rates of allelic dropout will generate similar effects on critical values.

Power analysis with simulated data sets

In order to compare the performance of these different polyploid assignment methods, we simulated populations using an individual-based model with polysomic inheritance and double reduction. Briefly, this simulation considered a finite island model with migration between 10 populations of constant size, each containing 1000 hermaphrodite individuals (with no selfing) with 24 microsatellite loci and a mutational rate (μ=2 × 10−4). By running separate simulations with different migration rates, we obtained replicate data sets with five different levels of average population differentiation (FST: 0.03, 0.06, 0.09, 0.13 and 0.20; see Supplementary Information S2 for more details).

We ran each combination of FST (0.03, 0.06, 0.09, 0.13 and 0.20), with three different number of loci (6, 12 and 24), two marker types (genotype, phenotype) and two model approaches (mother unknown (Models I and II) and mother known (Model III)) for two ploidy levels (tetraploid and hexaploid). Once the model was at mutation–migration equilibrium, we randomly sampled a set of reference samples (n=60 from each population) and generated a set of offspring (n=10 000). Here, among the final set of offspring, the migration rate was increased to m=0.5, so that ~5000 were generated from random mating within populations and the remaining from interpopulation mating. Although this represents an atypically high migration rate, this facilitated comparisons of accuracy on the equal sample size between different metapopulations with different FST. Owing to computational constraints for forward-time simulation of polyploid populations, we obtained 10 replicates for each combination of the above parameters (total n=1200 simulations). Low s.e. for accuracy among replicates, particularly for higher FST 0.13 and 0.20 and >12 loci (s.e. in accuracy <1%) suggests that this number was sufficient to demonstrate differences in the various assignment methods. Given that the incorrect choice of α and β had little impact on accuracy (Supplementary Information S2), we drew a random DRR at each locus between the theoretical minimum and maximum values for RCdS for α (0–0.14) and β (0–0.2727). With AutoPoly, individuals were assigned to their most likely paternal population of origin at 80% confidence calculated from simulating n=10 000 individuals with error rates (e1=0.005 and e2=0.005).

Migration rate point estimates

We also examined the performance of AutoPoly to provide estimates of interpopulation migration rates. Although population assignment is not designed to explicitly estimate migration rates, point estimates can be obtained by dividing the number of detected immigrants by the total sample size (Manel et al., 2003). Here we simulated autohexaploid populations with different migration rates for populations with different degrees of population differentiation (FST) and number of loci that resemble data typically available for studies of natural populations. Following the same procedure described in the power simulations, once the simulation reached drift-mutation equilibrium, we generated a final set of offspring (n=10 000 from each population). For the progeny, the migration rate, m, is the probability that individual offspring were generated through interpopulation mating, while 1−m were generated from random mating within populations. We ran 10 replicates for each of the following combination of parameters: migration rate m (0.02, 0.1, 0.2), FST (0.03, 0.06, 0.09), and number of loci (6, 12). Here we examined phenotype markers, mother known (Model III) and a fixed intermediate DRR (β=0.136) (that is, n=180 simulations), which was assumed known in the assignment test. Individuals were assigned with the same conditions in the previous power simulations.

AutoPoly and Structure

In order to compare the performance of AutoPoly with Structure, we use a subset of the same simulated data detailed above for the power simulations. Here we focus on tetraploid populations with lower levels of population differentiation (FST: 0.03, 0.06, 0.09). Initial simulations identified little difference in accuracy between the programs with higher levels of FST (both methods >98% accuracy).

With AutoPoly, individuals were assigned as in the above simulations. We used Structure version 2.3.4 (Falush et al., 2007) to assign individuals to their most likely candidate population or admixed population pair. Here we used the admixture model, updated allele frequencies only for the reference individuals, set the number of genetic clusters to K=10 and used sampling location as prior information (LOCPRIOR). Therefore, unlike studies that aim to search for the most likely number of clusters, here we assume K is known and equal to the number of demes in the simulated metapopulation. All data sets were run for a burn-in period of 20 000 and 200 000 iterations of the Markov Chain Monte Carlo (see Supplementary Information S2 for more details). The POPINFO parameter is not available for ploidy >2. Therefore, we assigned each individual using the membership coefficients (that is, ancestry proportion) Qk, which represents the posterior probability of membership to each of the K=10 clusters (here Q1, Qk+1,…,Q10). We assigned an individual to belong solely to the kth population if Qk>AQ, where AQ is the threshold Qk value for assignment. We tested a range of AQ thresholds (0.9, 0.8, 0.7, 0.6). If all coefficients were in the bounds of (1–AQ)<Q<AQ, an individual was instead assigned as admixed, with the most likely population pair involved being the two populations with the first and second highest Q values.

Autohexaploid empirical example

As an empirical test of this method, we used microsatellite (SSR) data available for the autohexaploid bird-pollinated shrub Eremophila glabra ssp. glabra from central Australia. At the study site, E. glabra is found in a series of discrete populations that are separated by agriculture. In this ~15 × 15 km2 area, reference landscape grids were delineated and intensively surveyed for E. glabra plants, identifying 15 discrete populations. To obtain reference allele frequencies, 32–62 individuals at each population were phenotyped at six highly polymorphic microsatellite markers (allelic dosage cannot be resolved in E. glabra). For four populations, 7–11 seed were phenotyped from up to 11 known plants. DNA extraction methods and SSR protocols follow those of Elliott (2009). Diversity and divergence measures were calculated using the adult samples from each of the 15 populations (see Supplementary Information S3 for more details).

We performed population assignment for each of the offspring phenotypes using AutoPoly when the mother is known (Model III). Therefore, we are assessing the most likely origin of the paternal (pollen) parent that may be the same population as the mother or any of the other 14 candidate populations sampled. Given E. glabra is bird pollinated, it is feasible that pollen could be dispersed among any of the candidate populations within the 15 × 15 km2 area. To calculate CIs, we simulated five replicates of n=10 000 individuals at each of the two different total error rates E=0.01 and 0.03 (where E=e1+e2 and e1=0.005 and 0.015 and e2=0.005 and 0.015). To examine the effect of using the incorrect DRR, we used randomly drawn values between the theoretical minimum and maximum in the simulations ‘true DRR’ but drew another set of randomly drawn ‘assumed DRR’ to use in the assignment calculations of the individual data and compare to when these values are the same. We used random values because real loci are more likely to vary somewhere between the theoretical minimum and maximum due simply to variation in marker position on the chromosome and distance from the centromere. As we discovered in the power simulations for data sets with low FST (0.03) and marker number (that is, 6 loci), Structure would not give consistent results with the E. glabra data due to a lack of model convergence.

Results

Power analysis

The accuracy of the population assignment for both tetraploids (Figure 1) and hexaploids (Figure 2) increased with more loci and higher population differentiation. The accuracy was generally similar between ploidy levels, with 0.5–3% greater accuracy for hexaploids compared with tetraploids. Accuracy can be improved by increasing the confidence threshold, although this comes at the expense of the number of individuals that can be assigned for data with low information content. For example, for only 6 loci, phenotypes and mother known, a mean of ~84% could be assigned at 80% confidence, compared with mean 35% assigned at 95% confidence. However, the accuracy improves rapidly (exceeds 90%) under moderate levels of population differentiation (FST=0.09) even with only six loci regardless of marker type or model approach.

Figure 1
figure 1

Simulations of autotetraploid populations showing the percentage of individuals assigned to the correct population for different levels of population differentiation (FST), number of loci (6, 12 and 24), marker types (genotype and phenotype) and model conditions (mother known and mother unknown). Indicated is the mean accuracy (percentage of individuals correctly assigned) (n=10 replicates) for 10 000 simulated progeny when the mother information of each progeny is included in the analysis and when the mother information is not included. Individuals assigned at 95% confidence levels. The actual proportion of immigrants among the progeny was fixed to 50% for all replicates (see Methods section for more details).

Figure 2
figure 2

Simulations of autohexaploid populations showing the percentage of individuals assigned to the correct population for different levels of population differentiation (FST), number of loci (6, 12 and 24), marker types (genotype and phenotype) and model conditions (see Figure 1 legend for more details).

The difference in assignment accuracy between genotype and phenotype markers was relatively low for most simulation parameters but was most evident at lower population differentiation. For example, with 6 loci, genotypes had between 3% (FST=0.03) and 0.5% (FST=0.20) greater accuracy for tetraploid data when using Model III (mother known; Figure 1). Greater differences in accuracy between genotypes and phenotypes were observed with increasing error rates and under mother unknown models. For example, with a total error rate=0.03 and assuming that the mother was unknown, genotypes had 14.1% (FST=0.03) to 9.9% (FST=0.20) greater accuracy than phenotypes.

The inclusion of maternal information (Model III—mother known) resulted in a large improvement in the accuracy of population assignment compared with Models I and II (mother unknown). For example, with phenotypes, the mother known model had 19.1% (FST=0.03) to 4.4% (FST=0.20) greater accuracy than mother unknown (Figures 1 and 2). The difference between these model types was most evident at low levels of population differentiation and when only 6 or 12 loci are available. With 24 loci and FST⩾0.13, there was no difference in assignment accuracy.

Migration rates

The power simulations suggest that false positives can substantially inflate the estimated number of immigrant genotypes (Figure 3). Substantial inflation of immigrant genotypes of ~25% above the true value was generated when population differentiation was low (FST=0.03) and few loci are simulated (n=6) (Figure 3a). Genotyping individuals for more loci (n=12) reduced this bias to ~10%, while scenarios with higher population differentiation of FST=0.06 and FST=0.09 reduced this bias further to ~2% (Figure 3b) and <1% (Figure 3c), respectively (assuming loci=12). Although the degree of bias is considerable for data sets with low power, the bias observed in the estimated migration rate is relatively consistent across different values of the true migration rate.

Figure 3
figure 3

Simulations of autohexaploid populations showing the actual proportion of simulated immigrants between populations against the proportion estimated by AutoPoly. Shown are three different levels of population differentiation (FST) and two number of loci (6, 12) for phenotype data with mother known model. Arrow indicates the simulated population parameters most similar to E. glabra data. The dotted line indicates when simulated and estimated values are equal.

AutoPoly and Structure

We found that the two programs exhibited very similar results when more marker data was available (number of loci=12 or 24) and population differentiation was high (FST=0.09) (Figure 4). In contrast, the performance of Structure (with a threshold Q value of AQ>0.7) was substantially lower than AutoPoly for all parameters combinations with six loci and FST=0.06 and with ⩽12 loci for the lowest differentiation of FST=0.03. In some cases, this prevented Structure from running with data sets of six loci for either genotype or phenotype (that is, missing data points; Figures 4a and b).

Figure 4
figure 4

The percentage of simulated autotetraploid individuals assigned to the correct population using AutoPoly (circles) and Structure (squares) for (a) genotype (allele copy number known) and (b) phenotype markers (allele copy number unknown). Mean accuracy (percentage of individuals correctly assigned) (n=10 replicates) for 10 000 simulated progeny for each combination of parameters, including three levels of population differentiation (FST: 0.03, 0.06, 0.09) and number of loci (6, 12 and 24). For AutoPoly, individuals assigned at 95% confidence levels and the models tested include mother unknown (Models I and II; open circles) and mother known (Model III; filled circle). For Structure, we assigned each individual to their most likely source using thresholds of the membership coefficient Q of AQ>0.9 (open square) or AQ>0.7 (closed square) (see Methods section for details). The actual proportion of immigrants among the progeny was fixed to 20% for all replicates.

The impact of marker type had drastically different effects on the accuracy of the two methods. In the case of AutoPoly, the accuracy of genotypes was only slightly higher than phenotypes (see above). In contrast, Structure exhibited much higher accuracy (up to ~40%) with genotypes than with phenotypes for the exact same parameters, except when 24 loci were available or population differentiation was high (FST=0.09).

Population assignment in E. glabra

In the E. glabra populations, we detected an average of 14.5–23 alleles across loci in each population (A, Supplementary Table S1). Mean pair-wise population differentiation was low (Rho=0.082±0.031 s.d.; Supplementary Table S2) and similar to the lowest differentiation examined in the power testing simulations (that is, FST=0.03; Rho=0.11±0.01 s.d., Supplementary Tables S3–S7).

Using simulations of mating events within and among populations of E. glabra, the number of individuals that could be assigned at a given confidence threshold was generally lower than suggested from power simulations. Simulations of E. glabra data indicate that between 71.7% and 63.1 % of individuals (n=10 000) could be assigned at 95% confidence using total error rates 0.01 and 0.03, respectively (Table 2). As found in the power testing simulations (Supplementary Information S2), using the correct DRR versus randomly drawing a new set of DRR for each locus (that is, DRR known versus unknown) had little impact on assignment (71.7% and 71.5% assigned, at 95% CI, respectively; Table 2).

Table 2 The percentage of simulated Eremophila glabra individuals assigned to each of the 15 populations at 95, 90 and 80% confidence

With the actual E. glabra individuals, 57% and 77% of 467 individuals could be assigned at 95% and 90% confidence, respectively (with simulated error=0.01). Using a 90% confidence threshold resulted in 2–8% higher immigration rates, suggesting that the strict threshold may reduce false positives at the expense of having fewer assigned individuals.

Discussion

Ever since the pioneering theory on autotetraploids by Haldane (1930), investigating the population genetics of natural polyploid populations has remained an ongoing challenge for biologists. We provide a new framework for population assignment for autopolyploids that complements existing methods implemented in Structure (Falush et al., 2007). The performance simulations imply that these new methods fill an important gap, enabling population assignment when population differentiation is low and when few polymorphic markers are available. We discuss the main factors that influence the performance of these methods and its application to empirical data. We then conclude by considering current challenges and future directions for population assignment in polyploids.

Accuracy of population assignment methods

Using a likelihood method that utilizes phenotypes and accounts for polysomic inheritance, we show that population assignment with microsatellite markers can reliably detect the origin of individuals or their gametes (for example, pollen). Based on the power simulations, knowledge of the degree of population differentiation, either FST or Rho (Ronfort et al., 1998), can be used to predict the performance of polyploid population assignment. Despite some inherent differences in allelic diversity found in polyploids, these simulations showed similar levels of performance to methods reported for diploids (Rannala and Mountain, 1997; Cornuet et al., 1999). When using maternal parent information, assignment accuracy is high (~95%) at relatively low FST (0.06) and with a modest number of loci (6). For plant studies, this will assist in the improvement of assignment accuracy of pollen dispersal, although assignment of seed will remain more challenging and may require genotyping more loci. Considering that investigations of contemporary dispersal patterns in plants often involve the collection of open pollinated seed arrays from known maternal parents, these results highlight the benefits of including maternal information for population assignment.

The methods we present for autopolyploids, like those of Structure, are not designed to specifically estimate migration rates. However, rough point estimates can be obtained by dividing the number of detected immigrants by the total sample size (Manel et al., 2003). Although it may be tempting to obtain migration rates among polyploid populations with this method, our simulations predict that significant overestimates of migration rates will be obtained using point estimates under some scenarios (that is, 6 loci and FST=0.03). Nevertheless, genotyping more loci can substantially reduce the amount of bias. For example, at FST=0.06 and migration 2%, the ~8% overestimate at six loci reduces to 2% with 12 loci. Until further theory is developed that explicitly models migration rates for autopolyploids (for example, Wilson and Rannala, 2003), similar simulations will be required to assess the power of individual empirical data sets and the extent of the overestimation bias.

Polysomic inheritance, double reduction and unknown allele dosage

Contrary to expectations of substantially lower phenotype performance, we found only small differences between genotype and phenotype methods. Similarly, we found that the uncertainty in the DRR had little impact on the accuracy of population assignment. It is often assumed that these aspects are major limiting factors for population genetic analysis of polyploids. This has led to the development of methods to infer full genotypes by estimating the allele copy number from isozyme band intensity (Young and Brown, 1999) or electropherogram peak areas (Esselink et al., 2004). Methods have also been developed to estimate allele frequencies from phenotype markers using maximum likelihood (De Silva et al., 2005) or iterative-based procedures (Markwith et al., 2006). These methods exhibit their own error and require prior information that may be unavailable (for example, selfing rate); this, together with the small difference we detected between phenotype and genotypes, suggests that our simple approach using marginal allele frequencies may be sufficient for likelihood-based population assignment.

AutoPoly versus Structure

Comparison of the two methods showed that, at low population differentiation, two to three times as many microsatellite loci may be required to use Structure compared with AutoPoly. Although they exhibited near identical levels of accuracy when population divergence was high, AutoPoly performed substantially better for most parameters at moderate (FST=0.06 with <12 loci) to low levels of divergence (FST=0.03) and performed better with phenotypes. With phenotype markers and few loci, we observed greater variance in Q values and in some cases a lack of model convergence (that is, missing data points; Figures 4a and b). It seems unlikely that the lower accuracy of Structure with phenotypes and low information content is due to AutoPoly explicitly incorporating double reduction, as using the correct DRR had little impact on assignment accuracy. One possibility is that phenotype data with low FST makes it difficult for Structure to accurately estimate allele frequencies in each population while simultaneously assigning individuals to clusters. In contrast, for AutoPoly allele frequencies in each of the candidate populations are given, and the assignment directly comes from the probability of obtaining the phenotype given the individual is from each candidate population.

Empirical example

With the autohexaploid E. glabra, population assignment with six SSR markers using phenotypes with maternal information could identify the origin of about half of the offspring samples (at 95% confidence). This was less than the proportion predicted from simulating the actual E. glabra data, which suggested that 63% of the offspring could be assigned at 95% confidence. The simulations suggest that the level of population differentiation and number of loci currently available for the E. glabra data set contains too little power to estimate population assignment with high accuracy. This effect has been noted previously in diploid organisms (see Rannala and Mountain, 1997) and may be partially overcome by genotyping more loci. The decision on whether to generate more marker data or increase confidence thresholds (at the cost of fewer assignable individuals) will depend on the biological question and the importance of minimizing false positives versus false negatives.

Lower assignment success between the power simulations and the Eremophila data also suggests that some complexities encountered in natural populations may need to be incorporated into the theory. Contributing factors likely include higher marker error rates (including null alleles), the presence of close relatives and more variable population differentiation among natural populations. We also assume that all possible source populations have been sampled, populations are randomly mating and dispersal occurs randomly with respect to the surrounding populations. However, isolation by distance and recent bursts of dispersal may generate complex genetic compositions in the candidate reference data (that is, due to recent admixture) and the offspring pool. Incorrect assumptions on the demography and extent of relatedness among individuals can result in the mis-specification of the simulations and introduce bias into estimates of the CIs. Future efforts will be required to quantify which of these factors contribute most to the overall number of type I and type II errors and what level of migration we can expect to detect for a given set of population parameters and sample size.

Conclusions

Population assignment using genetic markers holds the promise of rapid estimation of contemporary dispersal patterns in natural populations. There are, however, significant challenges when applying these methods to natural polyploid populations. Further development of the theory may benefit from explicitly modelling error rates, double reduction and the probability of unsampled data (alleles and candidate populations). This could be achieved in a full Bayesian framework (for example, Hadfield et al., 2006), although the computational burden of higher ploidy level and genotype uncertainty would make this a non-trivial task. As for diploids, the marker power as well as the sampling strategy will determine what level of accuracy can be achieved and how many individuals can be assigned with a high degree of confidence. Trade-offs exists between the number of individuals sampled for estimating allele frequencies and the number of individuals used to assess mating patterns (Meirmans, 2015). This will also depend on the ecology (for example, dispersal vector) of the organism and the demographic context in which each discrete population resides. When dispersal rates are low among populations, few immigrants will be generated, making it difficult to quantify migration rates without very large sample sizes (Manel et al., 2003). We should therefore always remain cautious when inferring dispersal rates from genetic data and interpret these patterns using knowledge about the ecology and demography of the study organism.

Data archiving

The AutoPoly R package that runs the population assignment methods described here (including documentation and example files) and empirical data sets are available at the Dryad Digital Repository (http://datadryad.org/), via http://dx.doi.org/10.5061/dryad.bc498. Updated versions of the program are also available from the Comprehensive R Archive Network, (https://cran.r-project.org), or https://github.com/dfield007/AutoPoly.

CONFLICT OF INTEREST

The authors declare no conflict of interest.