Using linked markers to estimate the genetic age of a volunteer population: a theoretical and empirical approach

Ostrowski, M-F; Rousselle, Y; Tsitrone, A; Santoni, S; David, J; Reboud, X; Muller, M-H

doi:10.1038/hdy.2009.156

Download PDF

Original Article
Published: 09 December 2009

Using linked markers to estimate the genetic age of a volunteer population: a theoretical and empirical approach

M-F Ostrowski¹,
Y Rousselle¹,
A Tsitrone¹,
S Santoni¹,
J David¹,
X Reboud² &
…
M-H Muller¹

Heredity volume 105, pages 358–369 (2010)Cite this article

702 Accesses
7 Citations
3 Altmetric
Metrics details

Abstract

Volunteers deriving from unharvested seeds of a crop can lead to persistent feral populations and participate in genetic exchanges across the agro-ecosystem, both between crop varieties and between crops and their wild relatives. A first step to understand the importance of volunteers is to characterize their capacity to reproduce autonomously for several generations. For that purpose, we constructed and evaluated a maximum-likelihood method to estimate the genetic age of a population deriving from one of the most common field crop type: an F1-hybrid variety. The method estimates the number of reproduction cycles that occurred since the cultivation of that variety. It makes use of genotypic data at a number of linked microsatellite loci pairs, thus exploiting the recombination of parental haplotypes, which is expected to occur as the population is reproducing. Estimates with moderate bias and variance were found for a broad range of parameter values in simulations, and the method revealed robust to some deviations from the assumptions of the underlying model. We propose a specific procedure to test the hypothesis of persistence, that is has a given volunteer population experienced more than one cycle of reproduction since the F1-hybrid state? The method was applied to both an experimental and a natural sunflower volunteer population and revealed promising, considering these ideal case studies. Possible further developments toward more complex natural systems are discussed.

Multi-parent populations in crops: a toolbox integrating genomics and genetic mapping with breeding

Article Open access 03 July 2020

Michael F. Scott, Olufunmilayo Ladejobi, … Richard Mott

Population structure and genetic diversity in red clover (Trifolium pratense L.) germplasm

Article Open access 20 May 2020

Charlotte Jones, Jose De Vega, … Leif Skøt

Efficient genomics-based ‘end-to-end’ selective tree breeding framework

Article Open access 03 January 2024

Yousry A. El-Kassaby, Eduardo P. Cappa, … Ilga M. Porth

Introduction

Crops result from the domestication of wild plants species and are often claimed to be maladapted in a natural environment. However, crop-derived plants developing without having been intentionally sown are commonly observed in fallows, field margins and within the field itself. Such plants, referred to as volunteers, originate generally from seed loss at harvest or during transport, the most popular example being oilseed rape (Crawley and Brown, 2004). The population dynamics of volunteers is still poorly documented, but an increased focus was recently given to both their potential for autonomous evolution and their function in genetic exchanges across the agro-system. Volunteers may lead to management problems for farmers. They can also be involved in gene flow across the agro-ecosystem through their contribution to the pollen pool (Devaux et al., 2005). For instance, volunteers acting as a genetic bridge between different members of the crop–weed–wild complex (Reagon and Snow, 2006) could impede any desired genetic isolation between GM and non-GM varieties. Moreover, if volunteers naturalize and constitute self-perpetuating feral populations, they could freely evolve and develop weedy characters (Londo and Schaal, 2007; Bagavathiannan and Van Acker, 2008).

The persistence of feral populations has been assessed through multi-year demographic surveys. This way, Crawley and Brown (2004) showed that oilseed rape populations are not self-replacing and that they rely on the introduction of new seeds (for example, losses from trucks). However, using biochemical markers, Pessel et al. (2001) showed that some feral populations presented original genetic characteristics that readily differ from any varieties cultivated in the studied area over the last 8 years, showing that a variety can influence the composition of the agro-system many years after its cultivation.

To explore the potential for autonomous genetic evolution, one decisive question can be formulated as follows: is a given volunteer population a 1-year transitory crop descent, or is it resulting from more than one cycle of reproduction? Molecular markers and population genetics tools can be helpful to address these questions. Indeed, methods relying on multilocus genotypic data have been developed to investigate the recent history of natural populations (for example, Cornuet et al., 1999; Wilson and Rannala, 2003; Excoffier et al., 2005). Cultivated plants differ from their wild relatives in their genetic structure. Nowadays, a typical field crop consists of a population of genetically homogeneous individuals resulting from the initial cross of two inbred lines (F1 hybrid, for example sunflower, oilseed rape, maize). As a consequence, volunteers deriving from a given field crop will typically develop into an out-of-equilibrium population. Methods of inference have then to be adapted or specifically designed (for example Devaux et al., 2005; Devaux et al., 2007). Fortunately, one advantage of these cultivated species is that a great deal of molecular information is often available, such as a wealth of mapped molecular markers (Koopman et al., 2007).

This paper describes a model-based method to estimate the genetic age of a volunteer population deriving from an F1-hybrid field crop, that is the number of reproduction cycles that occurred since the cultivation of the F1-hybrid individuals. This method relies on the genotypic information from markers linked by a known genetic distance and was applied to both an experimental and natural volunteer population of sunflower. The performances and potential limits of the method depending on various parameters of the model are interpreted and discussed.

Materials and methods

F1-hybrid model: assumptions and bilocus expectations

An F1-hybrid variety results from the cross between two inbred lines. Both parental lines are theoretically genetically fixed and homozygous at all loci; the F1-hybrid variety is thus expected to include a single genotype, heterozygote for the loci that are polymorphic among the parental lines. An interesting property is that for a pair of linked polymorphic loci, perfect association, that is maximum linkage disequilibrium, is expected between alleles inherited from each of the two inbred lines. When the variety reproduces, this linkage disequilibrium decreases from generation to generation and it is possible to give the theoretical expectation of the bilocus genotypic frequencies as a function of both the recombination rate between the loci and the selfing rate of the population.

Considering two loci with two alleles, let the gametic phase (the haploid two-locus genotype in each parental line) be [AB] and [ab]. The genotype of the F1 hybrid is thus AB/ab. There are 10 possible two-locus genotypes {AB/AB, ab/ab, Ab/Ab, aB/aB, AB/Ab, AB/aB, ab/Ab, ab/aB, AB/ab, Ab/aB} in frequencies {x_1,t, x_2,t, ..., x_10,t} at generation t. Let r be the recombination rate between the two loci and s the selfing rate of the population. At generation t+1:

with x_t corresponding to the column vector of genotypic frequencies at time t, M₁ and M₂ to the transition matrix of x_t because of outcrossing and selfing, respectively, and ϕ to an application defined from [0,1]⁴ (that is from four gametes) to [0,1]¹⁰ (that is to 10-bilocus genotypes). Details are given in the appendix. Using recursion Equation (1) and any given initial state, the expected bilocus genotypic frequencies can be computed for any time t, assuming no genetic drift, no mutation and no gene flow from other populations. A representation of the expected changes in frequencies is given in Figure 1 for contrasting parameter values. By convention, the initial state (F1 hybrid) corresponds to t=1. In our model, t will be referred to as the genetic age of the volunteer population.

Presentation of the method

The method considers a sample drawn from a volunteer population and analyzes the observed genotypic frequencies at K independent pairs of physically linked loci. The principle is to compute the probability of the observed frequencies given the genetic age t using the expected bilocus genotypic frequencies at time t (Equation (1)). We defined t̂, an estimator of the genetic age, as the t value maximizing the probability of data. Box 1 provides the definitions of all parameters.

Let D_j={n_1,j, n_2,j, …, n_9,j} be the vector of the numbers of sampled genotypes in each genotypic class, for the jth pair of loci. We consider here only nine genotypic classes, because the trans-heterozygotes are not distinguishable from the cis-heterozygotes when using usual laboratory techniques. D_j follows a multinomial distribution with parameters n._j, the total sample size for the jth pair of loci and the vector {p_1,j,t, p_2,j,t, …_, p_9,j,t}, the expected relative frequencies of the nine observable genotypic classes, at time t. For i=1 to 8, p_i,j,t=x_i,t and p_9,j,t=x_9,t+x_10,t, where the x_i,t are frequencies predicted from Equation (1), given r_j the recombination rate between the two loci of pair j and s the selfing rate of the population. The likelihood of t is defined as follows:

For K independent pairs of loci, the likelihood of t becomes

The maximum-likelihood estimator t̂ is obtained by a numerical exploration over the interval [2, 40] of t∈ℕ^*.

Simulation study

Effects of the recombination rate, selfing rate, drift and sampling strategy on the variance and bias of the estimator were investigated using samples drawn from simulated populations. A simulated population of age t consisted of a list of K vectors of bilocus genotypic frequencies (p_i,j.t in Equation (2)), corresponding to K independent pairs of loci. Hereafter, the term frequencies will be used to refer to bilocus genotypic frequencies. Simulation parameters were a selfing rate s, a drift parameter N_c, a vector of K recombination rates r_j and a list of K vectors of initial frequencies p_i,j,1 (typically, 100% of double heterozygotes). Frequencies for each pair of loci were simulated independently using an iterative procedure mimicking Equation (1) (programmed using Mathematica; Wolfram, 1996): frequencies at time t+1 were generated by (i) computing the frequencies in the zygotes at time t+1 using frequencies at time t, (ii) random sampling of N_c genotypes in the resulting multinomial frequency distribution and (iii) dividing the resulting vector of genotype numbers by N_c. The drift parameter N_c thus corresponds to the number of zygotes drawn to constitute to the next generation. In a completely outbreeding population, N_c is equivalent to the effective population size N_e, whereas N_e is lower than N_c in a selfing population (N_c=N_e+N_e s/(2−s)). When simulating a population without drift, the N_c sampling step was omitted: hereafter, this simulation modality will be referred to as the deterministic model. To produce the samples used to estimate t, n individual genotypes were drawn for each pair of loci from the simulated frequencies. We will refer to the product of n times K as the sampling effort E, and to the couple (n, K) as the sampling strategy. Variance and bias of t̂ were estimated using 1000 simulated populations for each set of [r_j, s, N_c, (n, K)] parameters.

We first studied the influence of r_j using the deterministic model and defined a value r^* minimizing the variance of t̂ (hereafter noted V(t̂)) across both the explored interval of age t and contrasting values of selfing rates s: {0, 0.25, 0.5, 0.75}. The maximum value of s that was explored was 0.75, because F1-hybrid varieties are seldom developed for highly selfing species (but see Virmani, 1994). The effects of the other parameter values on the behavior of t̂ were studied using r^*. For both the drift parameter N_c and the sampling effort E, three values were used: {50, 200, infinite} and {200, 400, 800}, respectively. Two sampling strategies (n, K) were contrasted for E=400: {(40, 10), (80, 5)}. Finally, the explored range of variation of true genetic age t was restricted to [2, 7] mainly because our model assumes that the population is isolated from other sources of pollen, a hypothesis that does not seem reasonable in an agro-system for more than a few generations. However, we present in Supplementary Information a summary of the results obtained for older populations (t⩽14).

The method requires estimates of s and r_j. To investigate the robustness of the estimator to errors made on these parameters, we simulated populations using some true parameter values, whereas likelihoods were computed using values deviating from the true values. We explored the effect of the uncertainty of estimates of the recombination rate r arising from mapping studies. We used (Lorieux, 1994)

as an approximation of the standard error of this parameter, with L standing for the number of recombinant inbred lines used to estimate r (Tang et al., 2002). The vector of recombination rates used for the simulations was obtained by drawing the K r_j values from a normal distribution with parameters (r^*, SE_r*), whereas t̂ was estimated using r^*. For the selfing rate s, we focused on the downward bias, which typically results from using F_IS-based estimates under a false assumption of inbreeding equilibrium (see below as well as Jarne and David, 2008).

Hypothesis testing and confidence interval

When addressing the question of whether or not a volunteer population is self-perpetuating, the relevant point is to determine whether it has experienced more than one cycle of reproduction, the first cycle having taken place in the cultivated field. In other terms, if adult volunteer plants have been sampled, we need to test if t>2, and if the sample consists of seeds produced by the volunteers, we need to test if t>3.

For that purpose, we propose constructing the empirical distribution of t̂ under the appropriate null hypothesis H₀, including the uncertainty on the parameters r_j and s. The H₀ distribution is obtained by estimating t in a large number of simulated populations (say ⩾1000) of age t=2 (or t=3). Each population is simulated using the same sampling strategy (n, K) than the studied population and drawing randomly s and all r_j values in a normal distribution with parameters (ŝ, SE_ŝ) and (r̂_j, SE_r̂j), respectively. The critical value t_crit of the H₀ distribution is then determined, so that the density of probability of all t̂⩾t_crit is ⩽5%. If the estimated age noted t̂^* is ⩾t_crit, then the test is considered significant at the 5% level.

The effect of drift can be included in the testing procedure. However, because estimates of the effective population size N_e are rarely available, we propose to compare H₀ distributions obtained simulating populations with contrasting values of the drift parameter N_c.

We compared the H₀ distributions obtained with 1000 simulated populations, setting (n, K)=(40, 10) and r_j=r* for all j, and all possible combinations of N_c∈{50, 200, infinite}, s∈{0, 0.5}, SE_r∈{0, 0.036} and SE_s∈{0, 0.1} (for s=0.5 only). The type I error (the false positive rate) and the power of the test (1 minus the false negative rate) were empirically determined for samples drawn from populations of true age t∈[3, 7] simulated using all the above sets of parameter values.

The change in type I error arising from using a downwardly biased F_IS-based estimates of selfing rate was empirically determined for a specific case. We simulated samples from t=2 and 3 populations using s=0.75. F_IS-based estimates would typically yield a value of ŝ=0.55 in the t=3 samples. We then applied the testing procedure to these samples by simulating the appropriate H₀ distributions using ŝ=0.75 vs ŝ=0.55. Other simulation parameter values were (n, K)=(40, 10), SE_r=0.036, SE_s=0.1.

In addition, the 95% confidence interval (CI) of t̂^* may be constructed by estimating t in a large number of simulated populations of successive ages: t=t̂^*−1, t̂^*−2, … (t̂^*+1, t̂^*+2, …, respectively). The lower and upper bound of the CI is then the smallest and largest, respectively, t value for which <97.5% of the simulations yielded estimates <t̂^* and >t̂^*, respectively.

Empirical study system

Almost all sunflower (Helianthus annuus) varieties cultivated nowadays in Western Europe are F1 hybrids. As opposed to the wild auto-incompatible H. annuus, these sunflower varieties can self at an unknown rate (Gandhi et al., 2005).

We applied the method to two volunteer populations of sunflower. We first analyzed three generations of an experimental population conducted as follows. In 2001, the F1-hybrid variety Prodisol (DEKALB) was cultivated in a 0.6 ha field of the experimental domain of INRA (Epoisses, France). The harvest of this variety was called generation G2. In 2002, soybean was sown on this field and volunteer sunflowers grew at a density varying from 0.5 to 4 plants per m². These volunteers originated from seeds lost at harvest and were then representative of generation G2. The harvest of 335 of these plants was bulked and constituted generation G3. During these 2 years, no other sunflower field occurred at a distance of <400 m. In 2004, 350 seeds of generation G3 were sown under a pollen-proof tunnel at INRA Mauguio (France). Honeybees were introduced during the flowering period to ensure free intercrossing of the plants. The harvest of these plants was bulked and constituted generation G4.

In 2004, a natural volunteer population (FR001) was sampled in a fallow close to Saint Laurent d’Aigouze (France). Hundreds of volunteers occurred at varying densities on an area of ∼3 ha. Independent maternal families were sampled from 44 plants all over the population.

About 60 to 71 seeds per generation (G2 to G4) and one seed per maternal family for FR001 were sown for analysis. DNA was isolated from about 100 mg of plant leaves according to the Dneasy Plant Mini kit (Qiagen, GmbH, Hilden, Germany) with the following modification: 1% of polyvinylpyrrolidone (PVP 40 000) was added to buffer AP1.

Both independent microsatellite loci and pairs of loci located at different genetic distances were selected from Tang et al. (2002). Care was taken to choose loci with no earlier evidence of null allele (Tang and Knapp, 2003; Tang et al., 2003). Twenty-three loci were screened on eight G2 individuals and only polymorphic loci were retained. Seventeen loci have been used on the whole data set (Table 1).

Table 1 Microsatellite loci analyzed in this study

Full size table

The amplification reaction consisted of 50 ng DNA, 4 pmol of unlabeled reverse primer, 2 pmol of forward primer, fluorescently labeled with NED, HEX or FAM, 1 × reaction buffer, 2 mM MgCl₂, 200 μM dNTP, 0.25U Taq DNA polymerase in a total volume of 25 μl. The amplification method was 95 °C for 2 min, 36 cycles of 94 °C for 30 s, Tx for 30 s (Tx is initially 63 °C and decreases to 1 °C per cycle for the six first cycles, until it reaches 57 °C) and 72 °C for 45s, followed by a final extension for 20 min at 72 °C. Electrophoresis was performed on an ABI 3130xl Genetic Analyser. Samples were prepared by adding 3 μl of diluted PCR products to 6.875 μl formamide and 0.125 μl of GenScan 400HD Rox size standard. The GENEMAPPER software (Applied Biosystems, Foster City, CA, USA) was used to analyze the DNA fragments and to score the genotypes.

Data analysis

Mean number of alleles per locus, multilocus heterozygosity and the fixation index F_IS were estimated for each generation over a subset of 11 independent loci, separated by at least 36 cM according to Tang et al. (2002) (Table 1), using GENETIX (Belkhir et al., 2001). The significance and the sampling variance of the estimated F_IS values were assessed using 1000 permutations of alleles among individuals and a jackknife procedure over the loci, respectively.

For both the experimental and natural volunteer population, estimates of s and approximated standard errors were computed on the assumption of inbreeding equilibrium (Jarne and David, 2008), namely

To determine the magnitude of the directional error that was made when assuming inbreeding equilibrium, the selfing rate between two successive generations of the experimental population was also estimated using these F_IS values and the recurrence formula on the deterministic evolution of F_IS (Crow and Kimura, 1970):

For the experimental population, the variance effective population size N_e was estimated using the temporal variation of allelic frequency at the 11 independent loci. We used the estimator of Waples (1989) based on F_c, the standardized variance in allelic frequencies between sampled generations. The 95% CI on N_e was obtained from the simulation of the actual distribution of F_c based on the estimated N_e, as suggested by Goldringer and Bataillon (2004). This method was implemented in the program kindly provided by Mathieu Siol and described in Siol et al. (2007). N_e was estimated for two pairs of samples: G2 and G3, G3 and G4. Rare alleles were pooled. For the natural population of volunteers FR001, we used 1/10 of the demographic size as an order of magnitude of N_e (Frankham, 1995), using one individual per 5 m² as a conservative estimate of the mean density.

Application of the age estimation method

We first discarded genotypes carrying rare alleles, which were present in only one or two individuals over the whole data set; they were interpreted as contamination (gene flow from distant sunflower field) or mutation.

The genetic model underlying the estimation method assumes that all loci are biallelic with equal initial allelic frequencies. However, many loci displayed more than two alleles and allelic frequencies were sometimes unbalanced. We interpreted these deviations as the consequence of incomplete fixation of the inbred lines used to produce the variety, as already described by Zhang et al. (2005). Indeed, the fixation of the parental lines is generally assessed by breeders using morphological rather than molecular markers. To uncover the haplotypic structure of the F1 hybrids, we first determined parental haplotypes and their frequencies by estimating correlation coefficients between pairs of alleles at different loci within each linkage group (LINKDIS program implemented in GENETIX, Garnier-Gere and Dillmann, 1992). We then fused some allelic classes to transform parental haplotypes into tractable initial bilocus genotypes in the F1 generation as illustrated in Figure 2a. This transformation was possible for six pairs of loci; that is after fusing the appropriate allelic classes, the loci of these pairs were biallelic with balanced allele frequencies. These six pairs involved 11 different loci (Table 1). The number of observations for each bilocus genotypic class (that is the n_i,j in Equation 3)) was computed and used as the input for the application of the age estimation method.

The same computations were made for the natural volunteer populations. Only four loci pairs were compatible with the genetic model and were thus used for the estimation of the genetic age (Table 1).

For each sample, we determined the 95% CI of t, and tested the null hypothesis ‘t=2’ for the experimental population or ‘t=3’ for the natural population, as described above. The standard error of each r_j value was approximated setting L=94 (Tang et al., 2002) in Equation (4).

Interpretation of the genotypic data and robustness to deviation from the assumed genetic structure of the variety

The genotypic frequencies with several loci presenting three alleles could not fit the hypothesis of a pure F1 hybrid. To further explore the origin of the unexpected polymorphism in the experimental volunteer population, we first looked for significant associations between the less common alleles (frequency between 10 and 20%) even if the corresponding loci were not mapped on the same linkage group. Visual investigation of the data set showed that the multilocus genotypes were compatible with the following hypothesis: the F1 hybrid in the field was actually a mixture of two F1-hybrid genotypes, that is a prevailing one (Prodisol) together with a less abundant one (hereafter referred to as ‘contaminant’). Their multilocus genotypes could be reconstituted (for example, Figure 2b). The multilocus genotypes in G2 could then be partitioned into three groups of progenies: Prodisol, contaminant and intercrossed between the two varieties. As this interpretation contrasted with the assumptions made on the genotype of the variety, we evaluated its consequences on the outputs of the age estimation method. We determined the actual composition of the initial field using the same recoding strategy of the data than above and simulated populations deriving from such field. Namely, we considered a variety that was an admixture of two F1 hybrids in frequencies 80/20%. The genotype of variety 1 and 2 were AB/ab and AB/aB, respectively (as in Figure 2b) for four pairs of loci, ab/ab and AB/AB for one pair, and AB/ab and AB/ab for the last one. Samples were simulated in accordance with our actual dataset ((n, K)=(60, 6)) and analyzed using the age estimation method (the likelihoods were computed under the assumption of a pure F1 hybrid).

Results

Simulation results

Empirically determined bias and variance

Simulations showed that t̂ was essentially an upwardly biased estimator, although only slightly for most explored sets of parameter values (Figure 3a). Bias tended toward zero when increasing E under the deterministic model, showing that t̂ was an asymptotically unbiased estimator (for example Figure 3a). The magnitude of the bias of t̂ was positively correlated to its variance (not shown). A brief description of the empirical distribution of t̂ across different sets of parameter values is given in Table 2 for E=400. For all explored values of t⩽7, s⩽0.75 and r_j⩽0.15, the observed range of bias was [−0.04, 0.54]. The variance of t̂ (also noted V(t̂)), behaved as a strictly non-linear increasing function of t (Figure 3b).

Table 2 Range, median, first and third quartile [25, 75%] of the empirical distribution of bias and variance of t̂, across different set of parameters using the deterministic model and E=400

Full size table

The recombination rate within pairs of loci r_j had a considerable effect on V(t̂). As illustrated in Figure 4, for a given set of other parameters values, there is an r_j value minimizing V(t̂), a value referred to as optimal r. The optimal r behaved as a decreasing function of t, but also as a slightly increasing function of the selfing rate s. We chose r^*=0.15 as an approximation of the value associated to minimum V(t̂) across the explored range of both the age t∈[2,7] and selfing rate s ∈{0, 0.25, 0.5, 0.75} (Figure 4). All following results were obtained using r^*.

The selfing rate s affected the variance of t̂ in an age-dependent manner. For small values of t, higher selfing rates resulted in lower V(t̂). However, this relationship was reversed for larger values of t (Figure 4). The age at which the relationship switched was larger for higher selfing rates and can be related to the approach of inbreeding equilibrium. This equilibrium corresponds to the equilibrium frequency of heterozygous genotype at single loci; it depends on s and is reached earlier for lower values of s. For instance, at t=4 for s=0.5 and at t=5 for s=0.75, we noted that inbreeding equilibrium was virtually reached (that is differences between theoretical frequencies at that age and equilibrium frequencies are less than sampling noise for E=400).

Increasing the sampling effort E or the drift parameter N_c resulted in a reduction of V(t̂) and the associated 95% envelope of the distribution of the estimator (Figures 3b and 5). For a given value of E, V(t̂) revealed sensitive to the sampling strategy (n, K), but only for populations simulated with drift (N_c≠infinite): in this case, increasing the number of pairs of loci resulted in a decrease of V(t̂) (Figure 5).

Robustness to deviations from true parameter values

Overestimating the selfing rate s resulted in underestimating t. Conversely, underestimating s led to overestimating t, but only until inbreeding equilibrium was virtually reached; beyond this point, t was underestimated (Figure 6a). Estimating s using F_IS values results in an underestimation when populations are not at inbreeding equilibrium (Jarne and David, 2008). When using these downwardly biased estimates instead of the parametric s, an overestimation of t is then expected. As illustrated in Figure 6b, an excess of positive bias is indeed observed; this excess was the most pronounced for high selfing rates at early stages (t=3 and 4).

Underestimating and overestimating all r_j values resulted in overestimating and underestimating, respectively, t (not shown). The magnitude of this error increased with the age of the population.

When populations were simulated drawing all r_j values from a normal distribution with parameters N (r_Op, SE≠0), V(t̂) increased relatively to simulations performed using SE=0, and a negative bias was then observed (Figure 7).

Finally, simulating populations deviating from the hypothesized genotypic structure of the F1 hybrid as described in Materials and methods yielded a positive bias from t=2 to 4, but a negative bias from t=5 to 7 (Supplementary Figure S2).

Hypothesis testing and CIs

For the explored parameter values, the ‘t=2’ and ‘t=3’ H₀ distributions were only slightly affected by the value of N_c, and by the uncertainty on r_j and s (not shown, but see Figure 5). Nevertheless, the slight increase in variance of the H₀ distributions changed or come close to changing the critical value t_crit, thus reducing the power of the test.

All together, simulations showed that the tests were reasonably powerful even in the presence of considerable drift and that the power was greater under partial selfing (Table 3). The H₀ ‘t=2’ and ‘t=3’ could always be rejected whenever t̂^*⩾4 and t̂^*⩾5, respectively, with an empirically determined type I error of P<0.015 and P<0.04, respectively, Table 3. The H₀ ‘t=2’ distribution exhibited a lower variance than the H₀ ‘t=3’ one. Accordingly, the power of the test under the former H₀ was shown considerably higher than under the latter (Table 3).

Table 3 Empirically determined power of test and type I error associated to the null hypothesis H₀ ‘t=2’ and ‘t=3’ obtained for successive values of t, different values of the drift parameter N_c and selfing rate s

Full size table

The lower bound of the CI was frequently equal or 1 year less than that of the 95% envelope of t̂ for simulated populations of t=t̂^*. In contrast, the CI's upper bound was frequently considerably larger than the corresponding bound of the 95% envelope (not shown).

When using the downwardly biased estimates ŝ=0.55 instead of the parametric value s=0.75, the change in type I error associated to the H₀ ‘t=2’ was negligible. Conversely, the type I error associated to H₀ ‘t=3’ increased from 0.002 to 0.107.

Empirical results

The number of alleles per locus varied from 2 to 5 in the experimental population and from 2 to 6 in the natural population. Some of these alleles were observed only once or twice. Diversity statistics estimated over 11 independent loci are presented in Table 4.

Table 4 Genetic diversity statistics and F_IS-based estimates of selfing rate

Full size table

Estimates of selfing rates are presented in Table 4. Interestingly, the selfing rates estimated in the experimental and the natural populations were both non-null and of the same order of magnitude (both about 0.4). It was not possible to estimate s between F1 and G2: indeed, as an F1-hybrid variety is theoretically composed of a unique, heterozygote genotype, the expectancy of genotypic frequencies resulting from pure selfing or pure outcrossing of the F1 individuals is the same. Accordingly, F_IS was not significantly different from zero in G2 (Table 4).

Estimates of effective population size are 54.0 [10–infinite] between G2 and G3 and 55.4 [12–infinite] between G3 and G4. No upper limit on the CI was obtained, indicating that the sampling variance was too large relative to the genetic drift. We, therefore, estimated CIs and tested the ‘t=2’ null hypothesis considering N_c=50 and 200, respectively.

The genetic age and the corresponding CIs estimated using two contrasting values of the drift parameter (N_c=50, 200) in the three successive generations of the experimental population were t̂= 2 {[2, 2], [2, 2]}, 3 {[2, 7], [3, 5]} and 5 {[3, 19], [3, 11]}, for G2, G3 and G4, respectively. The ‘t=2’ null hypothesis was rejected for both the G3 and G4 samples, but only for N_c=200 for the G3 sample. For the FR001 population, the drift parameter was set to N_c=600 and infinite, respectively. The genetic age and the corresponding CIs were estimated to t̂= 4 {[2, 11], [3, 9]}. The t=3 null hypothesis could not be rejected in either case.

Discussion

Parameters influence on the efficiency of the method

Simulations showed that t̂ was slightly positively biased and that its variance was increasing from generation to generation with a pattern depending on both the recombination and selfing rates. Interestingly, these results may be explained by the expected dynamics of genotypic frequencies as depicted in Figure 1. One important feature of this figure has to be pinpointed: the size of the steps (that is the magnitude of expected frequencies differences between successive generations) is decreasing when the population is aging. As a consequence, for any given age t, the step is larger between t and t−1 than between t and t+1.

These features can be expressed in terms of bias and variance of the estimator. Namely, for a population of age t, random samples will more frequently be assigned to generations deviating from the true age when the steps separating the generations are smaller. Then the decreasing size of the steps necessarily results in an increase of the variance of t̂ with time. This also explains the increase in type I error between the tests of the ‘t=2’ and ‘t=3’ null hypothesis. Moreover, if we assume that the sampling process generates symmetrically distributed deviations, more samples are expected to be assigned to t̂>t than to t̂<t, making t̂ an upwardly biased estimator, as was mainly observed in the simulations.

A similar reasoning can explain how recombination rate and selfing rate affect the estimator. Considering recombination rate, Figure 4 shows that a careful choice of the genetic distance between markers of a pair is crucial. As the true age of a sampled population is unknown, it is impossible to choose a universal optimal recombination rate. However, as long as the goal is to address the self-perpetuation of these populations by testing if t>2 or 3, the method performed well using optimal recombination rates found over the t∈[2, 7], which were approximately between 0.1⩽r_j⩽0.15. They could be chosen between 0.05⩽r_j⩽0.10 if relatively isolated older populations were to be expected (t ∈ [8,14], Supplementary Figure S1).

Sampling effects

The results showed that for a fixed sampling effort E, the sampling strategy (n, K) can affect V(t̂): for finite values of N_c, doubling the number of pairs of loci K was associated to lower variance than doubling the number of genotyped individuals n. This effect may be explained by the action of drift, which generates random deviations of the genotypic frequencies independently at each pair of loci. Increasing K reduces the probability of estimating frequencies that were drift deviated in the same direction, which in turn reduces the discrepancy of estimated ages.

In practice, natural populations of volunteers are often characterized with rather small effective population sizes and thus are likely affected by a consequent amount of drift. More accurate estimations and powerful tests can then be obtained by favoring a large number of loci pairs for a given sampling effort.

Robustness to deviation from assumed parameter values

Most of the considered deviations affecting s led to underestimating the genetic age, making our testing procedure generally conservative. A large underestimation of the selfing rate will, however, occur when using F_IS, as long as the population is far from inbreeding equilibrium (Figure 6), which may lead to a spurious rejection of the null hypothesis (‘t=3’). One solution to solve this problem is to analyze the distribution of the individual level of heterozygosity, which allows estimating the selfing rate in the earlier generation while relaxing the assumption of inbreeding equilibrium (Enjalbert and David, 2000). We implemented such an approach on the G3 and G4 samples of our experimental population and the resulting estimated selfing rates were found very similar to those estimated using the F_IS recursion Equation (7) (not shown).

Directional deviations from true recombination rates may induce large errors on the estimation (not shown). However, as long as several pairs of loci are used, this risk may be substantially reduced. Simulating normally distributed deviations of recombination rate yielded an essentially negative bias and an increased variance. This underlines the need to use information from accurate genetic maps.

Moreover, we showed that the uncertainty on s and r can be taken into account when constructing the null hypothesis and/or the 95% CI of the estimated age, which reduces the risk of rejecting spuriously H₀. Simulating H₀ using a conservative value of N_c is an appropriate way to take the uncertainty on N_e into account, at the detriment of the power of test. A careful observation of the population on the sampling site may provide valuable insights about this parameter (Frankham, 1995).

Application to empirical data

The method yielded a correct estimation of the genetic age of the two first generations (G2 and G3) of our experimental population of volunteers. For the third generation (G4), the true age was included in the 95% CI. For G3 and G4, the ‘t=2’ null hypothesis was appropriately rejected in both cases. The genetic age of the FR001 natural population was estimated to t̂^*=4. As we used genotypic data obtained from the offspring rather than from the volunteer plants, the appropriate null hypothesis was ‘t=3’, which could not be rejected even when using a moderately large value of N_c (600) and possibly underestimating s. We, therefore, cannot exclude that the plants actually observed in this fallow were just first-generation offspring of the plants earlier cultivated in this field. As the variance of the H₀ ‘t=2’ is smaller than the H₀ ‘t=3’, genotyping the plants sampled in the field rather than their offspring (seedlings) might have been a more powerful procedure. However, this result is compatible with the emerging picture of the distribution of volunteer populations of sunflower in studied French areas. Indeed, recent expedition trips have shown that such populations are rather both small and rare (Muller et al., 2006), contrasting with other countries such as Argentina (Cantamutto et al., 2008). The restricted place left in French agro-systems and also the limited spontaneous seed shattering of cultivated sunflower compared with, for example, oilseed rape probably lowers the potentiality of self-perpetuation of a volunteer population.

Imperfectness of empirical data always raises the question of the applicability of estimation methods, and perhaps more acutely when these tools are model based. To this regard, the age estimation method seemed surprisingly robust in several aspects. Indeed, despite the low effective population size, and the moderate number of markers that were used when compared with what is available in crop species such as sunflower (for example Kane and Rieseberg, 2008), the method performed reasonably well on real populations of known age (G2, G3 and G4). In addition, deviations from the theoretical genetic structure revealed tractable, provided that the data are appropriately transformed. To this respect, the consistent results obtained using samples from the experimental population are informative in two ways: (i) pooling haplotypes from distinct hybrids of the same genetic age does not distort the information as long as the parental haplotypes can be recognized. This suggests that reliable estimations could be obtained using data corresponding to more complex varietal structure (for example three ways hybrid in maize or more complex mixture of F1 hybrids sown the same year) and (ii) the elimination of rare alleles does not hinder the estimation procedure or the result.

Nevertheless, migration rates may be sufficiently large (especially through pollen import) to rapidly dilute the information about the original founders of a volunteer population. To explore the robustness of our method, we modeled a situation in which a small number of migrating gametes sharing alleles with the resident volunteer population were arriving each generation (not shown). Our simulations showed that for small migration rate (<0.15), removing all genotypes carrying the less frequent alleles (that is keeping only genotypes carrying the two most frequent alleles at each locus of any given pair) before estimating genetic age yielded consistent estimates of the genetic age of the resident population. The phase could be recognized with fairly good power by choosing the most frequent double homozygote genotype after discarding genotypes carrying the less frequent alleles.

Anyhow, the method we proposed here may be considered as a first step toward more sophisticated ones, taking explicitly migration into account. Indeed, as genetic exchanges may constitute key events in weedy evolution, it would be of primary interest to incorporate gene flow as a parameter to be estimated, using a more comprehensive model.

We developed and evaluated the properties of a maximum-likelihood method to estimate the genetic age of a volunteer population derived from an F1-hybrid variety, and proposed a test to answer this simple question: is the sampled volunteer population a 1-year transitory crop descent or the result of more than one autonomous cycle of reproduction? This method provides an example of how the available information from linked markers can be successfully exploited to analyze the very recent history of crop-derived populations. Although generic methods are continuously designed to analyze the huge amount of genotypic information now potentially available (for example Falush et al., 2003), we feel that the investigation of crop-relative evolution in the agro-system requires an adaptation of these methods to crop genetic structure, to low intervarietal differentiation and to specific questions (for example Devaux et al., 2007), in addition to the contribution of other approaches such as demographic modeling (Pivard et al., 2008).

References

Bagavathiannan MV, Van Acker RC (2008). Crop ferality: implications for novel trait confinement. Agric Ecosyst Environ 127: 1–6.
Article Google Scholar
Belkhir K, Borsa P, Chikhi L, Raufaste N, Bonhomme F (2001). GENETIX 4.02, Logiciel Sous WindowsTM Pour la Génétique des Populations. Laboratoire Génome, Populations, Interactions, CNRS UMR 5000, Université de Montpellier II: Montpellier, France.
Google Scholar
Cantamutto M, Poverene M, Peinemann N (2008). Multi-scale analysis of two annual Helianthus species naturalization in Argentina. Agric Ecosyst Environ 123: 69–74.
Article Google Scholar
Cornuet J-M, Piry S, Luikart G, Estoup A, Solignac M (1999). New methods employing multilocus genotypes to select or exclude populations as origins of individuals. Genetics 153: 1989–2000.
CAS PubMed PubMed Central Google Scholar
Crawley MJ, Brown SL (2004). Spatially structured populations dynamics in feral oilseed rape. Proc Roy Soc Lond 271: 1909–1916.
Article Google Scholar
Crow JF, Kimura M (1970). An Introduction to Population Genetics Theory. Burgess publishing company: Minneapolis, USA.
Google Scholar
Devaux C, Lavigne C, Austerlitz F, Klein EK (2007). Modelling and estimating pollen movement in oilseed rape (Brassica napus) at the landscape scale using genetic markers. Mol Ecol 16: 487–499.
Article CAS PubMed Google Scholar
Devaux C, Lavigne C, Falentin-Guyomar’ch H, Vautrin S, Lecomte J, Klein EL (2005). High diversity of oilseed rape pollen clouds over an agro-ecosystem indicates long-distance dispersal. Mol Ecol 14: 2269–2280.
Article CAS PubMed Google Scholar
Enjalbert J, David J (2000). Inferring recent outcrossing rates using multilocus individual heterozygosity: application to evolving wheat populations. Genetics 156: 1973–1982.
CAS PubMed PubMed Central Google Scholar
Excoffier L, Estoup A, Cornuet J-M (2005). Bayesian analysis of an admixture model with mutations and arbitrarily linked markers. Genetics 169: 1727–1738.
Article CAS PubMed PubMed Central Google Scholar
Falush D, Stephens M, Pritchard JK (2003). Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164: 1567–1587.
CAS PubMed PubMed Central Google Scholar
Frankham R (1995). Effective population size/adult population size ratios in wildlife: a review. Genet Res 66: 95–107.
Article Google Scholar
Gandhi SD, Heesacker AF, Freeman CA, Argyris J, Bradford K, Knapp SJ (2005). The self-incompatibility locus (S) and quantitative trait loci for self-pollination and seed dormancy in sunflower. Theor Appl Genet 111: 619–629.
Article CAS PubMed Google Scholar
Garnier-Gere P, Dillmann C (1992). A computer program for testing pairwise linkage disequilibria in subdivided populations. J Heredity 83: 239.
Article CAS Google Scholar
Goldringer I, Bataillon T (2004). On the distribution of temporal variation in allele frequency: consequences for the estimation of effective population size and the detection of loci undergoing selection. Genetics 168: 563–568.
Article PubMed PubMed Central Google Scholar
Jarne P, David P (2008). Quantifying inbreeding in natural populations of hermaphroditic organisms. Heredity 100: 431–439.
Article CAS PubMed Google Scholar
Kane NC, Rieseberg LH (2008). Genetics and evolution of weedy Helianthus annuus populations: adaptation of an agricultural weed. Mol Ecol 17: 384–394.
Article CAS PubMed Google Scholar
Koopman WJM, Li Y, Coart E, Van de Weg E, Vosman B, Roldán-Ruiz I et al. (2007). Linked vs unlinked markers: multilocus microsatellite haplotype-sharing as a tool to estimate gene flow and introgression. Mol Ecol 16: 243–256.
Article CAS PubMed Google Scholar
Londo JP, Schaal BA (2007). Origins and population genetics of weedy rice in the USA. Mol Ecol 16: 4523–4535.
Article CAS PubMed Google Scholar
Lorieux M (1994). ‘Aspects statistiques de la cartographie des marqueurs moléculaires’ in Document de travail de la mission biométrie du CIRAD n°1-94, pp 31–35.
Muller M-H, Arlie G, Bervillé A, David J, Delieux F, Fernandez-Martinez JM et al. (2006). Le compartiment spontané du tournesol Helianthus annuus en Europe: prospections et premières caractérisations génétiques. Actes du Colloque BRG 6: 335–353.
Google Scholar
Pessel FD, Lecomte J, Emeriau V, Krouti M, Messean A, Gouyon PH (2001). Persistence of oilseed rape (Brassica napus L.) outside of cultivated fields. Theor Appl Genet 102: 841–846.
Article Google Scholar
Pivard S, Adamczyk K, Lecomte J, Lavigne C, Bouvier A, Deville A et al. (2008). Where do the feral oilseed rape populations come from? A large-scale study of their possible origin in a farmland area. J Appl Ecol 45: 476–485.
Article Google Scholar
Reagon M, Snow AA (2006). Cultivated Helianthus annuus (Asteraceae) volunteers as a genetic ‘bridge’ to weedy sunflower populations in North America. Am J Bot 93: 127–133.
Article Google Scholar
Siol M, Bonnin I, Olivieri I, Prosperi JM, Ronfort J (2007). Effective population size associated with self-fertilization: lessons from temporal changes in allele frequencies in the selfing annual Medicago truncatula. J Evol Biol 20: 2349–2360.
Article CAS PubMed Google Scholar
Tang S, Kishore VK, Knapp SJ (2003). PCR-multiplexes for a genome-wide framework of simple sequence repeat marker loci in cultivated sunflower. Theor Appl Genet 107: 6–19.
Article CAS PubMed Google Scholar
Tang S, Knapp SJ (2003). Microsatellites uncover extraordinary diversity in native American landraces and wild populations of cultivated sunflower. Theor Appl Genet 106: 990–1003.
Article CAS PubMed Google Scholar
Tang S, Yu J-K, Slabaugh MB, Shintani DK, Knapp SJ (2002). Simple sequence repeat map of the sunflower genome. Theor Appl Genet 105: 1124–1136.
Article CAS PubMed Google Scholar
Virmani SS (1994). Monographs on Theoretical and Applied Genetics 22: Heterosis and Hybrid Rice Breeding. Springer: Verlag.
Book Google Scholar
Waples RS (1989). A generalized approach for estimating effective population size from temporal changes in allele frequency. Genetics 121: 379–391.
CAS PubMed PubMed Central Google Scholar
Wilson GA, Rannala B (2003). Bayesian inference of recent migration rates using multilocus genotypes. Genetics 163: 1177–1191.
PubMed PubMed Central Google Scholar
Wolfram S (1996). The Mathematica Book, 3rd edn. Wolfram media Cambridge University Press: Cambridge, UK.
Google Scholar
Zhang LS, Le Clerc V, Li S, Zhang D (2005). Establishment of an effective set of simple sequence repeat markers for sunflower variety identification and diversity assessment. Can J Bot 83: 66–72.
Article CAS Google Scholar

Download references

Acknowledgements

We thank the Domaine experimental INRA d’Epoisses for production of the two first generations and Muriel Latreille for technical assistance. This work was funded by the Bureau des Ressources Génétiques.

Author information

Authors and Affiliations

UMR DIA-PC 1097, INRA Montpellier, Domaine de Melgueil, Mauguio, France
M-F Ostrowski, Y Rousselle, A Tsitrone, S Santoni, J David & M-H Muller
UMR 1210, ‘Biologie et gestion des Adventices’, INRA, Dijon, France
X Reboud

Authors

M-F Ostrowski
View author publications
You can also search for this author in PubMed Google Scholar
Y Rousselle
View author publications
You can also search for this author in PubMed Google Scholar
A Tsitrone
View author publications
You can also search for this author in PubMed Google Scholar
S Santoni
View author publications
You can also search for this author in PubMed Google Scholar
J David
View author publications
You can also search for this author in PubMed Google Scholar
X Reboud
View author publications
You can also search for this author in PubMed Google Scholar
M-H Muller
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M-H Muller.

Ethics declarations

Competing interests

The authors declare no conflict of interest.

Additional information

Supplementary Information accompanies the paper on Heredity website

Supplementary information

Supplementary Information (DOC 442 kb)

Appendix

Let x_t=(x_1,t,…,x_10,t)^T be the column vector containing the 10 genotypic frequencies (order: AB/AB, ab/ab, Ab/Ab, aB/aB, AB/Ab, aB/ab, AB/aB, Ab/ab, Ab/aB, AB/ab) at time t. T stands for, ‘Transpoae’.

Let M₁ be the 4 × 10 transition matrix relating vector x_t to the vector (y₁,y₂, y₃, y₄)^T, containing the four gametic frequencies (order: AB, ab, Ab, aB):

Let ϕ be the application providing the expected genotypic frequencies under panmixia from a vector of gametic frequencies. ϕ is defined from [0,1]⁴ to [0,1]¹⁰ such that

The expected genotypic frequencies under panmixia at t+1 are then given by ϕ (M₁·x_t).

Let M₂ be the 10 × 10 transition matrix relating genotypic frequencies at time t to the genotypic frequencies at the next generation, under selfing.

Let s be the selfing rate (assumed equal for all genotypes). The recursion equation for genotypic frequencies under mixed mating is given by:

It is difficult to get a general expression relating x_t to x₁ (genotypic frequencies at first generation) from Equation (A1). However, under complete selfing, Equation (A1) leads to

And under complete outcrossing, Equation (A1) leads to

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ostrowski, MF., Rousselle, Y., Tsitrone, A. et al. Using linked markers to estimate the genetic age of a volunteer population: a theoretical and empirical approach. Heredity 105, 358–369 (2010). https://doi.org/10.1038/hdy.2009.156

Download citation

Received: 22 September 2008
Revised: 09 October 2009
Accepted: 15 October 2009
Published: 09 December 2009
Issue Date: October 2010
DOI: https://doi.org/10.1038/hdy.2009.156

Keywords

This article is cited by

Estimation of mating system parameters in an evolving gynodioecous population of cultivated sunflower (Helianthus annuus L.)
- M Roumet
- M-F Ostrowski
- M-H Muller
Heredity (2012)

Abstract

Similar content being viewed by others

Multi-parent populations in crops: a toolbox integrating genomics and genetic mapping with breeding

Population structure and genetic diversity in red clover (Trifolium pratense L.) germplasm

Efficient genomics-based ‘end-to-end’ selective tree breeding framework

Introduction

Materials and methods

F1-hybrid model: assumptions and bilocus expectations

Presentation of the method

Simulation study

Hypothesis testing and confidence interval

Empirical study system

Data analysis

Application of the age estimation method

Interpretation of the genotypic data and robustness to deviation from the assumed genetic structure of the variety

Results

Simulation results

Empirically determined bias and variance

Robustness to deviations from true parameter values

Hypothesis testing and CIs

Empirical results

Discussion

Parameters influence on the efficiency of the method

Sampling effects

Robustness to deviation from assumed parameter values

Application to empirical data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary Information (DOC 442 kb)

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

Estimation of mating system parameters in an evolving gynodioecous population of cultivated sunflower (Helianthus annuus L.)

Search

Quick links