Introduction

Modern crop diversity depends on the breadth of diversity found within the wild ancestors from which it was derived. The alleles necessary for domestication of the crop are derived from this diversity. From this perspective, it is important to understand the evolutionary history of the species in its wild state with regards to its levels of diversity, linkage disequilibrium (LD) and demographic history. Using multilocus sequence data, it has been shown that wild forms of species, such as maize (Tiffin and Gaut, 2001), sunflower (Liu and Burke, 2006) and poplar (Ingvarsson, 2005), have similar levels of diversity, whereas relative to these species, diversity is reduced in wild barley (Morrell et al., 2003) and wild rice species (Zhu et al., 2007). Diversity within the cultivated form of maize was 60% of that found in its wild ancestor (Tenallion et al., 2004). Coalescent simulations that modeled the demographic effects relative to selection determined that domestication within wild maize involved a small population of individuals. For rice, diversity was severely reduced to 20% of that found in wild genotypes (Zhu et al., 2007). At the other extreme is einkorn wheat where wild and domesticated forms had equal diversity levels (Kilian et al., 2007). Multilocus sequence data also supported a single domestication event for each of these three species (Tenaillon et al., 2004; Kilian et al., 2007; Molina et al., 2011).

The patterns of nucleotide diversity within and among populations are influenced by evolutionary forces, such as mutation, population structure and gene flow, at the genome level (Wright and Gaut, 2005; Arunyawat et al., 2007), and recombination and natural selection at the gene level (Schmid et al., 2005). Demographic parameters such as effective population sizes of current and ancestral populations, species divergence times and gene flow rates between populations collectively define biogeographic histories and the process of population divergence and/or speciation (Strasburg and Rieseberg, 2010). Collectively, these events leave traces on the pattern of the genetic diversity (Pyhajarvi et al., 2007) within the species and can be used to define the demographic history of the species. Multilocus DNA sequence data contains a wealth of information used to describe nucleotide variation and demographic parameters in multiple plant species (Liu and Burke, 2006; Arunyawat et al., 2007; Haudry et al., 2007; Pyhajarvi et al., 2007; Zhang and Ge, 2007; Zhu et al., 2007; Ingvarsson, 2008; Stadler et al., 2008; Strasburg and Rieseberg, 2008; Chen et al., 2010). It can also be used to model demographic parameters using coalescent analysis (Pyhajarvi et al., 2007; Stadler et al., 2008; Strasburg and Rieseberg, 2008). Multiple parameters can be estimated by comparing the output from the coalescent analysis with observed nucleotide diversity data. Recently, the inference of demographic parameters from genetic data has become an integral part of conservation genetic studies (Lopes and Boessenkool, 2010).

The focus here is on common bean (Phaseolus vulgaris L.), the most consumed food legume in the world and is a major source of protein for Latin America and Africa (Graham and Vance, 2003). Common bean is a highly structured species. Wild beans are partitioned into two large gene pools based on phaseolin seed protein variation (Gepts et al., 1986), DNA marker diversity (Becerra Velasquez and Gepts, 1994; Freyre et al., 1996; Tohme et al., 1996), morphology (Gepts and Debouck, 1991), isozymes (Koenig and Gepts, 1989) and mitochondrial DNA RFLPs (Khairallah et al., 1992). The wild Middle American gene pool extends from Mexico through Central America and into Venezuela and Colombia, whereas the Andean gene pool is found in Peru, Chile, Bolivia and Argentina. These two gene pools overlap in Colombia (Gepts and Bliss, 1986). Similar molecular analyses have shown that domesticated common bean is also split into two gene pools, and each gene pool consists of multiple races that in general have a similar geographical distribution as wild genotypes (Singh et al., 1991).

Recently, a multilocus sequence analysis determined that for each gene pool the cultivated form of the species arose by a single domestication event (Mamidi et al., 2011) out of distinct wild gene pools. This is unique among domesticated crop species studied to date that exhibited a single domestication event out of a single wild gene pool (Tenaillon et al., 2004; Kilian et al., 2007; Molina et al., 2011). The duration of the Middle American domestication bottleneck, 8160–6260 years before the present (BP), was longer than the Andean duration, 8500–7012 years BP. These dates are consistent with archeological evidence (Kaplan and Lynch, 1999). Additionally, the domestication bottleneck population size for the Andean gene pool was about 50% of that of the Middle American, a result that explains the consistent observation of reduced sequence variability within Andean landraces (McClean et al., 2004; McClean and Lee, 2007). Following the bottleneck, each species experienced a population expansion and divergence into races.

What has not been modeled previously is the evolutionary history of the wild form of a cultivated plant species using the coalescent approach. Here we collected multilocus sequence data from wild P. vulgaris genotypes, determined the nucleotide variation within this collection, described the evolutionary forces that contributed to the population structure of the progenitors of domesticated common bean, and estimated demographic parameters. Given the many reports that show the wild members of P. vulgaris are split into two gene pools, and that domesticated forms appear to have been derived from those two pools, we estimated these parameters using a single isolation–migration (IM) model. Understanding the genetic architecture of the wild common bean will help design approaches to mine the genetic diversity (Singh, 2001; Acosta-Gallegos et al., 2007) for the many adaptive traits that may be necessary under a climate-change scenario (Muraya et al., 2010; McClean et al., 2011).

Materials and methods

Genetic material, DNA isolation and sequence analysis

A collection of 24 wild P. vulgaris genotypes with a geographic distribution ranging from Northern Mexico to Northwestern Argentina were analyzed (Table 1). The genotypes were selected to represent the haplotype variability at the CHI intron-3 and DFR intron loci (McClean et al., 2004; McClean and Lee, 2007). A total of 13 nuclear loci were selected for sequencing with at least one locus located on each chromosome (McConnell et al., 2010; Mamidi et al., 2011).

Table 1 Wild common bean genotypes used for this study of diversity and demography

DNA was extracted from young leaves using the procedure described in Brady et al. (1998), and the 13 loci were amplified using standard PCR conditions. The amplified fragments were sequenced from both directions using Beckman CEQ 2000XL DNA Analysis System (Beckman Coulter Inc, Brea, CA, USA). The DNA sequence chromatograms were analyzed using the Staden Package (Staden, 1996; http://staden.sourceforge.net/). Gene annotation and structure were identified by blastx against the Viridiplantae database at NCBI. All sequences are deposited in the GenBank database.

Population differentiation and nucleotide diversity

To differentiate individuals into subpopulations, population structure was investigated using the STRUCTURE 2.2 software (Pritchard et al., 2000). Combined data for all loci was used for this analysis. Given that the common bean is a highly self-fertilizing species, the haploid phase setting was used for the data analysis. The use of an admixture model was included based on the results of Rossi et al. (2009). We set k (the number of subpopulations) from 1 to 6 and performed 10 runs for each k value. For each run, a burn in of 100 000 iterations was followed by an additional 500 000 iterations. We considered the STRUCTURE posterior probability for each k value and the Evanno et al. (2005) Δk output to select the subpopulation number. Individuals with membership coefficients of qi0.7 in STRUCTURE 2.2 runs were assigned to a specific group. The subpopulations were further defined as gene pools based on the geographic distribution of individuals within that subpopulation. To further differentiate the subpopulations, neighbor-joining (NJ) trees for the combined loci were built in ClustalX (Larkin et al., 2007) and bootstrapped over 1000 replicates.

Once the subpopulations were defined, Fst tests of population differentiation (Hudson et al., 1992) were performed using DnaSP 4.90 (Rozas and Rozas, 1999). Population differentiation was further assessed using the Hudson’s Snn (Hudson, 2000) statistic, and the significance was determined with 10 000 permutations. The number of shared (Ss) and fixed (Sf) silent sites between populations, and the number of unique silent sites (SAnd and SMA) in each population were analyzed. Unless noted otherwise, population parameters were estimated using DnaSP 4.90.

Nucleotide diversity for silent sites (synonymous and noncoding) were estimated for the complete population and the subpopulations separately. The estimated parameters include: Watterson’s estimator (θw=4Neμ), the average number of pairwise differences per site between sequences in a sample (π), number of segregating sites (S), the number of haplotypes (h) and haplotype diversity (Hd).

To test for departure from a neutral equilibrium model of evolution, Tajima’s D (DT; Tajima (1989)) was estimated. A multilocus Hudson–Kreitman–Aguadé (HKA) test (Hudson et al., 1987) across genetically unlinked or loosely linked loci was performed using the HKA program (http://genfaculty.rutgers.edu/hey/software) to discriminate between selection and population demography. P. coccineus genotypes PI 325589 and PI 325599 were used as outgroup members. The minimum number of recombination events (Rm) at silent sites was determined using the four-gamete test of Hudson and Kaplan (1985). The number of significant pairwise LD comparisons was evaluated using Fisher’s exact test and the Bonferroni procedure. The Zns statistic, which is the average of r2 of all pairwise comparisons, was calculated. The expected decay of LD with physical distance was graphed using a non-parametric regression equation suggested in Remington et al. (2001) in SAS 9.2 (Cary, NC, USA).

Demographic parameter estimation

Demographic parameters were estimated in two steps. In the first step, the MIMAR program (Becquet and Przeworski, 2007) was used to estimate divergence parameters for an IM model with the priors described in Table 2. MIMAR uses four statistics known to be sensitive to the parameters of the IM model (Wakeley and Hey, 1997; Leman et al., 2005). MIMAR outputs the posterior distribution of the parameters for a given data set obtained using a Markov Chain Monte Carlo (MCMC) simulation that allows for intra locus recombination. The MCMC was run for 20 million steps after an initial 5 million burn in with sampling every 1000 steps. We tested different ranges of priors, burn ins and variances of kernel distributions. The MIMAR autocorrelation function was used to discover a combination of values that gave good mixture of parameter values (Chen et al., 2010). A variable locus specific recombination rate was used. To test if the estimated IM parameters (NA, NMA, NAnd, T3, M12 and M21) described a model that was consistent with the observed data, we conducted a goodness of fit test using MIMARgof. Simulated and observed values of SAnd, SMA, Sf, Ss, π, Fst and DT at silent sites were compared by the test.

Table 2 Priors used for simulations and posteriors estimated after simulations

In the second step, the bottleneck population size was estimated separately for each gene pool using Hudson ms (Hudson, 2002) in an Approximate Bayesian Computation pipeline. A simple demographic model was modeled in which the ancestral population (NA) was split to a bottleneck population size (NBMA and NBAnd) at time T2M and T2A in the Middle American and Andean gene pools, respectively. The bottleneck ended at times T1M and T1A, respectively, with duration time equal to T2MT1M and T2AT1A, respectively, for each of the two gene pools. This was followed by an exponential growth to the present size. The priors for present-day population size and bottleneck start times were derived from the 99% confidence interval (CI) of MIMAR results from the first step. Ancestral population size and the effective population size (Ne) are assumed to be 220 000 individuals. A total of 10 million simulations per locus were run, and these simulations were piped to msstats software (available at http://molpopgen.org) to obtain the summary statistics for each of the simulation. All the priors including the recombination rate are listed in Table 2. The summary statistics of the simulations were compared with the statistics observed for the sequenced wild genotypes. The Euclidean distance was calculated between simulated and observed summary statistics using S, H, π, DT and Zns. We accepted 10 000 simulations with a Euclidean distance less than 0.01. The summary statistics for the accepted simulations were subjected to a Principal Component Analysis using the PRINCOMP procedure in SAS 9.2. The parameters were then estimated from the accepted 10 000 simulations using the general linear model (GLM), described in Leuenberger and Wegmann (2010), using the GLM procedure in SAS 9.2.

Combining the results for the goodness of fit test of our IM parameters estimated using MIMAR and bottleneck parameters estimated using ms, we performed 10 million simulations in Hudson ms. We compared the results of the simulations with the observed means of θ, π, DT, H, Hd and Zns data across the 13 loci within each gene pool. If the observed data fits within the 95% CI of simulated data, it can be said that the model and parameters are appropriate.

Results

Population differentiation

Sequence data for 13 nuclear loci were obtained from 24 wild common bean genotypes (Table 3). The length of the aligned sequence varied from 341 bp to 689 bp, and a total of 6.4 kb (4.13 kb of coding region) of sequence data was analyzed. Subpopulations were defined using the combined sequence data from all loci with the Bayesian approach implemented in STRUCTURE. Both the highest posterior probability approach (recommended in the STRUCTURE manual) and the Δk method of Evanno et al. (2005) (Figure 1) indicated the population was best represented by three subpopulations. In all, 21 genotypes were assigned to one of the two major groups based on a subpopulation membership coefficient of qi>0.7. Samples from Argentina, Bolivia, Venezuela and Peru grouped together in an Andean subpopulation, whereas a Middle American subpopulation was represented by samples from Mexico, El Salvador, Honduras and Guatemala. Three genotypes (DGD 1962, PI 535430, W6 12107), assigned to a third group with a major subpopulation membership coefficient of less than 0.7, were considered to be recently admixed genotypes. Because of our interest to evaluate events in a more distant past, and as this subpopulation did not appear representative of the ancestral wild populations, they were excluded from further analysis. A NJ analysis of the remaining 21 genotypes, using the combined data from all loci, defined a tree with two clusters (Figure 2). The cluster separation was supported with a bootstrap value of 100% (1000 replicates). The NJ cluster membership was same as that generated with the STRUCTURE analysis. The results of the two analyses were also consistent with the geographically based two-gene-pool model of wild P. vulgaris and strongly suggested that this population of 21 genotypes was appropriate for our demographic analyses.

Table 3 Diversity and neutrality parameters at the silent sites of each locus studied
Figure 1
figure 1

Structure results based on Δk approach. The number of subpopulations is listed on x-axis and Δk is listed on y-axis.

Figure 2
figure 2

NJ tree of the concatenated 13 loci. Only bootstrap>80% are indicated in the NJ tree.

The two subpopulations were strongly differentiated based on an average Fst value of 0.29 and a significant Hudson Snn statistic at ten loci (0.001<P<0.05). Collectively, these results suggested the genotypes were representative of the two presumed ancestral gene pools of wild common bean. A total of 34 shared polymorphisms were observed between the two subpopulations. Fixed polymorphisms were not found within these subpopulations (Table 4). The number of unique polymorphisms in the Middle American subpopulation (43) is higher than for the Andean subpopulation (29).

Table 4 Population differentiation data for the two gene pools

Nucleotide diversity of the entire population

A total of 106 segregating silent sites were observed among all genotypes for all loci (Table 3). The number of haplotypes varied between 3 and 16 for the silent sites with an average of 6.36. The average haplotype diversity across all the samples was 0.647. Within various loci, the Watterson’s estimate (θsil) for the silent sites varied between 0.0044 and 0.0394 with an average of 0.0109. Nucleotide polymorphism at silent sites (πsil) averaged 0.0097 and ranged from 0.0036 to 0.0354. Except for loci g1159 and g634, the ratio of πnonsyn/πsyn was less than one with an average of 0.372. Tajima’s D is significant only for locus g776 (P<0.05) and slightly negative for the majority of the loci. For the complete population, the multilocus HKA test of neutrality was not significant (P=0.9765). Among the loci, the minimum number of recombination events (Rm) varied between 0 and 4 with an average of less than one event per locus. The average r2 value (Zns) is 0.175. For the entire population, LD decayed to r2=0.1 within 500 bp (Figure 3a). Using the Fishers exact test, the percentage of significant LD pairwise comparisons was 7% when the Bonferroni correction was applied. Based on the significant correlation (P<0.0001) between silent diversity and interspecific divergence (θsil and Ksil; Zhu et al., 2007), the populations were considered to be under neutral equilibrium. A similar positive correlation was observed between segregating sites and the number of haplotypes (P<0.0001).

Figure 3
figure 3

Decay of linkage disequilibrium with distance. Solid lines indicate the non-parametric regression and the dots indicate the raw r2 values plotted against distance. In (a), the line indicates the whole population, and in (b) the black represents Middle America and gray represents Andean.

Nucleotide diversity of individual gene pools

The Middle American gene pool contained 77 silent segregating sites, whereas the Andean gene pool had 63 (Table 3). The average number of Middle American haplotypes (4.29) is slightly higher than that of the Andean loci (3.43). The Middle American haplotype diversity (Hd=0.652) was greater than the Andean (Hd=0.447). The level of nucleotide diversity within the two gene pools was nearly identical among loci with an average θsil=0.0090 (πsil=0.0089) for the Middle American gene pool and θsil=0.0082 (πsil=0.0068) for the Andean gene pool. The ratio of πnonsyn/πsyn was less than 1, except for locus g634 within Middle American gene pool.

DT did not depart from neutral equilibrium expectations at any locus within either gene pool except locus g634 (P<0.05) in the Andean gene pool (Table 3). DT values are negative in majority of the loci (7 and 11 loci in both gene pools). The multilocus HKA test was not significant within either the Middle American gene pool (P=0.9732) or Andean gene pool (P=0.9884). The strong positive correlations between θsil and Ksil (P<0.0001), and between segregating sites and the number of haplotypes (P<0.0001) further support the conclusion that these populations do not deviate from neutral equilibrium.

The average r2 value for the Andean gene pool (Zns=0.464) is greater than Middle American gene pool (Zns=0.272). The minimum number of recombination events (Rm) for most loci was 0. In the Middle American gene pool, LD decay was within 500 bp (r2=0.1), whereas in the Andean gene pool, the decay was within 250 bp (r20.3) (Figure 3b). Finally, using the Fishers exact test, only 0.2% pairwise comparisons were significant when Bonferroni correction was applied.

Demography estimation

Given that population differentiation was best explained by a geographic distribution and no fixed sites were observed between the two gene pools, a parapatric model with a recent divergence was assumed to best define the evolution of wild common bean (Phaseolus vulgaris). Such a model includes an ancestral population as the initial source of the variation from which the two wild gene pools were derived. And given the geographic proximity of the two gene pools, gene flow between the two was expected. Based on these assumptions, several population parameters were estimated using MIMAR for this model. First the divergence time between the two gene pools was 111 000 years ago (95% CI: 55 931–196 101; Table 2; Figure 4; Supplementary Figure 1). θ for the ancestral population was calculated to be 0.0102 (95% CI: 0.0059–0.0165), which is equivalent to an effective population size of 220 000 individuals (based on a mutation rate of 6.1 × 10−9 synonymous substitutions per generation for vascular plants (Lynch and Conery, 2000) and calculated using θ=4  and then scaled to Ne=N(2−s)/2 (Nordborg and Donnelly, 1997) where s is the selfing rate, which is estimated to be 93% (Ibarra-Perez et al., 1997). Similarly, θ for the Middle American gene pool is 0.0071 (95% CI: 0.0025–0.0194), whereas for Andean gene pool, it is 0.0033 (95% CI: 0.0014–0.0150). Asymmetric migration rates (4 Nem) of 0.2658 from Andean to Middle American population and 0.0968 from Middle American to Andean were estimated (Figure 3, Table 3). Using these population parameter estimates, the results from 10 000 simulations were analyzed using MIMARgof to test the accuracy of the estimated IM model parameters. The standard IM model gave a reasonably good fit for SAnd, SMA, Ss and Sf, π and Fst, for both populations, but the fit for Andean DT was poor (Supplementary Figure 2).

Figure 4
figure 4

Parameter estimates for the wild gene pools of Phaseolus vulgaris.

Based on the negative DT, which suggests population expansion, we included a population bottleneck followed by an exponential growth to the present sizes of two wild subpopulations. Using the Approximate Bayesian Computation pipeline, we separately estimated that the Andean and Middle American bottlenecks started at 103 000 years ago and ended 62 000 years ago. The start time of bottleneck is similar to the divergence time estimated above. The Andean bottleneck population size was equivalent to 26% of the ancestral population (95% CI: 14.15–38.42), whereas the Middle American bottleneck population size was equivalent to 46% of the ancestral population (95% CI: 35.17–67.54; Figure 4). Then using the divergence time and migration estimates from first step, bottleneck times and population sizes from second step, the simulations give a reasonable fit for the summary statistics tested, that is, the observed statistics lie between the 95% CI of simulations.

Discussion

Diversity and demography of wild common bean

Common bean is diverse crop best known for its variation in seed size, seed shape, seed coat color and patterns (McClean et al., 2002), which collectively define the various market classes of beans we consume. Both wild types and landraces of the species form the core germplasm base available to the breeders. Wild genotypes of P. vulgaris were previously the subject of multiple marker analyses (Gepts et al., 1986; Koenig and Gepts, 1989; Gepts and Debouck, 1991; Khairallah et al., 1992; Becerra Velasquez and Gepts, 1994; Freyre et al., 1996; Tohme et al., 1996; Kwak and Gepts, 2009; Rossi et al., 2009) and only recently has multilocus gene sequence data been used to evaluate its diversity (Bitocchi et al., 2012). Here we collected sequence data for 13 loci distributed across the common bean genome from a diverse group of genotypes representing the two major gene pools, and from that data assessed nucleotide diversity, population differentiation and for the first time demographic parameters for the wild form of the species. These results have implications regarding the origins of wild populations that were the source of variation used during the domestication events (Mamidi et al., 2011).

A combination of STRUCTURE and NJ tree analysis defined two subpopulations that correspond to the two wild-type gene pools. The geographic composition of the subpopulations were consistent with that described in previous studies (Gepts et al., 1986; Koenig and Gepts, 1989; Gepts and Debouck 1991; Khairallah et al., 1992; Becerra Velasquez and Gepts 1994; Freyre et al., 1996; Tohme et al., 1996; Rossi et al., 2009), and Fst estimates indicate a moderate to high level of population differentiation. These Fst estimates are higher than those found for cross-pollinated species (Ingvarsson, 2005; Arunyawat et al., 2007) and other self-pollinated species (Wright et al., 2003), and are most likely due to the self-pollinating nature of the species by which the amount of diversity contributed by the migrants is less than that in a cross-pollinated species (Ness et al., 2010). Significant Hudson’s Snn values for the majority of loci further confirmed the presence of significant differentiation between the two gene pools.

Low levels of nucleotide variation among the wild genotypes (πsil=0.0097) and within each gene pool (πsil-MA=0.0089; πsil-And=0.0068) also reflect the self-pollinating nature of P. vulgaris. This is consistent with the concepts of Charlesworth (2003), who suggested that low levels of genetic variation within self-pollinating species is due to a low effective population size and a low effective rate of recombination. The level of nucleotide variation observed for these wild populations is comparable to that for the wild form of some other self-pollinated crop species (Wright and Gaut, 2005; Liu and Burke, 2006; Arunyawat et al., 2007; Zhang and Ge, 2007; Zhu et al., 2007) and greater than that found in wild rice (Zhu et al., 2007). One reason for the low levels of nucleotide variation may be introgression from cultivated domesticated beans as reported recently (Papa et al., 2005). Finally, the fact that the nucleotide diversity at silent sites for the entire population is greater than that of individual subpopulations supports the existence of population structure (Moeller et al., 2007).

The nucleotide diversity of the Middle American gene pool was slightly higher than that of Andean gene pool yet not as great as observed previously for individual loci (McClean et al., 2004; McClean and Lee, 2007). Differences in nucleotide diversity among populations can be attributed to differences in population size, the number of immigrants and genotypic contributions to the migrant pool (Wakeley, 2001). Consistent with these suggestions, we observed a greater number of migrants entering the Middle American gene pool. In addition to greater nucleotide diversity, the Middle American gene pool had a higher level of haplotype diversity, a population feature attributed to a larger effective population size. The results of the IM model simulations were consistent with this expectation of a larger Middle American effective population size.

Other than a single locus (g634 in the Middle-America gene pool), the DT estimates were generally negative for the other loci in the two gene pools. This suggests an excess of singletons in the gene pools that can be attributed to the interplay between population structure and demography (Arunyawat et al., 2007), population expansion (Zhu et al., 2007; Städler et al., 2009) and/or due to population subdivision (Ingvarasson, 2005; Moeller et al., 2007). Based on the neutrality tests, and the correlations between θsil and Ksil, and between S and H, it appears that selection is not acting on most of these loci. The locus D1468, which maps near a domestication locus, has the same level of diversity as other loci, whereas g776, an alcohol dehydrogenase locus, which maps near D1468, has a significant DT value (P<0.5) indicating it is evolving by different evolutionary or demographic forces, possibly purifying selection.

We estimate the present-day population sizes of the Middle American and Andean gene pools are about 70% and 32% of the ancestral population, respectively. We also determined that the divergence time of the two gene pools at 111 000 years ago, a date more recent than the 500 000 years ago estimate based on alpha-amylase inhibitor and internal transcribed spacer sequence data (reviewed in Kwak and Gepts (2009)). These estimates could be biased if the ancestral gene pool was structured or if migrants were entering that population (Becquet and Przeworski, 2009; Strasburg and Rieseberg, 2010). Given that P. vulgaris and P. coccineus, common bean’s closest species, diverged at about 1.3–2.5 million years ago (Matt Lavin, personal communication), it is unlikely that in-migration from such a diverged species affected nucleotide diversity of the ancestral wild common bean population. In addition, recent analyses determined the wild Andean and Middle American gene pools were derived from a mostly unstructured ancestral population located in Mexico (Bitocchi et al., 2012). Although these observations add credence to our demographic estimates, it should be noted that the sensitivity of demography parameter estimates to the assumption of population structure is not clear (Hey 2006; Strasburg and Rieseberg, 2008).

As observed for wild relatives in other crop species (Muraya et al., 2010), our simulations determined that the ancestral wild population of common bean underwent bottlenecks associated with the development of the wild Andean and Middle American wild gene pools. Although a bottleneck in the Andean wild gene pool has been detected previously (Rossi et al., 2009; Bitocchi et al., 2012), our analysis for the first time also detected a bottleneck associated with the development of the Middle American wild gene pool. Although the direct consequences of the bottleneck are to decrease genetic diversity, increase rates of inbreeding and fix mildly deleterious alleles (reviewed in Muraya et al. (2010)), these factors did not have the same intensity for the wild Andean and Middle American gene pools. For example, although the duration of the bottleneck was similar for the two gene pools, the intensity of the bottleneck was less for the Middle American gene pool. This was directly reflected by difference in the diversity for the two wild gene pools. This is consistent with recent observations relative to the origin of the two wild gene pools and their subsequent adaptations to the ecosystems in Mexico, Central America and South America. Based on the close relationship between wild Andean and Middle American genotypes from central Mexico, Bitocchi et al. (2012) proposed the ancestral gene pool of common bean was located in central Mexico in a region bordered by the Sierra Madre Occidental, the Sierra Madre Oriental and the Transverse Volcanic Axis. They also observed wild Middle American subpopulations in North/Central Mexico, along the Mexican Pacific Coast and extending into Central America, and along the South/Central Mexican region bordering the Caribbean Sea. It would be expected from these and our observations that the Middle American gene pool would be more diverse to allow for adaptation to such a broad geographic and climatic area. Finally, an expansion in population size following the bottleneck was a critical parameter in explaining the current diversity levels. This was reflected by the negative DT values for both the wild gene pools.

Several observations suggest migration is occurring between the gene pools. With no fixed sites and many shared sites, it can be suggested that the subpopulations were recently diverged and migration was necessary. Also, ongoing migration was supported by the observation that the sum of haplotypes within each gene pool was greater than when the genotypes were evaluated as a single population. Asymmetric levels of gene flow are common between recently formed species (Hey, 2006; Zhou et al., 2008), and gene flow has been observed for many other wild ancestors of plant species (Stadler et al., 2008; Ross-Ibarra et al., 2009). Although modest in size, we did observe gene flow with a higher migration rate from the Andean to the Middle American gene pool. The gene flow (4Nem) estimated here is consistent with other species (Morjan and Rieseberg, 2004), which ranges from 0.02 to 90.4, with a mean of 1.8 and a median of 1.1. Further investigations into this variation in migration rates using genome-wide scans may provide clues to the genetic or genomic basis of differences and similarities of the two gene pools brought on by reproductive isolation and occasional migration events (Machado et al., 2007).

The differential effects of the bottleneck upon the gene pools was also reflected by the result that the average LD coefficient (Zns) estimate was higher for the Andean gene pool, a result consistent with earlier results (Rossi et al., 2009). The differential level of LD may reflect the differences in haplotype diversity between the two gene pools. In contrast to the LD results, recombination (as measured by Rm) was rare for most of the loci studied here. In the absence of recombination, other factors such as selection and effective population size may be modulating LD (Rafalski and Morgante, 2004). Although we cannot estimate LD or recombination rates at the chromosome level by evaluating a single locus per chromosome, we were able to determine that sufficient levels of diversity exist in common bean to search for diagnostic polymorphisms within or near candidate genes (Zhu et al., 2007) given a high density marker system.

Implications for association mapping (AM)

AM has the advantage of exploiting all the recombination events that have occurred in the evolutionary history of a sample and results in a higher mapping resolution compared with family mapping (Myles et al., 2009). The success of AM is related to the extent of LD with a population sample. With the increased use of AM to discover quantitative trait loci (QTL), it is necessary to understand the different factors that affect LD. The resolution with which a QTL can be mapped is a function of LD decay over distance. Differences in the extent of LD have a very important effect on the marker density required for association mapping. In association mapping, the price of higher LD is lower resolution. Among the different factors that affect LD, demographic parameters have a profound influence. As the majority of neutral genetic variation present in domesticated crops arose in their wild populations by mutations and genetic drift to present-day allele frequencies (Hamblin et al., 2011), it becomes necessary to study the demographic parameters of wild types that shaped the present-day landraces and cultivars.

Effective population size (Ne) is an important parameter in population genetics because it is inversely proportional to the rate of genetic drift (reviewed in Hamblin et al. (2011)). A larger effective population size would lead to more recombination events, and recombinant chromosomes are lost more slowly. Thus, larger populations are expected to have lower levels of LD (Hamblin et al., 2011). Self-pollinated plants are highly homozygous and rarely generate new haplotypes via recombination. When a very large population exists over many generations, rare outcrossing events can maintain linkage equilibrium. Bottlenecks, as observed for wild common bean here, reduce variation and change the frequency spectrum by removing low-frequency alleles (Hamblin et al., 2011). Bottlenecks also increase the extent of LD by eliminating recombinant lineages resulting in extensive haplotype structure, which is more pronounced in self-pollinating crops. Differential rates of LD can occur across the genome because selection reduces variation locally relative to that found throughout the genome (Myles et al., 2009). The strong population structure in many crops presents both challenges and opportunities for association mapping. It is a source of allelic diversity, which can be used to discover QTL regions through use of admixed populations. On the other hand, population structure generates spurious associations between phenotypes and unlinked markers. Although mixed models include factors that account for population structure, these corrections reduce our ability to detect true positives (Brachi et al., 2010).

In P. vulgaris, population structure arose in wild types 110 000 years ago followed by a bottleneck within each wild gene pool before domestication. Later, a domestication bottleneck reduced diversity and increased the population structure (Mamidi et al., 2011). This was followed by differentiation of domestication populations into races and intense selection by breeders. All these would lead to higher LD in present-day cultivar populations. As a result of the selfing mating system, an insufficient amount of time has passed to break up the LD generated by the bottleneck (Myles et al., 2009; Hamblin et al., 2011). Even though high LD is beneficial to detect QTL with high probability, fine mapping a causal variant is difficult, as multiple markers around the QTL have same strength of detection (Astle and Balding, 2009). With this it can be suggested that association mapping P. vulgaris should be performed independently for each gene pool. As the bottleneck and selection effects are different for each gene pool, AM in each gene pool will be necessary to detect the causal variants in P. vulgaris.

Data archiving

Sequence data have been submitted to the NCBI Genome Survey Sequences Database: accession numbers JY123505–JY123801.