Introduction

‘Admixture mapping’ as suggested by Chakraborty and Weiss (1988) and Briscoe et al. (1994) utilizes linkage disequilibrium (LD) induced by the mixing of genes from two divergent gene pools. In an outcrossing species and in the absence of confounding population structure, LD will decay with increasing genetic map or chromosomal distance (Lynch and Walsh, 1998), because the chance that stretches of DNA are broken up by recombination becomes greater the further two loci are apart. Admixture will effectively widen the region of a genome that is affected by LD, because recombination will take several/many generations to break up the chromosome blocks derived from each parental population (Briscoe et al., 1994; Chapman and Thompson, 2002). Hence, admixture potentially facilitates molecular marker-based ‘genome-scans’ to narrow in on genomic regions conferring trait differences between two divergent source gene pools (Chakraborty and Weiss, 1988; Briscoe et al., 1994; McKeigue et al., 2000; Pfaff et al., 2001). This prediction has been verified recently by successful admixture genome-scans for two complex traits in humans – hypertension and susceptibility for multiple sclerosis (Reich et al., 2005; Zhu et al., 2005).

The requirements for genome-scans through admixture in humans have been carefully evaluated by geneticists for years (e.g., McKeigue et al., 2000; Pfaff et al., 2001; Hoggart et al., 2004). In addition to setting the stage for association mapping in human medicine, these studies have encouraged the development of similar approaches in natural admixed populations or ‘hybrid zones’ of wild animals or plants (Rieseberg et al., 1999; Rieseberg and Buerkle, 2002). In ‘non-human’ organisms, admixture mapping holds enormous potential for studies addressing the genetic changes that occur during divergence of populations, ecotypes, or species. This may allow geneticists to address some of the biggest issues in evolutionary biology, for example the number and effect sizes of genes involved in adaptation and speciation (Fisher, 1930; Wright, 1931), the genetic architecture of barriers that keep previously diverged genomes from merging upon secondary contact (Barton and Hewitt, 1985; Barton and Gale, 1993), or the likelihood of spread of advantageous alleles (Morjan and Rieseberg, 2004). These questions are often addressed by quantitative trait locus (QTL) mapping of trait differences in ‘mapping populations’ derived from experimental crosses (Orr, 2001). However, crosses are often difficult to obtain in wild taxa, for example in species that are long-lived or otherwise of limited tractability – these are the taxa for which admixture mapping of complex traits would be most useful. An important aspect is that in evolutionary biology interest often will be directed toward identifying genomic segments subject to natural selection, rather than segments associated with a particular trait alone (Rieseberg et al., 1999; Wu, 2001). For simplicity, variation in such segments is referred to as ‘adaptive or detrimental variation’ from here onwards.

Populus has been suggested as a ‘model forest tree’ for studying tree form, function, and evolution (Taylor, 2002), interactions between ecological carrier species and their communities (Whitham, 1989), and how ecology interacts with plant development (Cronk, 2005). The favorable genetic attributes of Populus such as small genome size (550 Mb; 2C=1.1 pg), diploidy throughout the genus (2n=38), porous species barriers (Rajora and Dancik, 1992; Martinsen et al., 2001; Floate, 2004), and a near-complete genome sequence with thousands of expressed sequence tags and markers (http://www.ornl.gov/sci/ipgc) make this genus an ideal candidate for evaluating the ‘admixture mapping’ approach in a ‘non-human’ organism. Here, we focus on two species that hybridize frequently in Europe, Populus alba (white poplar) and Populus tremula (European aspen).

P. alba and P. tremula share a wide sympatric or parapatric distribution across large parts of Central and Southern Europe and hybrid zones often form between them (Rajora and Dancik, 1992; Fossati et al., 2004; Lexer et al., 2005). A recent sequencing survey of nuclear genes in one of the two species (P. tremula; Ingvarsson, 2005) indicates that LD may often not extend beyond a single gene, suggesting that the detection of genetic associations in wild intra-specific populations may be difficult. P. alba and P. tremula exhibit several features that render them potentially suitable for admixture mapping of detrimental or adaptive trait variation. Firstly, they are ecologically divergent: P. alba is restricted mainly to lowland flood-plain forests whereas P. tremula occurs in mixed upland communities (Adler et al., 1994), and traits potentially involved in these divergent ecological preferences have been identified (Karrenberg et al., 2002). Second, species barriers in Populus are likely to be genic rather than chromosomal (Cervera et al., 2001; Lexer et al., 2005), and thus variation at genic factors involved in species isolation should segregate in hybrids. Also, the two species differ in numerous diagnostic morphological characters (Adler et al., 1994), hybrids display large phenotypic variances in multiple traits, and the proportion of recombinant (backcrossed) genotypes in hybrid zones is high (Rajora and Dancik, 1992; Lexer et al., 2005).

Here, we ask the following questions regarding the potential of hybrid zones between P. alba and P. tremula for admixture genome-scanning: (1) How large are DNA microsatellite allele frequency differentials between the two species? (2) How large is background LD among unlinked loci in different hybrid genotypic classes and each parental species and what causes it? (3) How large is the variation in introgression rates among highly informative marker loci? We use our data to assess the feasibility of admixture mapping-related studies in European Populus, and we discuss a modified approach of interpreting genotypic clines that may be useful for studying the evolution of ‘mosaic’ hybrid zones.

Materials and methods

Sampling of hybrid zone and parental populations

A large ‘mosaic’ hybrid zone between P. alba and P. tremula (hybrids commonly known as P. x canescens) was sampled in the Danube valley near Vienna, Austria. The sampling area covered a linear distance of approximately 110 km of the river valley between Krems and Hainburg, Austria, and included lowland flood-plain ‘gallery’ forest located within the Danube Floodplain National Park (http://www.donauauen.at/) and adjacent areas. P. x canescens hybrid morphotypes were sampled in such a way as to maximize geographic coverage within the hybrid zone (sampling a transect was not feasible because of the patchy distribution of remnant forests and suitable habitats within forests). Leaf material was collected for DNA extraction.

Two neighboring ‘populations’ were sampled for each of the two parental species. These samples were described in Lexer et al. (2005) and are referred to as ‘subpopulations’ here, since molecular analyses indicate that levels of gene exchange are high (Nem>3.0 in both species; Lexer et al., 2005) and that they form one panmictic unit for each species. Subpopulations of P. alba were sampled in the Danube valley in Austria (within the zone of sympatry; sampling mid-point: 48.26°N, 16.27°E) and Romania (outside the zone of sympatry; sampling mid-point: 43.77°N, 23.96°E). P. tremula was sampled from the Austrian Danube (within the zone of sympatry; sampling mid-point: 48.28°N, 15.89°E) and from the Eastern Alps in Austria (outside the zone of sympatry; sampling mid-point: 46.62°N, 13.85°E). The sample sizes were: 378 chromosomes for P. x canescens hybrids, 88 chromosomes for P. alba and 78 chromosomes for P. tremula. In effect, the sampling of the hybrid zone was increased by 100% since the initial characterization of the hybrid zone (Lexer et al., 2005).

Within-genome sampling

The 19 microsatellite markers used in this study were developed by Tuskan et al. (2004) and Van der Schoot et al. (2000) as indicated on the web-site of the International Populus Genome Consortium: http://www.ornl.gov/sci/ipgc/ssr_resource.htm. These 19 markers, listed in Table 1, can be considered genetically independent, since they are located either on different chromosomes or widely spaced on the same linkage group of the P. trichocarpa x P. deltoides genetic map (Yin et al., 2004); levels of synteny in Populus are high as indicated by comparative genetic mapping (Cervera et al., 2001). Six of the markers used in this study have not yet been placed on the Populus genetic map, but exact tests for LD in P. tremula (the species with the larger effective population size Ne) indicated no genetic association among any of these loci. Our choice of using unlinked markers for this study reflects the situation encountered by many students of non-model organisms, where a limited number of unlinked loci becomes available first and a larger number of linked loci is employed at a later stage.

Table 1 Microsatellte markers employed in this study, including repeat type, linkage group and map position on the P. trichocarpa × P. deltoides genetic map, number of alleles/frequency (f) of most informative allele in P. alba and P. tremula, allele frequency differential δ, and number of rare alleles (f<8%) in P. alba, P. tremula and P. x canescens hybrids, respectively

DNA extractions and microsatellite genotyping

DNA extractions were carried out as described previously (Lexer et al., 2005), except that the initial step of tissue disruption of silicagel-dried leaves was automated using a Retsch MM301 mixer mill. Microsatellites were PCR-amplified following Lexer et al. (2005), making use of a three-primer PCR to achieve economic fluorescent labeling of PCR products via a labeled ‘universal’ M13 primer. PCR products were resolved on an AB 3100 automated sequencer (Applied Biosystems) using the fluorescent dyes 6-FAM and JOE as well as molecular size differences for multiplexing. Molecular sizes in base pairs were determined using the GENSCAN-500 ROX (Applied Biosystems) size standard, and microsatellite genotypes were scored by two people independently (CL and JAJ) using GENESCAN and GENOTYPER software (Applied Biosystems, Foster City, CA, USA).

Genetic data analysis

Microsatellite allele frequency differentials were determined as

where fi1 and fi2 represent the ith allele frequencies in the two parental populations, respectively, following Zhu et al. (2005). Rare alleles in the hybrid zone and each parental species were identified based on a threshold population frequency of 8%. To test whether samples of each parental species represented one or more panmictic population(s), a Bayesian analysis was carried out with STRUCTURE version 2 (Pritchard et al., 2000), using the admixture model allowing for correlated allele frequencies, a burn-in phase of 50 000 followed by a run length of 100 000. These parameter settings were identified as appropriate based on the diagnostic tools available in STRUCTURE.

Maximum-likelihood estimates (MLEs) as well as upper and lower bounds of the molecular hybrid index h for each plant from the hybrid zone were calculated from codominant microsatellite data using the computer program HINDEX (Buerkle, 2005). The genomic composition of plants from the hybrid zone was presented previously (Lexer et al., 2005), and the Bayesian genetic structure analysis presented there allowed the identification of pure parental reference populations for studying admixture. Estimation of parental allele frequencies was thus straight-forward, so simple hybrid index MLE's (h) rather than Bayesian admixture proportions (Q) were chosen for the present study. Although Q should in theory yield equivalent results, more experience exists with h in interspecific hybrid zones (Rieseberg et al., 1998, 1999; Buerkle, 2005). The purpose of re-calculating h here was to create criteria for sorting hybrids into different genotypic classes so that two-locus disequilibria could be computed for each class. The lower (P. tremula) bound of h was used for this purpose. Four genotypic classes with approximately equal sample size of N=45±2 individuals were defined from the hybrid zone data. Ranked from low hybrid index values (early generation hybrids) to high hybrid index values (advanced generation backcrosses to P. alba), the four genotypic classes were defined as: hybrid index (HI) group 1 (0.256–0.771), HI group 2 (0.771–0.854), HI group 3 (0.854–0.908), and HI group 4 (0.908–0.933). Seven apparent F1 genotypes were omitted from the dataset because the F1 generation is typically not suitable for admixture mapping.

Two-locus disequilibria were calculated in the form of the standardized LD D′. Haplotype frequencies for this purpose were estimated from microsatellite genotype data using the Expectation-Maximization (EM) algorithm as implemented in ARLEQUIN (Excoffier and Slatkin, 1995), and D′ was calculated from haplotype data using the HAPLOXT program distributed within the GOLD software package (Abecasis and Cookson, 2000). This method was used to estimate D′ for all pairs of marker loci in each of the four hybrid genotypic classes (HI groups 1–4) and in the parental populations of P. alba and P. tremula. The calculations were carried out both with and without rare alleles (f<8%) in the datasets, resulting in a total of 1026 two-locus comparisons. The significance of pairwise disequilibria was tested using the EM algorithm and associated permutation procedure in ARLEQUIN, and significance thresholds were corrected for multiple tests using sequential Bonferroni (Rice, 1989). Descriptive statistics such as means, standard errors, and medians for D′ were calculated and compared in SPSS (SPSS Inc.).

In order to investigate the likely causes for background LD in the two parental populations, variance components of LD were calculated following Ohta (1982) using the computer program LINKDOS (Garniere-Gere and Dillmann, 1992). This method uses Wright's island model to partition the variance in LD into within- and between subpopulation components, in analogy to the partitioning of inbreeding coefficients in subdivided populations. The most intuitive variance components are DIS2, the variance of LD within subpopulations, DST2, the variance component due to genetic drift between subpopulations, and DIT2, the variance of LD in the total population. Differences between variance components contain information regarding the likely cause(s) for LD, for example, if DIS2<DST2 then LD may be due to drift and limited migration, and if DIS2>DST2 then epistatic natural selection is a more likely cause. The hypothesis that DIS2 and DST2 differed from one another was tested with non-parametric Wilcoxon signed-rank tests using SPSS.

For the analysis of introgression frequencies, 16 marker loci with the greatest information content were chosen (δ>0.25, see Table 1). In order to maximize the accuracy with which clines could be estimated, alleles at each locus were combined into two allelic classes with frequency differences between species (δ) equal to those observed when alleles were utilized separately (see Supplementary material). Allelic classes were created by an exhaustive search of all possible combinations of alleles in two classes (code written in R version 2.1.1; R Development Core Team, 2005). This simple pooling of alleles into classes reduces multiallelic data to a biallelic classification, without a loss of information or distortion of the relationship between parental species. Below we refer to genotypes, which in this case are genotypes of allelic classes rather than of individual microsatellite alleles.

Introgression of P. tremula alleles into the P. alba background was described using estimated clines of microsatellite genotype frequencies. The more common approach to the analysis of introgression is to calculate clines across a spatial transect (Barton and Gale, 1993). Our approach also differs from human admixture mapping, where locus-specific excess ancestry is used as a predictor variable in a logistic regression with disease status as the response variable (Reich et al., 2005; Zhu et al., 2005). In contrast to these methods, the approach utilized here involves estimating genotype frequencies as a function of hybrid index. For each locus, the allelic class with the highest frequency in P. tremula was taken as the focal class and separate clines were estimated for homozygous and heterozygous genotypes that included the allelic class. Multinomial logistic regression was used to estimate genotype frequencies as a function of hybrid index, with genotypes that did not include the focal allelic class serving as the reference category. Regressions were performed in R (R Development Core Team, 2005), using the R packages nnet (package version 7.2–16) and genetics (G Warnes and F Leisch, package version 1.1.3).

Given that the hybrid index is an estimate of the genome-wide average frequency of P. alba alleles within an individual, it can be utilized to generate expected genotype frequencies under the assumption of neutral introgression. For any given h, the expected frequency of an allele is given by: E(a)=atrem+(aalbaatremh, where aalba and atrem are the allele frequencies in the parental populations of P. alba and P. tremula (with a diagnostic, diallelic locus, this reduces to E(a)=h). The expected frequencies for the genotypes that include a particular allele are given by E(a)2 for the homozygote and 2·E(a)·(1−E(a)) for the heterozygote, and the frequency for all other genotypes is expected to be (1−E(a))2. To test for departures from neutrality, the slope and intercept of the regressions for observed data were compared to those obtained in 1000 replicate simulations of genotypic data for each locus. In simulations, genotypes were sampled randomly according to frequencies expected under neutrality, using a population of the same size and with the same distribution of hybrid index as the empirical data. Preliminary analyses indicated that in all simulation replicates, parameter estimates for the regression converged without difficulty.

Results

Allele frequency differentials and rare alleles

The 19 microsatellites exhibited large allele frequency differentials between the two parental species, P. alba and P. tremula (up to 0.909, mean of δ=0.619±0.067 s.e., median=0.647). There was no obvious relationship between microsatellite repeat type and δ, for example, frequency differentials varied from 0.064 to 0.987 for tetra-repeat marker loci and from 0.114 to 0.919 for di-repeat markers. Estimates for δ for each marker, along with information about the number of alleles observed in each species and the parental frequencies of the most informative allele for each locus, are given in Table 1.

The number of ‘rare alleles’ (alleles with population frequencies <8%) was larger in the hybrid zone compared to samples of the two parental species for most loci (15 out of 19 microsatellites; Table 1). Probabilistic analysis of this result was not intended since sampling strategies differed between the hybrid zone and the two parental species, the two parental gene pools being represented by a smaller number of gene copies sampled over a much larger geographic range. Nevertheless, the high proportion of low frequency alleles present in the hybrid zone indicates that rare alleles must be taken into account when interpreting patterns of LD in admixture mapping studies.

Patterns of background LD in the parental genepools

Bayesian genetic structure analysis indicated that the neighboring subpopulations sampled for each of the two parental species, P. alba and P. tremula, behaved like one single panmictic unit in each case. The natural logarithm (ln) of the probability of pairs of subpopulations of P. alba representing one panmictic unit was −1412, whereas it was −1890 for a model including two populations for this species. Likewise, the statistics in P. tremula were −1545 for the ‘one-population’ genetic structure model and −1614 for the ‘two-population’ model, again indicating a single panmictic genepool. These results confirm previous estimates of high levels of gene exchange (Nem) among neighboring subpopulations of each species. Based on the data, subpopulations were combined for analyses of LD in each parental genepool.

Standardized estimates of LD (D′) for the two parental genepools were larger for P. alba than for P. tremula (mean=0.300±0.016 s.e., median=0.300 for P. alba; mean=0.251±0.011 s.e., median=0.231 for P. tremula; Figure 1a). A similar difference between species was observed when rare alleles were excluded from the analysis, but the median of D′ was smaller for both species in this case (Figure 1b). Analysis of the variance components of LD in the parental genepools revealed striking differences between the two species. In P. alba, DST2, the variance of LD due to genetic drift between subpopulations was significantly larger than DIS2, the variance of LD within subpopulations (Z of Wilcoxon sign rank test=−8.043, P<0.005; Figure 2a). This difference was not observed in P. tremula (Figure 2b). These inter-specific differences provide information about the likely causes of background LD in these two divergent genepools.

Figure 1
figure 1

Distribution of standardized two-locus disequilibria (D′) for 171 comparisons among 19 multi-allelic microsatellite loci in Central European populations of P. alba and P. tremula. (a) Rare alleles (f<8%) included. (b) Rare alleles excluded. For overall allele numbers and numbers of rare alleles at each locus in each species see Table 1.

Figure 2
figure 2

Distribution of DIS2, the within-subpopulation variance component of LD, and DST2, the between-subpopulation variance component due to genetic drift, in Central European populations of Populus alba and P. tremula. Medians are indicated as thick black lines within boxplots. Z-values and significance levels from non-parametric tests for differences between the two variance components are indicated in each graph. (a) P. alba. (b) P. tremula.

Two-locus disequilibria in different hybrid generations

Linkage disequilibria in four different genotypic classes of hybrids as defined by their molecular hybrid index (HI groups 1–4) were generally larger in early generation hybrids (HI group 1) than in highly introgressed advanced generation backcrosses (HI groups 3 and 4; see Supplementary material). The proportion of significant tests for LD followed the same overall pattern: 15.8% of exact tests between pairs of loci were significant in HI group 1 (early generation hybrids), whereas 5.3, 4.2 and 3.8% were significant in HI groups 2, 3 and 4 (later generation hybrids), respectively. For comparison, only 1.4% of exact tests for LD yielded a significant result in the parental population of P. alba and no significant test at all was observed in P. tremula (not shown). Note, however, that slight differences in allelic diversities between study groups may lead to differences in the power to detect LD. As expected, the median of D′ in different genotypic classes of hybrids was generally smaller when rare alleles were excluded from the analysis (see Supplementary material). Our data reflect the decay of LD in successive hybrid generations as predicted from theory (Briscoe et al., 1994; Martinsen et al., 2001; Chapman and Thompson, 2002).

Variation in introgression frequencies among DNA microsatellite loci

Sixteen loci that exhibited the greatest level of differentiation between parental taxa (δ>0.25; Table 1) were selected for introgression analysis. The markers displayed a variety of patterns of parental genotype frequencies and patterns of introgression (Figure 3). Several loci exhibited departures from neutral expectations and can be placed into three categories: a small excess of P. tremula genotypes in the hybrid and P. alba genetic backgrounds, a large deficit of P. tremula genotypes, and shifts in the centrality of the cline (primarily toward P. tremula; Figure 4; see Supplementary material). Homozygous genotypes at four loci (ORPM 202, ORPM 28, ORPM 137, ORPM 60) and heterozygotes at two loci (ORPM 344, ORPM 127) exhibited introgression frequencies that were slightly higher than neutral expectations (slope of smaller magnitude). Homozygotes at ORPM220 were significantly underrepresented in hybrids, with a slope estimate that was much steeper than any in replicate simulations. The functions for homozygotes at six loci and heterozygotes at four loci had intercepts that were significantly smaller than those in 95% of simulations. For all but one of these loci, a smaller intercept along with the negative slope, means that relative to expectations the curves were shifted toward lower hybrid index and P. tremula genotype frequencies (Figures 3 and 4). The exception is ORPM 28, for which the frequency of heterozygotes increases with hybrid index (positive slope), so that the smaller intercept corresponds to a shift in the cline toward P. alba.

Figure 3
figure 3

The introgression of the predominant P. tremula allelic class (black solid lines – homozygotes, black dashed lines – heterozygotes) from individuals with low hybrid index (P. tremula) to those with high hybrid index (P. alba) varies among the sixteen most informative loci. Probabilities of observing homozygotes and heterozygotes are derived from multinomial logistic regressions. Gray regions indicate the 95% confidence envelope for logistic regressions fit to simulated data under the assumption of neutral introgression (dark gray – homozygotes, light gray – heterozygotes). Departure of the observed introgression rates from neutrality is evident graphically when fitted lines deviate from the confidence envelopes.

Figure 4
figure 4

Estimates of the slope or intercept (black circles with 95% confidence intervals as black bars) from multinomial logistic regressions for several loci (underlined) fall outside of the 95% confidence intervals (CI) derived from 1000 replicate simulations of neutral introgression at each locus (gray boxes, median indicated by white line). Black vertical lines indicate 95% CIs for parameter estimates from observed data.

Discussion

Natural hybrid zones are likely to represent extreme cases for admixture genome-scanning compared to admixed human populations: allele frequency differences between divergent populations or species will be larger in most cases (Rieseberg et al., 1999; Rieseberg and Buerkle, 2002), genetic architectures of parental gene pools may differ due to differences in breeding systems, and the excess of rare alleles often found in hybrid zones (Schilthuizen et al., 1999) may contribute to background LD. Nevertheless, the most critical factors for admixture mapping are likely to hold across a wide range of organisms. These include: (1) the magnitude of allele frequency differences between the hybridizing populations, (2) levels of background LD (associations among unlinked markers) in the parental gene pools, (3) patterns of admixture and their effects on LD in the admixed population (Pfaff et al., 2001, 4) the range of variation in marker ancestry (variation in introgression frequencies or cline shapes) across the genome. Here, we have tested these factors one-by-one in a natural hybrid zone between two ecologically divergent members of the ‘model tree’ genus Populus, P. alba and P. tremula. We focused on genomic segments subject to natural selection rather than segments linked to a particular trait. We use our data to assess the feasibility of admixture genome scanning in Populus and to discuss the potential and caveats of admixture mapping-related studies in hybrid zones of wild species.

Background LD in P. alba, P. tremula, and a natural hybrid zone

Assessing the strength of background LD is critical to genetic association studies because the strength of LD in the target gene pool will dictate both the power and error of association tests (Lynch and Walsh, 1998). Knowing the strength of background LD is especially important in ‘admixture mapping’ studies. LD induced by admixture will decay with each generation following the initial hybridization event (Chakraborty and Weiss, 1988; Briscoe et al., 1994; Pfaff et al., 2001) as a function of the recombination rate (θ) (Briscoe et al., 1994), which is proportional to genetic map distance in an idealized situation. However, recombination rates may vary across the target genome, and numerous other factors may affect LD in addition to θ, most notably population structure (Lynch and Walsh, 1998; Flint-Garcia et al., 2003). Hence, it is important to estimate the strength of background LD in parental and admixed populations before carrying out a full-blown genome-scan experiment. Whereas it would be desirable to include markers with different degrees of linkage (different physical distances along the chromosomes) in studies of LD, the necessary high-density genotypic data for linked markers are usually not available until an admixture genome scan has been completed. Thus, the present study utilized unlinked markers to obtain first data on background LD in two Populus species and their hybrids.

As expected based on the outbreeding mating system of our study taxa (wind-dispersed pollen and seeds), LD among unlinked markers was generally small to moderate in both species (Figure 1). Levels of LD were somewhat stronger in P. alba than in P. tremula (Figure 1a), in concordance with smaller effective population sizes in the former species (P. alba: Ne c. 500–550; P. tremula: Ne c. 550–700; Lexer et al., 2005). The variance component of LD due to genetic drift between subpopulations (DST2) was significantly larger than the within-subpopulation component (DIS2) in P. alba, but not in P. tremula (Figure 2). This indicates that, although no genetic structure was detectable among subpopulations with conventional and Bayesian-based methods, the patchy and fragmented distribution of the flood-plain pioneer P. alba may have allowed the build-up of small to moderate levels of LD due to drift, whereas the more continuous distribution of the mixed-forest-species P. tremula appears to have resulted in linkage equilibrium among independent markers. Our results indicate that levels of background LD in the two parental gene pools are low to moderate, and that most of the observed LD is due to drift rather than epistatic selection (Ohta, 1982).

An even more relevant question is whether LD in hybrids is conducive to admixture mapping experiments. Our results from a hybrid zone between P. alba and P. tremula in Central Europe indicate that LD among independent marker loci is pronounced in early generation hybrids (median=0.303 with rare alleles included in the dataset; 15.8% of tests significant at an experiment-wide level), and that LD decays markedly in later hybrid generations (see Supplementary material). In the most advanced backcross generations (hybrid index group 4), the median of LD is approximately 0.25, which is within the range of the two parental species. Of course, this decrease in LD with progressive hybrid generations is expected based on simulation studies of LD as a function of generations since admixture (Briscoe et al., 1994), and studies that model genomic block size in hybrid populations (Martinsen et al., 2001; Chapman and Thompson, 2002). Also, pollen records for warmth-loving tree species like P. alba (Huntley and Birks, 1983) suggest that the studied hybrid zone should already be 100–200 tree generations old, and thus LD should approach that of the pure parental gene pools for a substantial proportion of genotypes sampled in nature.

A regression-based method for detecting departures from neutral introgression at multi-allelic codominant markers

Admixture mapping as applied to humans has focused on genomic regions with an excess of ancestry in the direction of either parental population, for example, marker-location specific excess ancestry in cases and controls (Pfaff et al., 2001; Zhu et al., 2005). Similarly, the few existing admixture mapping-related studies in non-human taxa have focused on regions of the genome that may introgress more or less frequently than expected under neutrality (Rieseberg et al., 1999; Rieseberg and Buerkle, 2002). In either case, methods are needed to estimate ancestry at marker loci against the remainder of the genome and across a range of different genetic backgrounds, and these methods need to be applicable to a wide range of markers, population structures, and sampling schemes.

Excellent analytical tools for the genetic analysis of admixed populations have been created by evolutionary biologists studying ‘clines’ in hybrid zones (Barton and Hewitt, 1985; Barton and Gale, 1993). However, cline parameters are often difficult to measure where habitats are discontinuous and hybrid zones are patchy, as is the case for ‘mosaic hybrid zones’ (Harrison, 1990). Here, we have explored an alternative, multinomial regression-based approach. Our method uses models of the genotypes of individuals along a cline, as opposed to clinal models of pooled genotypes (i.e., genotype frequencies in populations) at a geographic location (Barton and Gale, 1993). Of course, variability in cline shape my also arise due to genealogical/spatial variation among neutral loci (Ibrahim et al., 1996; Klopfstein et al., 2006). In our approach, genealogical variation among loci is encompassed by computer simulations under the null hypothesis of neutrality. We estimated genotype frequencies at microsatellite loci as a function of the overall genomic composition of individual plants. We utilized logistic regression as in human admixture mapping (McKeigue et al., 2000; Hoggart et al., 2004; Reich et al., 2005; Zhu et al., 2005). In contrast to human studies, however, we did not use locus-specific excess ancestry as a predictor variable for a phenotypic trait such as disease status. Rather, we examined genotypes at individual marker loci as a function of genome-wide admixture, that is hybrid index. Computer simulations for each locus provided confidence intervals for expected introgression across the full range of genetic backgrounds (estimated via the hybrid index) and were compared to observed introgression for each locus. Our approach is thus equivalent to a genomic scan for chromosomal segments that move across the species barrier more or less frequently than expected under neutrality, which is a topic of major interest in evolutionary biology (Wu, 2001).

Application of this approach to our data revealed large variation in introgression frequencies among individual marker loci (Figure 3), and the introgression patterns of several loci deviated from neutral expectations (see departures of point estimates from simulation envelopes in Figure 4). This included one locus for which homozygotes for P. tremula alleles are much rarer than expected in hybrids (ORPM 220, Figure 4). In several cases (e.g., ORPM 220 and 127), departures from neutral expectations were evident for homozygotes, but not heterozygotes, or vice versa (Figure 4). The differences may have a biological explanation, or may result from sample size and distributional differences between homozygotes and heterozygotes, and differences in the power to estimate functions with narrow confidence limits.

With respect to biological interpretations, it would be premature to interpret departures from neutrality in terms of linked candidate genes before a genome-wide marker scan has been completed. However, three conclusions may be drawn from our current data: Firstly, different regions of the P. tremula genome do introgress into P. alba at different rates, and these differences can be detected with the methods devised here. Secondly, alleles from P. tremula may sometimes introgress into P. alba at a rate slightly higher than that expected under neutrality, as exemplified by homozygotes at locus ORPM 202 and 60 (Figure 4). Much of speciation genetics has been focused on genetic factors that are negatively selected in hybrids (Dobzhansky, 1937; Butlin, 1989), but recent conceptual and empirical work suggests that chromosomal blocks within recombinant hybrids may sometimes experience positive selection (Barton, 2001; Rieseberg et al., 2003; Seehausen, 2004). Our results suggest that it will be feasible to detect positive selection on individual genomic blocks in hybrid zones of P. alba and P. tremula, if present, through the use of multi-allelic codominant markers. Thirdly, our approach can detect significant under-representation of P. tremula genotypes in hybrids and in the P. alba genetic background (ORPM 220; Figure 4). In addition, by examining genotype rather than allele frequencies, the regression method can reveal significant contrasts between the introgression of homozygotes and heterozygotes (e.g. ORPM 220), which may be due to dominance relationships among alleles at linked loci. One interpretation of the difference between homozygotes and heterozygotes at ORPM 220 is that P. tremula alleles linked to the marker are recessive to the P. alba alleles and experience negative selection in hybrids.

Outlook for ‘admixture mapping’ of adaptive or detrimental variation in Populus

Hybrid zones between P. alba and P. tremula meet the most critical requirements for admixture-based genetic analyses: microsatellite allele frequency differentials (δ) at most loci are substantial (Table 1; δ>0.3 is needed for admixture mapping), background LD in the parental gene pools is low to moderate (Figure 1) and should respond to sampling schemes that minimize drift and account for rare alleles, and LD in hybrids decays rapidly with increasing hybrid index, i.e., increasing number of backcross generations. The latter observation is of special interest.

We have shown previously (Lexer et al., 2005) that both early and late generation hybrids are present in this Central European hybrid zone. Hence, preferential sampling of recombinant early generation hybrids from the population would provide us with a sample of genotypes suitable for low-resolution mapping. This could, for example, include 80–100 markers to conduct a first admixture genome-scan at 15–20 cM intervals, based on studies of genomic block size in hybrid populations (Briscoe et al., 1994; Martinsen et al., 2001; Pfaff et al., 2001), age estimates for the hybrid zone based on fossil records for temperate trees (Huntley and Birks, 1983), and genome-length estimates for Populus (Bradshaw and Stettler, 1994; Frewen et al., 2000; Cervera et al., 2001).

In contrast, preferential sampling of advanced generation hybrids should provide us with material for high-resolution analyses. We note that ancestry from the donor genome (P. tremula) in the studied population is roughly 20% if advanced generations are included (Lexer and co-workers, unpublished data), which is comparable to admixture proportions utilized in human genetics (Zhu et al., 2005). Given the size of the Populus genome (2n=38; 550 Mb; 2C=1.1 pg), however, high-resolution mapping experiments are likely to require hundreds if not thousands of markers. This limitation can probably not be overcome with microsatellite loci (roughly 4000 markers are available for P. trichocarpa, but only a fraction of these cross-amplify in P. alba and P. tremula). It is therefore likely that high-density SNP (Single Nucleotide Polymorphism) data will be required for high resolution ‘admixture mapping’ in Populus.

Admixture LD among highly informative codominant markers in Populus hybrid zones should permit the precise estimation of marker ancestry in hybrids in a genomic context. This should make it possible to estimate the selective value (fitness effects) of individual chromosome blocks in admixed populations. Hybrid populations in other European river valleys may serve as independent ‘replicates’. Admixture should also allow the detection of associations among markers and quantitative phenotypic traits or ecological habitat factors. This would be of great interest not only for evolutionary biology but also for applied breeding programs, and first attempts to model the dependence of quantitative traits upon individual admixture (McKeigue et al., 2000; Hoggart et al., 2004) are encouraging.