Introduction

Endosperm, a result of double fertilization in flowering plants, is a triploid tissue whose genetic constitution is consequently more complex than that of common diploid tissue. Endosperm traits, such as protein and amino-acid content in wheat, amylose content and gel consistency in rice, sugar content in sweetcorn and starch and gum content in barley, are of great economic importance because they are directly related to grain quality. Mapping endosperm trait loci (ETL) can provide an efficient way to genetically improve grain quality (Hospital and Charcosset, 1997; Moreau et al., 1998; Peleman and Voort, 2003; Servin et al., 2004). However, quantitative trait loci (QTL) mapping methods are usually designed for traits that are under diploid control (Lander and Botstein, 1989; Haley and Knott, 1992; Martinez and Curnow, 1992; Jansen, 1993; Zeng, 1994; Kao et al., 1999; Xu, 2003, 2007; Zhang and Xu, 2005a, 2005b; Zhang, 2006). The development of a new method for mapping ETL is thus warranted.

The key to understanding the genetic architecture of endosperm traits is found in the study of the properties of individual genes and their interactions. However, classical statistical methodologies (Gale, 1976; Mo, 1987; Bogyo et al., 1988; Foolad and Jones, 1992; Pooni et al., 1992; Zhu and Weir, 1994) generally focus on partitioning the phenotypic variance of an endosperm trait into genetic and nongenetic (environmental) components, and limit the analysis of the genetic variation to the collective properties of genes. With the advent of molecular markers, QTL mapping became popular. Early QTL mapping used diploid methods to analyze endosperm traits (Tan et al., 1999; Wang and Larkins, 2001; Wang et al., 2001). This simple treatment failed to take into account the triploid nature of endosperm traits.

To overcome this problem, several approaches have been proposed. Wu et al. (2002a, 2002b) pointed out that diploid QTL mapping models require modification to encompass the trisomic inheritance of endosperm traits and the generation difference between a maternal plant and its corresponding endosperm. Such a model requires simultaneous use of two successive generations (two-stage hierarchical design). Theoretically, this can lead to an increase in genetic information extraction from both the maternal plant and its offspring embryo genomes, and in resolution for ETL mapping, compared with a single segregation generation (one-stage) design. Xu et al. (2003) expressed the mean value of endosperm traits of F2:3 seeds as a dependent variable and the expectations of genotypic indicators for additive and dominant effects of a putative ETL as independent variables for iteratively reweighted least-squares mapping. Recently, Hu and Xu (2005) postulated that genetic expression of an endosperm trait may be controlled simultaneously by triploid endosperm and diploid maternal genotypes, and proposed a statistical method for ETL mapping that included maternal genetic effects. However, both of these methods are problematic. First, they handle only models with a single ETL. Only the effects of the putative ETL at the current position are included in the model; all other ETL effects are ignored. Thus, this model is biased in estimating the effects and the positions of ETL provided that multiple and epistatic ETL (eETL) control the trait. Wu et al. (2002b) proposed a two-ETL genetic model to detect eETL, but theirs is not a true multiple eETL genetic model. Subsequently, Kao (2004) developed a method of triploid multiple interval mapping (MIM) that combined the triploid nature of endosperm with their diploid MIM (Kao et al., 1999).

Second, the existing methods do not produce unbiased estimates of the two dominant effects of ETL. If the genotype of a plant is QQ (or qq), all the endosperms of the seeds on the plant will be QQQ (or qqq); if the genotype of a plant is Qq, all the endosperms will be 0.25 (QQQ+QQq+Qqq+qqq). This means that the first and second dominant effects cannot be distinguished individually, only collectively, so the result is equivalent to that obtained from a diploid genetic model (Wen and Wu, 2006). Wen and Wu (2006) put forward a random hybridization design to estimate the two dominant effects of ETL without bias, but their method does not consider epistasis.

Epistasis, the interaction between QTL, plays an important role in the dissection of genetic architecture for complex traits (Phillips, 1998; Carlborg and Haley, 2004). To date, several approaches have been developed, including the MIM method (Kao and Zeng, 1997; Kao et al., 1999), the least-squares multiple regression model (Broman and Speed, 1999), the Bayesian shrinkage estimation method (Xu, 2003; Wang et al., 2005; Zhang and Xu, 2005b), stochastic search variable selection methodology derived from George and McMulloch (1993) (Oh et al., 2003; Yi et al., 2003a, 2003b), the unified Bayesian method (Yi, 2004), the penalized maximum likelihood (PML) method (Zhang and Xu, 2005a) and the empirical Bayes method (Xu, 2007; Xu and Jia, 2007). Most of these are feasible methods for identifying epistatic QTL. Although PML is an all-marker analysis method, it has some advantages. It is simple to use, its result is concise, its running time is much shorter than that of the Bayesian analysis method (Zhang and Xu, 2005a) and it has been proved to be very effective (Broman and Speed, 1999; Xu, 2003; Zhang and Xu, 2005a). Because of these advantages, we used the PML method in our study.

We attempted to detect triploid eETL using a random hybridization design and to estimate, without bias, all effects of eETL, using the PML method.

Method

Experimental design

To form a randomly hybridized population, the parental F2 population was divided into two groups (maternal and paternal) of equal size. The order of the F2 plants in each parental group was randomly permuted, and pairs of plants with corresponding order numbers in the two parental groups were crossed. This procedure was repeated until sufficient hybrid lines were obtained. For each hybrid line, the phenotypic value of the endosperm trait and molecular marker information was required. To obtain the phenotypic value of the trait, we measured the mixture of seeds on the maternal plant for each hybrid line to calculate the mean of the line. Molecular marker information was derived from diploid tissues rather than from the triploid endosperm, since the three genotypes MMM, MMm and Mmm could not be distinguished from one another for dominant markers; nor could genotypes MMm and Mmm be distinguished for co-dominant markers (Wu et al., 2002b). Therefore we predicted ETL behavior using marker information from parental F2 plants. These endosperm trait means of hybrid lines and known marker genotype information from the parental F2 plants were used to map eETL.

Genetic model for random hybrid line mean of an endosperm trait

Let n be the number of random hybrid (RH) lines and m be the number of markers. We assume that there are no maternal effects affecting endosperm trait expression and that, in the RH population, there is one ETL residing on each marker in the entire genome with two different alleles (Q and q). All pair-wise eETL are considered. The mean of hybrid line j, yj, for the trait is described by the following genetic model

where μ is the population mean; ak is the additive effect for locus k, which measures the average effect of substituting Q for q; dk1 (dk2) is the first (second) dominant effect for locus k, which measures the departure of the substitution effect in QQ (qq) background; i.. is the epistatic effect between two loci (Kao, 2004); ɛj is the residual error with an assumed N (0, σ2) distribution; and x, z1 and z2 are dummy variables taking values depending on the genotype combination of the two parental F2 plants randomly hybridized (Table 1).

Table 1 Values of dummy variables for x, z1 and z2 in random hybridization design of F2 plants

We now use l to index the lth genetic effect (the additive, the first and second dominant and epistatic effects) for l=1, …, q. We can rewrite model (1) as

where b0=μ, q=1.5m (3m−1),

and xl′={x1l′, …, xnl′}T is an n × 1 incidence vector corresponding to the effect bl (∀l=1, …, q).

Parameter estimation

The PML method (Zhang and Xu, 2005a) was used to estimate the parameters in model (2). The method is briefly described here; for technical detail the reader is referred to the original study (Zhang and Xu, 2005a).

In the PML method, the objective function to be maximized for parameter estimation is the penalized likelihood function, that is, the product of the likelihood function L(θ∣Y, M) and the penalty function P(θ, ξ). The former is

where Y=(y1, y2, …, yn)T, M is marker information, and ϕ (yj; μj, σ2) is a normal probability density function with mean μj and variance σ2; the latter is

where θ=(b0, b1, …, bq, σ2), ξ=(μ1, …, μq, σ12, …, σq2) is the vector of hyperparameters, and η>0 is prior sample size for accessing μk. Therefore, the penalized likelihood function is

The PML estimates for both model parameters and hyperparameters are

The procedures for parameter estimation are the same as those used by Zhang and Xu (2005a).

Statistical test

As noted by Zhang and Xu (2005a), the usual likelihood ratio test (LRT) cannot be performed with the PML method because of overparameterization. We proposed the following two-stage selection process to screen the markers (Zhang and Xu, 2005a). In the first stage, all markers with ∣b̂/ς̂∣>10−6 are picked up. In the second stage, the epistatic genetic model is modified so that only effects past the first round of selection are included in the model. Owing to the smaller dimensionality of the modified model, we can use the maximum likelihood method to reanalyze the data and perform the LRT. The procedure for the LRT is as follows.

The overall null hypothesis is no effect of ETL at the locus of interest, denoted by H0: a=d1=d2=0 or H0: Lu=0, where L={1 0 0; 0 1 0; 0 0 1} and u={a d1 d2}T. If we determine the maximum likelihood estimates of the parameters under the restriction of Lu=0 and calculate the log-likelihood value of the solutions with this restriction, we have L(θ̂∣Lu=0). At the same time, we can also evaluate the log-likelihood value of the solutions without restriction and obtain L(θ̂). Therefore, the LRT statistic is

Various other statistical tests can be carried out by redefining the L matrix. To test the hypothesis of H1: a=0, for example, we define L1={1 0 0}. The LRT statistic is LR1=−2 [L(θ̂∣L1u=0)−L(θ̂)].

For eETL, we may define L=diag ({1 1 1 1 1 1 1 1 1 1 1 1 1 1 1})15 × 15 and u=b. In the same way, the significance of epistatic effects can be tested. The significance threshold of log of the odds (LOD) score is set at 3.0 where LOD=LR/4.605.

Simulation studies

Genetic design

We simulated RH populations, with a sample size of 300 in most cases. Twenty-one equally spaced markers were simulated on three-chromosome segments 360 cM long. We used three main ETL effects and one pair-wise interaction effect, all of which overlapped with markers. All three ETL effects were located at the center (60 cM) of the chromosome. Their genetic parameters were: a1=2.0 (marginal variance 5.00), d11=5.2 (marginal variance 5.07) and d12=−5.2 (marginal variance 5.07) for the first ETL; a2=3.0 (marginal variance 11.25), d21=3.0 (marginal variance 1.69) and d22=0.0 (marginal variance 0.00) for the second ETL; a3=1.0 (marginal variance 1.25) and d31=d32=0.0 (marginal variance 0.00) for the third ETL. The eETL was the additive-by-additive interaction between the second and third ETL ( ) and its effect was set to be equal to 1.50 (marginal variance 3.52). The marginal genetic variances explained by the three main effect ETL were 23.72, 15.19 and 1.25, respectively (Appendix). The total genetic variance for the endosperm trait (σg2) was 43.67. The environmental variance was calculated by σe2=(1−h2)σg2/h2 with h2 being a 0.50 heritability for most cases. A mixture of ten seeds from each maternal plant for each hybrid line was simulated for the endosperm trait to obtain the mean of the line. To investigate the performance of the proposed method, different cases were considered. Each case was replicated 200 times. For each simulated ETL, we counted the samples in which the LOD statistic had passed 3. A detected ETL within 20 cM of the simulated ETL was considered as a true ETL. The ratio of the number of such samples to the total number of replicates (200) represented the empirical power for this ETL. The false-positive rate was calculated as the ratio of the number of false-positive effects to the total number of zero effects considered in a multiple-ETL genetic model.

Effect of ETL heritability on results of ETL mapping

In the first simulation experiment, we studied the effect of ETL heritability on the results of ETL mapping. The parameters simulated in this experiment, with the exception of ETL heritability, were described in the section on genetic design. By changing the size of residual variance, the total heritability for an endosperm trait was set at four levels: 0.20, 0.40, 0.60 and 0.80. The true and estimated values for the effects and the positions of ETL along with the empirical powers in the detection of ETL are listed in Table 2. As expected, the precision of the estimates of the effects and positions of ETL and the empirical power increase as the heritability increases. Note that the estimates for most of the effects and positions of ETL are unbiased; all coefficients of variance (CV) are below 30%; and the CV falls below ∼10%, whereas the marginal variance of a genetic effect accounts for >5% of the total phenotypic variance. We also noted that, in the case of 0.20 heritability, the powers in the detection of d21, a3 and are relatively low owing to low genetic variances and explained by their corresponding effects (0.78, 0.57 and 0.69%). In addition, the false-positive rate is low.

Table 2 Effect of ETL heritability on results of ETL mapping in random hybridization design of F2 plants (200 replicates)

Effect of sample size on ETL mapping

In the second experiment, we evaluated the effect of sample size on the results of ETL mapping. By changing the number of RH lines, sample size was set at five levels: 100, 200, 400, 600 and 1000. The results from the simulated experiments are listed in Table 3. They show the general behavior of QTL mapping: as sample size increases, the result improves (as judged by the decrease in the standard deviation and the increase in empirical power). When sample size is above 400, accurate estimates and high power can be achieved, even for small genetic effects d21, a3 and (marginal heritabilities are 1.95, 1.44 and 1.73%, respectively).

Table 3 Effect of sample size on results of ETL mapping in random hybridization design of F2 plants (200 replicates)

Effect of the number of seeds per plant on ETL mapping

This simulation experiment aims to evaluate the effect of the number of seeds per maternal plant on the results of ETL mapping. We set the number of seeds per plant at five levels: 1, 3, 5, 10 and 20. The results are given in Table 4. We found that, when the number of seeds per plant was more than 10, all parameters were accurately and precisely estimated. Indeed the power was high, even when there were only three seeds. Therefore, the results are robust.

Table 4 Effect of the number of seeds per maternal plant on results of ETL mapping in random hybridization design of F2 plants (200 replicates)

Effect of sampling strategy on ETL mapping

The effect of sampling strategy on the results of ETL mapping was investigated. We evaluated five schemes of sampling strategy: 600 × 5 (5 seeds were sampled from each of 600 F2 maternal plants), 300 × 10, 200 × 15, 150 × 20 and 100 × 30. The results of 200 replicated simulations are summarized in Table 5. We observed the expected trend of an increase in power as the number of hybrid lines increased; the number of hybrid lines was more important than the number of seeds per maternal plant. The reason for this may be that a larger number of hybrid lines can provide more marker information.

Table 5 Effect of sampling strategy on results of ETL mapping in random hybridization design of F2 plants (200 replicates)

A simulated example of a large genome

Finally, we simulated a large genome 1260 cM long to explore the performance of the proposed method in real data analysis. The genome consisted of 12 chromosomes, each covered by eight evenly spaced markers with a 15 cM per marker interval. The simulated parameters are listed in Table 6 for main effects and in Table 7 for epistatic effects. By changing the size of the residual variance, the total heritability for an endosperm trait was set at 0.60. The total number of ETL effects included in the model was 1.5 × 96 × (3 × 96−1)=41 328. We increased the sample size to 600. The number of effects was about 68 times as large as the sample size. Obviously, it was overloaded. At this juncture, a two-stage method was proposed. In the first stage, a full model that included all of the main and pair-wise epistatic effects was divided into many reduced models, each with all of the main effects and proportion of the epistatic effects. It was feasible to estimate the parameters of each reduced model using the PML method. In this way, individual effects apart from zero could be discerned. In the second stage, we modified our epistatic genetic model so that only effects past the first round of the selection were included in the model and we could use the PML method to reanalyze the data. The results are listed in Tables 6 and 7. They show that all ETL are detected with the exception of an eETL with a dominant-by-dominant effect, and that the effects and positions of the detected ETL are close to their corresponding true values. For the undetected eETL, the genetic variance explained by its effect is relatively low. In addition, three false-positive eETL with additive-by-additive epistatic effects were identified. However, their effects are small, and their LOD values for LRT are about 5 (data not shown)—much less than those for true ETL. Thus, the new method works well.

Table 6 Simulated and estimated ETL positions and effects from a single data set of a large genome
Table 7 Simulated and estimated positions and effects of interacting ETL from a single dataset of a large genome

Discussion

Genetic improvement of grain production and quality is a major aim in plant breeding. Endosperm is a main part of grain seed and many endosperm traits are directly related to grain quality, so endosperm traits are of great importance. To uncover their genetic architecture, several methods of mapping ETL have been proposed (Wu et al., 2002a, 2002b; Xu et al., 2003; Kao, 2004; Hu and Xu, 2005; Wen and Wu, 2006). These triploid-based methods are all superior to diploid methods for ETL mapping. The method described here, however, offers advantages over triploid-based methods. As in Kao (2004) method, it allows for a model that includes all main and pair-wise epistatic effects, in contrast to other methods in which only a single ETL genetic model is considered (Wu et al., 2002a, 2002b; Xu et al., 2003; Hu and Xu, 2005; Wen and Wu, 2006). In our new model, biased estimates will not occur if there are linked or eETL. However, our method differs from Kao (2004) method, in which genetic model determination relies on the adoption of a critical statistic whose true distribution is very difficult to determine. The usual technique is the permutation test (Churchill and Doerge, 1994; Kao, 2004), which is very time consuming. In our new method, model selection is unnecessary, and the best model can always be captured (Zhang and Xu, 2005a). Along with Wen and Wu (2006) method, our method can provide unbiased estimates for the first and second dominant effects and corresponding epistatic effects. However, our method differs in that theirs handles only a model with a single ETL. In addition, our method is economical and easy to implement. Although Wu et al. (2002b) and Kao (2004) proposed a more advanced two-stage design (with marker information collected from maternal plant and seed embryo), it is difficult to put into practice. The reasons are technical difficulty, imprecise single-seed phenotype measurement, and the high cost of marker assay. In our method, bulked endosperm trait measurement is used for phenotype data, and F2 plant tissue for marker data.

Another major concern is how the PML method deals with a multiple ETL model that potentially can assume one ETL residing on each marker position. A number of questions arise in this regard. First, what are those markers’ false-positive rates? The results in Tables 2, 3, 4 and 5 indicate that if a marker is not associated with a trait, its genetic effect on the locus shrinks to nearly zero. The same result is seen in the simulated experiment with a large genome, and in Zhang and Xu (2005a). Therefore, the false-positive rate is low.

Second, how do we analyze real data? The procedure necessitates pretreatment to deal with dominant and missing markers and marker density. Marker imputation techniques may be used in the case of incomplete information marker data (Xu, 2007). They involve the calculation of the conditional probability of marker genotypes using a multipoint method (Jiang and Zeng, 1997), and the sampling of a complete imputed data set for the marker genotypes. Usually, 10–20 imputed data sets are generated (Sen and Churchill, 2001; Xu, 2007). The reported result is the mean of estimates for each imputed data set. When marker density is too high, choosing one marker from the cluster of markers avoids a high degree of multicollinearity (Zhang and Xu, 2005a). When the marker is too sparse, a virtual marker (treated as missing data) may be inserted.

Third, is the number of markers that can be applied using the PML method limited? It is preferable to gather more samples or reduce the number of effects considered in the model (Zhang and Xu, 2005a; Hoti and Sillanpää, 2006). If the number of markers is large, however, the number of effects in the model is enormous—more than 40 000 in the simulated experiment with a large genome. In this case, a two-stage method, taking about 22 h, is recommended. The results in Tables 6 and 7 show that this works well, and a further study is under way.

Fourth, how can we fine-map ETL? Although our method, a type of marker analysis, is inadequate for fine-mapping, its strategy has been proved to be very effective (Broman and Speed, 1999; Xu, 2003; Zhang and Xu, 2005a), and we can use the result derived from this method as a starting point for other methods based on a multiple-ETL model, such as Kao (2004) method. Combining the two methods can provide stable model determination and high resolution. Moreover, extension to ETL with epistatic effects, making use of the PML framework, is under way and may be used to fine-map ETL.

It should be noted that in our study an additive-by-additive effect was simulated for most cases. This is because the effect has a relatively high proportion of genetic variance (Appendix) and is easily detected. Larger sample sizes are recommended to explore other kinds of epistatic effects.