Introduction

Linear mixed model (LMM) methodology is a powerful technology to analyze models containing both the fixed and random effects. The model was first proposed to estimate genetic parameters for unbalanced data (Henderson, 1950). This technique has been used to map genes controlling the variation of quantitative traits (Xu and Yi, 2000; Boer et al, 2007). The LMM methodology cannot be directly applied to traits with discrete distributions. Wedderburn (1974) proposed a linear predictor and a link function to handle discrete traits. The linear predictor is simply a linear model combining information from the independent variables. The link function is used to describe the relationship between the linear predictor and the expectation of the response variable. This approach eventually leads to a special area of statistics called the generalized linear model (GLM) (McCullagh and Nelder, 1989).

Xu and Hu (2010) recently developed a GLM approach to interval mapping (IM) for traits with discrete distribution. The purpose of that study was to investigate the efficiencies of two different methods for handling missing genotypes: (1) the heterogeneous residual variance method and (2) the mixture model method. In the first method (heterogeneous residual variance method), we replaced the missing genotypes of quantitative trait loci (QTL) by the conditional expectations of the genotype indicator variables and then took into account the heterogeneous residual variances of different genotypes due to heterogeneous information contents. In the second method (the mixture model method), we fully utilized the conditional distributions of the missing genotypes. Theoretically, the mixture model approach should be optimal. In practice, the heterogeneous residual variance method is more efficient because it is robust to departure from the assumed normal distribution of the residuals. On the contrary, the mixture model is very sensitive to the departure of an assumed distribution and the choice of the initial values of the parameters. These missing-genotype-handling methods have not been applied to multiple QTL mapping (MQM).

When the number of QTL included in a model reaches a certain level, for example, the number of QTL is larger than the sample size, the model is oversaturated. In this case, some kind of penalty is required to shrink the superfluous QTL down to zero. The penalty is accomplished by treating each QTL effect, say QTL k, as a random effect with a N(0, σk2) distribution. When the linear predictor contains both fixed and random effects, the model is then called the generalized LMM (GLMM) (Breslow and Clayton, 1993; McCulloch and Neuhaus, 2005). Special algorithms have been developed to estimate variance components and predict the random effects, for example, the pseudo likelihood algorithm (Wolfinger and O'Connell, 1993). However, existing GLMM have not fully considered the missing genotype problem.

In this study, we extended the GLM for IM of QTL (Xu and Hu, 2010) to GLMM for MQM. The difference between IM and MQM is that IM uses a model that contains only one QTL effect at a time (the entire genome analysis requires multiple analyses of many single-effect models), whereas MQM estimates all QTL simultaneously in a single model. Although Xu and Hu (2010) developed two methods for GLM analysis, we only examined the heterogeneous residual variance method. The mixture model did not offer any advantages over the heterogeneous residual variance method (Xu and Hu, 2010), and thus will not be examined here in this study. In addition, we evaluated a simple method called the expectation method, in which the missing genotypes of QTL are simply replaced by the conditional expectation of the genotype indicator variables. The heterogeneous residual variance method called by Xu and Hu (2010) is now rephrased as the overdispersion method. We believe that overdispersion is a more appropriate term in the context of GLMM.

Methods

Generalized linear mixed model

We use a binomial trait as an example to demonstrate the new methodology, although the method can be applied to other discrete traits. Let yj be the number of events and tj be the number of trials for individual j from a population of n individuals. Let E(yj/tj)=μj be the expectation of the binomial trait. Define ηj=Φ−1(μj) as a linear predictor with the probit link function. The linear predictor is a function of marker genotypes, as described below,

where β is the intercept, γk is the effect for marker k, Zjk is an independent variable determined by the genotype of marker k of individual j and m is the total number of markers included in the model. In a later section, markers are replaced by pseudo markers. Each marker is then considered as a putative QTL. Therefore, we may call m the number of putative QTL. Details about Zjk will be described later.

When m is large, say m>n, the model is oversaturated and solutions of the parameters will not be unique. To overcome this problem, a penalty should be placed on the QTL effects. Ridge regression (Hoerl and Kennard, 1970) is often used as a penalized regression analysis. It corresponds to the L2 penalty (Zou, 2006; Friedman et al, 2010), in which γk is treated as a random effect and further described by a N(0, σk2) distribution. Once γk is treated as a random effect, it becomes a random variable and thus does not reduce the degree of freedom of the residual. In addition, the zero mean distribution serves as a ‘prior’ belief of no effect from the Bayesian point of view. These are the very reasons why a mixed model can handle a very large number of regression coefficients once the coefficients are treated as random effects. The intercept β is treated as a fixed effect (no distribution is assigned) because we do not want to penalize a model based on the size of the intercept. The linear predictor includes both the fixed effect (β) and the random effects (γ), and thus is called the mixed model. The least absolute shrinkage and selection operator (Lasso) method developed by Tibshirani (1996) is another penalized regression analysis, called the L1 penalty. We will not pursue the Lasso approach because it is beyond the scope of the GLMM.

Let us denote all QTL effects by an m × 1 vector and denote the multivariate normal density of γ by p(γ∣G)=N(γ∣0,G) where is a diagonal matrix for the variance components. This special notation for probability density p(γ∣G)=N(γ∣0,G) is adopted from Gelman et al. (2004). It represents both the distribution and the density, that is,

Conditional on ηj=β+Zjγ, the binomial distribution for yj is

When γ are treated as random effects, they are no longer considered as parameters in the GLMM, although they remain to be important genetic parameters in terms of QTL mapping. The parameters are now formed by θ={β,G}. Conditional on η=β+Zγ, we have the joint probability for the binomial trait of the entire sample

The likelihood function for the parameter vector θ={β,G} is proportional to the following probability

where the integration is taken with respect to γ. The integral is multivariate and no explicit expression exists. The log likelihood function for parameter θ={β,G} is defined as

and thus also does not have an analytical expression. The maximum likelihood estimate of θ={β,G} would be obtained by solving

if L{β,G} were explicitly expressed. A pseudo likelihood algorithm was developed to solve for the parameters (Wolfinger and O'Connell, 1993). Laplace approximation has also been used to replace the integral (Vonesh, 1996). In this study, we adopted a simple method that does not involve numerical integration. This method is called the MAP estimation, as described below.

MAP estimation

The word MAP stands for maximum a posteriori (DeGroot, 2004), which is a terminology related to Bayesian analysis. Our GLMM is a frequentist approach if we treat {β,G} as parameters. However, if we consider {β,γ} as parameters and treat G as a prior variance matrix for γ, the problem becomes a Bayesian problem and parameter estimation can be achieved under the Bayesian framework. In a typical Bayesian problem, the parameters in the prior should be provided by the investigator before the data analysis. It is hard to provide a prior value for G, and thus we must estimate G from the data. Once G is estimated from the data, the problem is more like a mixed model problem. Therefore, the difference between the Bayesian model and the GLMM becomes blurred. We may consider the MAP algorithm as a simplified approach to estimating parameters under the GLMM framework (see McGilchrist 1994). We will first provide the MAP estimation and then show the difference between the MAP estimates and the ML estimates.

Unlike the ML estimation in which the target function for maximization is L(β,G), in the MAP estimation, we maximize the log posterior function defined as

where

and

The MAP estimation for ξ={β,γ,G} is obtained by setting and solving for ξ. The iteration process is summarized in the following sequences.

Step (1): Set t=0 and initialize all parameters .

Step (2): Update β using

Step (3): Update γk for k=1,…,m using

Step (4): Update σk2 for k=1,…,m using

Step (5): Repeat Steps (2) to Step (4) until the sequence converges.

Note that Steps (2) and (3) are the first step iteration of the Newton–Raphson algorithm (Ypma, 1995). The MAP approach for GLMM was first proposed by McGilchrist (1994). It is a much simplified algorithm that has avoided multiple integration. The original MAP of McGilchrist (1994) did not address the missing value problem, which will be dealt with in the next section of this study.

Let us now compare the MAP with the EM algorithm. The target function to be maximized with the EM algorithm is

where the expectation is taken with respect to γ. The MLE of θ={β,G} is obtained by solving

The corresponding EM steps are modifications of the MAP steps as shown below. The updating step for β in the EM is

which is a maximization (M) step. The updating step for γk is

This is the expectation (E) step. Another maximization (M) step is to update σk2 for k=1,…,m using

where is the expectation of γk and

is the variance of γk. The EM algorithm requires calculation of the expectation of the first- and second-order partial derivations of the target function, which is by no means a simple task. This is the very reason why McGilchrist (1994) proposed the MAP for GLMM. Note that the updating step for σk2 is explicit and obtained by setting for the MAP and for the EM algorithm. Therefore, the MAP estimation does not exactly lead to the EM estimation in the frequentist framework. However, the results are very close and this is why McGilchrist (1994) developed the MAP estimation for variance component analysis in the GLMM framework.

LOD (log of odds) score test

The estimated QTL effect (after MAP iteration converges) is denoted by . We can now perform statistical tests. The test statistic for H0:γk=0 may be the t-test,

It is called the t-test because it is expressed as the ratio of the estimated effect to the s.e. However, under the null model, this test statistic may not follow the t-distribution because of the penalty placed on the estimation. This test statistic is negative if the estimated QTL effect is negative. The Wald test (Wald, 1943) is simply the square of the t-test

which is similar to the likelihood ratio test statistic. The best presentation of the test statistic is the LOD score defined as

A nice property of the LOD score test is that an empirical critical value of

may be used to declare statistical significance at the 0.05 type I error rate (Kidd and Ott, 1984; Risch, 1991). The number m occurred in log10(m) is the number of putative QTL included in the model. The special case of m=1 corresponds to the LOD 3 criterion.

Missing genotypes

In QTL mapping, the genotype indicator variable (Zjk) is missing if the QTL position does not overlap with a fully informative marker. However, partial information is available due to linkage disequilibrium. We examined two methods for handling missing genotypes.

Expectation method

The linkage disequilibrium allows us to infer the conditional distribution of Zjk given information from linked markers. Let A1A1, A1A2 and A2A2 be the three genotypes of a QTL for an individual in an F2 population. The Z variable is determined by the genotype of locus k,

In the context of GLMM, γk=ak, where ak is called the additive effect of locus k. When Zjk is missing, the expectation and variance of it are inferred from the genotypes of flanking markers (Jiang and Zeng, 1997). Let pj(+1), pj(0) and pj(−1) be the conditional probabilities of the three genotypes inferred from neighboring markers using the multipoint method (Jiang and Zeng, 1997). The expectation and variance of Zjk are (Xu and Hu, 2010)

and

With the expectation method, we simply replace Zjk by Ujk. Therefore, the linear predictor is defined as

Everything else remains the same as the situation with complete genotypic information.

Overdispersion method

The expectation method only takes advantage of the first moment of the distribution of Zjk. The second moment information has been ignored, which will generate a situation called overdispersion. For locus k, the overdispersion is defined as

Incorporating this overdispersion, we redefine the linear predictor as

where

is an offset of the linear predictor contributed by other loci. We now have a locus-specific to define various log functions for maximization.

Results

Simulation study

Binomial data

We simulated a single large chromosome of 2400-cM long evenly covered by 241 co-dominance markers (10 cM per marker interval). The simulated population was an F2 family derived from the cross of two inbred lines with a sample size n=500. The genotype indicator variable for individual j at locus k was defined as Zjk={+1, 0, −1} for the three genotypes (A1A1, A1A2, A2A2). Dominance effects were not simulated and also not included in the model for this simulation experiment. A total of 20 QTL were simulated with the true sizes and locations of the QTL depicted in Figure 1 (the top panel). Most QTL were placed in the left part of the genome. Some QTL were far apart from each other, whereas others were clustered in some narrow regions. About half of the simulated QTL overlapped with true markers (known genotypes) and the remaining QTL were located between markers (having missing genotypes). We first generated a linear predictor ηj for each individual using the genotypes of the 20 simulated QTL and the true effects of these QTL. The linear predictor was then converted into the probability of a binomial variable using . We then simulated a zero-truncated Poisson variable with mean 4 as the number of trials for individual j, denoted by tj (the number of trial must be >zero). We then simulated the number of events yj from the corresponding binomial distribution defined by μj and tj, that is, yj∼Binomial (μj,tj). The simulation experiment was replicated 100 times.

Figure 1
figure 1

True QTL effects (top panel) and their estimated values for the simulated binomial trait (BINOMIAL) using the expectation method (panel in the middle) and the overdispersion method (bottom panel). The estimate QTL effects are drawn from a single simulated sample. The positions of 20 simulated QTL are indicated by the inward ticks on the horizontal axis.

In the simulated data analysis, we added a pseudo marker in every 2.5 cM of the genome, which is equivalent to adding three pseudo markers per marker interval (10 cM is the length of each interval). Genotypic probabilities of the pseudo markers were inferred from information of flanking markers (Jiang and Zeng, 1997). These probabilities were used to calculate Ujk and Σjk. The total number of putative loci analyzed was with 241 true markers and 720 pseudo markers. This m is almost twice the size of the sample (n=500). We wrote a SAS/IML program to analyze the data. The IML code is available from the corresponding author on request.

The estimated QTL effects from one random sample are presented in Figure 1 for the two methods (expectation and overdispersion) along with the true simulated effects. The two methods produced very similar results, which mimic the true QTL effects closely in terms of locations and the sizes of the effects. The general observations from this figure are (1) a large QTL effect may be split into two or more small effects in the neighborhood of the true QTL and (2) the estimated effects are generally smaller than the true effects due to penalty. Without the penalty, however, we cannot estimate all the 961 putative QTL simultaneously. In any real data analysis with a single sample, the pattern shown in Figure 1 is what an investigator expects to see.

Figure 2 shows the plot of the average estimated QTL effects (across 100 replicated samples) against the genome location. This time, the positions and the patterns of the QTL are extremely close to the true QTL shown in Figure 1 (the top panel). However, the average estimates of the QTL effects are severely biased downwards (towards zero). The differences between the two methods were barely noticed from the visual plots. The simulation experiments allow us to evaluate the bias and estimation error of each QTL and eventually the mean-squared error (MSE) for all the QTL. Let be the average estimate of γk for the 100 replicates and sk2 be the variance of the estimated γk across the 100 samples, the MSE for γk is defined as

Figure 2
figure 2

Estimated QTL effects for the simulated binomial trait (BINOMIAL) using the expectation method (panel in the middle) and the overdispersion method (bottom panel). The estimated QTL effects are the averages of 100 replicated samples.

The sum of MSE's for all QTL is

The MSEs for the two methods (expectation and overdispersion) are shown in Table 1. The overdispersion method has a slightly larger bias but with a smaller error compared with the expectation method. Overall, the overdispersion method has a smaller MSE than the expectation method. The bias defined here from the replicated simulation experiments may be overstated for the following reasons: (1) with the high density of the putative QTL in the model, a true QTL is often detected by a nearby marker close to the true QTL. The exact locations vary from one sample to another, but all in the neighborhood of the true QTL. When an average is taken across the samples, the effect of the true location is diluted by those samples in which the estimated QTL is a few cM away from the true QTL. For example, if a true QTL is estimated in the true location (A) from one sample and it is estimated in a position 2.5 cM away from the true location (B) in the second sample, the average effect of the two samples is then halved for the true location. This problem will be corrected in any real data analysis because a QTL detected in experiment A (denoted by QTLA) will be treated as the same QTL detected in experiment B (denoted by QTLB) as long as QTLA and QTLB are not too far away from each other. Therefore, the smaller estimation error of the overdispersion method is perhaps more important than the large bias.

Table 1 Comparison of the MSE for the two methods in the 100 replicated simulation experiments

Binary data

The experimental design is exactly the same as that of the binomial experiment. The only difference in the simulation is that the trial was a fixed number of one for every individual in the binary data simulation experiment. The estimated QTL effects from one random sample are presented in Figure 3 for the two methods (expectation and overdispersion) along with the true simulated effects. Again, the two methods produced very similar results. However, they differ from the true QTL effects more than what we observed for the binomial trait analysis. Some QTL with small effects have been missed here, for example, the last simulated QTL in the genome. This indicates lower efficiency of QTL mapping for binary trait analysis than for binomial trait analysis.

Figure 3
figure 3

True QTL effects (top panel) and their estimated values for the simulated binary trait (BINARY) using the expectation method (panel in the middle) and the overdispersion method (bottom panel). The estimated QTL effects are drawn from a single simulated sample.

Figure 4 shows the plot of the average estimated QTL effects (across 100 replicated samples) against the genome location. Again, the positions and the patterns of the QTL are close to the true QTL shown in Figure 3 (the top panel). The MSE′s for the two methods (expectation and overdispersion) are shown in Table 1. The overdispersion method has a much larger bias but with much smaller error compared with the expectation method. Overall, the overdispersion method has a smaller MSE than the expectation method. The advantages of the overdispersion method are well supported in the simulation experiments.

Figure 4
figure 4

Estimated QTL effects for the simulated binary trait (BINARY) using the expectation method (panel in the middle) and the overdispersion method (bottom panel). The estimated QTL effects are the averages of 100 replicated samples.

Mapping wheat fertility QTL

The experiment was conducted by Dou et al (2009). The mapping population contained 243 F2 individuals derived from the cross of two inbred lines. The trait of interest is the female fertility measured as a binomial trait. The event is the number of seeded spikelets per plant (average 19.13 seeded spikelets) and the trial is the total number of spikelets per plant (average 25.15 spikelets). A total of 28 markers were genotyped in this experiment. These markers covered five chromosomes of the wheat genome with an average marker interval of 15.5 cM. The five chromosomes are only part of the wheat genome.

Binomial trait

As the marker map is sparse, we inserted one pseudo marker in every 2 cM, generating a total of 197 loci (28 true markers and 169 pseudo markers). The pseudo markers have missing genotypes and the probability distributions of these pseudo markers were inferred from linked markers using the multipoint methods (Jiang and Zeng, 1997). The sample size was n=243 and the size of the model was m=197. Both the expectation and overdispersion methods were used for the binomial data analysis.

For the real data analysis, we need to calculate the LOD score for each putative locus. The estimated QTL effects for the two methods are depicted in Figure 5 (the top panel). The LOD score profiles for the two methods are depicted in Figure 5 (the bottom panel). The two methods show some similarity and differences. Using the LOD=3+log10(197)=5.2944 as the threshold (Kidd and Ott, 1984), the expectation method detected 17 QTL, whereas the overdispersion method detected 15 QTL. Among these detected QTL, eight of them were detected by both methods. The estimated effects along with the s.e. and the LOD scores obtained from the overdispersion method are listed in Table 2. Most detected QTL were located on chromosome II, IV and V. The QTL with the largest effect and LOD score occurred on the second chromosome at position 28.71 cM (cumulative position of 104.60 cM). This QTL was split into a few smaller ones in the neighborhood of the major peak by the expectation method. Unlike the simulation study where the true effects of QTL were known, for the wheat data, the true QTL were not known. Therefore, we were not able to compare the biases and the MSE of the estimated QTL effects. We chose an alternative approach for evaluating the two methods, that is the leave-one-out cross validation (Picard and Cook, 1984). The cross validation approach only evaluates the predictabilities of the models. For the purpose of molecular breeding and marker assistant selection, higher predictability is more preferable. For the purpose of gene cloning, the biases of QTL effect and location estimates are of major concern. We used the Pearson correlation coefficient ( ) between the observed (y) and predicted ( ) trait values as a measurement of the predictability. The Pearson correlation coefficients for the expectation and overdispersion methods were 0.5166 and 0.5290, respectively. The overdispersion method showed a slight advantage over the expectation method. We also examined the prediction errors defined by

Figure 5
figure 5

Binomial trait analysis of the wheat experiment using the expectation method (blue) and the overdispersion method (red). The top panel shows the estimated QTL effects and the bottom panel shows the LOD scores. Chromosomes are separated by the dotted vertical lines. Positions of true markers are indicated by the inward ticks on the horizontal axis.

Table 2 QTL detected for the binomial trait of wheat fertility using the overdispersion method

for the two methods. The results of PEs were 0.100563 and 0.098658, respectively, for the expectation and the overdispersion methods. The PE comparison is consistent with the Pearson correlation comparison. We concluded that incorporation of overdispersion does show the expected benefit (increase in predictability) in QTL mapping over the simple expectation method.

Binary trait

Among the 243 plants, 39 of them did not have seeds at all. The frequency distribution of the number of seeded spikelets is shown in Figure 6. It appears that the zero category was inflated. The binomial data analysis did not differentiate QTL responsible for seed presence and absence. We now defined a binary trait as seed presence/absence and used the two methods (expectation and overdispersion) to analyze the binary trait. The estimated QTL effect profiles are shown in the top panel of Figure 7 and the LOD score profiles are depicted in the bottom panel of the same figure. The two methods appeared to generate much the same result. Using the LOD 5.29 criterion, we only detected a single QTL at position 28.71 cM of chromosome II (cumulative position 104.60 cM). This QTL is the same one as that detected for the binomial trait (the largest QTL for the binomial trait) detected by the overdispersion method. Our conclusion was that, except for this particular QTL, the multiple QTL detected for the binomial trait reported early were all responsible for the variation of the number of seeded spikelets, not the seed presence/absence trait. The leave-one-out cross validation analysis did not show much difference for the two methods. The Pearson correlation coefficients between the observed and predicted trait values were 0.4715 and 0.4729, respectively, for the expectation and overdispersion methods. The corresponding PE′s were 0.104914 and 0.104721, respectively. Both criteria indicate that the overdispersion method is better than the expectation method.

Figure 6
figure 6

Frequency distribution of the number of seeded spikelets of the F2 wheat population. Among the 243 plants, 39 of them had no seeds (zero category).

Figure 7
figure 7

Binary trait (seed presence/absence) analysis using the expectation method (blue) and the overdispersion method (red). The top panel shows the estimated QTL effects and the bottom panel shows the LOD scores. Chromosomes are separated by the dotted vertical lines. Positions of true markers are indicated by the inward ticks on the horizontal axis.

Discussion

The overdispersion method for handling missing genotypes was proposed by Xu and Hu (2010) for IM under the GLM framework. We examined this method and an additional one (expectation method) under the GLMM framework for mapping multiple QTL. The GLMM and GLM are different and thus the extension is not a trivial task. The overdispersion method consistently showed advantages over the expectation method in both the simulated data and real data analyses and in both the binomial and binary trait analyses. Based on the visual plots of the estimated QTL effects, the advantages appeared to be marginal. Then why should we bother to develop such a method, given the observed marginal advantage? First, the overdispersion method does not require much more computational load than the expectation method. The computational times of the two methods are pretty much the same when we used the numerical differentiation packages to evaluate the first and second partial derivatives. Therefore, we should take any opportunity to extract maximum information from the data; even a slight advantage is worth the effort. Secondly, the simulation experiments are always limited. It is hard to simulate all possible scenarios so that the advantages of the overdispersion method are fully exposed. In some situations, the advantage may be obvious and we may simply fail to identify those situations. Thirdly, the two methods for the wheat data analysis of the binomial trait already demonstrated some interesting differences that are worth of discussion. The largest QTL detected by the overdispersion method was split into several smaller QTL by the expectation method. The cross validation analysis showed that the overdispersion method gave a better prediction, implying that the single large QTL may most likely represent the truth. The binary data analysis of the wheat experiment showed that the same locus also had a large effect on the binary trait. This time both methods showed a single large QTL. This observation further supports the single large QTL hypothesis. Without the overdispersion method, we would not have such a confidence of this single large QTL.

The advantage of the overdispersion method will diminish as the marker density increases. In the situation where the entire genome is sequenced, the two methods would converge to the same result because genotypes of all markers will be observed. However, full genome sequences for most species are not expected to happen soon. In addition, missing genotypes may still exist due to human and technical errors in experiments. Therefore, the missing genotype handling methods remain useful in the foreseeable future.

The GLMM is sufficiently general so that it can handle traits with any distributions as long as the likelihood function is programmable. The normal distribution for the QTL effects may be substituted by other distributions. Explicit expressions of the derivatives are not required to implement the Newton–Raphson updates. Recently, Yi and Banerjee (2009) developed a hierarchical GLM for mapping discrete trait QTL. They used the pseudo likelihood approach to approximate the observed log likelihood function. The authors used an EM algorithm to estimate the QTL effects but they treated {β,γ} as parameters and G as missing values. In addition, Yi and Banerjee (2009) only considered marker analysis with missing marker genotypes replaced by the conditional expectation, which is equivalent to the expectation method of this study. However, they only considered missing marker genotypes in the sense that majority of the individuals are genotyped. The missing genotypes in their study were solely caused by technical or human errors. They did not insert pseudo marker in every few centiMorgan to saturate the genome.

Responding to a reviewer's suggestion, we analyzed both the binomial and binary traits of the wheat experiment using the LMM by ignoring the discrete nature of the traits. The correct model should be the GLMM, but we used the LMM as an ad hoc model to analyze the discrete traits. The results are depicted in Supplementary Figure S1 for the binomial trait and Supplementary Figure S2 for the binary trait. Supplementary Figure S1 shows the estimated QTL effects and LOD scores of the LMM analysis for the binomial trait. Only one large QTL was detected using this ad hoc model. Comparing Supplementary Figure S1 here with Figure 5 of the main text, we can see that many small- to median-sized QTL detected by the GLMM were missed. Results of leave-one-out cross validation are shown in Supplementary Table S1. The Pearson correlation coefficients between the observed and predicted trait values were dropped from 0.517 (expectation) and 0.529 (overdispersion) to 0.495 (expectation) and 0.497 (overdispersion). This means that the median-sized QTL detected by GLMM do contribute to the binomial trait variation, and ignoring the discrete nature of the trait has decreased the predictability of the model. The binary trait comparison between GLMM and LMM favors even more for the GLMM (see Supplementary Figure S2 and Table S2 of the Supplementary material).

GLM or GLMM represents an important area of statistics. It was particularly designed to deal with discrete traits or other traits deviating from a normal distribution. In statistics, people rarely argue the suitability GLMM given that LMM is already available for normally distributed traits. In case–control studies for human diseases, logistical regression (belongs to GLM) is often used to detect disease QTL (Hunter et al, 2007) because case (designated by 1) and control (designated by 0) consist of the two binary states of the disease outcome. People rarely analyze the 0-1 binary trait using the simple regression analysis by ignoring the discrete nature of the trait. The situation is different for QTL mapping in plants and animals. Every time a new method is developed for discrete traits, the investigator must face challenges from peers about how much improvement can be achieved if the discrete nature of the trait is ignored. These challenges repeatedly occurred and may largely credit (or blame) to the works by Visscher et al (1996) and Rebai (1997) who showed marginal improvement of GLM over LM for binary trait QTL mapping when the binary trait is treated as if it were continuous. Rao and Xu's (1998) conclusion about the ad hoc treatment of categorical trait analysis was slightly different. They found that if a categorical trait is analyzed using simple linear models, the power and accuracy of QTL parameter estimation can be reduced substantially if the categorical nature of the trait is ignored. Although from practical point of view, it is true that the loss of power and accuracy may be marginal when discrete traits are treated as continuous ones, the GLM or GLMM is built based on a rigorous statistical foundation and thus its suitability should not be argued. Especially, in the era of high power computing, one should not use a suboptimal algorithm on knowing the availability of the optimal algorithm. On the other hand, if an investigator presents result of a binary trait analysis using simple method by treating the binary phenotype as a continuous trait, then the investigator will often face criticism from the peers for not using the correct model, given the availability of GLM or GLMM. For the benefit of these investigators, the new GLMM approach provides a useful tool for correctly analyzing their data to avoid rejection of their fine works.

Data archiving

Simulated data, the test dataset from Dou et al. (2009) and SAS scripts for analyzing these datasets have been deposited at Dryad: doi:10.5061/dryad.mn159hq6.