Generalized linear mixed models for mapping multiple quantitative trait loci

Che, X; Xu, S

doi:10.1038/hdy.2012.10

Download PDF

Original Article
Published: 14 March 2012

Generalized linear mixed models for mapping multiple quantitative trait loci

X Che¹ &
S Xu²

Heredity volume 109, pages 41–49 (2012)Cite this article

1840 Accesses
12 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Many biological traits are discretely distributed in phenotype but continuously distributed in genetics because they are controlled by multiple genes and environmental variants. Due to the quantitative nature of the genetic background, these multiple genes are called quantitative trait loci (QTL). When the QTL effects are treated as random, they can be estimated in a single generalized linear mixed model (GLMM), even if the number of QTL may be larger than the sample size. The GLMM in its original form cannot be applied to QTL mapping for discrete traits if there are missing genotypes. We examined two alternative missing genotype-handling methods: the expectation method and the overdispersion method. Simulation studies show that the two methods are efficient for multiple QTL mapping (MQM) under the GLMM framework. The overdispersion method showed slight advantages over the expectation method in terms of smaller mean-squared errors of the estimated QTL effects. The two methods of GLMM were applied to MQM for the female fertility trait of wheat. Multiple QTL were detected to control the variation of the number of seeded spikelets.

Assessment of two statistical approaches for variance genome-wide association studies in plants

Article 10 May 2022

Reinventing quantitative genetics for plant breeding: something old, something new, something borrowed, something BLUE

Article Open access 15 April 2020

CORE GREML for estimating covariance between random effects in linear mixed models for complex trait analyses

Article Open access 21 August 2020

Introduction

Linear mixed model (LMM) methodology is a powerful technology to analyze models containing both the fixed and random effects. The model was first proposed to estimate genetic parameters for unbalanced data (Henderson, 1950). This technique has been used to map genes controlling the variation of quantitative traits (Xu and Yi, 2000; Boer et al, 2007). The LMM methodology cannot be directly applied to traits with discrete distributions. Wedderburn (1974) proposed a linear predictor and a link function to handle discrete traits. The linear predictor is simply a linear model combining information from the independent variables. The link function is used to describe the relationship between the linear predictor and the expectation of the response variable. This approach eventually leads to a special area of statistics called the generalized linear model (GLM) (McCullagh and Nelder, 1989).

Xu and Hu (2010) recently developed a GLM approach to interval mapping (IM) for traits with discrete distribution. The purpose of that study was to investigate the efficiencies of two different methods for handling missing genotypes: (1) the heterogeneous residual variance method and (2) the mixture model method. In the first method (heterogeneous residual variance method), we replaced the missing genotypes of quantitative trait loci (QTL) by the conditional expectations of the genotype indicator variables and then took into account the heterogeneous residual variances of different genotypes due to heterogeneous information contents. In the second method (the mixture model method), we fully utilized the conditional distributions of the missing genotypes. Theoretically, the mixture model approach should be optimal. In practice, the heterogeneous residual variance method is more efficient because it is robust to departure from the assumed normal distribution of the residuals. On the contrary, the mixture model is very sensitive to the departure of an assumed distribution and the choice of the initial values of the parameters. These missing-genotype-handling methods have not been applied to multiple QTL mapping (MQM).

When the number of QTL included in a model reaches a certain level, for example, the number of QTL is larger than the sample size, the model is oversaturated. In this case, some kind of penalty is required to shrink the superfluous QTL down to zero. The penalty is accomplished by treating each QTL effect, say QTL k, as a random effect with a N(0, σ_k²) distribution. When the linear predictor contains both fixed and random effects, the model is then called the generalized LMM (GLMM) (Breslow and Clayton, 1993; McCulloch and Neuhaus, 2005). Special algorithms have been developed to estimate variance components and predict the random effects, for example, the pseudo likelihood algorithm (Wolfinger and O'Connell, 1993). However, existing GLMM have not fully considered the missing genotype problem.

In this study, we extended the GLM for IM of QTL (Xu and Hu, 2010) to GLMM for MQM. The difference between IM and MQM is that IM uses a model that contains only one QTL effect at a time (the entire genome analysis requires multiple analyses of many single-effect models), whereas MQM estimates all QTL simultaneously in a single model. Although Xu and Hu (2010) developed two methods for GLM analysis, we only examined the heterogeneous residual variance method. The mixture model did not offer any advantages over the heterogeneous residual variance method (Xu and Hu, 2010), and thus will not be examined here in this study. In addition, we evaluated a simple method called the expectation method, in which the missing genotypes of QTL are simply replaced by the conditional expectation of the genotype indicator variables. The heterogeneous residual variance method called by Xu and Hu (2010) is now rephrased as the overdispersion method. We believe that overdispersion is a more appropriate term in the context of GLMM.

Methods

Generalized linear mixed model

We use a binomial trait as an example to demonstrate the new methodology, although the method can be applied to other discrete traits. Let y_j be the number of events and t_j be the number of trials for individual j from a population of n individuals. Let E(y_j/t_j)=μ_j be the expectation of the binomial trait. Define η_j=Φ⁻¹(μ_j) as a linear predictor with the probit link function. The linear predictor is a function of marker genotypes, as described below,

where β is the intercept, γ_k is the effect for marker k, Z_jk is an independent variable determined by the genotype of marker k of individual j and m is the total number of markers included in the model. In a later section, markers are replaced by pseudo markers. Each marker is then considered as a putative QTL. Therefore, we may call m the number of putative QTL. Details about Z_jk will be described later.

When m is large, say m>n, the model is oversaturated and solutions of the parameters will not be unique. To overcome this problem, a penalty should be placed on the QTL effects. Ridge regression (Hoerl and Kennard, 1970) is often used as a penalized regression analysis. It corresponds to the L₂ penalty (Zou, 2006; Friedman et al, 2010), in which γ_k is treated as a random effect and further described by a N(0, σ_k²) distribution. Once γ_k is treated as a random effect, it becomes a random variable and thus does not reduce the degree of freedom of the residual. In addition, the zero mean distribution serves as a ‘prior’ belief of no effect from the Bayesian point of view. These are the very reasons why a mixed model can handle a very large number of regression coefficients once the coefficients are treated as random effects. The intercept β is treated as a fixed effect (no distribution is assigned) because we do not want to penalize a model based on the size of the intercept. The linear predictor includes both the fixed effect (β) and the random effects (γ), and thus is called the mixed model. The least absolute shrinkage and selection operator (Lasso) method developed by Tibshirani (1996) is another penalized regression analysis, called the L₁ penalty. We will not pursue the Lasso approach because it is beyond the scope of the GLMM.

Let us denote all QTL effects by an m × 1 vector and denote the multivariate normal density of γ by p(γ∣G)=N(γ∣0,G) where is a diagonal matrix for the variance components. This special notation for probability density p(γ∣G)=N(γ∣0,G) is adopted from Gelman et al. (2004). It represents both the distribution and the density, that is,

Conditional on η_j=β+Z_jγ, the binomial distribution for y_j is

When γ are treated as random effects, they are no longer considered as parameters in the GLMM, although they remain to be important genetic parameters in terms of QTL mapping. The parameters are now formed by θ={β,G}. Conditional on η=β+Zγ, we have the joint probability for the binomial trait of the entire sample

The likelihood function for the parameter vector θ={β,G} is proportional to the following probability

where the integration is taken with respect to γ. The integral is multivariate and no explicit expression exists. The log likelihood function for parameter θ={β,G} is defined as

and thus also does not have an analytical expression. The maximum likelihood estimate of θ={β,G} would be obtained by solving

if L{β,G} were explicitly expressed. A pseudo likelihood algorithm was developed to solve for the parameters (Wolfinger and O'Connell, 1993). Laplace approximation has also been used to replace the integral (Vonesh, 1996). In this study, we adopted a simple method that does not involve numerical integration. This method is called the MAP estimation, as described below.

MAP estimation

The word MAP stands for maximum a posteriori (DeGroot, 2004), which is a terminology related to Bayesian analysis. Our GLMM is a frequentist approach if we treat {β,G} as parameters. However, if we consider {β,γ} as parameters and treat G as a prior variance matrix for γ, the problem becomes a Bayesian problem and parameter estimation can be achieved under the Bayesian framework. In a typical Bayesian problem, the parameters in the prior should be provided by the investigator before the data analysis. It is hard to provide a prior value for G, and thus we must estimate G from the data. Once G is estimated from the data, the problem is more like a mixed model problem. Therefore, the difference between the Bayesian model and the GLMM becomes blurred. We may consider the MAP algorithm as a simplified approach to estimating parameters under the GLMM framework (see McGilchrist 1994). We will first provide the MAP estimation and then show the difference between the MAP estimates and the ML estimates.

Unlike the ML estimation in which the target function for maximization is L(β,G), in the MAP estimation, we maximize the log posterior function defined as

where

and

The MAP estimation for ξ={β,γ,G} is obtained by setting and solving for ξ. The iteration process is summarized in the following sequences.

Step (1): Set t=0 and initialize all parameters .

Step (2): Update β using

Step (3): Update γ_k for k=1,…,m using

Step (4): Update σ_k² for k=1,…,m using

Step (5): Repeat Steps (2) to Step (4) until the sequence converges.

Note that Steps (2) and (3) are the first step iteration of the Newton–Raphson algorithm (Ypma, 1995). The MAP approach for GLMM was first proposed by McGilchrist (1994). It is a much simplified algorithm that has avoided multiple integration. The original MAP of McGilchrist (1994) did not address the missing value problem, which will be dealt with in the next section of this study.

Let us now compare the MAP with the EM algorithm. The target function to be maximized with the EM algorithm is

where the expectation is taken with respect to γ. The MLE of θ={β,G} is obtained by solving

The corresponding EM steps are modifications of the MAP steps as shown below. The updating step for β in the EM is

which is a maximization (M) step. The updating step for γ_k is

This is the expectation (E) step. Another maximization (M) step is to update σ_k² for k=1,…,m using

where is the expectation of γ_k and

is the variance of γ_k. The EM algorithm requires calculation of the expectation of the first- and second-order partial derivations of the target function, which is by no means a simple task. This is the very reason why McGilchrist (1994) proposed the MAP for GLMM. Note that the updating step for σ_k² is explicit and obtained by setting for the MAP and for the EM algorithm. Therefore, the MAP estimation does not exactly lead to the EM estimation in the frequentist framework. However, the results are very close and this is why McGilchrist (1994) developed the MAP estimation for variance component analysis in the GLMM framework.

LOD (log of odds) score test

The estimated QTL effect (after MAP iteration converges) is denoted by . We can now perform statistical tests. The test statistic for H₀:γ_k=0 may be the t-test,

It is called the t-test because it is expressed as the ratio of the estimated effect to the s.e. However, under the null model, this test statistic may not follow the t-distribution because of the penalty placed on the estimation. This test statistic is negative if the estimated QTL effect is negative. The Wald test (Wald, 1943) is simply the square of the t-test

which is similar to the likelihood ratio test statistic. The best presentation of the test statistic is the LOD score defined as

A nice property of the LOD score test is that an empirical critical value of

may be used to declare statistical significance at the 0.05 type I error rate (Kidd and Ott, 1984; Risch, 1991). The number m occurred in log₁₀(m) is the number of putative QTL included in the model. The special case of m=1 corresponds to the LOD 3 criterion.

Missing genotypes

In QTL mapping, the genotype indicator variable (Z_jk) is missing if the QTL position does not overlap with a fully informative marker. However, partial information is available due to linkage disequilibrium. We examined two methods for handling missing genotypes.

Expectation method

The linkage disequilibrium allows us to infer the conditional distribution of Z_jk given information from linked markers. Let A₁A₁, A₁A₂ and A₂A₂ be the three genotypes of a QTL for an individual in an F₂ population. The Z variable is determined by the genotype of locus k,

In the context of GLMM, γ_k=a_k, where a_k is called the additive effect of locus k. When Z_jk is missing, the expectation and variance of it are inferred from the genotypes of flanking markers (Jiang and Zeng, 1997). Let p_j(+1), p_j(0) and p_j(−1) be the conditional probabilities of the three genotypes inferred from neighboring markers using the multipoint method (Jiang and Zeng, 1997). The expectation and variance of Z_jk are (Xu and Hu, 2010)

and

With the expectation method, we simply replace Z_jk by U_jk. Therefore, the linear predictor is defined as

Everything else remains the same as the situation with complete genotypic information.

Overdispersion method

The expectation method only takes advantage of the first moment of the distribution of Z_jk. The second moment information has been ignored, which will generate a situation called overdispersion. For locus k, the overdispersion is defined as

Incorporating this overdispersion, we redefine the linear predictor as

where

is an offset of the linear predictor contributed by other loci. We now have a locus-specific to define various log functions for maximization.

Results

Simulation study

Binomial data

We simulated a single large chromosome of 2400-cM long evenly covered by 241 co-dominance markers (10 cM per marker interval). The simulated population was an F₂ family derived from the cross of two inbred lines with a sample size n=500. The genotype indicator variable for individual j at locus k was defined as Z_jk={+1, 0, −1} for the three genotypes (A₁A₁, A₁A₂, A₂A₂). Dominance effects were not simulated and also not included in the model for this simulation experiment. A total of 20 QTL were simulated with the true sizes and locations of the QTL depicted in Figure 1 (the top panel). Most QTL were placed in the left part of the genome. Some QTL were far apart from each other, whereas others were clustered in some narrow regions. About half of the simulated QTL overlapped with true markers (known genotypes) and the remaining QTL were located between markers (having missing genotypes). We first generated a linear predictor η_j for each individual using the genotypes of the 20 simulated QTL and the true effects of these QTL. The linear predictor was then converted into the probability of a binomial variable using . We then simulated a zero-truncated Poisson variable with mean 4 as the number of trials for individual j, denoted by t_j (the number of trial must be >zero). We then simulated the number of events y_j from the corresponding binomial distribution defined by μ_j and t_j, that is, y_j∼Binomial (μ_j,t_j). The simulation experiment was replicated 100 times.

In the simulated data analysis, we added a pseudo marker in every 2.5 cM of the genome, which is equivalent to adding three pseudo markers per marker interval (10 cM is the length of each interval). Genotypic probabilities of the pseudo markers were inferred from information of flanking markers (Jiang and Zeng, 1997). These probabilities were used to calculate U_jk and Σ_jk. The total number of putative loci analyzed was with 241 true markers and 720 pseudo markers. This m is almost twice the size of the sample (n=500). We wrote a SAS/IML program to analyze the data. The IML code is available from the corresponding author on request.

The estimated QTL effects from one random sample are presented in Figure 1 for the two methods (expectation and overdispersion) along with the true simulated effects. The two methods produced very similar results, which mimic the true QTL effects closely in terms of locations and the sizes of the effects. The general observations from this figure are (1) a large QTL effect may be split into two or more small effects in the neighborhood of the true QTL and (2) the estimated effects are generally smaller than the true effects due to penalty. Without the penalty, however, we cannot estimate all the 961 putative QTL simultaneously. In any real data analysis with a single sample, the pattern shown in Figure 1 is what an investigator expects to see.

Figure 2 shows the plot of the average estimated QTL effects (across 100 replicated samples) against the genome location. This time, the positions and the patterns of the QTL are extremely close to the true QTL shown in Figure 1 (the top panel). However, the average estimates of the QTL effects are severely biased downwards (towards zero). The differences between the two methods were barely noticed from the visual plots. The simulation experiments allow us to evaluate the bias and estimation error of each QTL and eventually the mean-squared error (MSE) for all the QTL. Let be the average estimate of γ_k for the 100 replicates and s_k² be the variance of the estimated γ_k across the 100 samples, the MSE for γ_k is defined as

The sum of MSE's for all QTL is

The MSEs for the two methods (expectation and overdispersion) are shown in Table 1. The overdispersion method has a slightly larger bias but with a smaller error compared with the expectation method. Overall, the overdispersion method has a smaller MSE than the expectation method. The bias defined here from the replicated simulation experiments may be overstated for the following reasons: (1) with the high density of the putative QTL in the model, a true QTL is often detected by a nearby marker close to the true QTL. The exact locations vary from one sample to another, but all in the neighborhood of the true QTL. When an average is taken across the samples, the effect of the true location is diluted by those samples in which the estimated QTL is a few cM away from the true QTL. For example, if a true QTL is estimated in the true location (A) from one sample and it is estimated in a position 2.5 cM away from the true location (B) in the second sample, the average effect of the two samples is then halved for the true location. This problem will be corrected in any real data analysis because a QTL detected in experiment A (denoted by QTL_A) will be treated as the same QTL detected in experiment B (denoted by QTL_B) as long as QTL_A and QTL_B are not too far away from each other. Therefore, the smaller estimation error of the overdispersion method is perhaps more important than the large bias.

Table 1 Comparison of the MSE for the two methods in the 100 replicated simulation experiments

Full size table

Binary data

The experimental design is exactly the same as that of the binomial experiment. The only difference in the simulation is that the trial was a fixed number of one for every individual in the binary data simulation experiment. The estimated QTL effects from one random sample are presented in Figure 3 for the two methods (expectation and overdispersion) along with the true simulated effects. Again, the two methods produced very similar results. However, they differ from the true QTL effects more than what we observed for the binomial trait analysis. Some QTL with small effects have been missed here, for example, the last simulated QTL in the genome. This indicates lower efficiency of QTL mapping for binary trait analysis than for binomial trait analysis.

Figure 4 shows the plot of the average estimated QTL effects (across 100 replicated samples) against the genome location. Again, the positions and the patterns of the QTL are close to the true QTL shown in Figure 3 (the top panel). The MSE′s for the two methods (expectation and overdispersion) are shown in Table 1. The overdispersion method has a much larger bias but with much smaller error compared with the expectation method. Overall, the overdispersion method has a smaller MSE than the expectation method. The advantages of the overdispersion method are well supported in the simulation experiments.

Mapping wheat fertility QTL

The experiment was conducted by Dou et al (2009). The mapping population contained 243 F₂ individuals derived from the cross of two inbred lines. The trait of interest is the female fertility measured as a binomial trait. The event is the number of seeded spikelets per plant (average 19.13 seeded spikelets) and the trial is the total number of spikelets per plant (average 25.15 spikelets). A total of 28 markers were genotyped in this experiment. These markers covered five chromosomes of the wheat genome with an average marker interval of 15.5 cM. The five chromosomes are only part of the wheat genome.

Binomial trait

As the marker map is sparse, we inserted one pseudo marker in every 2 cM, generating a total of 197 loci (28 true markers and 169 pseudo markers). The pseudo markers have missing genotypes and the probability distributions of these pseudo markers were inferred from linked markers using the multipoint methods (Jiang and Zeng, 1997). The sample size was n=243 and the size of the model was m=197. Both the expectation and overdispersion methods were used for the binomial data analysis.

For the real data analysis, we need to calculate the LOD score for each putative locus. The estimated QTL effects for the two methods are depicted in Figure 5 (the top panel). The LOD score profiles for the two methods are depicted in Figure 5 (the bottom panel). The two methods show some similarity and differences. Using the LOD=3+log₁₀(197)=5.2944 as the threshold (Kidd and Ott, 1984), the expectation method detected 17 QTL, whereas the overdispersion method detected 15 QTL. Among these detected QTL, eight of them were detected by both methods. The estimated effects along with the s.e. and the LOD scores obtained from the overdispersion method are listed in Table 2. Most detected QTL were located on chromosome II, IV and V. The QTL with the largest effect and LOD score occurred on the second chromosome at position 28.71 cM (cumulative position of 104.60 cM). This QTL was split into a few smaller ones in the neighborhood of the major peak by the expectation method. Unlike the simulation study where the true effects of QTL were known, for the wheat data, the true QTL were not known. Therefore, we were not able to compare the biases and the MSE of the estimated QTL effects. We chose an alternative approach for evaluating the two methods, that is the leave-one-out cross validation (Picard and Cook, 1984). The cross validation approach only evaluates the predictabilities of the models. For the purpose of molecular breeding and marker assistant selection, higher predictability is more preferable. For the purpose of gene cloning, the biases of QTL effect and location estimates are of major concern. We used the Pearson correlation coefficient ( ) between the observed (y) and predicted ( ) trait values as a measurement of the predictability. The Pearson correlation coefficients for the expectation and overdispersion methods were 0.5166 and 0.5290, respectively. The overdispersion method showed a slight advantage over the expectation method. We also examined the prediction errors defined by

Table 2 QTL detected for the binomial trait of wheat fertility using the overdispersion method

Full size table

for the two methods. The results of PEs were 0.100563 and 0.098658, respectively, for the expectation and the overdispersion methods. The PE comparison is consistent with the Pearson correlation comparison. We concluded that incorporation of overdispersion does show the expected benefit (increase in predictability) in QTL mapping over the simple expectation method.

Binary trait

Among the 243 plants, 39 of them did not have seeds at all. The frequency distribution of the number of seeded spikelets is shown in Figure 6. It appears that the zero category was inflated. The binomial data analysis did not differentiate QTL responsible for seed presence and absence. We now defined a binary trait as seed presence/absence and used the two methods (expectation and overdispersion) to analyze the binary trait. The estimated QTL effect profiles are shown in the top panel of Figure 7 and the LOD score profiles are depicted in the bottom panel of the same figure. The two methods appeared to generate much the same result. Using the LOD 5.29 criterion, we only detected a single QTL at position 28.71 cM of chromosome II (cumulative position 104.60 cM). This QTL is the same one as that detected for the binomial trait (the largest QTL for the binomial trait) detected by the overdispersion method. Our conclusion was that, except for this particular QTL, the multiple QTL detected for the binomial trait reported early were all responsible for the variation of the number of seeded spikelets, not the seed presence/absence trait. The leave-one-out cross validation analysis did not show much difference for the two methods. The Pearson correlation coefficients between the observed and predicted trait values were 0.4715 and 0.4729, respectively, for the expectation and overdispersion methods. The corresponding PE′s were 0.104914 and 0.104721, respectively. Both criteria indicate that the overdispersion method is better than the expectation method.

Discussion

The overdispersion method for handling missing genotypes was proposed by Xu and Hu (2010) for IM under the GLM framework. We examined this method and an additional one (expectation method) under the GLMM framework for mapping multiple QTL. The GLMM and GLM are different and thus the extension is not a trivial task. The overdispersion method consistently showed advantages over the expectation method in both the simulated data and real data analyses and in both the binomial and binary trait analyses. Based on the visual plots of the estimated QTL effects, the advantages appeared to be marginal. Then why should we bother to develop such a method, given the observed marginal advantage? First, the overdispersion method does not require much more computational load than the expectation method. The computational times of the two methods are pretty much the same when we used the numerical differentiation packages to evaluate the first and second partial derivatives. Therefore, we should take any opportunity to extract maximum information from the data; even a slight advantage is worth the effort. Secondly, the simulation experiments are always limited. It is hard to simulate all possible scenarios so that the advantages of the overdispersion method are fully exposed. In some situations, the advantage may be obvious and we may simply fail to identify those situations. Thirdly, the two methods for the wheat data analysis of the binomial trait already demonstrated some interesting differences that are worth of discussion. The largest QTL detected by the overdispersion method was split into several smaller QTL by the expectation method. The cross validation analysis showed that the overdispersion method gave a better prediction, implying that the single large QTL may most likely represent the truth. The binary data analysis of the wheat experiment showed that the same locus also had a large effect on the binary trait. This time both methods showed a single large QTL. This observation further supports the single large QTL hypothesis. Without the overdispersion method, we would not have such a confidence of this single large QTL.

The advantage of the overdispersion method will diminish as the marker density increases. In the situation where the entire genome is sequenced, the two methods would converge to the same result because genotypes of all markers will be observed. However, full genome sequences for most species are not expected to happen soon. In addition, missing genotypes may still exist due to human and technical errors in experiments. Therefore, the missing genotype handling methods remain useful in the foreseeable future.

The GLMM is sufficiently general so that it can handle traits with any distributions as long as the likelihood function is programmable. The normal distribution for the QTL effects may be substituted by other distributions. Explicit expressions of the derivatives are not required to implement the Newton–Raphson updates. Recently, Yi and Banerjee (2009) developed a hierarchical GLM for mapping discrete trait QTL. They used the pseudo likelihood approach to approximate the observed log likelihood function. The authors used an EM algorithm to estimate the QTL effects but they treated {β,γ} as parameters and G as missing values. In addition, Yi and Banerjee (2009) only considered marker analysis with missing marker genotypes replaced by the conditional expectation, which is equivalent to the expectation method of this study. However, they only considered missing marker genotypes in the sense that majority of the individuals are genotyped. The missing genotypes in their study were solely caused by technical or human errors. They did not insert pseudo marker in every few centiMorgan to saturate the genome.

Responding to a reviewer's suggestion, we analyzed both the binomial and binary traits of the wheat experiment using the LMM by ignoring the discrete nature of the traits. The correct model should be the GLMM, but we used the LMM as an ad hoc model to analyze the discrete traits. The results are depicted in Supplementary Figure S1 for the binomial trait and Supplementary Figure S2 for the binary trait. Supplementary Figure S1 shows the estimated QTL effects and LOD scores of the LMM analysis for the binomial trait. Only one large QTL was detected using this ad hoc model. Comparing Supplementary Figure S1 here with Figure 5 of the main text, we can see that many small- to median-sized QTL detected by the GLMM were missed. Results of leave-one-out cross validation are shown in Supplementary Table S1. The Pearson correlation coefficients between the observed and predicted trait values were dropped from 0.517 (expectation) and 0.529 (overdispersion) to 0.495 (expectation) and 0.497 (overdispersion). This means that the median-sized QTL detected by GLMM do contribute to the binomial trait variation, and ignoring the discrete nature of the trait has decreased the predictability of the model. The binary trait comparison between GLMM and LMM favors even more for the GLMM (see Supplementary Figure S2 and Table S2 of the Supplementary material).

GLM or GLMM represents an important area of statistics. It was particularly designed to deal with discrete traits or other traits deviating from a normal distribution. In statistics, people rarely argue the suitability GLMM given that LMM is already available for normally distributed traits. In case–control studies for human diseases, logistical regression (belongs to GLM) is often used to detect disease QTL (Hunter et al, 2007) because case (designated by 1) and control (designated by 0) consist of the two binary states of the disease outcome. People rarely analyze the 0-1 binary trait using the simple regression analysis by ignoring the discrete nature of the trait. The situation is different for QTL mapping in plants and animals. Every time a new method is developed for discrete traits, the investigator must face challenges from peers about how much improvement can be achieved if the discrete nature of the trait is ignored. These challenges repeatedly occurred and may largely credit (or blame) to the works by Visscher et al (1996) and Rebai (1997) who showed marginal improvement of GLM over LM for binary trait QTL mapping when the binary trait is treated as if it were continuous. Rao and Xu's (1998) conclusion about the ad hoc treatment of categorical trait analysis was slightly different. They found that if a categorical trait is analyzed using simple linear models, the power and accuracy of QTL parameter estimation can be reduced substantially if the categorical nature of the trait is ignored. Although from practical point of view, it is true that the loss of power and accuracy may be marginal when discrete traits are treated as continuous ones, the GLM or GLMM is built based on a rigorous statistical foundation and thus its suitability should not be argued. Especially, in the era of high power computing, one should not use a suboptimal algorithm on knowing the availability of the optimal algorithm. On the other hand, if an investigator presents result of a binary trait analysis using simple method by treating the binary phenotype as a continuous trait, then the investigator will often face criticism from the peers for not using the correct model, given the availability of GLM or GLMM. For the benefit of these investigators, the new GLMM approach provides a useful tool for correctly analyzing their data to avoid rejection of their fine works.

Data archiving

Simulated data, the test dataset from Dou et al. (2009) and SAS scripts for analyzing these datasets have been deposited at Dryad: doi:10.5061/dryad.mn159hq6.

References

Boer MP, Wright D, Feng L, Podlich DW, Luo L, Cooper M et al. (2007). A mixed-model quantitative trait loci (QTL) analysis for multiple-environment trial data using environmental covariables for QTL-by-environment interactions, with an example in maize. Genetics 177: 1801–1813.
Article Google Scholar
Breslow NE, Clayton DG (1993). Approximate inference in generalized linear mixed models. J Am Stat Assoc 88: 9–25.
Google Scholar
DeGroot MH (2004). Optimal Statistical Decision. John Wiley & Sons: Hoboken, New Jersey.
Book Google Scholar
Dou B, Hou B, Xu H, Lou X, Chi X, Yang J et al. (2009). Efficient mapping of a female sterile gene in wheat (Triticum aestivum L). Genet Res Camb 91: 337–343.
Article CAS Google Scholar
Friedman J, Hastie T, Tibshirani R (2010). Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33: 1–22.
Article Google Scholar
Gelman A, Carlin JB, Stern HS, Rubin DB (2004). Bayesian Data Analysis. Chapman & Hall, New York.
Henderson CR (1950). Estimation of genetic parameters (Abstract). Ann Math Statist 21: 309–310.
Google Scholar
Hoerl AE, Kennard RW (1970). Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12: 55–67.
Article Google Scholar
Hunter DJ, Kraft P, Jacobs KB, Cox DG, Yeager M, Hankinson SE et al. (2007). A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nat Genet 39: 870–874.
Article CAS Google Scholar
Jiang C, Zeng Z-B (1997). Mapping quantitative trait loci with dominance and missing markers in various crosses from two inbred lines. Genetica 101: 47–58.
Article CAS Google Scholar
Kidd KK, Ott J (1984). Power and sample size in linkage studies. Human Gene Mapping 7(1984): Seventh International Workshop on Human Gene Mapping. Cytogenet Cell Genet 37: 510–511.
Google Scholar
McCullagh P, Nelder JA (1989). Generalized Linear Models. Chapman & Hall: New York.
Book Google Scholar
McCulloch CE, Neuhaus JM (2005). Generalized Linear Mixed Model. Encyclopedia of Biostatistics. John Wiley & Sons, Ltd: San Francisco.
Google Scholar
McGilchrist CA (1994). Estimation in generalized mixed models. JR Stat Soc B 56: 61–69.
Google Scholar
Picard R, Cook D (1984). Cross-validation of regression models. J Am Stat Assoc 79: 575–583.
Article Google Scholar
Rao SQ, Xu S (1998). Mapping quantitative trait loci for ordered categorical traits in four—way crosses. Heredity 81: 214–224.
Article Google Scholar
Rebai A (1997). Comparison of methods for regression interval mapping in QTL analysis with non-normal traits. Genet Res, Camb 69: 69–74.
Article Google Scholar
Risch N (1991). A note on multiple testing procedures in linkage analysis. Am J Hum Genet 48: 1058–1064.
CAS PubMed PubMed Central Google Scholar
Tibshirani R (1996). Regression shrinkage and selection via the lasso. J R Stat Soc B 58: 267–288.
Google Scholar
Visscher PM, Haley CS, Knott SA (1996). Mapping QTLs for binary traits in backcross and F2 populations. Genet Res Camb 68: 55–63.
Article Google Scholar
Vonesh EF (1996). A note on the use of Laplace's approximation for nonlinear mixed-effects models. Biometrika 83: 447–452.
Article Google Scholar
Wald A (1943). Tests of statistical hypotheses concerning several parameters when the number of observations is large. Trans Amer Math Soc 54: 426–482.
Article Google Scholar
Wedderburn RWM (1974). Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika 61: 439–447.
Google Scholar
Wolfinger R, O'Connell M (1993). Generalized linear mixed models: A pseudo-likelihood approach. J Statist Comput Simul 48: 233–243.
Article Google Scholar
Xu S, Hu Z (2010). Generalized linear model for interval mapping of quantitative trait loci. Theor Appl Genet 121: 47–63.
Article Google Scholar
Xu S, Yi N (2000). Mixed model analysis of quantitative trait loci. Proc Nat Acad Sci USA 97: 14542–14547.
Article CAS Google Scholar
Yi N, Banerjee S (2009). Hierarchical generalized linear models for multiple quantitative trait locus mapping. Genetics 181: 1101–1113.
Article CAS Google Scholar
Ypma TJ (1995). Historical development of the Newton-Raphson method. SIAM Review 37: 531–551.
Article Google Scholar
Zou H (2006). The adaptive Lasso and its oracle properties. J Am Stat Assoc 101: 1418–1429.
Article CAS Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, University of California, Riverside, CA, USA
X Che
Department of Botany and Plant Sciences, University of California, Riverside, CA, USA
S Xu

Authors

X Che
View author publications
You can also search for this author in PubMed Google Scholar
S Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S Xu.

Ethics declarations

Competing interests

The authors declare no conflict of interest.

Additional information

Supplementary Information accompanies the paper on Heredity website

Supplementary information

Supplementary Information (DOC 117 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Che, X., Xu, S. Generalized linear mixed models for mapping multiple quantitative trait loci. Heredity 109, 41–49 (2012). https://doi.org/10.1038/hdy.2012.10

Download citation

Received: 11 November 2011
Revised: 05 January 2012
Accepted: 09 January 2012
Published: 14 March 2012
Issue Date: July 2012
DOI: https://doi.org/10.1038/hdy.2012.10

Keywords

This article is cited by

New PCR-specific markers for pollen fertility restoration QRfp-4R in rye (Secale cereale L.) with Pampa sterilizing cytoplasm
- Agnieszka Niedziela
- Marzena Wojciechowska
- Piotr Tomasz Bednarek
Journal of Applied Genetics (2021)
Statistical power in genome-wide association studies and quantitative trait locus mapping
- Meiyue Wang
- Shizhong Xu
Heredity (2019)
Assessing the additive and dominance genetic effects of vegetative propagation ability in Eucalyptus—influence of modeling on genetic gain
- Garel Makouanzi
- Jean-Marc Bouvet
- Philippe Vigneron
Tree Genetics & Genomes (2014)
Genomic prediction of dichotomous traits with Bayesian logistic models
- Frank Technow
- Albrecht E. Melchinger
Theoretical and Applied Genetics (2013)