Variation explained in mixed-model association mapping

Sun, G; Zhu, C; Kramer, M H; Yang, S-S; Song, W; Piepho, H-P; Yu, J

doi:10.1038/hdy.2010.11

Download PDF

Original Article
Published: 10 February 2010

Variation explained in mixed-model association mapping

G Sun¹^na1,
C Zhu¹^na1,
M H Kramer²,
S-S Yang³,
W Song³,
H-P Piepho⁴ &
…
J Yu¹

Heredity volume 105, pages 333–340 (2010)Cite this article

4127 Accesses
97 Citations
Metrics details

Subjects

Abstract

Genomic mapping of complex traits across species demands integrating genetics and statistics. In particular, because it is easily interpreted, the R² statistic is commonly used in quantitative trait locus (QTL) mapping studies to measure the proportion of phenotypic variation explained by molecular markers. Mixed models with random polygenic effects have been used in complex trait dissection in different species. However, unlike fixed linear regression models, linear mixed models have no well-established R² statistic for assessing goodness-of-fit and prediction power. Our objectives were to assess the performance of several R²-like statistics for a linear mixed model in association mapping and to identify any such statistic that measures model-data agreement and provides an intuitive indication of QTL effect. Our results showed that the likelihood-ratio-based R² (R_LR²) satisfies several critical requirements proposed for the R²-like statistic. As R_LR² reduces to the regular R² for fixed models without random effects other than residual, it provides a general measure for the effect of QTL in mixed-model association mapping. Moreover, we found that R_LR² can help explain the overlap between overall population structure modeled as fixed effects and relative kinship modeled though random effects. As both approaches are derived from molecular marker information and are not mutually exclusive, comparing R_LR² values from different models provides a logical bridge between statistical analysis and underlying genetics of complex traits.

Exome-wide analysis implicates rare protein-altering variants in human handedness

Article Open access 02 April 2024

Dick Schijven, Sourena Soheili-Nezhad, … Clyde Francks

Genome-wide association studies

Article 26 August 2021

Emil Uffelmann, Qin Qin Huang, … Danielle Posthuma

Genetic gains underpinning a little-known strawberry Green Revolution

Article Open access 19 March 2024

Mitchell J. Feldmann, Dominique D. A. Pincot, … Steven J. Knapp

Main

Researchers in many disciplines use linear regression models widely. The R² statistic, the coefficient of determination, is one of the most frequently used measures of prediction power and goodness-of-fit for simple linear regression models (Draper and Smith, 1981; Everitt, 2002). In the literature on genetics, researchers often report R² values of newly identified genetic loci in addition to effect sizes and P-values (Lettre et al., 2008; Weedon et al., 2008). For nonstandard linear regression models, however, several competing R²-like statistics have been proposed to measure prediction power and goodness-of-fit (Buse, 1973; Magee, 1990; Xu, 2003; Kramer, 2005) but have not been used in genetics. Indeed, it is desirable to have a measure for general linear mixed models analogous in some ways to the R² of the linear regression model, which has a ‘variation explained’ interpretation.

Association mapping searches the association between genetic markers and complex traits (for example disease susceptibility) based on populations (Hirschhorn and Daly, 2005). It complements linkage analysis in mapping the genetic basis of complex traits. Mixed models have long been used in genetic research (Henderson, 1984; Lynch and Walsh, 1998), and the mixed-model association mapping methods were developed to account for complex population structure (Meuwissen et al., 2002; Yu et al., 2006; Malosetti et al., 2007). Although statistics like deviance and the Bayesian Information Criterion (BIC) (Schwarz, 1978) can be used to select models (Broman and Speed, 2002; Littell et al., 2006), many researchers desire a R²-like statistic for mixed models because it can indicate the prediction power of various models containing different fixed and random effects and their associated variance–covariance structure. After identifying statistically significant genetic loci (Kennedy et al., 1992), many geneticists would ask how much of the phenotypic variation is explained by each quantitative trait locus (QTL) for the interpretation or comparison purpose. In other words, what is the relative degree of improvement of the model fit to the data that results by including this significant genetic effect. Moreover, R²-like statistics complement statistical testing by providing practitioners with a more intuitive measurement than the P-value from other statistical tests (for example, likelihood-ratio (LR) test or F-test). Compared with statistics like deviance and BIC, R²-like statistics offer an alternative, easier to grasp measurement for geneticists.

Several approaches can quantify the genetic relationship of a complex population in the context of association mapping using molecular marker information (Weir et al., 2006). The first approach was developed to examine population structure by estimating the probability of subgroup membership (Pritchard et al., 2000; Falush et al., 2003). Recent research showed that principal component analysis (PCA) can also capture population differentiation (Price et al., 2006). A second approach focuses on the pairwise genetic relationship by estimating relative kinship (Loiselle et al., 1995; Ritland, 1996; Yu et al., 2006). As these two approaches are not orthogonal and the same marker data can be used to reflect population structure, principal components, and relative kinship, dependency among these different estimates is expected. Simultaneously fitting these estimates in the model, however, does not necessarily preclude the objective of controlling multiple levels of genetic relatedness within the association panel. In practice, the effects of controlling complex population structure with different estimates (population structure, principal components, and relative kinship) can vary by populations, traits, or both (Yu et al., 2006, 2009; Zhao et al., 2007; Zhu and Yu, 2009). A legitimate question, then, is whether a statistic like R² can be used to compare the different levels of control for genetic relationships.

Much of the literature on using R² for nonstandard linear models comes from statistics and econometrics, whereas such literature in the field of genetics is limited. Accordingly, our objectives were to assess the performance of several R²-like statistics for a linear mixed model in association mapping and to identify a general R²-like statistic that measures model-data agreement and provides an intuitive indication of the QTL effect. Although theoretical derivation or developing new statistics are beyond the scope of this study, we introduce four R²-like statistics for nonstandard linear models, describe mixed-model association mapping, and test the performance of these four R²-like statistics in the context of association mapping with computer simulations. We then apply these statistics to two empirical data sets.

Materials and methods

R² for fixed linear models

For the linear model with only fixed effects

where y is an n × 1 vector, X is an n × k matrix, β is a k × 1 vector of unknown regression coefficients, and e is an n × 1 vector consisting of i.i.d. normal variables with mean 0 and variance σ². Then the usual R² statistic is defined as

where . As 0⩽SSE⩽SSTO, it follows that 0⩽R²⩽1.

R² statistics for linear mixed models

The linear model with both fixed effects and random effects is

where y is an n × 1 observation vector, X is an n × k design matrix linked to the fixed effect, β is a k × 1 vector of unknown regression coefficients of fixed effects, Z is an n × p design matrix linked to the random effects, u is a p × 1 vector of random variables from a multivariate normal distribution (MVN) with zero means and variance–covariance matrix G (that is u∼MVN (0, G)), and e is an n × 1 vector of random errors with zero means and variance–covariance matrix Iσ² (that is e∼MVN (0, Iσ²)). Thus, y is MVN (Xβ, V) and V=ZGZ′+Iσ². Several statistics have been proposed for mixed models (Table 1), and we describe them briefly in the following sections.

Table 1 Summary of different R² statistics for the linear mixed model

Full size table

Two research groups (Cox and Snell, 1989; Magee, 1990) have independently proposed the likelihood-ratio-based R² (R_LR²), a R²-like statistic based on the LR:

where logL_M is the maximum log-likelihood of the model of interest, logL₀ is the maximum log-likelihood of the intercept-only model, n is the number of observations, and

Please note that the calculation is based on maximum likelihood (ML), not restricted ML (REML). The same formula of R_LR² was also suggested for the binary response models earlier by Maddala (1983). The LR statistic can be written as LR=2log(L_M/L₀). The relationship between R_LR² and LR is R_LR²=1−exp(−LR/n). The R_LR² statistic is appropriate when the concept of residual variance cannot be easily defined and ML is the criterion of fitting the model of interest. It can be shown that when the model only has fixed effects, R_LR² is reduced to the traditional R² statistic. For discrete models like logistic regression, a scaling procedure should be applied to ensure the resulting R_LR² is bounded between 0 and 1 (Nagelkerke, 1991).

The generalized least square R² statistic, R_W² is defined as (Buse, 1973):

where is the best predictor of y, and is the weighted mean: with ξ′=(1,…,1). This original definition is denoted as R_W1². It can be shown that, with , there is a direct summation relationship:

Replacing with ŷ=Xβ̂+Zû in R_W² yields the R_W2² statistic, (Kramer, 2005). There is no direct summation relationship for components in R_W2². In addition, it is difficult to interpret the numerator term where V⁻¹, rather than (Iσ²)⁻¹, is used because the random term appears in both (y−ŷ)=(y−(Xβ̂+Zû)) and V=ZGZ′+Iσ². Here is marginal because the prediction only involves fixed effects, but is conditional because the prediction is conditional on random effects (Vonesh et al., 1996; Vonesh and Chinchilli, 1997; Littell et al., 2006). Note that when the model has only fixed effects, both forms of R_W² are reduced to the traditional R² statistic.

The r_c statistic is a goodness-of-fit measure originally derived for the generalized nonlinear mixed-effect model, following the unweighted concordance correlation coefficient (ρ_c) (Vonesh et al., 1996):

where n is the number of observations, ȳ is the mean of y, ŷ=Xβ̂+Zû, and ỹ is the mean of ŷ. With ŷ=Xβ̂+Zû, both fixed and random effects are used to measure goodness-of-fit and prediction power and r_c is conditional (Vonesh et al., 1996; Vonesh and Chinchilli, 1997). The r_c statistic can be interpreted as a measure of the degree of agreement between the observed values and the predicted values as ρ_c measures agreement between two random variables. The possible values of r_c lie in the range −1⩽r_c⩽1.

The P_rand statistic measures the proportional reduction in the penalized quasi-likelihood function assuming MVN random effects (Zheng, 2000):

where PQL_M denotes a penalized quasi-likelihood function for the model of interest, PQL₀ denotes a penalized quasi-likelihood function for the null model where the model contains only the intercept, û is the estimated best linear unbiased predictor of u, ŷ=Xβ̂+Zû is the estimated best linear unbiased predictor of y, Ĝ is the ML estimate of G (the variance covariance matrix of u), and ς̂ is the ML estimate of σ. The range of the statistic P_rand is 0–1 under these model assumptions. The larger the P_rand the better the prediction and the smaller the random effect. The penalty for random effects in P_rand is analogous to Akaike’s Information Criterion and Schwarz’s BIC. Note that when the model has only fixed effects, P_rand is reduced to the traditional R² statistic.

Models in association mapping

When both population structure (Q) and kinship (K) are included, the mixed model for the Q+K method is

where y is a vector of phenotype observation, μ is a vector of intercepts; v is a k × 1 vector of population effects; u is a p × 1 vector of random polygene background effects; e is a vector of random experimental errors; Q is an n × k matrix defining the subgroup membership, generated from population structure analysis of marker data, and Z is an n × p incidence matrix relating y to u. For Var(u)=G=2KV_g, K is a p × p matrix of kinship coefficients, and V_g (a scalar) is the unknown genetic variance, E(e)=0 and Var(e)=Iσ².

Likewise, we can define the Q model without the Zu term; the K model without the Qv term; the P model with P (that is eigenvectors) from PCA replacing Q but no Zu term; and the P+K model with P replacing Q (Table 2). These models represent different combinations of methods that account for complex genetic relationships in the association mapping population (Yu et al., 2006; Weber et al., 2007; Zhao et al., 2007).

Table 2 Models used in the data analysis

Full size table

Computer simulation

To assess the performance of these R² statistics in the context of mixed-model association mapping, we generated genetic populations with both gross level population structure and familial relationships within subpopulations. This allowed us to investigate mixed models with both fixed effects for population structure and random effects for relative kinship. Detailed simulation procedures have been described earlier (Zhu and Yu, 2009). Briefly, the β distribution (Balding and Nichols, 1995; Nicholson et al., 2002; Marchini et al., 2004) was used to model the correlated allele frequencies. Once allele frequencies of each locus for each subpopulation were sampled under the β model, conditionally on Hardy–Weinberg and linkage equilibrium, we mimicked different populations consisting of subpopulations. Specifically, we carried out simulations that mimicked two types of population used in association studies (Yu et al., 2006; Zhu and Yu, 2009): samples with both population structure and familial relatedness (type IV) and samples with severe population structure and familial relationship (type V). As with earlier extensive simulations (Zhu and Yu, 2009), the population size was 216, and three subpopulations were simulated for type IV and V samples. For each sample type, a total of 500 independent data sets were generated for analysis with three different models, and the various R² statistics were obtained. Samples in which the Hessian matrix or the covariance matrix of the random effects (seven for type IV and four for type V) were not positive semidefinite were removed.

To generate genotypes and phenotypes, a linkage map of 2000 cM composed of 10 chromosome segments, each 200 cM in length, was considered. An additive genetic model with no dominance or epistasis was used. Of the 2000 single nucleotide polymorphism (SNP) locations, 25 were chosen at random to be quantitative trait nucleotide (QTN) locations. In all simulations, we set each QTN genotypic value with genotype QQ as 0.5, genotype qq as 0, and the overall mean at 10. The overall genotypic value of an individual was obtained as the sum of genotypic values across all QTN plus the overall mean. An individual phenotype was generated as the genotypic value plus a random variable sampled from a standard normal distribution. Heritability for each QTN varied around 2%, depending on the allele frequency at each specific QTN.

To verify the general agreement between the R_LR² statistic and the detection of true QTNs, we plotted the values of R_LR² for all SNPs with the P+K model from a random run of type IV samples.

Empirical data analysis

Data from two association mapping populations were used for empirical data analysis. Genotypes and three phenotypes (that is flowering time, ear height, and ear diameter) were chosen from 277 maize strains across 553 SNP as described earlier (Liu et al., 2003; Flint-Garcia et al., 2005; Yu et al., 2006). The Q matrix was computed by STRUCTURE (Pritchard et al., 2000; Falush et al., 2003) and the K matrix by SPAGeDi (Hardy and Vekemans, 2002). The P matrix was computed from EIGENSTRAT (Price et al., 2006), and three PCAs were used to be consistent with the Q matrix for degree of freedom in the model-fitting process. Arabidopsis genotypes and phenotypes were obtained from a published data set with 5419 SNPs and two flowering time measurements (SDV and JIC8W) (Zhao et al., 2007). These two traits passed our trait screening process and yielded meaningful variance component estimates for mixed-model analysis. The Q matrix contains eight subgroups, and the P matrix contains the first eight PCAs (Zhao et al., 2007). For R_LR², we modified the Venn diagrams to depict the overlapping but complementary nature of Q and K in capturing genetic relationships. The modification was to make the size of the circle proportional to the R_LR² value for easier interpretation of the diagram.

Results

All R² statistics (Table 1) yielded values between zero and one when different models were used to analyze data from two association mapping sample types, except the R_W1² statistic (Table 3). Notably, the zero values for R_W1² under the K model were not unexpected because its definition excludes random effects in calculating the predicted value. However, including the random term in prediction (R_W2²) yields values comparable to those of other R² statistics.

Table 3 Performance of R² statistics from different models under two association sample types

Full size table

When only the fixed effect was involved (that is P model), four R² statistics (that is R_LR², R_W1², R_W2², and P_rand) yielded identical values (Table 3). This was expected because theoretical derivation showed that all three definitions reduce to the original R² form for the fixed linear model. Meanwhile, the r_c statistic yielded different values for the fixed-effect model P because its formula does not reduce to R² for the fixed linear model.

Comparing an R² statistic among P, K, and P+K models showed differences between having a variable missing and having it added. Notably, R_LR² for the model with added variables (P+K model) was consistently higher than for the model with fewer variables (P or K model) without exception, but this was not the case for other R² statistics (Table 3). Moreover, the standard deviation of R_LR² was either equal to or smaller than that of other R² statistics. Also, the range of R² statistics was 0–1 except when the Hessian matrix or the covariance matrix of the random effects was not positive semidefinite, with the resulting negative value for P_rand removed in calculating the mean and standard deviation.

After determining the suitable candidate R² statistic for model comparison in mixed-model association mapping, we further demonstrated changes in R_LR² as the QTNs and other SNPs across the genome entered the mixed model individually (Figure 1). To do this, we used a type IV association mapping sample. As expected, R_LR² values with the SNP/QTN term were equal to or greater than the baseline R_LR² value from the model without the SNP/QTN term. As the variation due to individual QTNs varied depending on allele frequency, not all QTNs yielded a high R_LR² when their effects were included in the model. On the other hand, some SNPs can show a high R_LR² even when they were not the causal loci, revealing the challenges faced in association mapping.

For the maize data, only R_LR² consistently yielded a higher value for models with more variables (Q+K or P+K) than models with fewer variables (Q, K, or P) across three traits (Table 4). Next, for models with only fixed effects (that is Q or P), r_c values were different from the other four statistics, which agrees with the theoretical expectation and the simulation results. Furthermore, for Arabidopsis data, R_LR², r_c, and P_rand yielded a higher value for models with more variables, but this was not the case for R_W1² or R_W2² (Table 5).

Table 4 Analysis results of different R² statistics obtained by analyzing the maize traits with different models

Full size table

Table 5 Analysis results of different R² statistics obtained by analyzing the Arabidopsi s traits with different models

Full size table

In the modified Venn diagram, R_LR² shows the overlap between the two methods in accounting for genetic relationships: population structure (Q) captures general grouping patterns and relative kinship (K) is a polygene background control (Figure 2). The relative importance of Q and K in model fitting varied for different quantitative traits, which was expected given the theory (Tables 4 and 5). The complementary nature of P and K can also be seen in the modified Venn diagram. Obviously, the relative contribution of Q, P, and K to the mixed-model analysis varied across different data sets or different traits. For example, both Q and P made a small contribution in the analysis of maize ear diameter, but including K only improved the model fit by a negligible amount, as shown by a small increase in R_LR².

Discussion

Various R²-like statistics for mixed models revealed the mixed perspectives on how the goodness-of-fit of the mixed models should be measured. For instance, the R_LR² statistic, based on the LR test (Magee, 1990), considers the change of likelihood between models with different fixed and random effects simultaneously. However, the R_W² statistic, based on the Wald statistic (Buse, 1973), measures the agreement between observations and the generalized least square predictors without considering random effects. The modified form, R_W2², which considers both random and fixed effects, would be a better choice than R_W1² for analyzing genetic relationships but needs further study. Next, the r_c statistic, based on the concordance correlation (Vonesh et al., 1996), indicates agreement between observations and the unweighted predicted values with both fixed and random effects, whereas the P_rand statistic, based on the penalized quasi-likelihood function (Zheng, 2000), measures the proportional reduction in penalized quasi-likelihood function. When only fixed effects are included in the model, three R² statistics, but not r_c, reduce to the simple form for fixed linear models. By definition, all R² statistics other than R_W1² would be suitable for genomic mapping with different fixed and random terms controlling genetic relationships. The zero value of R_W1² for the K model prevents its use in mixed-model association analysis. In comparing R_LR² and R_W2² for mixed-model analysis of a randomized complete block design and a design with spatially autocorrelated residuals (Kramer, 2005), the R² values of these two statistics increased when random effects were added to the model or when the correlated error structure was considered.

As the direct summation of sum of square of model and sum of square of residual to equal sum of square of the corrected total does not necessarily exist in generalized linear mixed models, the term ‘Pseudo-R²’ was suggested to differentiate the above proposed statistics from the classical R² (Schabenberger and Pierce, 2002). We, however, adopted the general definition of the R² statistic (Buse, 1973; Magee, 1990; Nagelkerke, 1991), rather than the specific definition for a fixed linear model, in the text. Here, we stress that the ‘proportion of variation explained’ in linear mixed models should not be interpreted to mean that there is always an exact summation. In this study, we focused on comparing four different R² statistics for their potential in mixed-model association mapping. All these statistics contain similar components, involving differences between the observed values and the predicted values (either directly in R_W², r_c and P_rand or indirectly in R_LR²). In particular, the R_LR² statistic has several appealing properties (Nagelkerke, 1991). First, it reduces to the classical R² for fixed models and is asymptotically independent of the sample size. Second, it is dimensionless and permits an interpretation based on proportion of variation explained. Furthermore, using R_LR², to compare models with the same random components (that is K with Q+K or P+K) can be interpreted as comparing the fit of various nested models. On the other hand, comparing models with different fixed and random components provides a measure of model-data agreement under the ML framework, which satisfies a criterion proposed earlier: R² values for different models fitting the same data should be directly comparable (Kvalseth, 1985).

Ultimately, because it is easily computed and its monotonic nondecreasing property, R_LR² is our choice to measure the goodness-of-fit of the model to the data. Expanding the mixed model to include other genetic and nongenetic factors should not complicate the calculation and interpretation of R_LR² because it is directly computed from the maximum log-likelihood of the full model and the reduced model. In simulation studies, an R² measure computed as the squared correlation between simulated and model predicted genetic values may be used (Piepho and Möhring, 2007). Other R² statistics based on the ratio of variance component for residuals between two models have also been proposed (Xu, 2003). A recent study, however, found that these latter statistics performed poorly because the R² values varied so little that identifying the most parsimonious model was difficult (Oreliena and Edwards, 2008). Extending R_LR² to the REML approach needs further study because comparing models with different fixed or random terms is only valid under the ML framework (Littell et al., 2006). The relationship between model fit and model selection, particularly in genomic mapping, is beyond the scope of this study (Broman and Speed, 2002; Sillanpaa and Corander, 2002; Yi et al., 2005). We have no intention of using R_LR² to conduct model selection because the monotonic nondecreasing property of R_LR² does not indicate a better model as additional fixed or random effects are added. Instead, we stress that the R_LR² statistic provides an additional measurement for results interpretation.

For mixed models with random components (K, P+K, or Q+K), variance component estimation was conducted independently before the solutions for mixed models were used to compute different R² statistics. On the basis of the definition of R_LR², the convergence process of ML of a model containing additional effects other than intercept and residual can also be viewed as a process to maximize R_LR² but not the other R² statistics. Clearly, R_LR² can quantify the goodness-of-fit of different models regardless of the statistical properties of the models (Cameron and Windmeijer, 1996). In an earlier study, we showed that the likelihood-based model-fitting approach can quantify the robustness of genetic relationships derived from molecular marker data (Yu et al., 2009). Essentially, kinship construction with subsets of the whole marker panel and subsequent model testing with multiple phenotypic traits can be viewed as a process to test the model-data fit of different variance–covariance matrices. With an adequate number of molecular markers, an accurate genetic relationship among individuals (that is variance covariance matrices) can be obtained, and the change in the value of R_LR² becomes minimal.

Comparing the values of R_LR² for Q, K, and Q+K, as shown with modified Venn diagrams, can help us understand the genetics behind two overlapping methods in accounting for genetic relationships. With complex genetic relationships among individuals in many association mapping panels (Meuwissen et al., 2002; Yu et al., 2006; Zhao et al., 2007; Zhu and Yu, 2009), various competing but mostly complementary methods to capture these relationships were developed. Thus, the contribution to the model-data agreement from either Q and P (population structure and PCA) or K (kinship) can be determined from the R_LR² when each is fitted alone. Next, the overall contribution and overlap can be shown by comparing the R_LR² values of Q+K (or P+K) with the values from models with individual components (that is Q, P, or K). Finally, although it is not a statistic with a significance test, R_LR² does provide an indication of a variable's importance in model fitting, for example, SNP, Q, P, or K (Kvalseth, 1985). With an established base model (Yu et al., 2006), the changes in R_LR² values resulted from adding individual molecular marker provide information on the relative importance of different markers in further explaining the total variation.

In summary, we demonstrated through simulated association mapping samples and empirical data analyses that the LR-based R² statistic has several desirable properties useful in mixed-model association mapping. Applying genomic technologies in complex trait dissection has generated vast amounts of data, the analysis of which requires a joint effort in genetics and statistics. There are many challenges in this multidisciplinary research (Hirschhorn and Daly, 2005; Weir et al., 2006; McCarthy et al., 2008; Zhu et al., 2008), but such research also provides great opportunities for further collaboration among researchers from different disciplines with different specialties.

References

Balding DJ, Nichols RA (1995). A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96: 3–12.
Article CAS PubMed Google Scholar
Broman KW, Speed TR (2002). A model selection approach for the identification of quantitative trait loci in experimental crosses. J R Stat Soc B 64: 641–656.
Article Google Scholar
Buse A (1973). Goodness of fit in generalized least-squares estimation. Am Stat 27: 106–108.
Google Scholar
Cameron AC, Windmeijer FAG (1996). R-squared measures for count data regression models with applications to health-care utilization. J Bus Econ Stat 14: 209–220.
Google Scholar
Cox DR, Snell EJ (1989). Analysis of Binary Data, 2nd edn. Chapman and Hall: London.
Google Scholar
Draper NR, Smith H (1981). Applied Regression Analysis, 2nd edn. John Wiley & Sons: New York, NY.
Google Scholar
Everitt BS (2002). Cambridge Dictionary of Statistics, 2nd edn. Cambridge University Press: Cambridge, UK.
Google Scholar
Falush D, Stephens M, Pritchard JK (2003). Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164: 1567–1587.
CAS PubMed PubMed Central Google Scholar
Flint-Garcia SA, Thuillet AC, Yu J, Pressoir G, Romero SM, Mitchell SE et al. (2005). Maize association population: a high-resolution platform for quantitative trait locus dissection. Plant J 44: 1054–1064.
Article CAS PubMed Google Scholar
Hardy OJ, Vekemans X (2002). SPAGeDi: a versatile computer program to analyze spatial genetic structure at the individual or population levels. Mol Eco Notes 2: 618–620.
Article Google Scholar
Henderson CR (1984). Application of Linear Models in Animal Breeding. University of Guelph: Ontario.
Google Scholar
Hirschhorn JN, Daly MJ (2005). Genome-wide association studies for common diseases and complex traits. Nat Rev Genet 6: 95–108.
Article CAS PubMed Google Scholar
Kennedy BW, Quinton M, van Arendonk JA (1992). Estimation of effects of single genes on quantitative traits. J Anim Sci 70: 2000–2012.
Article CAS PubMed Google Scholar
Kramer M (2005). R² statistics for mixed models. 2005 Proceedings of the Conference on Applied Statistics in Agriculture, Manhattan, KS, pp 148–160.
Kvalseth TO (1985). Cautionary note about R². Am Stat 39: 279–285.
Google Scholar
Lettre G, Jackson AU, Gieger C, Schumacher FR, Berndt SI, Sanna S et al. (2008). Identification of ten loci associated with height highlights new biological pathways in human growth. Nat Genet 40: 584–591.
Article CAS PubMed PubMed Central Google Scholar
Littell RC, Milliken GA, Stroup WW, Wolfinger RD, Schabenberger O (2006). SAS for Mixed Models, 2nd edn. SAS Press: Cary, NC, USA.
Google Scholar
Liu K, Goodman M, Muse S, Smith JS, Buckler E, Doebley J (2003). Genetic structure and diversity among maize inbred lines as inferred from DNA microsatellites. Genetics 165: 2117–2128.
CAS PubMed PubMed Central Google Scholar
Loiselle BA, Sork VL, Nason J, Graham C (1995). Spatial genetic structure of a tropical understory shrub, Psychotria officinalis (Rubiaceae). Am J Bot 82: 1420–1425.
Article Google Scholar
Lynch M, Walsh JB (1998). Genetics and Analysis of Quantitative Traits. Sinauer Associates, Inc.: Sunderland, MA.
Google Scholar
Maddala GS (1983). Limited-Dependent and Qualitative Variables in Econometrics. Cambridge University Press: Cambridge, UK.
Book Google Scholar
Magee L (1990). R2 measures based on Wald and likelihood ratio joint significance tests. Am Stat 44: 250–253.
Google Scholar
Malosetti M, van der Linden CG, Vosman B, van Eeuwijk FA (2007). A mixed-model approach to association mapping using pedigree information with an illustration of resistance to Phytophthora infestans in potato. Genetics 175: 879–889.
Article CAS PubMed PubMed Central Google Scholar
Marchini J, Cardon LR, Phillips MS, Donnelly P (2004). The effects of human population structure on large genetic association studies. Nat Genet 36: 512–517.
Article CAS PubMed Google Scholar
McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP et al. (2008). Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet 9: 356–369.
Article CAS PubMed Google Scholar
Meuwissen TH, Karlsen A, Lien S, Olsaker I, Goddard ME (2002). Fine mapping of a quantitative trait locus for twinning rate using combined linkage and linkage disequilibrium mapping. Genetics 161: 373–379.
CAS PubMed PubMed Central Google Scholar
Nagelkerke NJD (1991). A note on a general definition of the coefficient of determination. Biometrika 78: 691–692.
Article Google Scholar
Nicholson G, Smith AV, Jónsson F, Gústafsson Ó, Stefánssonand K, Donnelly P (2002). Assessing population differentiation and isolation from single-nucleotide polymorphism data. J R Stat Soc B 64: 695–715.
Article Google Scholar
Oreliena JG, Edwards LJ (2008). Fixed-effect variable selection in linear mixed models using R² statistics. Comput Stat Data Anal 52: 1896–1907.
Article Google Scholar
Piepho HP, Möhring J (2007). Computing heritability and selection response from unbalanced plant breeding trials. Genetics 177: 1881–1888.
Article PubMed PubMed Central Google Scholar
Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38: 904–909.
Article CAS PubMed Google Scholar
Pritchard JK, Stephens M, Donnelly P (2000). Inference of population structure using multilocus genotype data. Genetics 155: 945–959.
CAS PubMed PubMed Central Google Scholar
Ritland K (1996). Estimators for pairwise relatedness and individual inbreeding coefficients. Genet Res 67: 175–186.
Article Google Scholar
Schabenberger O, Pierce FJ (2002). Contemporary Statistical Models for the Plant and Soil Sciences. CRC Press: Boca Raton, FL.
Google Scholar
Schwarz G (1978). Estimating dimension of a model. Ann Stat 6: 461–464.
Article Google Scholar
Sillanpaa MJ, Corander J (2002). Model choice in gene mapping: what and why. Trends Genet 18: 301–307.
Article CAS PubMed Google Scholar
Vonesh EF, Chinchilli VM (1997). Linear and Nonlinear Models for the Analysis of Repeated Measures. Marcel Dekker: New York.
Google Scholar
Vonesh EF, Chinchilli VM, Pu K (1996). Goodness-of-fit in generalized nonlinear mixed-effects models. Biometrics 52: 572–587.
Article CAS PubMed Google Scholar
Weber A, Clark RM, Vaughn L, Sanchez-Gonzalez Jde J, Yu J, Yandell BS et al. (2007). Major regulatory genes in maize contribute to standing variation in teosinte (Zea mays ssp. parviglumis). Genetics 177: 2349–2359.
Article CAS PubMed PubMed Central Google Scholar
Weedon MN, Lango H, Lindgren CM, Wallace C, Evans DM, Mangino M et al. (2008). Genome-wide association analysis identifies 20 loci that influence adult height. Nat Genet 40: 575–583.
Article CAS PubMed PubMed Central Google Scholar
Weir BS, Anderson AD, Hepler AB (2006). Genetic relatedness analysis: modern data and new challenges. Nat Rev Genet 7: 771–780.
Article CAS PubMed Google Scholar
Xu R (2003). Measuring explained variation in linear mixed effects models. Stat Med 22: 3527–3541.
Article PubMed Google Scholar
Yi N, Yandell BS, Churchill GA, Allison DB, Eisen EJ, Pomp D (2005). Bayesian model selection for genome-wide epistatic quantitative trait loci analysis. Genetics 170: 1333–1344.
Article CAS PubMed PubMed Central Google Scholar
Yu J, Pressoir G, Briggs WH, Vroh Bi I, Yamasaki M, Doebley JF et al. (2006). A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat Genet 38: 203–208.
Article CAS PubMed Google Scholar
Yu J, Zhang Z, Zhu C, Tabanao DA, Pressoir G, Tuinstra MR et al. (2009). Simulation appraisal of the adequacy of number of background markers for relationship estimation in association mapping. Plant Genome 2: 63–77.
Article CAS Google Scholar
Zhao K, Aranzana MJ, Kim S, Lister C, Shindo C, Tang C et al. (2007). An Arabidopsis example of association mapping in structured samples. PLoS Genet 3: e4.
Article PubMed PubMed Central Google Scholar
Zheng B (2000). Summarizing the goodness of fit of generalized linear models for longitudinal data. Stat Med 19: 1265–1275.
Article CAS PubMed Google Scholar
Zhu C, Gore MA, Buckler ES, Yu J (2008). Status and prospects of association mapping in plants. Plant Genome 1: 5–20.
Article CAS Google Scholar
Zhu C, Yu J (2009). Nonmetric multidimensional scaling corrects for population structure in whole genome association studies. Genetics 182: 875–888.
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This project is supported by the National Research Initiative (NRI) Plant Genome Program of the USDA Cooperative State Research, Education and Extension Service (CSREES) (2006-03578), the National Science Foundation (DBI-0820610), and the Targeted Excellence Program of Kansas State University. Hans-Peter Piepho is supported by the German Federal Ministry of Education and Research (BMBF) within the AgroClustEr ‘Synbreed—Synergistic plant and animal breeding’.

Author information

G Sun and C Zhu: These authors contributed equally to this work.

Authors and Affiliations

Department of Agronomy, Kansas State University, Manhattan, KS, USA
G Sun, C Zhu & J Yu
USDA-ARS, Beltsville, MD, USA
M H Kramer
Department of Statistics, Kansas State University, Manhattan, KS, USA
S-S Yang & W Song
Institute of Crop Production and Grassland Research, Bioinformatics Unit, University of Hohenheim, Stuttgart, Germany
H-P Piepho

Authors

G Sun
View author publications
You can also search for this author in PubMed Google Scholar
C Zhu
View author publications
You can also search for this author in PubMed Google Scholar
M H Kramer
View author publications
You can also search for this author in PubMed Google Scholar
S-S Yang
View author publications
You can also search for this author in PubMed Google Scholar
W Song
View author publications
You can also search for this author in PubMed Google Scholar
H-P Piepho
View author publications
You can also search for this author in PubMed Google Scholar
J Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to J Yu.

Ethics declarations

Competing interests

The authors declare no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sun, G., Zhu, C., Kramer, M. et al. Variation explained in mixed-model association mapping. Heredity 105, 333–340 (2010). https://doi.org/10.1038/hdy.2010.11

Download citation

Received: 13 October 2009
Revised: 22 December 2009
Accepted: 15 January 2010
Published: 10 February 2010
Issue Date: October 2010
DOI: https://doi.org/10.1038/hdy.2010.11

Keywords

This article is cited by

Different wheat loci are associated to heritable free asparagine content in grain grown under different water and nitrogen availability
- Mélanie Lavoignat
- Cédric Cassan
- Jacques Le Gouis
Theoretical and Applied Genetics (2024)
Genome-wide association analysis reveals a novel QTL CsPC1 for pericarp color in cucumber
- Hongyu Huang
- Qinqin Yang
- Yuhe Li
BMC Genomics (2022)
The genetic architectures of vine and skin maturity in tetraploid potato
- Maria V. Caraza-Harter
- Jeffrey B. Endelman
Theoretical and Applied Genetics (2022)
Fine mapping of a novel QTL CsFSG1 for fruit skin gloss in cucumber (Cucumis sativus L.)
- Hongyu Huang
- Yuefan Du
- Bin Liang
Molecular Breeding (2022)
Genetic controls of Tas1r3-independent sucrose consumption in mice
- Cailu Lin
- Michael G. Tordoff
- Danielle R. Reed
Mammalian Genome (2021)