Improving power and accuracy of genome-wide association studies via a multi-locus mixed linear model methodology

Wang, Shi-Bo; Feng, Jian-Ying; Ren, Wen-Long; Huang, Bo; Zhou, Ling; Wen, Yang-Jun; Zhang, Jin; Dunwell, Jim M.; Xu, Shizhong; Zhang, Yuan-Ming

doi:10.1038/srep19444

Download PDF

Article
Open access
Published: 20 January 2016

Improving power and accuracy of genome-wide association studies via a multi-locus mixed linear model methodology

Shi-Bo Wang^1,2,
Jian-Ying Feng¹,
Wen-Long Ren¹,
Bo Huang¹,
Ling Zhou¹,
Yang-Jun Wen¹,
Jin Zhang¹,
Jim M. Dunwell³,
Shizhong Xu⁴ &
…
Yuan-Ming Zhang^1,2

Scientific Reports volume 6, Article number: 19444 (2016) Cite this article

13k Accesses
285 Citations
13 Altmetric
Metrics details

Subjects

Abstract

Genome-wide association studies (GWAS) have been widely used in genetic dissection of complex traits. However, common methods are all based on a fixed-SNP-effect mixed linear model (MLM) and single marker analysis, such as efficient mixed model analysis (EMMA). These methods require Bonferroni correction for multiple tests, which often is too conservative when the number of markers is extremely large. To address this concern, we proposed a random-SNP-effect MLM (RMLM) and a multi-locus RMLM (MRMLM) for GWAS. The RMLM simply treats the SNP-effect as random, but it allows a modified Bonferroni correction to be used to calculate the threshold p value for significance tests. The MRMLM is a multi-locus model including markers selected from the RMLM method with a less stringent selection criterion. Due to the multi-locus nature, no multiple test correction is needed. Simulation studies show that the MRMLM is more powerful in QTN detection and more accurate in QTN effect estimation than the RMLM, which in turn is more powerful and accurate than the EMMA. To demonstrate the new methods, we analyzed six flowering time related traits in Arabidopsis thaliana and detected more genes than previous reported using the EMMA. Therefore, the MRMLM provides an alternative for multi-locus GWAS.

Robustification of GWAS to explore effective SNPs addressing the challenges of hidden population stratification and polygenic effects

Article Open access 22 June 2021

Zobaer Akond, Md. Asif Ahsan, … Md. Nurul Haque Mollah

Assessment of two statistical approaches for variance genome-wide association studies in plants

Article 10 May 2022

Matthew D. Murphy, Samuel B. Fernandes, … Alexander E. Lipka

Leveraging information between multiple population groups and traits improves fine-mapping resolution

Article Open access 10 November 2023

Feng Zhou, Opeyemi Soremekun, … Jennifer L. Asimit

Introduction

Genome-wide association studies (GWAS) have been widely used in genetic dissection of complex traits, especially with the development of advanced genomic sequencing technologies. The mixed linear model (MLM) method^1,2 by fitting a population structure (Q) and a polygene (K), the so called Q + K model, is the most popular method used for GWAS. After the MLM of Yu et al.² was published, many advanced MLM-based methods have been proposed^3,4,5,6,7, primarily to improve computational efficiency. A common feature of the MLM based GWAS is the one-dimentional genome scan by testing one marker at a time. The major advantage of such a genome scanning approach is the ability to handle a large number of markers, e.g., more than a million markers. However, such a model does not facilitate good estimates of marker effects because the model is never correct if a trait is indeed controlled by multiple loci, which is the case for most complex traits. Another problem with the method is the issue of multiple test correction for the threshold value of significance test. The typical Bonferroni correction is often too conservative, so that many important loci may not pass the stringent criterion of significance test.

Most complex traits are controlled by several major loci plus numerious undetectable loci with small effects (collectively called polygenes). The one-dimentional scanning GWAS will never recover the true model due to the intrinsic limitation of the model. Multi-locus models are better alternative methods for GWAS; these include Bayesian LASSO⁸, penalized Logistic regression^9,10, Elastic-Net¹¹ and empirical Bayes¹² methods. An obvious advantage of these methods is that no Bonferroni correction is required because of the multi-locus nature. Although these methods are shrinkage approaches and are supposed to be able to handle the number of markers several times larger than the sample sizes, they will fail when the number of markers is many times larger than the sample size, either due to constraint in computational time or limit in memory allocation. These models will also face the multicollinearity issue when the marker density is extremely high. Recently, Segura et al.¹³ has proposed a multi-locus MLM approach. However, further refinement is needed.

If the number of markers is small or moderately large and can be handled by one of the multi-locus approaches, we should consider a multi-locus method for GWAS, otherwise, a combination of the single locus genome scanning and multiple locus analysis may be considered. In the first stage, markers are scanned and selected with a low criterion of significance test. In the second stage, a multiple locus method is implemented for markers that have passed the initial screening. Statistical tests and marker effect estimation are then based on the multiple locus model. The MLM method of GWAS in the initial scanning stage treats marker effects as fixed. Goddard et al.¹⁴ proposed to treat marker effects as random, following a normal distribution with zero mean and an unknown variance. They described several advantages of the random model approach over the fixted model treatment. One advantage is that the random model approach will shrink the estimated (better called predicted) marker effects towards zero, leading to a maximum correlation between observed and predicted phenotypic values. However, Goddard et al.¹⁴ did not provide an efficient computational algorithm to estimate (or predict) marker effects.

In this study, we developed an efficient algorithm to estimate variances of the markers and predict effects of these markers. This method is called the random-SNP-effect mixed linear model (RMLM). The result of RMLM can either be treated as the final result of GWAS or used to select markers for the second stage analysis. In the second stage of GWAS, the selected markers are simultaneously evaluated in a single model using an EM empirical Bayes approach¹⁵. Estimation of marker effects and significance tests of these markers are performed in the second stage. This method is called the multi-locus random-SNP-effect mixed linear model (MRMLM). We demonstrate that this two-stage combined method of GWAS has significantly increased the statistical power and decreased Type 1 error compared with other methods, including the efficient mixed model analysis (EMMA).

Results

Statistical power for quantitative trait nucleotide (QTN) detection

To confirm the effectiveness of the MRMLM and RMLM methods, a series of Monte Carlo simulation experiments were carried out. Each sample was analyzed by the two new methods (MRMLM and RMLM) along the EMMA method. The significance threshold p value for the MRMLM method was 0.0002 (see Methods for calculation of this threshold). The corresponding threshold p value for the RMLM method was (modified Bonferroni correction for multiple tests), where is the effective number of markers (see Methods for calculation of m_e). The threshold p value for the EMMA method was (Bonferroni correction for multiple tests), where m is the total number of markers. For each QTN, power was defined as the proportion of samples where the QTN was detected (p value smaller than the designated threshold). In the first simulation experiment where no polygenic variance was simulated, the MRMLM method has the highest power for all six QTNs simulated, followed by the RMLM method and the EMMA method (Fig. 1a and Table S1). On one occasion (QTN number 5), the RMLM method is slightly more powerful than the MRMLM method. In the second simulation experiment when an additive polygenic variance ( and ) was added to the polygenic background, the same trend in power was observed where MRMLM is more powerful than RMLM and EMMA is the least powerful (Fig. 1b and Table S2). On one occasion (QTN number 4), the three methods have very similar power, with RMLM being slightly more powerful than the other two methods. In the third simulation experiement where three pairs of epistatic effects (collectively contributing 0.15 of the phenotypic variance) was added to the genetic background, again, MRMLM is the most powerful followed by RMLM and EMMA (Fig. 1c and Table S3) with an exception for QTN number 5 where RMLM is slightly more powerful than MRMLM. The sample sizes of the above three simulation experiments were all . We also changed the sample size from 199 to 149 and 99 under the fourth simulation experiment with the MRMLM method. The statistical powers are demonstrated in Fig. 1(d). As expected, the statistical power has declined as we reduced the sample size (Table S4). Similar trend of power changes were also observed for different numbers of markers (Table S5).

Accuracies of estimated QTN effects

We used the mean squared error (MSE) to measure the accuracy of an estimated QTN effect for a particular method. We evaluated the accuracies for all the six simulated QTNs from all three methods. The MSE’s are demonstrated in Fig. 2, where panels (a), (b) and (c) represent the results from the three simulation experiments, respectively. The MRMLM method is consistently more accurate than the RMLM method, which in turn is more accurate than the EMMA method (see Tables S1–S3). Figure 2(d) shows the results of different sample sizes by the MRMLM method from the fourth simulation experiment, showing that, as expected, a large sample size is associated with a small MSE (Table S4).

Type 1 error and ROC curve

The empirical Type 1 error rates of the three methods from the three simulation experiments are illustrated in Fig. 3. Overall, the three methods have similar Type 1 errors except the first simulation experiment where EMMA has an very large Type 1 error compared with the two new methods. In the second and third simulation experiments, EMMA has the least Type 1 errors followed by the MRMLM and RMLM methods. Fig. 3(d) shows the empirical Type 1 errors of the MRMLM method from the fourth simulation experiment with three different sample sizes (199, 149 and 99), where Type 1 error has been increased with a decreased sample size.

A useful way to compare different methods for their efficiencies in the detection of significant effects is the receiver operating characteristic (ROC) curve comparison. An ROC is a plot of the statistical power against the controlled Type 1 error. The higher the curve, the better the method. When sixty-one probability levels for significance, between 1E-8 to 1E-2, were inserted, the corresponding powers were calculated in the first simulation experiment. Figure 4 shows the comparison of the ROC curves from the three methods for each of the six QTNs simulated from the first simulation experiment. Clearly, the MRMLM method stands out way above the other two methods while the RMLM is better than the EMMA method when the Type 1 error is relaxed.

Computational efficiency

When performing GWAS on the simulated data, we first scanned the genome by the single-locus RMLM method to find the association between each SNP and the trait of interest. This process took 12.78 hours (Intel Core i5 CPU 4570, 3.20 GHz, Memory 8.00G, 1000 datasets) in the first simulation experiment. The MRMLM took an additional 0.51 hours to conduct the multi-locus analysis. Although the MRMLM method requires more computing time, the high power and small MSE relative to the RMLM method are good justifications for the improved method. The EMMA method took 68.77 hours for completing the analysis for the first simulation experiment.

Real data analyses in Arabidopsis

We analyzed six flowering time related traits of the Arabidopsis thaliana population published by Atwell et al.¹⁶ using all the three methods (MRMLM, RMLM and the EMMA). The numbers of SNPs significantly associated with the six traits are 29, 15, 27, 13, 22 and 14, respectively, for traits LD (days to flowering under long days), LDV (days to flowering under long days with vernalization), SD (days to flowering under short days), 0 W (days to flowering under long days with no vernalization), 2 W (days to flowering under long days for two week vernalization) and 4 W (days to flowering under long days for four week vernalization), from the MRMLM method. The corresponding numbers of associated SNPs are 8, 5, 3, 6, 6 and 7 from the RMLM method. The EMMA method only detected 1, 3, 1, 0, 1 and 2 SNPs for the above six traits (see Table S6 for details of the associated SNPs). These significantly associated SNPs for each trait were used to conduct a multiple linear regression analysis and the corresponding Bayesian information criteria (BIC) were calculated. The MRMLM method shows the lowest BIC values for all traits (Table 1), indicating that SNPs detected by the MRMLM method fit the data better than the other methods.

Table 1 Goodness of fit (log likelihood and BIC) for SNPs detected by three methods (MRMLM, RMLM and EMMA), where a lower value indicates a better fit.

Full size table

We found that 6, 4, 6, 2, 3 and 5 genes previously reported to be associated with the six traits are in the proximity of the SNPs detected by the MRMLM method. The corresponding numbers of genes in the vicinity of the SNPs detected by the RMLM method are 3, 3, 2, 1, 1 and 2, respectively, for the six traits. Only 2, 2, 1, 0, 0 and 1 genes are in the neighborhood of the SNPs detected by the EMMA method (see Table 2 and Table S7 for details of the genes). Clearly, the MRMLM method detected more known genes than the other two methods, indicating that this multi-locus model (MRMLM) has a higher power for QTN detection than the single-locus model (RMLM) and the EMMA method.

Table 2 Genes detected for six flowering time traits in Arabidopsis thaliana using three methods (MRMLM, RMLM and EMMA).

Full size table

Discussion

To reduce computing time required for GWAS, Zhang et al.⁶ proposed a P3D algorithm that fixes the polygenic-to-residual variance ratio in the genome-wide scanning step. Kang et al.³ used a matrix transformation prior to the genome-wide scanning stage and treated the scanned SNP effect as fixed. If we view the SNP effect as random, one additional variance of the QTN effect needs to be estimated and the complexity and computing time in parameter estimation has been increased, as shown with the MLM-based approaches of Zhang et al.¹ and Yu et al.². In the present study, a new matrix transformation is constructed, the P3D algorithm is adopted and the residual variance is estimated after the variance of the QTN effect is estimated. Therefore, only one parameter, the ratio of the QTN effect variance to the residual variance, is estimated in the genome-wide scanning stage. In doing so, the MRMLM method requires only 20% of the computing time needed by the EMMA method. More importantly, the new method performs better than EMMA in terms of high statistical power, low Type 1 error and low MSE of an estimated QTN effect.

The current GWAS method is a single-locus analysis approach under polygenic background and population structure controls. The number of tests involved is the number of markers, requiring a Bonferroni correction for multiple tests. To control the experimental error at a genome-wide level of 0.05, the significance level for each test should be adjusted as , which is 5E-8 if one million markers are to be scanned. In the multi-locus model, however, there is no need for such a multiple test correction due to the multi-locus and shrinkage natures of the new method. This conclusion is also supported by the results of Monte Carlo simulation studies. We compared the result of EMMA in this study with the result reported in Atwell et al.¹⁶; fewer known genes are listed in Table 2, because some genes identified in previous studies are not significantly associated with the traits after the Bonferroni correction. If the significance level was changed to a less stringent criterion, more known genes would have been found (Table S7). We investigated the effect of the critical value on the selection of putative QTNs. Similar results were observed for the three critical values selected (0.001, 0.01 and 0.05), although the 0.01 value resulted in the marginally best performance in terms of statistical power of QTN detection and accuracy of QTN effect estimation (Table S8).

There are several multi-locus GWAS approaches already published in the literature^5,13,17. When the number of markers is not large, all marker effects and their interactions can be included in a single model, such as the empirical Bayes method¹². If the number of markers is large, this single model approach is not feasible. One question is how to reduce the number of parameters in a full genetic model. Zhou et al.⁵ developed a Bayesian sparse linear mixed model and Moser et al.¹⁷ proposed a Bayesian mixture model. Under these models, two to four common components in the mixture distribution were considered and only a few variance components were estimated. Although about 500 effects in the genetic model are finally considered after several rounds of Gibbs sampling, the computing time becomes a major concern for these Bayesian approaches. Therefore, the ideal method is to delete spurious QTN effects prior to implementing the multi-locus model. The first step of MRMLM is RMLM, which deletes the majority of the markers in advance so that only a small set of markers are left to the second stage for evaluation. The MRMLM method differs from the multi-locus method of Segura et al.¹³ in several aspects. First, the SNP effects are viewed as random in the MRMLM method while they are treated as fixed effects in Segura et al.¹³. Secondly, we adopted a simple matrix transformation technique to improve the computational efficiency while Segura et al.¹³ implemented an algorithm involving three complicated treatments. Finally, the MRMLM method uses one set of selected SNPs, which have p values less than 0.01 in the initial scanning while Segura et al.¹³ requires MCMC samplings.

Atwell et al.¹⁶ listed 500 most significantly associated SNPs, although some of them were not significant at the criterion. In the neighborhood of these SNPs, some genes were found to be related to the traits of interest (Table 2 and S7). In this study, 21 genes for six flowering time traits are found to be in the vicinity of the detected SNPs, consistent with results previously reported, as shown in the database ( http://www.arabidopsis.org/), the work of Atwell et al.¹⁶ and related references^{18,19,20,21,22,23,24} (Table S7). Therefore, the Arabidopsis thaliana GWAS results of his study appear to be reliable.

In the study of GWAS methodology, real genotypes in natural population are frequently used to conduct Monte Carlo simulation studies^1,2,6. In this study, therefore, the real SNP dataset in Atwell et al.¹⁶ was adopted in the simulation studies. To further confirm the new methods, 200 samples with simulated genotypes derived from the minPtest R package²⁵ were analyzed. As a result, similar results were found (Table S9).

Conclusion

The RMLM simply treats the SNP effect as random and includes new matrix transformation, fixing the polygenic-to-residual variance ratio and estimating residual variance after the variance of QTN effect is obtained. Meanwhile, it allows a modified Bonferroni correction to be used to calculate the threshold p value for significance tests. The MRMLM is a multi-locus model including markers selected from the RMLM and all the effects in the model are estimated by an EM empirical Bayes method. Results from real data analyses and simulation studies show that the MRMLM has the highest power for QTN detection, the best fit for genetic model, the minimal bias in the estimation of the QTN effect and the strongest robustness, as compared with the RMLM and the EMMA.

Methods

Random effect linear mixed model (RMLM)

Let y be a vector of phenotypic values for all individuals. The mixed linear model is

where X is an incident matrix for fixed (non genetic) effects, is a vector of the fixed effects, is a vector of genotype indicators for the kth SNP that are coded as 1, 0 and −1, for one of the two homozygotes, the heterozygote and the other homozygote, respectively, is the effect of marker k with an assumed normal distribution of mean zero and variance , is a vector of polygenic effects with a multivariate normal distribution of mean zero and variance described by a covariance structure K, ε is a vector of residual errors with a distribution and is the residual variance. The expectation of y is and the variance is

where and are variance ratios and .Various methods of inferring a kinship matrix have been proposed. In this study we used a marker-inferred kinship matrix²⁶ defined as

In the single-locus RMLM, the polygenic variance ratio λ is only estimated once under a pure polygenic model (the null model) prior to the marker scanning stage. The estimated variance ratio, , is then treated as a constant when markers are scanned. This approach has been called GWAS with population parameters previously defined (P3D)⁶. The original P3D was implemented when was treated as a fixed effect. In this study, is treated as a random effect, which presents a great challenge in computation. However, we adopted a new algorithm to ease the computation, as described in the following paragraph.

Let us perform eigen decomposition for K so that , where is a diagonal matrix for the eigenvalues and U is an n × n matrix for the eigenvectors. Let , and be transformed variables so that

The expectation of is . The variance-covariance matrix of is

where is a known diagonal matrix. Let be the general covariance structure. After absorbing and σ², we have the following profiled restricted log likelihood function,

where q is the rank of matrix X and

This likelihood function contains only one unknown parameter λ_k. The Newton algorithm for λ_k is

Once the iteration process converges, the solution is a single-locus RMLM estimate of λ_k, denoted by . Note that the likelihood function involves and , which are very expensive to compute. However, the special structure of allows us to implement the Woodbury matrix identities²⁷ for calculating and . As a result, the random model approach does not present a substantial increase in computational time.

Given , the estimates of and σ² are

The best linear unbiased prediction (BLUP) of γ_k is also the conditional expectation of γ_k given and has the following expression,

The conditional variance is

Under the single-locus RMLM approach, we first estimate and then fix it at to estimate λ_k and scan all markers by testing λ_k = 0 for each SNP. The null hypothesis test for or in the single-locus RMLM is implemented using a Wald test,

The p-value of this Wald test is calculated using

where is a Chi-square variable with one degree of freedom.

Because the estimated marker effects under the random model are shrunken towards zero, we are able to use a modified Bonferroni correction to find the threshold p value for genome-wide significance tests²⁸. This modified Bonferroni correction uses an effective number of markers to adjust for multiple tests so that the threshold p value is , where

is the effective number of markers.

Multi-locus random effect mixed linear model (MRMLM)

The single marker RMLM method described above can also be considered as an intial screening step for a new multi-locus random effect mixed linear model (MRMLM) that is described here. We use a less stringent criterion for the initial stage screening from RMLM for all markers that have p values smaller than 0.01. In addition, consecutive markers passing the 0.01 threshold around an already selected marker (±20 kb for real data analysis and ± 1 kb for simulated data analysis) are eliminated to reduce collinearity among selected markers. Only these selected markers are included in the muti-locus model for further evaluation, including estimation of marker effects and significane tests. Due to the shrinkage nature, the majority of markers will be eliminated in the intitial screening. Therefore, the number of markers left in the second stage analysis is often a small subset of all markers, say a few hundred or a few thousand at most. Among the remaining markers, all those that passed the modified Bonferroni correction are used to conduct a likelihood ratio test (LRT) and the others are treated as random. If the LOD score for one marker in the LRT is more than 1.50, this marker is treated as fixed, or it is viewed as random. This small number of surviving markers are then included in a single multi-locus model. We propose to use the EM empirical Bayes (EMEB) method¹⁵ because this method also provides a significance test for each marker (likelihood ratio test), while the LASSO method does not have a default method to perform such a test. The EMEB method is also a random model approach because each random marker effect is assigned an empirical distribution with a variance. Because the model is multi-locus in nature, there is no requirement for Bonferroni correction. Therefore, the original 0.05 threshold may be used for significance test. Considering that all markers are selected in the first stage, we decided to place a slightly more stringent criterion of 0.0002, which is converted from LOD score 3.0 of the test statistics using , where is converted from LOD 3 to its corresponding likelihood ratio test, which, under the null hypothesis, follows a Chi-square distribution with one degree of freedom.

Efficient mixed model analysis (EMMA)

This is an existing method for GWAS³ and used as a gold standard for comparison. This method is the fixed model version of the original MLM, in which was treated as a fixed effect with no distribution assigned. The method was implemented in the R software package EMMA ( http://mouse.cs.ucla.edu/emma/). The threshold of p value was set as after Bonferroni correction for multiple tests, where m is the number of markers.

Simulation experiments

In the first four simulation experiments, all the SNP genotypes were derived from the 216130 SNPs in Atwell et al.¹⁶. All the SNPs between 11226256 and 12038776 bp on Chr. 1, between 5045828 and 6412875 bp on Chr. 2, between 1916588 and 3196442 bp on Chr. 3, between 2232796 and 3143893 bp on Chr. 4 and between 19999868 and 21039406 bp on Chr. 5 were used to conduct simulation studies. The sample size was the number of individuals in Atwell et al.¹⁶, namely 199. In the first simulation experiment, six QTNs were simulated and placed on the SNPs with allelic frequencies of 0.30; their heritabilities were set as 0.10, 0.05, 0.05, 0.15, 0.05 and 0.05, respectively; and their positions and effects are listed on Table S1. The average was set at 10.0; and residual variance was set at 10.0. Empirical statistical power for each QTN was calculated as the proportion of samples in which the p value is smaller than the designated threshold p value. A QTN detected within 1 kb of the simulated QTN was considered a true QTN. Empirical Type 1 error for each method was defined as the proportion of significant markers (excluding the markers overlapping with the six QTNs) over all markers with zero effects. In addition to power and Type 1 error, we also evaluated the mean square error (MSE) for each of the six simulated QTNs. For the ith QTN for , the MSE is defined as

where is the estimated effect of QTN i from the jth sample and in the true effect of QTN i. A method with a small MSE is generally more preferable than a method with a large MSE.

To investigate the effect of the polygenic (small effect genes) background on the MRMLM and RMLM methods, the polygenic effect was simulated by multivariate normal distribution , where is the polygenic variance and is the kinship coefficient matrix between a pair of individuals. Here , so . The QTN size (h²), average, residual variance and other values were the same as those in the first simulation experiment.

To investigate the effect of epistatic background on the MRMLM and RMLM methods, three epistatic QTNs each with and were simulated. The first one was placed between 3063784 bp on Chr. 4 and 5227063 bp on Chr. 2; the second one was placed between 5986135 bp on Chr. 2 and 2031781 bp on Chr. 3; and the third one was placed between 2668059 bp on Chr. 3 and 11824678 bp on Chr. 1. The QTN sizes (h²), average, residual variance and other values were also the same as those in the first simulation experiment.

The Arabidopsis thaliana data

We also analyzed the well-known Arabidopsis thaliana data published by Atwell et al.¹⁶. The data contain accessions with genotyped SNPs. Six flowering time related quantitative traits were analyzed using all the three methods (MRMLM, RMLM and EMMA). The six traits are: LD, LDV, SD, 0 W, 2 W and 4 W. These data were downloaded from the following website: http://www.arabidopsis.usc.edu/. We developed our own software to implement the data analysis (see Software S1).

Additional Information

How to cite this article: Wang, S.-B. et al. Improving power and accuracy of genome-wide association studies via a multi-locus mixed linear model methodology. Sci. Rep. 6, 19444; doi: 10.1038/srep19444 (2016).

References

Zhang, Y.M. et al. Mapping quantitative trait loci using naturally occurring genetic variance among commercial inbred lines of maize (Zea mays L.). Genetics 169, 2267– 2275 (2005).
Article CAS Google Scholar
Yu, J. et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 38, 203–208 (2006).
Article CAS Google Scholar
Kang, H.M. et al. Efficient control of population structure in model organism association mapping. Genetics 178, 1709–1723 (2008).
Article Google Scholar
Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44, 821–824 (2012).
Article CAS Google Scholar
Zhou, X., Carbonetto, P. & Stephens, M. Polygenic modeling with Bayesian sparse linear mixed models. PLoS Genet. 9, e1003264 (2013).
Article CAS Google Scholar
Zhang, Z. et al. Mixed linear model approach adapted for genome-wide association studies. Nat. Genet. 42, 355–360 (2010).
Article CAS Google Scholar
Li, M. et al. Enrichment of statistical power for genome-wide association studies. BMC Biology 12, 73 (2014).
Article CAS ADS Google Scholar
Yi, N. & Xu, S. Bayesian LASSO for quantitative trait loci mapping. Genetics 179, 1045–1055 (2008).
Article CAS Google Scholar
Hoggart, C.J., Whittaker, J.C., De Iorio, M. & Balding, D.J. Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genet. 4(7), e1000130 (2008).
Article Google Scholar
Ayers, K.L. & Cordell, H.J. SNP selection in genome-wide and candidate gene studies via penalized logistic regression. Genetic Epidemiology 34, 879–891 (2010).
Article Google Scholar
Cho, S. et al. Joint identification of multiple genetic variants via Elastic-Net variable selection in a genome-wide association analysis. Annals of Human Genetics 74, 416–428 (2010).
Article Google Scholar
Lü, H.-Y., Liu, X.-F., Wei, S.-P. & Zhang, Y.-M. Epistatic association mapping in homozygous crop cultivars. PLoS ONE 6, e17773 (2011).
Article ADS Google Scholar
Segura, V. et al. An efficient multi-locus mixed-model approach for genome-wide association studies in structured populations. Nat. Genet. 44, 825–830 (2012).
Article CAS Google Scholar
Goddard, M.E., Wray, N.R., Verbyla, K. & Visscher, P.M. Estimating effects and making predictions from genome-wide marker data. Stat. Sci. 24, 517–529 (2009).
Article MathSciNet Google Scholar
Xu, S. An expectation-maximization algorithm for the Lasso estimation of quantitative trait locus effects. Heredity 105, 483–494 (2010).
Article CAS Google Scholar
Atwell, S. et al. Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature 465, 627–631 (2010).
Article CAS ADS Google Scholar
Moser, G. et al. Simultaneous discovery, estimation and prediction analysis of complex traits using a Bayesian mixture model. PLoS Genet. 11, e1004969 (2015).
Article Google Scholar
Hartmann, U. et al. Molecular cloning of SVP: A negative regulator of the floral transition in Arabidopsis. Plant J. 21(4), 351–360 (2000).
Article CAS Google Scholar
Han, P., García-Ponce, B., Fonseca-Salazar, G., Alvarez-Buylla, E.R. & Yu, H. AGAMOUS-LIKE 17, a novel flowering promoter, acts in a FT-independent photoperiod pathway. Plant J. 55(2), 253–265 (2008).
Article CAS Google Scholar
Tominaga, R. et al. Arabidopsis CAPRICE-LIKE MYB 3 (CPL3) controls endoreduplication and flowering development in addition to trichome and root hair formation. Development 135, 1335–1345 (2008).
Article CAS Google Scholar
Reeves, P. & Coupland, G. Analysis of flowering time control in Arabidopsis by comparison of double and triple mutants. Plant Physiol. 126, 1085–1091 (2001).
Article CAS Google Scholar
Hepworth, S., Valverde, F., Ravenscroft, D., Mouradov, A. & Coupland, G. Antagonistic regulation of flowering-time gene SOC1 by CONSTANS and FLC via separate promoter motifs. EMBO J 21, 4327–4337 (2002).
Article CAS Google Scholar
Portolés, S. & Más, P. Altered oscillator function affects clock resonance and is responsible for the reduced day-length sensitivity of CKB4 overexpressing plants. Plant J. 51(6), 966–977 (2007).
Article Google Scholar
Gu, X., Wang, Y. & He, Y. Photoperiodic regulation of flowering time through periodic histone deacetylation of the florigen gene FT. PLoS Biology 11(9), e1001649 (2013).
Article CAS Google Scholar
Hieke, S., Binder, H., Nieters, A. & Schumacher, M. minPtest: a resampling based gene region-level testing procedure for genetic case-control studies. Computational Statistics 29(1-2), 51–63 (2014).
Article MathSciNet Google Scholar
Xu, S. Mapping quantitative trait loci by controlling polygenic background effects. Genetics 195, 1209–1222 (2013).
Article CAS Google Scholar
Golub, G.H. & van Loan, C.F. Matrix computations (3rd, Ed) . Baltimore and London: The Johns Hopkins University Press, 1996.
Wang, Q., Wei, J., Pan, Y. & Xu S. An efficient empirical Bayes method for genomewide association studies. J. Anim. Breed. Genet., published online: 19 NOV 2015 10.1111/jbg.12191 (2015).

Download references

Acknowledgements

This work was supported by funding from the National Natural Science Foundation of China (31571268, 31301004) and Huazhong Agricultural University Scientific and Technological Self-innovation Foundation (Program No. 2014RC020) to Y.-M.Z. The project was also supported by the United States National Science Foundation Grant 005400 to S.X.

Author information

Authors and Affiliations

State Key Laboratory of Crop Genetics and Germplasm Enhancement, Nanjing Agricultural University, Nanjing, 210095, China
Shi-Bo Wang, Jian-Ying Feng, Wen-Long Ren, Bo Huang, Ling Zhou, Yang-Jun Wen, Jin Zhang & Yuan-Ming Zhang
Statistical Genomics Lab, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan, 430070, China
Shi-Bo Wang & Yuan-Ming Zhang
School of Agriculture, Policy and Development, University of Reading, Reading, RG6 6AR, United Kingdom
Jim M. Dunwell
Department of Botany and Plant Science, University of California, Riverside, 20705, USA
Shizhong Xu

Authors

Shi-Bo Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jian-Ying Feng
View author publications
You can also search for this author in PubMed Google Scholar
Wen-Long Ren
View author publications
You can also search for this author in PubMed Google Scholar
Bo Huang
View author publications
You can also search for this author in PubMed Google Scholar
Ling Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yang-Jun Wen
View author publications
You can also search for this author in PubMed Google Scholar
Jin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jim M. Dunwell
View author publications
You can also search for this author in PubMed Google Scholar
Shizhong Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yuan-Ming Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Y.-M.Z. and S.X. conceived and supervised the study and wrote the manuscript. S.-B.W., J.-Y.F., W.-L.R., L.Z., Y.-J.W. and J.Z. performed the experiments and analyzed the data. W.-L.R. and B.H. wrote the Windows software. J.M.D. improved the language within the manuscript. All authors reviewed the manuscript.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Electronic supplementary material

Supplementary Information

Rights and permissions

This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Reprints and permissions

About this article

Cite this article

Wang, SB., Feng, JY., Ren, WL. et al. Improving power and accuracy of genome-wide association studies via a multi-locus mixed linear model methodology. Sci Rep 6, 19444 (2016). https://doi.org/10.1038/srep19444

Download citation

Received: 30 May 2015
Accepted: 14 December 2015
Published: 20 January 2016
DOI: https://doi.org/10.1038/srep19444

This article is cited by

Unveiling the predominance of Saccharum spontaneum alleles for resistance to orange rust in sugarcane using genome-wide association
- Jordan Dijoux
- Simon Rio
- Jean-Yves Hoarau
Theoretical and Applied Genetics (2024)
Multi-locus genome-wide association study and genomic prediction for flowering time in chrysanthemum
- Jiangshuo Su
- Zhaowen Lu
- Fadi Chen
Planta (2024)
Multi-locus genome-wide association studies reveal the dynamic genetic architecture of flowering time in chrysanthemum
- Jiangshuo Su
- Junwei Zeng
- Fadi Chen
Plant Cell Reports (2024)
Association mapping in multiple yam species (Dioscorea spp.) of quantitative trait loci for yield-related traits
- I.I. Adejumobi
- Paterne A. Agre
- D.O. Onautshu
BMC Plant Biology (2023)
E-GWAS: an ensemble-like GWAS strategy that provides effective control over false positive rates without decreasing true positives
- Guang-Liang Zhou
- Fang-Jun Xu
- Meng-Jin Zhu
Genetics Selection Evolution (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.