Introduction

Most complex traits in human, plant and animal genetics are quantitative traits and these traits are controlled by multiple quantitative trait loci (QTLs). The identification of these loci is usually performed by QTL mapping or genome-wide association study (GWAS). A large number of single-nucleotide polymorphisms (SNPs) can be easily obtained for the genotypes by the rapid development of sequencing and genotyping technologies. If all the SNPs are included in a genetic model, the number of SNPs will be much larger than the sample size. The commonly used methods are infeasible for such an oversaturated model.

Many approaches have been proposed to estimate the parameters in the oversaturated model and these approaches include ridge regression (Hoerl and Kennard, 1970), stochastic search variable selection (George and McCulloch, 1993; Yi et al., 2003), Bayesian shrinkage estimation (Meuwissen et al., 2001; Wang et al., 2005), penalized maximum likelihood (Zhang and Xu, 2005; Hoggart et al., 2008; Zhang et al., 2012), empirical Bayes (Xu, 2010) and Bayesian-LASSO (Bayesian-least absolute shrinkage and selection operator; Park and Casella, 2008; Yi and Xu, 2008). However, these methods are mainly proposed for linkage analysis in biparental segregation populations, rather than for GWAS in natural population.

GWAS has been used to dissect the genetic foundation of quantitative traits (Zhang et al., 2005, 2010; Yu et al., 2006; Kang et al., 2008; Zhou and Stephens, 2012; Wang et al., 2016). The widely used approach, such as efficient mixed model association (EMMA; Kang et al., 2008; Zhou and Stephens, 2012), was proposed for single-marker analysis under the population structure and polygenic background controls. However, this method has relatively low power in detecting small-effect QTLs. To overcome these problems, therefore, multilocus model methods have been suggested (Fridley et al., 2010; Lü et al., 2011), for example, a Bayesian-inspired penalized maximum likelihood approach (Zhang and Xu, 2005; Hoggart et al., 2008) and PUMA (Penalized Unified Multiple-locus Association; Hoffman et al., 2013). These methods can be used if the number of variables in the multilocus model is not too large. Recent strategies for high-dimensional modeling have focused on reducing the dimension of a large matrix and then selecting the most potentially associated SNPs by using shrinkage methods such as the LASSO and SCAD (smoothly clipped absolute deviation) penalty (Fan and Lv, 2008; Wu et al., 2009). Although other multilocus approaches have also been proposed by Segura et al. (2012), Moser et al. (2015), Liu et al. (2016), Wang et al. (2016) and Wen et al. (2017), now further refinement and studies are still needed.

In this study, we integrated least angle regression (LARS) algorithm with empirical Bayes to perform multilocus GWAS for quantitative traits, as the LARS algorithm makes LASSO (Tibshirani, 1996) efficient and acceptable (Efron et al., 2004). To control polygenic background, we adopted the model transformation of Wen et al. (2017) that whitens the covariance matrix of the polygenic matrix K and residual noise. The LARS algorithm was implemented on the transformed model to select SNPs that are most potentially associated with the trait, empirical Bayes was used to estimate the effects of all the selected SNPs and all the nonzero effects were further examined by likelihood ratio test so as to confirm true quantitative trait nucleotides (QTNs). We refer to this method as the pLARmEB (polygene-background-control-based least angle regression plus empirical Bayes). pLARmEB was validated by analysis of the data sets from a series of Monte Carlo simulation experiments and seven Arabidopsis flowering time traits. We also discussed the possibility of applying pLARmEB for linkage analysis.

Materials and methods

Genetic model

Let yi (i=1,⋯,n) be the phenotypic value of the ith individual in a sample of size n from a natural population. The genetic model is expressed by

where y=(y1, ⋅⋅⋅,yn)T; 1 is a n × 1 vector of 1 and μ is total average; α is population structure effect as fixed; γ~MVNm(0,Σγ) are QTN effects as random, and m is the number of putative QTNs; W and Z are the corresponding designed matrices for α and γ; polygenic effects is a n × 1 random vector and K is a known n × n relatedness matrix; and ɛ is residual error with an assumed MVNn(0,σ2In) distribution, σ2 is residual error variance and In is an n × n identity matrix.

As γ is treated as being random, the variance of y in the model (1) is

where (k=1,⋯,m), and H=Z diag{λ1,⋯,λm}ZT+λgK+In

Using EMMA, we can obtain the estimate of λg, denoted by . Let , an eigen (or spectral) decomposition of the positive semidefinite matrix B was

where QB is orthogonal, Λr is a diagonal matrix with positive eigenvalues, r=Rank(B), Q1 and Q2 are the n × r and n × (n−r) block matrices of QB, respectively, and 0 is the corresponding block zero matrix (Wen et al., 2017).

Let , the model (1) is changed to

where yc=Cy, 1c=C1, Wc=CW, Zc=CZ and ɛc=Cu+Cɛ~MVNn(0,σ2In) (Wen et al., 2017).

In the above model (4), let , Y=yc−1c μ with a zero mean, and standardizing each column in matrix (Wc Zc) produces a new matrix X with and (j=1,⋯,m). Therefore, the model (4) can be rewritten as

Parameter estimation

LARS for the full model

LARS is a flexible method for variable selection that has been described previously (Efron et al., 2004). We used the LARS algorithm to select the n−1 variables that are most likely associated with quantitative trait of interest.

First, let , so

Then, suppose that is the current LARS estimate and that

is the vector of current correlations. The active set ∈ is the set of indices corresponding to covariates with the greatest absolute current correlations,

Let for j ∈ F. We can calculate XF=(···sjxj···)j∈ F, uF=XFωF, , , where , and 1F being a vector of 1 with the length of equaling |F|.

Third, update in the LARS algorithm:

where , min+ indicates that the minimum is taken over only positive components within each choice of j in the formula of , and a≡XTuF.

Repeat step 2 to step 3 until a criterion of convergence is satisfied. The above algorithm was conducted by lars package (http://cran.r-project.org/web/packages/lars/) in R language.

Usually, if all the marker effects are included in one genetic model, the parameters cannot be estimated under the situation of m≫n, where n is sample size and m is the number of variables. As most markers are not likely associated with the trait of interest, once the markers with zero effects are deleted from the full model, marker effects of the reduced model is estimable. In each LARS variable selection, the n−1 SNPs that are most potentially associated with the trait are selected to construct the reduced model.

Empirical Bayes estimation in the reduced model

In the reduced model,

where y is the same as that in the model (1); β is a vector of fixed effect, γ is a vector of random effect of the selected markers and X and Z are the design matrices for β and γ, respectively. All the parameters in the model (7) were estimated by empirical Bayes proposed by Xu (2010).

The fixed effect β and residual variance σ2 were estimated by

where . The random effect γk of each marker and its prediction error var(γk) were predicted by best linear unbiased prediction:

where , ω=τ=0, and mk is the number of genotypes at locus k. The method requires inverse of matrix V. If the sample size is large, that is, n>p, binomial inverse theorem (Henderson and Searle, 1980) can be used:

where

Based on our experiences, empirical Bayes is feasible when the number of variables is less than 40 times of the sample size. However, this condition is not frequently met in GWAS. If the LARS algorithm is used to select the variables that are most potentially associated with the trait under polygenic background control, the effects of the selected markers can be estimated by empirical Bayes.

Likelihood ratio (LR) test

Based on the estimate of marker effect γk in the reduced model, markers with are considered not to be associated with the trait; however, the association of the chosen markers with the trait and the effects θ={γ(1),⋯,γ(q)} needs to be tested, where q is the number of SNPs in the reduced model. To test the null hypothesis H0:γ(i)=0, that is, no QTL linked to the marker, we conducted an LR test by

where , is a log-likelihood function, φ(yi;Xβ+Zγ,σ2) is a normal density function with mean Xβ+Zγ and variance σ2 and LOD=LR/4.605. The critical value for significance was set at LOD=2.0 (Bu et al., 2015).

AIC and BIC for testing goodness of fit of models

The goodness of fit for a statistical model can be measured by

where L is the likelihood function value and k is the number of independent variables, and n is sample size. Smaller Akaike information criterion (AIC) or Bayesian information criterion (BIC) value indicates a good fit.

pLARmEB has been implemented in R and its software can be downloaded from https://cran.r-project.org/web/packages/mrMLM/index.html.

Data sets for analyses

One Arabidopsis data set and four Monte Carlo simulated data sets were used to validate pLARmEB. Each data set contained phenotypic observations for quantitative traits and genotypic values for molecular markers.

The Arabidopsis data set

The data set downloaded from http://www.arabidopsis.org/ includes 199 diverse inbred lines each with 216 130 SNPs and 107 traits (Atwell et al., 2010). Among these traits, seven are related to flowering time, including days to flowering under long days, days to flowering under long days with vernalization, days to flowering under short days, days to flowering under short days with vernalization, days to flowering at 10 °C, days to flowering at 16 °C and days to flowering at 22 °C. We analyzed these traits using pLARmEB, EMMA, multilocus random-SNP-effect mixed linear model (mrMLM) and fast multilocus random-SNP-effect EMMA (FASTmrEMMA) methods. The population structure Q matrix and kinship coefficient matrix K between all the pairs of lines were used to control population structure and polygenic background. We also deleted the SNPs with minor allele frequency <10%. When all the markers on one chromosome were in one genetic model, the markers on other chromosomes were used to calculate K matrix as polygenic background control (Rincent et al., 2014; Yang et al., 2014; Wei and Xu, 2016). Here 50 SNPs most potentially associated with the trait are selected to construct the reduced model. This number may vary across different data sets.

Data sets from Monte Carlo simulation in natural population

Three Monte Carlo simulation experiments were conducted to validate pLARmEB. The three data sets are the same as those in Wang et al. (2016). In the first experiment, all the SNP genotypes were derived from 216 130 SNPs reported by Atwell et al. (2010) and 2000 SNPs were randomly sampled from each chromosome (Chr.). The positions of these SNPs in the genome were between 11 226 256 and 12 038 776 bp on Chr. 1, between 5 045 828 and 6 412 875 bp on Chr. 2, between 1 916 588 and 3 196 442 bp on Chr. 3, between 2 232 796 and 3 143 893 bp on Chr. 4 and between 19 999 868 and 21 039 406 bp on Chr. 5 (Wang et al., 2016). The sample size was 199, and this was the number of lines in Atwell et al. (2010). Six QTNs were simulated and placed on the SNPs with rare allelic frequency of 0.30. The heritabilities of the QTNs were set as 0.10, 0.05, 0.05, 0.15, 0.05 and 0.05, respectively; their positions and effects are listed in Supplementary Table S1. The total average was set at 10.0 and residual variance was set at 10.0. For each simulated QTN, we counted the number of samples in which the LOD (logarithm (base 10) of odds) exceeded 2.0 (Bu et al., 2015). A detected QTN within 2 kb of the simulated QTN was considered a true QTN. The ratio of the number of such samples to the total number of replicates (1000) represented the empirical power of this QTN. False positive rate (FPR) was calculated as the ratio of the number of false positive effects to the total number of zero effects considered in the full model. To measure the bias of gene effect estimate, mean squared error (MSE)

was calculated, where is the estimate of effect γk in the ith sample.

We investigated the effect of polygenic background on pLARmEB in the second experiment by adding polygenic effects from a multivariate normal distribution , where is polygenic variance and K is a pairwise kinship coefficient matrix among individuals. Here , so . The QTN size (h2), total average, residual variance and other parameter values were the same as those in the first experiment, and all the parameters are listed in Supplementary Table S2.

In the third experiment, we investigated the effect of epistatic background on pLARmEB. Three epistatic QTNs were added. The related parameters for the simulated three epistatic QTNs have been described in Wang et al. (2016). The QTN sizes (h2), total average, residual variance and other parameter values were also the same as those in the first experiment (Supplementary Table S3).

Monte Carlo simulation experiments in backcross

To test whether pLARmEB can be used in biparental population, we conducted another simulation experiment. In this experiment, 200 individuals each with 10 001 evenly spaced markers on the entire genome of 100 000 cM length were simulated in backcross population. Eight main-effect QTLs were simulated and placed at marker positions. The sizes and locations of these QTLs are listed in Supplementary Table S4. The population mean (b0) and residual error variance (σ2) were set at 10 and 10, respectively. The number of replicates was set at 200.

Results

Monte Carlo simulation studies

Statistical power for QTN detection

To validate pLARmEB, three simulation experiments were conducted. In the first experiment, each simulated sample was analyzed by pLARmEB, least angle regression plus empirical Bayes (LARmEB), EMMA, FASTmrEMMA, mrMLM and Bayesian hierarchical generalized linear model (BhGLM). Among the 1000 samples, the first 100 were further analyzed using the BhGLM method. As shown in Supplementary Table S1 and Figure 1a, the average power for the above 6 methods was 77.1, 68.9, 46.0, 70.7, 68.6 and 54.5%, respectively. The method in which polygenic background was controlled had the highest average power among the six methods (Figure 1a). To further confirm the effectiveness of pLARmEB, polygenic effect simulated from multivariate normal distribution (r2=9.2%) was added to each phenotype in the second experiment and three epistatic QTNs (r2=15%) were added in the third simulation experiment. The average powers based on pLARmEB, LARmEB, EMMA, FASTmrEMMA, mrMLM and BhGLM were 78.3, 69.6, 42.5, 75.0, 67.6 and 60.7%, respectively, in the second experiment (Supplementary Table S2); and 74.4, 57.5, 39.1, 59.2, 58.9 and 56.3%, respectively, in the third experiment (Supplementary Table S3). The highest average power was observed when pLARmEB included polygenic background control.

Figure 1
figure 1

Average powers in the detection of QTNs (a) and average of mean squared errors in the estimation of QTN effects (b) across six simulated QTNs using pLARmEB, LARmEB, EMMA, FASTmrEMMA, mrMLM and BhGLM.

Accuracies of estimated QTN effects

MSE measured accuracies of estimated QTN effects, and low MSE indicates high accuracy for parameter estimation. As shown in Figure 1b and Supplementary Tables S1–S3, the average MSEs based on pLARmEB, LARmEB, EMMA, FASTmrEMMA, mrMLM and BhGLM were 0.0895, 0.1005, 0.5432, 0.2885, 0.0940 and 0.2577, respectively, in the first experiment (Figure 1b and Supplementary Table S1); 0.0917, 0.0997, 0.5680, 0.3227, 0.0852 and 1.3139, respectively, in the second experiment (Supplementary Table S2); and 0.0973, 0.1240, 0.5973, 0.3450, 0.1024 and 0.3934, respectively, in the third experiment (Supplementary Table S3). pLARmEB had the highest accuracy for estimating QTN effect among the six methods.

FPR and ROC curve

High FPR is a major concern in GWAS. To overcome this issue, a very high significance level was frequently adopted in genome-wide single marker scan. In our multilocus method, a less stringent significance level (LOD=2.0) was recommended. We wanted to know whether this criterion produces high FPR. All the FPR results in the three simulation experiments are listed in Supplementary Tables S1–S3. Clearly, the FPRs based on pLARmEB, LARmEB, EMMA, FASTmrEMMA, mrMLM and BhGLM were 0.0009, 0.0127, 0.0325, 0.0084, 0.0168 and 0.0115 (%), respectively, in the first experiment (Supplementary Table S1); 0.0025, 0.0010, 0.0166, 0.0081, 0.0210 and 0.0093%, respectively, in the second experiment (Supplementary Table S2); and 0.0089, 0.0031, 0.0253, 0.0148, 0.0265 and 0.0120%, respectively, in the third experiment (Supplementary Table S3). These results indicate that pLARmEB had a low FPR.

To compare various approaches for their efficiencies in the detection of significant QTNs, receiver operating characteristic (ROC) curve was plotted. ROC is a plot of average power against FPR. We calculated the corresponding average powers for the 41 thresholds between 10−6 and 10−2 in the first simulation experiment, and compared the ROC curves among the above 6 methods. Under the 0.01 to 0.001 significant levels, pLARmEB has the highest power to detect QTN among the six methods (Figure 2).

Figure 2
figure 2

Statistical powers of six simulated QTNs in the first simulation experiment plotted against false positive rate (in a log10 scale) for pLARmEB, LARmEB, EMMA, FASTmrEMMA, mrMLM and BhGLM.

Computational efficiency

We scanned and identified SNPs that were associated with the trait on each chromosome using LARS. We then included all the potentially associated SNPs across the genome into one genetic model and estimated their effects by empirical Bayes (Xu, 2010). For the first simulation experiment, the above procedures took 4.20, 6.82, 68.77, 8.32, 13.29 and >100 h (Intel Core i5-4570 CPU 3.20 GHz, Memory 7.88G, Nanjing, China) for pLARmEB, LARmEB, EMMA, FASTmrEMMA, mrMLM and BhGLM, respectively. pLARmEB took the least computing time among the six approaches. A similar trend was found in real data analyses (Supplementary Table S5).

Analysis of the Arabidopsis data set

To test the performance of pLARmEB, a data set containing 7 Arabidopsis flowering traits along with 216 130 SNPs in Atwell et al. (2010) were reanalyzed by pLARmEB, EMMA, FASTmrEMMA and mrMLM. All the significantly associated SNPs were used to fit the regression for each trait and model fitness was reflected by AIC and BIC values. The AIC values for all the seven traits based on pLARmEB were much lower than those based on EMMA, FASTmrEMMA and mrMLM (Table 1). Hence, FASTmrEMMA and mrMLM were better than EMMA and a similar result was also observed from the BIC values. The finding suggests that pLARmEB is better in model fit than EMMA, FASTmrEMMA and mrMLM.

Table 1 AIC and BIC values for the regression of significantly associated SNPs on each Arabidopsis flowering time trait using pLARmEB, EMMA, FASTmrEMMA and mrMLM

Within 20 kb of each SNP significantly associated with traits, we mined candidate genes for these traits. Among the genes identified in previous studies, pLARmEB, FASTmrEMMA and mrMLM identified more previously reported genes than EMMA (Supplementary Table S6). For example, pLARmEB, FASTmrEMMA and mrMLM identified more than three genes for long days with vernalization, whereas EMMA detected only one gene (AT5G45890). A similar trend was also observed for other traits (Supplementary Table S6). Among these previously reported genes, 48 were identified only by pLARmEB (Table 2). Interestingly, genes AT2G19690 and AT2G19760 identified by pLARmEB were associated simultaneously with long days with vernalization and short days with vernalization SDV, and three genes (AT2G07020, AT2G07040 and AT2G07050) adjacent to the SNP at 2 910 430 bp of chromosome 2 were found to be associated with short days.

Table 2 The previously reported genes for seven flowering time traits in Arabidopsis that were detected only by pLARmEB

Discussion

Analysis of one random sample in the first Monte Carlo simulation experiment using LARS, empirical Bayes and pLARmEB showed that LARS identified many QTNs with small effects in addition to all the simulated QTNs, and thus its FPR was high (Figure 3a). The empirical Bayes was also able to identify simulated and small-effect QTNs although FPR was decreased (Figure 3b), and pLARmEB detected almost all the simulated QTNs and the effects of nonsimulated QTNs were almost close to zero (Figure 3c). More importantly, 48 previously reported genes in Arabidopsis were identified only by pLARmEB. Therefore, pLARmEB is a good alternative method for multilocus GWAS.

Figure 3
figure 3

Comparison of least angle regression (a), empirical Bayes (b) and pLARmEB (c) in the estimation of QTN effects in one random sample of the first simulation experiment.

Although pLARmEB was proposed for GWAS, it is appropriate for mapping populations of backcross, doubled haploid and recombinant inbred lines. To illustrate the effectiveness of pLARmEB, pseudo-markers in every d cM were created genome-wide, and the fourth Monte Carlo simulation experiment with 200 simulated data sets was conducted and analyzed using pLARmEB and empirical Bayes. The higher power for QTL detection and less bias for the QTL-effect estimates were observed from pLARmEB than from empirical Bayes (Supplementary Table S4). pLARmEB is also suitable for a population consisting of chromosome segment substitution lines. However, we can only scan marker positions, because we cannot calculate conditional probabilities of pseudo-marker positions. If the number of genotypes in a mapping population is more than two, for example, AA, Aa and aa in F2, the current method requires some modifications.

Among the previously identified genes in Arabidopsis (Supplementary Table S6), a few were found commonly by several approaches and this is different from linkage analysis. The main reason is that GWAS mapping population has a complicated population structure. Although pLARmEB, FASTmrEMMA, and mrRMLM had similar powers of QTN detection in the simulation experiments, different previously reported genes were detected in real data analysis. For example, 48 previously reported genes were identified only by pLARmEB (Table 2). For this reason, we recommend pLARmEB as an alternative method for GWAS and also recommend the joint implementation of several methods in the GWAS analyses of one trait.

The AIC or BIC values of FASTmrEMMA in Wen et al. (2017) and mrRMLM in Wang et al. (2016) are different from the corresponding values in this study. In this study, we considered population structure in GWAS. With the inclusion of population structure in genetic model, some different SNPs are found to be significantly associated with the trait. The above two differences result in different AIC or BIC values for the same trait in different studies.

Multilocus GWAS has become the state-of-the-art GWAS procedure. Iwata et al. (2007, 2009) developed multilocus Bayesian GWAS approaches for quantitative and ordinal traits, although running time is a major concern. Segura et al. (2012) proposed a multilocus linear mixed model method that is simple, stepwise mixed model regression with forward inclusion and backward elimination. Wang et al. (2016) suggested mrMLM and Wen et al. (2017) proposed FASTmrEMMA. To make assumptions more suitable to a given data set, Zhou et al. (2013) and Moser et al. (2015) proposed a hybrid method of mixed linear model and sparse regression model, named Bayesian sparse linear mixed model. In this study, the integration of LARS with empirical Bayes under polygenic background control provides one simple and efficient way for multilocus GWAS. In Arabidopsis real data analysis, the number of SNPs was >1000 times larger than sample size and we were able to scan each chromosome by LARS and include all the associated SNPs across the genome in the multilocus model and estimate their effects by empirical Bayes, and thus pLARmEB is better than EMMA.

To obtain low FPR in GWAS, a relatively stringent significance criterion is widely adopted, such as Bonferroni correction. Even after using a less stringent significance criterion (such as LOD=2.0), pLARmEB has less FPR and higher power than EMMA. We also conducted GEMMA (Zhou and Stephens, 2012) and its power is same as that of EMMA (results not shown). pLARmEB works better than all the other methods considered.

Data archiving

All simulated data sets are available from the Dryad Digital Repository: http://dx.doi.org/10.5061/dryad.sk652. The real data set can be retrieved from: http://www.arabidopsis.org/.