Abstract
Modern GWAS studies use an enormous sample size and ultrahigh density SNP genotypes. These conditions reduce the mapping resolution of marginal association tests–the method most often used in GWAS. Multilocus Bayesian Variable Selection (BVS) offers a onestop solution for powerful and precise mapping of risk variants and polygenic risk score (PRS) prediction. We show (with an extensive simulation) that multilocus BVS methods can achieve high power with a low false discovery rate and a much better mapping resolution than marginal association tests. We demonstrate the performance of BVS for mapping and PRS prediction using data from blood biomarkers from the UKBiobank (~300,000 samples and ~5.5 million SNPs). The article is accompanied by opensource Rsoftware that implement the methods used in the study and scales to biobanksized data.
Similar content being viewed by others
Introduction
GenomeWide Association Studies (GWAS) have reported large numbers of variants associated with many important traits and diseases; however, for complex traits many smalleffect riskloci remain unmapped. In the last decade, several public (e.g., UKBiobank [1], Million Veteran Program [2], TOPMed, All of Us) and private (e.g., 23andMe^{®}) initiatives have generated unprecedently large biomedical data sets comprising genotype data linked to extensive phenotype/disease data. These advances in data availability have not been fully matched with adequate changes in the analysesmethods used.
Singlemarkerregression (SMR) remains the method most frequently used for mapping in GWAS. SMR tests for the marginal association between a phenotype (or a disease indicator) and individual SNPs and does not account for linkage disequilibrium (LD) between variants. Therefore, it can lead to significant associations of phenotypes with SNPs that are physically distant from causal variants–we refer to this phenomenon as poor mapping resolution. Importantly, the mapping resolution of SMR deteriorates with sample size because a large sample size increases the power to detect weak marginal associations between SNPs and phenotypes (Supplementary Data, Section 1). Therefore, for fine mapping, most genetic studies adopt some form of local variable selection approach to refine (SMR) GWASpeaks to a smaller number of locally independent signals [3, 4]. However, these methods may reduce power due to cancellation of marginal effects (e.g., [5], this could happen if variants have effects with signs opposite to the sign of the covariance of the reference alleles at the two loci) and makes accurate error control challenging.
Bayesian variable selection (BVS) models [6, 7] offer a onestop solution for fine mapping and Polygenic Risk Score (PRS) prediction, with the clear advantage that Bayesian models can provide accurate error control. However, the adoption of these methods in GWAS remained limited in part because achieving high power with these methods requires using a large sample size and because the computational burden of implementing BVS methods with ultrahigh density SNP panels and biobank size data is substantial.
We implemented an efficient algorithm to generate samples from the posterior distribution of BVS models for problems involving hundreds of thousands of samples–the software is part of the BGLR Rpackage [8]. In this study, we use this software to study the powerFDR performance of BVS for mapping very smalleffect risk loci. We compared the performance of a BVS method with a prior from the SpikeSlab (SS) family known as BayesC [9], with marginalassociation testing (SMR), two other BVS methods, SuSiE [10] and FINEMAP [11], and two nonBayesian variable selection procedures (LASSO, and a forward (FWD) regression). Furthermore, we used BayesC and SMR to map risk variants for six blood biomarkers related to metabolic syndrome. The empirical analysis shows that BayesC identifies most of the regions identified by SMR (and a many more) with a much finer mapping resolution than SMR.
Materials and methods
We used data from the UKBiobank [1] comprising genotypes and phenotypes of distantly related (pairwise genomic relationships smaller than 0.05) individuals of European background (n = 315,874). From the imputed genotype SNPs, after filtering (for a minor minorallelefrequency >0.001 and a calling rate >0.95) and LDpruning (Rsquared <0.9), we retained 5,593,953 SNPs (see Supplementary Methods more details).
For the evaluation of power and FDR, we simulated complex traits with 500 (randomly chosen) casual variants and a trait heritability of 0.5 (i.e., on average a causal locus explained 1/10th of 1% of the phenotypic variance). We conducted 10 wholegenome simulations, each involving 500 causal loci and 5,593,453 SNPs without effects. We also considered a second simulation scenario with the same heritability and a smaller number of causal variants (50); thus, with larger SNPeffect sizes.
We evaluated six regression methods: marginal association testing (via SMR) and five variable selection methods (LASSO, FWD, and three Bayesian variable selection procedures). The SMR was a simple linear regression fitted via ordinary least squares using the phenotype as the response and one SNP as the predictor.
The Variable Selection methods were multiple regression models of the form
where \(y = \left( {y_1,y_2, \ldots ,y_n} \right)\prime\) is a vector of phenotypes, \(X = \left\{ {x_{ij}} \right\}\) is a matrix of genotypes, \(\beta = \left( {\beta _1,\beta _2, \ldots ,\beta _p} \right)^\prime\) is a vector of SNP effects and \(\varepsilon = \left( {\varepsilon _1,\varepsilon _2, \ldots ,\varepsilon _n} \right)^\prime\) is a vector of error terms.
Local regressions
To apply variable selection methods on a wholegenome scale, we leveraged the fact that LD decays within relatively short distances; therefore, following Funkhouser et al. [12], we applied the variable selection method to overlapping segments containing 7000 contiguous SNPs (~4 Mbp for the imputed genotypes). This window of SNPs was displaced by 5000 SNPs, thus producing local regressions with a core of 3000 SNPs and flanking regions, each of ~2000 SNPs. From each regression we retrieved results from the core only (Supplementary Methods for more details).
The LASSO [13] regressions were fitted using the glmnet [14] Rpackage. The software produces a sequence of solutions \(\{ {\hat \beta _{\lambda _1},\hat \beta _{\lambda _2}, \ldots .} \}\) over a grid of values of the regularization parameter (λ). We formed a grid with 1000 values that was evenly spaced in the logscale. The same grid of values of λ was used across each of the segments to which LASSO regression was applied (see Local Regressions above). For each λ in the sequence we obtained a discovery set and a rejection set consisting of the SNPs with nonzero and zero effect in \(\hat \beta _\lambda\), respectively. We ranked SNPs based on the value of λ at which the SNP becomes active in the model; these ranks were used to evaluate power and FDR over the regularization path.
The Forward regression also produces a sequence of solutions \(\{ {\hat \beta _{FWD_1},\hat \beta _{FWD_1}, \ldots .} \}\) starting from the null model (no SNPs), then adding to the model one SNP at a time, at each step adding the SNP that produces the largest reduction in the residual sum of squares. The FWD regressions were applied to overlapping segments (see Local Regression above) and SNPs were ranked based on the reduction on the RSS produced when the SNP entered the model. These ranks were then used to evaluate power and FDR along the forward path.
For the Bayesian Variable Selection regression, we first used a model from the SpikeSlab family known as BayesC [9]. Briefly, the model assumes that the error terms in [1] are iid Normal \(\varepsilon _i\sim ^{iid}N\left( {0,\sigma _\varepsilon ^2} \right)\); therefore, the conditional distribution of the data given the model parameters \(\theta = \left\{ {\beta ,\sigma _\varepsilon ^2} \right\}\) was:
where \(MVN\left( {yX\beta ,I\sigma _\varepsilon ^2} \right)\) represents a multivariate normal density with mean Xβ and (co)variance matrix \(I\sigma _\varepsilon ^2\).
In a Bayesian models, priors that assign nonzero probabilities to null effects also specifies probabilities over possible models; this plays a very important role in error control [15]. Therefore, we consider a prior for SNP effects that has a point of mass at zero and a Gaussian slab
where π \(\left( {0 \le \pi \le 1} \right)\)represents the proportion of loci with nonnull effects and \(\sigma _\beta ^2\) is the variance of effects (other common choices for the slab are the scaledt and doubleexponential). The prior used in BayesC [9] is equivalent to the one earlier proposed by George & McCulloch’s [16] with a Gaussian spike replaced with a point of mass at zero.
The hyperparameters (π, \(\sigma _b^2\) and \(\sigma _\varepsilon ^2\)) are unknown; thus, for the variance parameters we use scaledinverse chisquare priors and for \(\pi\) we use a Beta prior, \(\pi \sim B\left( {\alpha _1,\alpha _2} \right)\) with \(\alpha _1 = 1.1\) and \(\alpha _1 = 99\), implying \(E\left[ \pi \right] = 1.1/100\).
We compared the powerFDR performance of BayesC with that of SuSiE [17] and FINEMAP [11]. FINEMAP was developed to refine peaks detected in GWAS; therefore, we applied FINEMAP to segments detected through marginal association testing. The segments consisted of SNPs with singlemarkerregression pvalue smaller than 5e8 that were at a distance of each other smaller than 1 Mbp. SuSiE was applied in a wholegenome scale using the same local regression approach used to implement BayesC.
Bayesian FDR
We used the samples from the posterior distribution to estimate SNPspecific probabilities of association: \(\pi _j = p\left( {\beta _j \,\ne\, 0data} \right)\). The “local” FDR (LFDR [18]) for the j^{th} SNP with \(\pi _j\) is simply \(LFDR_j = 1  \pi _j\). A decision rule that rejects \(H_{0j}\,if\,\pi _j \, > \, \tau\) (\(\tau \in [0,1]\)) has an expected proportion of false discoveries equal to the average LFDR of the SNPs in the discovery set:
where p_{τ} is the number of SNPs in the discovery set. Expression [4] was evaluated for each SNP using the BFDR() function of the BGLR Rpackage [19].
Software
SNP filtering was done using PLINK [20], genomic relationships were computed using the getG() function of the BGData Rpackage [21]. Singlemarker regressions were performed using the GWAS() function of the BGData Rpackage. BayesC and SuSiE were implemented using the BGLR [19] (function BLRXy()) and susieR [17] Rpackages, respectively. FINEMAP was fitted using the FINEMAP command line tool [11]. The Forward regressions were implemented using the FWD() function available in the BGData Rpackage, and LASSO regressions were fitted using the glmnet [14] Rpackage. Plots were generated using ggplot2 [22].
Power and FDR determination
To estimate powerFDR curves, for each of the simulation scenarios and method we ranked SNPs based on the evidence for association produced by each method: (i) the pvalues for the SMR (from smallest to larger), (ii) singleSNP posterior probabilities of inclusion for the BVS method (from largest to smallest, this was used for all the Bayesian models, (iii) the value of λ at which the SNP entered in the model for the LASSO regressions (from largest to smallest), and (iv) the reduction in the RSS produced when the SNP entered in the model in the FWD regression (from largest to smallest). We then produced discovery and rejection sets for each method by selecting the topk SNPs of each of the ranks (k = 1, 2, …). For each discovery set we estimated the proportion of the 500 causal loci recovered in the discovery set and the proportion of SNPs in the discovery set that were not causal loci (i.e., the false discovery proportion).
To evaluate the ability of each method to finemap causal variants we estimated the powerFDR performance at different mapping resolutions. Specifically, for an xkbp mapping resolution (x = 10 kbp, 100 kbp, …, 1 Mbp), a discovery was considered true (false) if the distance with the closest causal variant was smaller (larger) than xkbp.
Analysis of six blood biomarkers
The simulation study demonstrated that the FWD and the BVS methods BayesC and SuSiE had the best performance. Furthermore, the performance of BayesC and SuSiE were very similar and better than that of FINEMAP; therefore, for analysis of the real data we used BayesC and SMR, which is the method most used in GWAS.
The biomarkers that we analyzed (glucose, serum urate (SU), serum creatinine, low and highdensity lipoprotein cholesterols (LDL and HDL, respectively), and triglycerides) are often monitored in medical checkups and are related to metabolic syndrome (see Table S1 of the Supplementary Data for sample size and descriptive statistics by trait).
Analyses were performed using the same genotypes used in the simulation (~5.6 million SNPs). All the traits were adjusted by the effects of sex, age, center, and with the top10 SNPderived eigenvectors. For rejection we used pvalue < 5e8 for the SMR and BFDR ≤0.05 or ≤0.10 for the BVS method. In regions of highLD there may be multiple SNPs with elevated posterior probability of nonzero effect, with none of them reaching the singleSNP BFDR threshold (see Section I of the Supplementary Data for examples of this). Therefore, after identifying individual SNPs that cleared the BFDR thresholds mentioned above, we also identified short segments that had elevated inclusion probability but did not clear the BFDRthreshold. For these segments we estimated the posterior probability of the segment (i.e., the frequency at which at least one SNP from the segment was active in the model) and included that segment in the discovery set if the segment BFDR was smaller than 0.05 or 0.1. Therefore, the discovery sets for the BVS method consisted of individual SNPs and short segments that cleared one of the two BFDR thresholds.
Polygenic risk scores
To evaluate the prediction accuracy of polygenic risk scores (PRS) we set aside data from 10,000 individuals for testing. As a baseline PRS we used one based on GWASsignificant SNPs (p value < 5e8) with SNP effects estimated from SMR. These estimates do not account for LD; therefore, we considered a second PRS in which SNPs where selected based on SMR pvalues and then SNP effects were estimated using BayesC. For these PRSs, we used pvalue thresholds for SNP selection ranging from 1e12 to 1e4. Finally, we considered a wholegenome PRS derived using the estimates of effects from the local Bayesian regressions implemented using model BayesC (the same approach used for mapping). These local Bayesian regressions covered all the available SNPs (~5.6 million); however, to simplify the computation of the PRS we only used the SNPs with posterior inclusion probability greater than 1/1000.
Results
The powerFDR curves estimated from the simulation scenario with heritability 0.5 and 500 causal loci are displayed in Fig. 1 (and File S1 of the Supplementary Data). For a sample size of 10,000 and a mapping resolution of 100 kbp (topleft panel of Fig. 1) all the methods had relatively low power–this was expected because individual SNPs with nonnull effect explained only 1/1000 of the phenotypic variance. Increasing sample size improved the powerFDR performance of all the methods; however, the variable selection methods improved their performance much more than the SMR. Among the variable selection procedures, the BVS methods (including BayesC, SuSiE, and FINEMAP) and the FWD regression were the best performing ones. Importantly, with a large sample size these methods had a very sharp phasetransition in the powerFDR curve showing that, with a large sample size, both methods can achieve high power with very low FDR even for very small effect variants. This was evident even with a mapping resolution of 10 kbp (see topright plot in Fig. 1). On the other hand, the SMR only achieve a comparable powerFDR performance with a mapping resolution of 1 Mbp (see lowerright plot) demonstrating that with a large sample size mapping based on SMR pvalues produces a large proportion of discoveries that are more than 100 kbp apart from the causal variants. Among the Bayesian methods, SuSiE and BayesC performed very similarly and FINEMAP had a slightly lower power for an FDR of 0.1 (see Fig. 1, top two panels for sample size 50,000 and 100,000). This small reduction in power may result from some of the smalleffect causal variants not reaching GWASsignificant values; thus, not making it to the second step.
The results from the simulation scenario with larger effect sizes (heritability 0.5, 50 causal variants, Fig. S4) were similar to the ones obtained in the simulation scenario with 500 causal variants in that FWD, SuSiE, and BayesC achieved the best powerFDR performance and had very sharp powerFDR transitions. However, as expected, for any given sample size and FDR in this scenario these three methods achieved higher power than in the scenario with smaller effect sizes (500 causal variants). On the other hand, the powerFDR performance of SMR was worst in the scenario with larger effects (50 causal variants) than in the scenario with smaller effects. This happens because large effect loci can generate marginal association significant results even for a very weak LD (i.e., at a long physical distance) between the marker and the causal variant.
Bayesian FDRcontrol
We used the results from the most challenging simulation scenario (heritability 0.5, 500 causal variants) to evaluate the empirical FDR of standard decision rules including SMR pvalue ≤ 5e8 and BFDR ≤ 0.10 or 0.05. The results are summarized in Fig. 2, Figs. S5, S6. For a 1 Mbp mapping resolution the standard rule used in GWAS SMR pvalue ≤5e8 leads to an FDR of ~0.08, comparable to a decision rule using BFDR ≤ 0.1, and a bit higher than using BFDR ≤ 0.05 (lower panel of Figs. 2 and S5, S6). However, for finer mapping resolutions (e.g., 125 kbp) a decision rule rejects if SMR pvalue ≤ 5e8 can produce a rate of false discoveries greater than 50%. Importantly, for the SMR, the exponential growth of the FDR with increasingly finer mapping resolution was more marked with large sample size, illustrating once again how the mapping resolution of SMR deteriorates with sample size. On the other hand, while the BVS model also had an increasing FDR with finer mapping resolution, the slope of the curves was very small compared with that of the SMR suggesting that the prior provide reasonably effective (albeit not perfect) error control. We conclude from these results that, for data from unrelated white Europeans, using a BFDR < 0.05 as a decision rule leads to an FDR ≤ 0.1 for a mapping resolution of ~125 kbp.
High resolution mapping of risk loci associated with six metabolic syndromeassociated blood biomarkers
Table 1 and Fig. 3 display the results of the SMR and of BayesC. The number of variants with SMRsignificant marginal association ranged from 469 (Glucose) to 5991 (serum urate). We grouped the SMRsignificant variants into nonoverlapping chromosome segments, each including all the SMRsignificant variants that were at a distance smaller than 1000 Mbp. The number of segments harboring SMRsignificant variants ranged from 43 (Glucose) to 225 (HDLCholesterol); these regions are displayed in yellowred scale in Fig. 3.
BayesC identified a much smaller number of variants than the SMR; however, the number of independent segments identified by BayesC were typically higher than those identified by SMR except for Glucose. Most often BayesC selected one or a few variants within each of the segments (Fig. 3). The segments identified by BayesC were often very short–the median length was about 30 kbp–36 kbp. On the other hand, the SMRsegments had a median length of 142.5 kbp.
Polygenic prediction
Figure 4 and Table S2 show the prediction correlations obtained in testing sets. A PRS based on GWASsignificant SNPs (SMR pvalue < 5e8) and with SNP effects estimated from SMRs achieved prediction correlations ranging from 0.09 (+/− 0.01, Glucose) to 0.302 (+/− 0.01, HDL Cholesterol)–the results from these PRSs are represented in blue in Fig. 4 (see also Table S2). The estimates of effects from SMR do not account for LD; reestimating the SNP effects of GWASsignificant SNPs using BayesC led to significant increases in prediction correlations. The gains in prediction correlation achieved by reestimating the effects of GWASsignificant SNPs using BayesC ranged from 17% (glucose) to 47% (triglycerides). The PRS that used the estimates of effects from the wholegenome Bayesian regressions (horizontal dashed black lines in Fig. 4, see also Table S2) were very similar to the ones obtained by a PRS based on GWASsignificant SNPs with effect estimates derived using BayesC. Furthermore, for all traits but creatinine, the prediction accuracy achieved by the wholegenome Bayesian regression were within the margin of error of the maximum prediction accuracy that one could obtain in this data set by selecting SNPs using pvalues from SMR and then estimating the effects of the SNPs using BayesC (i.e., the maximum of the salmon curve in Fig. 4).
Discussion
Modern genetic studies use a very large sample size and ultrahighdensity genotypes (potentially millions of SNPs). In principle, the large sample size and the highmarker density should improve our ability to map risk variants. However, these conditions deteriorate the mapping resolution of SMR–the most frequently used methodology used in GWAS. We illustrated this problem with extensive simulations and with the analysis of six blood biomarkers. With a sample size of ~300,000 and high marker density, SMR can lead to significant associations for variants that are up to 300–1000 kbp apart from the causal variant depending on the effect size, and the extent of LD in the region (Figs. S1, S2). This results in poor powerFDR performance (Fig. 1, Fig. S4); thus, when marginal association testing is applied to biobanksize data and ultrahighdensity genotypes, high power can only be achieved at the price of a very high FDR.
To address the poor mapping resolution of SMR several methods have been proposed. One approach is to ‘weight’ the evidence of association of SNPs within a region to estimate an approximate posterior probability of association [3, 23]. However, this approach assumes that only one SNP (in the region) has an effect and do not fully account of multilocus LD in the region. Another common approach is to use twosteps procedures in which first a marginalassociation test is used to identify chromosome segments harboring GWASsignificant variants and then, in a second step, the GWASsummary statistics obtained in the first step are used, in conjunction with an LDreference panel, to identify independent signals. However, in the first step the procedure may miss important signals due to “unfaithfulness” or cancellation of marginal effects [5]. Additionally, the use of a reference panel to approximate LD patterns may not accurately reflect the LDpatterns of the data set used to derive the GWAS summary statistics in the first place. The slightly worse performance of FINEMAP is likely reflecting a loss of power due to the use of a 2step procedure. Furthermore, we note that our results are likely giving an optimistic view of the performance of two step procedures because, here, the LDmatrix was computed using the same data set that was used to obtain the SMR summary statistics. If, as often done, the LDmatrix is computed from a reference panel (with possibly different LD patterns than the data set used to derive the summary statistics) the loss of power may be higher.
To address limitations of twosteps procedures, here we considered four variable selection methods (FWD, LASSO, and two variable selection procedures: BayesC and SuSiE priors) that account for multilocus LD. These methods are not new; however, the adoption of these methods in human GWAS has been limited in part because achieving high power with variable selection methods often requires a very large sample size. The advent of Big Data in genomic research has opened new opportunities for the use of these methods in GWAS.
Among the four variable selection methods considered, the FWD regression and the BVS methods (both SuSiE and BayesC) were the ones that achieved the best powerFDR performances. With a large sample size (n ≥ 100,000) these two methods can achieve high power with low FDR and very fine mapping resolution, even for verysmalleffect variants.
BayesC, a Bayesian method with a SpikeSlab prior, and the FWD regression achieved a very good (and remarkably similar) powerFDR performance. This is not surprising considering the links that exist between these two methods and subset selection. The FWD regression is an approach developed to approximate subset selection constraining the search to a path that adds one predictor at a time [24]. Furthermore, the objective function of subset selection, \(\hat \beta = argmin\left\{ {RSS\left( {y,X,\beta } \right) + \lambda \Sigma _j1\left( {\beta _j \,\ne\, 0} \right)} \right\},\) can be seen as the logarithm of the kernel of the posterior distribution of a Bayesian model with a Gaussian likelihood and a prior on SNP effects with a point of mass and a flat slab, which is similar to the prior used in BayesC.
Collecting samples from the posterior distribution of high dimensional Bayesian models is computationally demanding. However, advances in hardware and in algorithms has made the application of BVS to biobanksize data feasible. As a reference we provide in Supplementary Fig. S7 the estimated computing time required for BLRXy() to generate 10,000 posterior samples as a function of number of SNPs in the model (from 1000 to 10,000 SNPs) and sample size (we evaluated up to n = 300,000). The information in the appendix also provides the computing times required for up to 100 iterations of SuSiE and SuSiEsufficient statistics. It took on average 17 min for BLRXy() to generate 10,000 posterior samples for a model involving 10,000 SNPs and a sample size of 300,000. The computing times of BLRXy() were similar to those of SuSiEsufficient statistics and considerably lower than those of SuSiE when sample size was large. These results show that it is doable to apply Bayesian regressions using the localoverlapping segments approach used in Funkhouser et al. [12] and adopted here.
In this study we focused on a specific BVS model that uses a prior with point of mass at zero and a Gaussian slab. Our simulation results suggest that the powerFDR performance of different BVS methods (e.g., BayesC, SuSiE) is very similar (see Fig. 1) provided that the prior induces some form of variable selection. There are many other variable selection priors that we anticipate will perform similarly, including priors from the spike slab family that use nonGaussian slabs (e.g., scaledt [25], or doubleexponential [26, 27]).
One concern that is often raised about Bayesian models is the need of specifying prior hyperparameters and the influences that these may have on inferences. In the case of BayesC there are two hyperparameters: the prior proportion of nonzero effects and the variance of the slab. To avoid specifying these hyperparameters apriori, we treated them as unknown and assigned priors to each of them. For the variance, we choose a scaledinverse chisquare with small DF which results in limited influence of the prior on inferences when sample size is large. For the proportion of nonzero effects, we used a Beta prior with a prior mean of 1/100 (i.e., assuming a prior that 1% of the SNPs have nonezero effect). One could use a uniform prior (which is a special case of the Beta); however, adequate FDR control and stringent variable selection can be better achieved by using priors that are informative; this can be particularly important for studies involving a much smaller sample size than the one presented here.
In regions of high LD collinearity may lead to many SNPs with elevated inclusion probability without any of them reaching stringent FDR thresholds (e.g., BFDR < 0.1); thus, reducing power. In our analysis of blood biomarkers, we illustrated how this problem can be addressed using methods which identify sets of variants that are jointly associated with a phenotype.
Finally, we evaluated various strategies to build PRSs; our results suggest that the prediction accuracy that can be achieved using a wholegenome BVS procedure implemented using local regressions is similar to the highest prediction accuracy that can be achieved fitting a BVS to SNPs filtered based on marginal association tests. Therefore, we conclude that BVS applied using local Bayesian regressions can be used for both fine mapping and accurate PRS prediction.
Data availability
The data that supports the findings of this study are available from the UKBiobank but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of the UKBiobank.
References
Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLOS Med [Internet]. 2015;12:e1001779. https://doi.org/10.1371/journal.pmed.1001779.
Gaziano JM, Concato J, Brophy M, Fiore L, Pyarajan S, Breeling J, et al. Million Veteran Program: A megabiobank to study genetic influences on health and disease. J Clin Epidemiol [Internet]. 2016 Feb 1 [cited 2018 Mar 31];70:214–23. Available from: http://linkinghub.elsevier.com/retrieve/pii/S0895435615004448.
Mahajan A, Taliun D, Thurner M, Robertson NR, Torres JM, Rayner NW, et al. Finemapping type 2 diabetes loci to singlevariant resolution using highdensity imputation and isletspecific epigenome maps. Nat Genet [Internet]. 2018;50:1505–13. https://doi.org/10.1038/s4158801802416.
Yang J, Ferreira T, Morris AP, Medland SE, Madden PAF, Heath AC, et al. Conditional and joint multipleSNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat Genet. 2012;44:369–75.
Wasserman L, Roeder K. Highdimensional variable selection. Ann Stat [Internet]. 2009;37:2178–201. http://projecteuclid.org/euclid.aos/1247663752.
George EI, McCulloch RE. Variable selection via Gibbs sampling. J Am Stat Assoc [Internet]. 1993;88:881–9. https://doi.org/10.1080/01621459.1993.10476353.
Ishwaran H, Rao JS. Spike and slab variable selection: Frequentist and bayesian strategies. Vol. 33, Annals of Statistics. Institute of Mathematical Statistics; 2005. p. 730–73.
Pérez P, de los Campos G. Genomewide regression and prediction with the BGLR statistical package. Genet [Internet]. 2014;198:483–95. http://www.ncbi.nlm.nih.gov/pubmed/25009151.
Habier D, Fernando R, Kizilkaya K, Garrik DJ. Extension of the {B}ayesian Alphabet for Genomic Selection. BMC Bioinformatics. 2011;12.
Wang G, Sarkar A, Carbonetto P, Stephens M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J R Stat Soc Ser B Statistical Methodol [Internet]. 2020;82:1273–300. https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12388.
Benner C, Spencer CCA, Havulinna AS, Salomaa V, Ripatti S, Pirinen M. FINEMAP: efficient variable selection using summary data from genomewide association studies. Bioinforma [Internet]. 2016;32:1493–501. https://academic.oup.com/bioinformatics/articlelookup/doi/10.1093/bioinformatics/btw018.
Funkhouser SA, Vazquez AI, Steibel JP, Ernst CW, Campos G de los. Deciphering sexspecific genetic architectures using local Bayesian regressions. bioRxiv [Internet]. 2019 May 31 [cited 2019 Jun 15];653386. Available from: https://www.biorxiv.org/content/10.1101/653386v1.
Tibshirani R. Regression shrinkage and selection via the {LASSO}. J R Stat Soc Ser B.1996;58:267–88.
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw [Internet]. 2010;33:1–22. https://www.jstatsoft.org/index.php/jss/article/view/v033i01/v33i01.pdf.
Scott JG, Berger JO. Bayes and empiricalBayes multiplicity adjustment in the variableselection problem. Ann Stat [Internet]. 2010;38:2587–619. http://projecteuclid.org/euclid.aos/1278861454.
George EI, McCulloch RE. Variable Selection via {G}ibbs sampling. J Am Stat Assoc. 1993;8:881–9.
Wang G, Sarkar A, Carbonetto P, Stephens M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J R Stat Soc Ser B Statistical Methodol [Internet]. 2020;82:1273–300. https://onlinelibrary.wiley.com/doi/10.1111/rssb.12388.
Efron B, Hastie T. Computer Age Statistical Inference. Cambridge University Press; 2016.
Pérez P, De Los Campos G. Genomewide regression and prediction with the BGLR statistical package. Genetics. 2014;198.
Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Secondgeneration PLINK: rising to the challenge of larger and richer datasets. Gigascience [Internet]. 2015;4:7. https://academic.oup.com/gigascience/articlelookup/doi/10.1186/s1374201500478.
Grueneberg A, de Los Campos G BGData  A Suite of R Packages for Genomic Analysis with Big Data. G3 (Bethesda) [Internet]. 2019 May 7 [cited 2019 Jul 10];9:1377–83. Available from: http://www.ncbi.nlm.nih.gov/pubmed/30894453.
Wickham H ggplot2: Elegant Graphics for Data Analysis [Internet]. SpringerVerlag New York; 2016. Available from: https://ggplot2.tidyverse.org.
Maller JB, McVean G, Byrnes J, Vukcevic D, Palin K, Su Z, et al. Bayesian refinement of association signals for 14 loci in 3 common diseases. Nat Genet [Internet]. 2012;44:1294–301. https://doi.org/10.1038/ng.2435.
Draper NR, Smith H. Applied regression analysis. Applied Regression Analysis. wiley; 2014. 1–716 p.
Meuwissen THE, Hayes BJ, Goddard ME. Prediction of total genetic value using genomewide dense marker maps. Genetics. 2001;157:1819–29.
Park T, Casella G. The {B}ayesian {LASSO}. J Am Stat Assoc. 2008;103:681–6.
de los Campos G, Naya H, Gianola D, Crossa J, Legarra A, Manfredi E, et al. Predicting quantitative traits with regression models for dense molecular markers and pedigree. Genetics. 2009;182:375–85.
Acknowledgements
The authors thank Professors Michael Boehnke and James O. Berger for comments on earlier versions of this manuscript.
Funding
This research has been conducted using the UK Biobank Resource under Application Number 15326. The development of BGLR and BGData Rpackages was supported by NIHNIGMS grant GM101219, GDLC also received financial support from NIHNHGRI grant HG011674 and from Michigan State University.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethical approval
This study was entirely based on secondary analysis of deidentified data. The study was conducted under IRB permit LEGACY15–745: 15–745: Analysis and Prediction of Complex Traits and Disease Phenotypes Using Genomic Markers (CGA# 143415, 143206, 143549).
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
de los Campos, G., Grueneberg, A., Funkhouser, S. et al. Fine mapping and accurate prediction of complex traits using Bayesian Variable Selection models applied to biobanksize data. Eur J Hum Genet 31, 313–320 (2023). https://doi.org/10.1038/s41431022011355
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41431022011355
This article is cited by

Association between kidney function and Parkinson’s disease risk: a prospective study from the UK Biobank
BMC Public Health (2024)

Genomewide association testing beyond SNPs
Nature Reviews Genetics (2024)

An adaptive identification method for outliers in dam deformation monitoring data based on Bayesian model selection and least trimmed squares estimation
Journal of Civil Structural Health Monitoring (2024)

Genes=disease (?)
European Journal of Human Genetics (2023)