Abstract
Genomewide association analysis of cohorts with thousands of phenotypes is computationally expensive, particularly when accounting for sample relatedness or population structure. Here we present a novel machinelearning method called REGENIE for fitting a wholegenome regression model for quantitative and binary phenotypes that is substantially faster than alternatives in multitrait analyses while maintaining statistical efficiency. The method naturally accommodates parallel analysis of multiple phenotypes and requires only local segments of the genotype matrix to be loaded in memory, in contrast to existing alternatives, which must load genomewide matrices into memory. This results in substantial savings in compute time and memory usage. We introduce a fast, approximate Firth logistic regression test for unbalanced case–control phenotypes. The method is ideally suited to take advantage of distributed computing frameworks. We demonstrate the accuracy and computational benefits of this approach using the UK Biobank dataset with up to 407,746 individuals.
Main
Since the first large genomewide association studies^{1} were carried out in 2007, there has been a steady increase in sample size—now reaching hundreds of thousands of individuals—which is enabled by a parallel stream of methods with everincreasing computational efficiency. Initial methods used simple linear or logistic regression using programs such as SNPTEST^{1} and PLINK^{2}, but these have largely been replaced by the use of linear mixed models (LMMs) and the closely related wholegenome regression models. These approaches have been shown to account for population structure and relatedness, and offer advantages in power by conditioning on associated markers from across the whole genome^{3,4,5,6,7}.
The initial methods were focused on quantitative traits^{3} for studies with a few thousand samples and assumed a Gaussian distribution on SNP effect sizes. These approaches were extended to datasets including tens of thousands of individuals by computational strategies that avoided repeated matrix inversions when testing each SNP^{7,8}. Building on work from the plant and animal breeding literature^{9,10}, even more efficient wholegenome regression approaches were developed that allowed for more flexible (nonGaussian) prior distributions of SNP effect sizes^{11,12}. BOLTLMM and LEMMA are implementations of this approach^{13,14,15}. The fastGWA LMM approach reduces the computational time by using a sparse representation of the genetic correlations present in the sample^{16}. For simple linear regression of quantitative traits, the BGENIE method (https://jmarchini.org/BGENIE/) introduced the idea of the simultaneous analysis of multiple quantitative traits, which required only a single pass through the genetic data and provided substantial speedups over PLINK^{17}.
BOLTLMM and fastGWA have also been applied to binary (case–control) traits when the case–control ratio is reasonably balanced and relatively common variants are tested for association. However, these approaches break down when applied to unbalanced case–control studies tested with rarer variants, such as those found in exome sequencing studies. The SAIGE method implements a logistic mixedmodel approach and a saddlepoint approximation (SPA) to the null distribution of the test statistic, which is effective at controlling Type 1 errors^{18}.
The BOLTLMM, fastGWA and SAIGE methods all proceed in two main steps that are applied one trait at a time. In Step 1, a model is fit to a set of SNPs from across the whole genome, such as all of the SNPs on a genotyping array. The resulting model fit is then used to create either a prediction of individual trait values based on the genetic data (in BOLTLMM and SAIGE) or an estimate of the trait variance–covariance matrix (in fastGWA). In Step 2, a larger set of imputed or sequenced variants on the same set of samples are tested for association, conditional on the predictions or variance–covariance matrix in Step 1. This is usually carried out using the socalled leaveonechromosomeout (LOCO) scheme, where each imputed SNP on a chromosome is tested conditional on the Step 1 predictions ignoring that chromosome. This approach avoids proximal contamination, which can reduce the power of association tests^{8,19}.
In this paper we propose a new machinelearning method within this twostep paradigm, called REGENIE (https://rgcgithub.github.io/regenie/), that is substantially faster than existing approaches. Extended Data Fig. 1 provides an overview of the REGENIE method. In Step 1, array SNPs are partitioned into consecutive blocks of B SNPs and a small set of J ridge regression predictions are generated from each block (this is referred to as Level 0). Within each block, the ridge regression predictors each use a slightly different set of shrinkage parameters. The idea of using a range of shrinkage values is to capture the unknown number and size of truly associated genetic markers in each window. This approach is equivalent to placing a Gaussian prior on the effect sizes of the SNPs in the block and finding the maximum a posteriori estimate of the effect sizes and the resulting prediction. One can think of these predictions as local polygenic scores that account for local linkage disequilibrium (LD) within blocks. Combining the predictions from across the genome results in a large reduction in the size of the genetic dataset. In this paper we use B = 1,000 and J = 5, and this reduces a set of M = 500,000 SNPs to M = 2,500 predictors. The method then uses a second ridge regression (referred to as Level 1) to combine the M predictors into a single predictor, which is then decomposed into 23 chromosome predictions for a LOCO approach. Linear or logistic regression is used at Level 1, depending on the phenotype. The resulting LOCO predictions are then used as a covariate in Step 2 when each imputed SNP is tested. This approach completely decouples Step 1 and Step 2 so that the Step 1 predictions can be reused when running Step 2 on distinct sets of markers (for example, imputed and exome markers) or even when distinct statistical tests are needed at Step 2. All of the predictions at Level 0 and Level 1 are obtained within a crossvalidation (CV) scheme (either Kfold CV or leaveoneout CV (LOOCV)) to prevent overfitting.
This approach exhibits a number of desirable properties. First, many of the calculations in Steps 1 and 2 can be carried out for multiple traits in parallel. This leads to substantial gains in speed as the files containing the variants in Steps 1 and 2 are read only once, rather than repeatedly for each trait. In practice, we find that for Step 1, REGENIE can be over 150× faster than BOLTLMM and 300× faster than SAIGE when analyzing 50 UK Biobank quantitative and binary traits with up to 407,746 samples (Tables 1 and 2). In Step 2, each variant is tested for association and the overall computational burden will depend on the number of variants tested. For example, analyzing imputed variants across the genome will result in a higher computational burden for Step 2 relative to Step 1, but this will be less so when Step 2 involves testing coding variants from exome sequencing. The computational differences between methods in Step 2 are less extreme and mostly depend on the type of trait, test statistic and implementations of file format reading and parallelization schemes. However, REGENIE analyzes multiple traits in parallel and this can result in substantial computational savings, especially for quantitative traits. On an imputed dataset with 30 million tested variants and 50 traits, we find that over both Steps 1 and 2, REGENIE is 19.5× and 4.4× faster than BOLTLMM and SAIGE, respectively. In the Supplementary Note and Supplementary Table 1 we provide an analysis of the computational complexity of REGENIE.
Second, in Step 1 of REGENIE, only B SNPs need to be stored in memory at once, which leads to a low memory footprint, which can reduce the costs on cloudbased platforms. Third, the method is applicable to both quantitative and binary traits and we have implemented a new, fast Firth logistic regression test as well as a SPA test for binary traits. Finally, our algorithm is ideally suited to implementation on distributed computing frameworks, such as Apache Spark, where both the dataset and application of the method and computation can be parallelized across a large number of machines. The main implementation of REGENIE is a standalone C++ program (https://rgcgithub.github.io/regenie/) but these methods have also been implemented for quantitative traits in the Apache Sparkbased Glow project (http://projectglow.io; see Supplementary Note). All of the main experiments and results in the paper were obtained using the C++ program.
Results
Quantitative traits
Figure 1 shows the results of applying REGENIE, BOLTLMM and fastGWA to three quantitative phenotypes measured on white British participants of the UK Biobank (lowdensity lipoprotein cholesterol, n = 389,189; body mass index, n = 407,609; and bilirubin, n = 388,303), where Step 2 testing was performed on 9.8 million imputed SNPs (see Supplementary Note). The Manhattan plots for all three phenotypes show good agreement between the methods (see also Extended Data Fig. 2) with both REGENIE and BOLTLMM showing increased power gains relative to fastGWA at known peaks of association.
To demonstrate the advantages of analyzing multiple traits in parallel using REGENIE, we compared it to BOLTLMM and fastGWA on a set of 50 quantitative traits from the UK Biobank, each with a distinct missing data pattern (Supplementary Table 2). Whereas REGENIE can analyze all traits at once within a single run of the software, the BOLTLMM and fastGWA software must be run once for each of the 50 traits. Across all 50 traits, we found that the P values for REGENIE and BOLTLMM were in very close agreement on the majority of traits tested, with some evidence that REGENIE is slightly less powerful for a few traits (Supplementary Fig. 1), whereas the fastGWA P values were noticeably deflated compared with REGENIE and BOLTLMM. The compute time and memory usage of the three methods is given in Table 1. The table shows that in this 50trait scenario, REGENIE is 151× faster than BOLTLMM in elapsed time for Step 1 and 11.5× faster for Step 2, and this translates into an overall speedup in terms of elapsed time of approximately 20× when projecting to 30 million tested variants obtained using imputation information score > 0.8. Similar to BOLTLMM, Step 2 of REGENIE has been optimized for input genotype data in BGEN v1.2 format, which highly helped reduce the runtime. In addition, REGENIE has a maximum memory usage of 12.9 GB, which is mostly due to REGENIE reading onlya small portion of the genotype data at a time, whereas BOLTLMM required 50 GB. To keep memory usage low when analyzing the 50 traits, withinblock predictions are stored on disk and read separately for each trait working across blocks. The added input/output operations incur a small cost on the overall runtime but substantially decrease the amount of memory needed by REGENIE (Supplementary Table 3). When running analyses on cloudbased services such as Amazon Web Services, these time and memory reductions both contribute to large reductions in cost as cheaper Amazon Web Services instance types can be used and for less time. In the same 50traits scenario, we find that REGENIE is about 3× faster than fastGWA but fastGWA is very memory efficient and uses a maximum of only 2 GB.
Binary traits
In addition to analyzing quantitative traits, REGENIE was also designed for the analysis of binary traits, including those with unbalanced case–control ratios. REGENIE includes implementations of both Firth and SPA corrections to handle this scenario (see Methods). Figure 2 (see also Extended Data Fig. 3) shows the results of applying REGENIE, BOLTLMM and SAIGE to four binary phenotypes measured on white British participants of the UK Biobank (coronary artery disease, N = 352,063; glaucoma, N = 406,927; colorectal cancer, N = 407,746; and thyroid cancer, N = 407,746) where Step 2 testing was performed on 11.6 million imputed SNPs (Supplementary Note). All four approaches demonstrated very good agreement for the most balanced trait (coronary artery disease; case–control ratio = 1:11), but as the fraction of cases decreased, BOLTLMM tended to give inflated test statistics. However, both REGENIE with Firth and SPA corrections as well as SAIGE are robust to this inflation and show similar agreement for the associations detected.
The SPA approach calculates a standard score test statistic and approximates the null distribution, whereas the Firth correction uses a penalized likelihood approach to estimate the SNPeffectsize parameters in an asymptotic likelihoodratio test. Although both provide good control of Type 1 error for rare binary traits, we found that the SPA approach implemented in SAIGE can result in very inflated effectsize estimates (Supplementary Fig. 2). However, the Firth correction used in REGENIE provides reasonable effectsize estimates and standard errors when the minor allele count is low (Supplementary Table 4). The fast Firth correction that we developed agrees well with the exact Firth correction (Supplementary Figs. 3 and 4) but is approximately 60 times faster (Supplementary Table 5).
To assess the computational resources needed to analyze a larger number of traits, we again ran REGENIE using Firth/SPA correction and SAIGE on a set of 50 binary traits from the UK Biobank with a range of different case–control ratios and distinct missing data patterns (Supplementary Table 6). The compute time and memory usage details are given in Table 2.
For Step 1, we found that REGENIE (using the LOOCV scheme) was about 350 times faster (CPU time of 777 versus 275,070 h) and required only 40% of the memory used by SAIGE (19.5 versus 49 GB). In Step 2, REGENIE–Firth and REGENIE–SPA were 2× and 3× faster than SAIGE in CPU time, respectively, but 21× and 34× faster than SAIGE in elapsed time, respectively, which suggests that REGENIE makes better use of parallelization in this step. Overall, in this 50trait setting, REGENIE–Firth was 4.4× faster than SAIGE in terms of CPU time and 23× faster in elapsed time when projected to 30 million tested variants obtained using INFO > 0.8%. REGENIE reduces the CO_{2} footprint by more than 85% compared with SAIGE (Supplementary Table 7). Supplementary Figs. 5–10 compare the accuracy of REGENIE and SAIGE across all 50 traits and show good agreement.
A large portion of the compute time in SAIGE is used to implement the LOCO, but it has been suggested that for binary traits, the effect of proximal contamination is not as substantial for less prevalent traits^{18}. We ran SAIGE without LOCO on the same 50 binary traits and observed that the impact of using LOCO was indeed more apparent with low case–control imbalance, where it can be highly beneficial (Extended Data Fig. 4 and Supplementary Figs. 11–13). These results would caution against the perfunctory use of SAIGE without LOCO for the analysis of all traits in a study such as the UK Biobank.
CV scheme
We implemented both a Kfold CV and a LOOCV scheme in Step 1 for both quantitative and binary traits (see Methods). Both approaches provided almost identical accuracy (Supplementary Figs. 14 and 15). For the dataset of 50 quantitative traits, LOOCV required 192 h of CPU time and 23.6 GB of memory, whereas the Kfold CV required 111 h of CPU time and 12.9 GB of memory (Table 1). For the dataset of 50 binary traits, the LOOCV approach required approximately 50% of the CPU time used by the Kfold CV approach (CPU time of 777 versus 1,590 h) and 65% more memory, but the elapsed time of the two methods was similar (108 versus 117 h). The LOOCV approach requires fewer relatively expensive logistic regression calls compared with the Kfold CV, but the extra calls needed are easily parallelized across multiple cores.
Missing phenotype data
When analyzing multiple traits together with different missing data patterns, we use mean imputation of missing phenotype values in Step 1 but keep only samples with nonmissing phenotypes in Step 2. This approach gave almost identical results to an exact approach that uses only samples with nonmissing phenotypes in both Step 1 and Step 2 (Supplementary Figs. 16–18).
Simulation studies
Through simulations, we investigated the Type 1 error and power of the tests in REGENIE, BOLTLMM, fastGWA, SAIGE and simple linear/logistic regression with the top principal component as a covariate (PCA). We also assessed the LOCO scheme used in REGENIE and the accuracy of the effectsize estimates from REGENIE–Firth and SAIGE. We used real genetic array data from the UK Biobank to simulate realistic genetic LD patterns and population structure. We sampled 100,000 individuals from either the white British or the full European ancestry set of the UK Biobank and selected one of the following: (1) only unrelated individuals, (2) individuals at random or (3) half of the samples from the related individuals and the remaining half from the unrelated individuals. To consider scenarios of more extreme relatedness, we sampled only firstdegree relatives (N = 22,990), first and seconddegree relatives (N = 30,775) or first to thirddegree relatives (N = 70,684) in the set of white British participants. More details can be found in the ‘Data simulation’ section of Methods.
We used genomic inflation (λ_{GC}) and empirical Type 1error rate (defined as the proportion of null tests with a P value less than a nominal level α) to assess the calibration of the tests across 100 simulation replicates. For the quantitative traits, the PCA method was well calibrated when the sample consisted only of unrelated individuals but became inflated with increasing levels of relatedness (Extended Data Fig. 5 and Supplementary Tables 8,9). However, REGENIE, BOLTLMM with a mixture of Gaussian’s model (BOLTLMMMoG), the BOLTLMM infinitesimal model (BOLTLMMInf) and fastGWA retained good Type 1error control in all of the settings considered. REGENIE had slightly deflated type I error rates when half of the samples were related. This was also observed in more extreme relatedness scenarios, where the use of the Step 1 predictions in REGENIE led to good calibration of the test unlike the PCA method, which was inflated (Supplementary Table 10).
For binary traits with low case–control imbalance, REGENIE–Firth, REGENIE–SPA, BOLTLMMMoG, BOLTLMMInf and fastGWA had good control of the Type 1 error with the more common variants tested (Extended Data Fig. 6, Supplementary Fig. 19d and Supplementary Tables 11,12), and SAIGE had slightly deflated Type 1 error rates. BOLTLMMMoG, BOLTLMMInf and fastGWA were inflated for more unbalanced traits and this was worse for rarer variants (Supplementary Fig. 19). However, REGENIE–Firth, REGENIE–SPA and SAIGE were robust against this inflation and for extremely unbalanced traits; REGENIE–SPA was conservative with rarer variants. In more extreme relatedness scenarios, the PCA method became inflated with higher heritability levels but REGENIE–Firth and REGENIE–SPA retained good control of the Type 1 error, although they became conservative for more heritable traits (Supplementary Table 13).
To quantify power, we used the mean χ^{2} test statistic at causal SNPs. For quantitative traits, REGENIE and BOLTLMMInf had similar power performance, which was higher than for fastGWA across all settings (Supplementary Fig. 20 and Supplementary Table 14). BOLTLMMMoG had the highest power performance with fewer causal SNPs, and the power difference with REGENIE decreased as the number of causal SNPs increased. BOLTLMMMoG uses a more flexible mixture of Gaussian’s prior, which may model traits with highly noninfinitesimal genetic architectures better. For binary traits, we compared REGENIE–Firth, REGENIE–SPA and SAIGE, which were all well calibrated. With low case–control imbalance, REGENIE–Firth and REGENIE–SPA had slightly higher power than SAIGE and for more unbalanced traits, REGENIE–Firth and SAIGE had similar power performance and REGENIE–SPA had slightly lower performance at rarer variants (Supplementary Fig. 21 and Supplementary Table 15).
We further investigated the accuracy of the effect sizes from REGENIE–Firth and SAIGE. They were similar for moderately unbalanced traits but as the case–control imbalance increased, the estimates from SAIGE became highly inflated (Extended Data Fig. 7). Finally, when comparing the approximate LOCO scheme in REGENIE to an exact LOCO scheme, we found that they gave similar results, both for the genetic predictions in Step 1 (squared sample correlation coefficient (R^{2}) = 0.99932 for quantitative traits and R^{2} = 0.98710 for binary traits) and the P values in Step 2 (Supplementary Fig. 22).
We assessed the singletrait performance of REGENIE, BOLTLMM and SAIGE and found that REGENIE takes approximately 3× less CPU time than BOLTLMM for quantitative traits and >8× less CPU time than SAIGE for binary traits (Supplementary Table 16). With five traits, the computational efficiency of REGENIE improved—it took approximately 10× less CPU time than BOLTLMM and >22× less CPU time than SAIGE. More generally, we observed that Step 1 of REGENIE scales sublinearly with the number of traits and the scaling gets closer to linear when the number of traits become large (Extended Data Fig. 8).
Interchromosomal LD in the UK Biobank
While developing REGENIE, we and others^{20} identified an anomaly in the UK Biobank array genotypes that leads to reduced performance of some of the LMMs being tested. We observed a sizeable number of SNP pairs that exhibited interchromosome LD, which breaks the assumptions of the LOCO scheme and can result in loss of power when any one of the SNPs in a pair is associated with a trait (see Supplementary Note).
Discussion
In this study we present a machinelearning method that implements simultaneous wholegenomewide regression of multiple quantitative or binary traits. The method uses a strategy that splits computation into blocks of consecutive SNPs and does not require loading of a genomewide set of SNPs into memory. This approach also facilitates the analysis of multiple traits in parallel. Overall, this results in substantial computational savings in terms of both CPU time and memory usage compared with existing methods such as BOLTLMM, fastGWA and SAIGE. As the number of largescale cohorts with deep phenotyping grows, this approach will probably become even more relevant. The parallel nature of the approach is ideally suited to distributed environments such as Apache Spark. We have developed a first version of REGENIE for quantitative traits within the Glow project as well as the full version of the method for quantitative and binary traits in a standalone C++ program with source code that is openly available.
Analysis of large cohorts for which phenotypes are derived from electronic health records often results in many binary traits with substantial case–control imbalance. REGENIE is applicable to binary traits and we have proposed an approximate Firth regression approach, which we show is almost identical to an exact Firth regression implementation, and much faster. This approach has the added benefit that it avoids the parameter estimate inflation that occurs when SAIGE is used to analyze ultrarare variants.
Like many existing mixedmodelbased approaches, REGENIE is well able to handle relatedness in the sample, although it can become conservative in more extreme cases and hence, we recommend that it not be used for smaller cohorts with high levels of relatedness—such as founder populations, where exact mixedmodel methods can be used. As previous methods have proposed to address this issue^{13,18,21}, we plan to explore extending REGENIE to compute and incorporate a calibration factor in its association testing step.
The approach used in REGENIE is inspired by, but not the same as, the machinelearning approach of stacked regressions^{22}. REGENIE uses ridge regression to combine a set of correlated predictors, whereas Breiman’s stacking approach used nonnegative least squares to combine a set of highly correlated predictions in an ensemble learning approach. We have not yet investigated whether nonnegative least squares might have advantages here. Furthermore, our simulations with traits that have a sparser genetic architecture also highlight the potential improvement of the REGENIE method by using more flexible priors on the effect size of predictors, as is done in BOLTLMM with a mixture of Gaussian’s prior.
There are many other potential avenues for development of this approach. It will be easy to expand the functionality to include tests such as SNP × covariate interactions^{16}, variance tests^{23} and a whole range of genebased tests^{24,25,26}. Multivariate probit regression for binary traits^{27}, multivariate linear regression for quantitative traits^{28} and multitrait burden tests^{29} will all be straightforward to implement.
We also plan to investigate whether REGENIE can be extended to handle timetoevent data^{30} and multinomial regression in a mixedmodel framework^{31,32}. We suspect it may also be possible to leverage the REGENIE output to estimate SNP heritability, polygenic scores and multitrait missing data imputation using mixed models on a scale that is not possible using the existing approaches^{33}.
One novel application would be to use REGENIE to analyze cohorts that have undergone both RNAsequencing and either wholegenome SNP genotyping or sequencing. In this setting, the expression levels of up to 20,000 genes would represent the multiple traits of interest, and running a wholegenome regression analysis would allow for joint inference of cis and trans expression quantitative trait loci in a single analysis. This would be equivalent to an LMM analysis of an RNAsequencing study, which has been performed in previous studies^{34,35}.
Cohorts will continue to grow in terms of sample size, the number of phenotypes and the number of variants available for testing, either via imputation from wholegenomesequenced reference panels or via direct wholegenome sequencing of the study samples. It seems clear to us that Step 1 of the wholegenome regression paradigm is now highly computationally tractable using the REGENIE approach. However, further advances will be needed to reduce the compute time in Step 2, as wholegenome sequencing produces everincreasing numbers of rare variants. Efficient utilization of the sparsity of such variants will help to improve memory efficiency and substantially reduce the cost of computation.
Methods
Wholegenome linear regression
In a sample of N individuals, y denotes the Nelement phenotype vector, G represents the N × M genotype matrix, where G_{ij} ∈ {0, 1, 2} is the allele count for individual i at the jth marker and X represents the N × C matrix of covariates (including an intercept), which is assumed to be full rank. We consider a wholegenome regression model
where α are the fixed covariate effects, G_{S} is a standardized version of G, the genotypes have been transformed to have a mean of zero and variance of one, β ~ MVN(0, \(\sigma _g^2\)I_{M}) and \(\mathbf{\upepsilon}\) ~ MVN(0, \(\sigma _e^2\)I_{N}), where MVN denotes the multivariate normal distribution. This is the standard infinitesimal model, which can also be rewritten as
with g ~ MVN(0, \(\sigma _a^2\)K), where \(K = G_{\mathrm{S}}{G_{\mathrm{S}}^{T}}/M\) is usually referred to as a geneticrelatedness matrix or empirical kinship matrix and \(\sigma _a^2 = M\sigma _g^2\) is the additive polygenic variance.
Covariate effects are removed from both the trait and the genotypes in equation (1) by first computing an orthonormal basis for the covariates, projecting the genotypes and the trait onto that basis and then subtracting out the resulting vectors to obtain the residuals. This is equivalent to using a projection matrix \(\it{P}_{X} = \it{I}_N  \it{X}(\it{X}^{T}\it{X})^{  1}\it{X}^{T}\) with
Both the genotype and phenotype residuals are then scaled to have a variance of one.
Stacked block ridge regression
Fitting equation (1) is computationally intensive given that G typically has many hundreds of thousands of columns. Instead, for Step 1, we transform the model to
where W is a matrix derived from G with substantially fewer columns. Specifically, we divide G into blocks of B consecutive and nonoverlapping SNPs, and from each block we derive a small set of predictors using ridge regression across a range of J shrinkage parameters (see Supplementary Note). The idea behind using a range of shrinkage values is to capture the unknown number and size of truly associated genetic markers within each window. This approach is equivalent to placing a Gaussian prior on the effect sizes of the SNPs in the block and finding the maximum a posteriori estimate of the effect sizes and the resulting prediction. Another approach would be to integrate out the effect sizes over the Gaussian prior to obtain the best linear unbiased prediction^{36} but we have not investigated that approach in this paper.
The ridge predictors are rescaled to have unit variance and are stored in place of the genetic markers in matrix W, providing a large reduction in data size. If M = 500,000, B = 1,000 and J = 5 are used, then the reduced dataset will have JM / B = 2,500 predictors. We refer to this part of the method as the Level 0 ridge regression.
To keep the memory usage low when analyzing multiple traits, the withinblock predictions are stored on disk and read separately for each trait when fitting models at Level 1 (see below). The added input/output operations incur a small cost on the overall runtime and substantially decrease the amount of memory needed.
The ridge regression takes account of the LD within each block but not between blocks. One option that we have considered, but not implemented yet, is to condition the ridge regression on the estimates from the previous block, which may better account for LD across block boundaries.
The predictors in W will all be positively correlated with the phenotype. Thus, it is important to account for that correlation when building a wholegenomewide regression model. The predictors will also be correlated with each other, especially within each block, but also between blocks that are close together due to LD. We use a second level of ridge regression on W for a range of shrinkage parameters and choose a single best value using the Kfold CV scheme^{22}. This assesses the predictive performance of the model using heldout sets of data and aims to control any overfitting induced by using the first level of ridge regression to derive the predictors (see Supplementary Note). We refer to this part of the method as the Level 1 ridge regression.
The result of this model fit is a single N × 1 predicted phenotype \({\hat{{\mathbf{y}}}}^ \ast\), and this can be partitioned into 22 LOCO predictions (denoted \({\hat{{\mathbf{y}}}}_{{\mathrm{LOCO}}}^ \ast\)), which are used when testing SNPs for association in Step 2 to avoid proximal contamination (see Supplementary Note).
Association testing
When testing for association of the phenotype with a variant g in Step 2, we consider a simple linear model
where \({\hat{{\mathbf{y}}}}_{{\mathrm{resid}},{\mathrm{LOCO}}}^ \ast = {\tilde{{\mathbf{y}}}}  {\hat{{\mathbf{y}}}}_{{\mathrm{LOCO}}}^ \ast\) refers to the phenotype residuals where the polygenic effects estimated from the null model with LOCO have been removed, \({\tilde{{\mathbf{g}}}} = \it{P}_X{{{\mathbf{g}}}}\) are residuals obtained from removing the covariate effects from the tested variant and \(\tilde {\mathbf{\upepsilon}} =\it{P}_X{\mathbf{\upepsilon}}\) with \(\mathbf{\upepsilon}\) ~ MVN(0, σ_{e}^{2}I_{N}).
A score test statistic for H_{0}: β = 0 is
where we use \(\hat \upsigma _e^2 = {\hat{{\mathbf{y}}}}_{{\mathrm{resid}},{\mathrm{LOCO}}}^ \ast _2^2/(N  C)\). In equation (7), when estimating the variance of the numerator, we assume that the polygenic effects are given, which leads to the denominator involving only O(N) computation. While other methods make use of a calibration factor in the denominator to account for the variance of the polygenic effects^{13,18,21}, we found in applications that the results obtained using this simple form match up closely to those using a calibration factor. Finally, we use a normal approximation, \(T_{{\mathrm{linear}}}^2\sim \chi _1^2\), to estimate the P value. As with Step 1 above, the REGENIE software reads the genetic data file in blocks of B SNPs and these are processed together, taking advantage of parallel linear algebra routines in the Eigen library.
Multiple traits
Both Step 1 and Step 2 above are easily extended so that multiple phenotypes can be processed in parallel. The genetic data files in both steps can be read once, in blocks of B SNPs, which means the method uses a small amount of memory. In addition, the linear algebra operations for the covariate residualization, ridge regression and association testing can be shared across traits. This is similar to the approach implemented in the BGENIE software for single SNP linear regression analysis^{17}. The fine details of the multiple phenotype approach are given in the Supplementary Note.
Binary traits
For binary traits, we use exactly the same Level 0 ridge regression approach, which effectively treats the trait as if it were quantitative. However, at Level 1, instead of a linear regression in equation (5) we use logistic regression
where p_{i} = E(y_{i}) = P(y_{i} = 1) with y_{i} indicating the case status of the ith individual, X_{i} is the covariate vector for the ith individual, α are the fixed covariate effects, W_{i} are the withinblock (BR) predictions for the ith individual and η = (η_{1},…, η_{BR})^{T}, with η_{i} ~ N(0, 1/ω). This model corresponds to logistic regression with ridge penalty applied to the effects of withinblock predictions in W.We approximate the model in equation (8) by first fitting a null model for each trait that only has
covariate effects and then using the resulting estimated effects as an offset in the model in equation (8),
where \(\hat{\mathbf{\upalpha}}\) represents the effects estimated in equation (9). As the covariate effects are not expected to change substantially (unless correlation between covariates and block predictions are very large), this approximation is expected to work well in most analyses.
As with quantitative traits, we used Kfold CV to choose the Level 1 ridge regression parameter. However, for extremely unbalanced traits, it may happen that one of the folds contains no cases. To avoid this situation, we also implemented an efficient version of LOOCV. Although at first sight it may seem that LOOCV is more computationally intensive than Kfold CV given that the model has to be fitted N times (rather than K times) on data with N − 1 samples, the leaveoneout estimates can actually be obtained (approximately for binary traits) from rank 1 updates to the results from fitting the model once to the full data (see Supplementary Note). In practice we have found that LOOCV gives similar association results to Kfold CV (Supplementary Figs. 14 and 15) and can be computationally faster in some cases (see Tables 1 and 2). A LOCO scheme is applied to the polygenic effect estimates and the resulting predictions \({\hat{{\mathbf{w}}}}_{{\mathrm{LOCO}}} = \it{W}_{{\mathrm{LOCO}}}\hat{\mathbf{\upeta}} _{{\mathrm{LOCO}}}\) are then stored.
In Step 2, we use a logistic regression model score test to test for association between each marker and binary trait. Covariate effect sizes are estimated along with genetic marker effect sizes but we include the LOCO predictions from Step 1 as a fixed offset (see Supplementary Note).
When rare variants are tested for association with a highly unbalanced trait (that is, a trait that has low sample prevalence), the use of asymptotic test statistic distributions does not work well and results in elevated Type 1 error rates. REGENIE implements several methods to handle this situation. First, it includes the SPA test^{37}, which is also included in SAIGE^{18}. This approach better approximates the null distribution of the test statistic but we have found that it can sometimes fail to produce good estimates of SNP effect sizes and standard errors, which are highly desirable for metaanalysis applications (Supplementary Table 4 and Supplementary Fig. 2).
Second, we use Firth logistic regression, which uses a penalized likelihood to remove much of the bias from the maximumlikelihood estimates in the logistic regression model. This approach results in wellcalibrated Type 1 errors and usable SNP effect sizes and standard errors. Given that the use of Firth regression can be relatively computationally intensive, we have developed an approximate Firth regression approach that is much faster (Supplementary Table 5), which involves estimating the covariate effects in a null Firth regression model and then including covariate effects along with the LOCO genetic predictor as offset terms in a Firth logistic regression test (see Supplementary Note). In practice, we have found this approximation to give very similar results to when the exact Firth test is used (Supplementary Fig. 3). This approach has been used to analyze COVID19 outcomes across four studies and four ancestries^{38}, and proved vital to provide accurate effectsize estimates for the metaanalysis.
Handling missing data
As a key goal of our approach is to analyze multiple traits all at once, one issue that remains to be addressed is the presence of ‘missingness’ in the data, which could differ among the traits. We consider different approaches based on the nature of the trait as well as whether the null model is being fitted or whether association testing is being performed.
For quantitative traits, when fitting the null model missing data is addressed by replacing the missing values by the sample averages for each trait and in the association testing step, individuals with missing phenotype observations are removed from the analysis for each trait. The latter is done by ensuring that when taking sums over individuals, those with a missing phenotype have a zero contribution to the sum. This is similar to the approach implemented in the BGENIE software (https://jmarchini.org/BGENIE/) for single SNP linear regression analysis^{17}. We assume that covariates are fairly wellbalanced in the sample and project them out of the phenotypes using all of the samples (that is, ignoring the missingness within each trait). In the case where phenotypes have the same or very similar patterns of missingness, or if only a single phenotype is being analyzed, it may be more logical to discard the missing observations rather than impute them with the sample averages per trait. Hence, we implement an alternative approach where, in both the nullfitting and the association testing steps, all samples with missingness at any of the P phenotypes are dropped. An approach we have not yet implemented, but may produce better results for quantitative traits, would involve using a multivariate normal model to jointly model correlation between the set of traits and impute missing data, either before or conditional on the output of Step 1.
For binary traits, we use the meanimputed phenotypes to fit the Level 0 linear ridge regression models within blocks but discard missing observations when fitting Level 1 logistic ridge regressions. As the logistic ridge regressions are fitted separately for each trait, this makes it straightforward to account for the missingness patterns separately for each trait. Similarly, in the testing step, we discard missing observations when fitting logistic regression for each trait as well as when using Firth or SPA corrections.
UK Biobank dataset
The UK Biobank^{17} (http://www.ukbiobank.ac.uk) is a large prospective study of about 500,000 individuals who are 40–69 years old and for whom extensive phenotype information is being recorded. Genotyping was performed using the Affymetrix UK BiLEVE Axiom array on an initial set of 50,000 participants and the Affymetrix UK Biobank Axiom array was used for the remaining participants. Up to 11,914,699 variants imputed by the Haplotype Reference Consortium panel that either have a minor allele frequency above 0.5% or a minor allele count above five and are annotated as functional in 462,428 samples of European ancestry were used in the data analyses. We selected up to 407,746 individuals of white British ancestry for whom genotype and imputed data were available and applied qualitycontrol filters on the genotype data using PLINK2 (ref. ^{39}; version v2.00aLM, https://www.coggenomics.org/plink2) that included: a minor allele frequency of ≥1%, a Hardy–Weinberg equilibrium test not exceeding P = 1 × 10^{−15}, a genotyping rate above 99%, not present in lowcomplexity regions, not involved in interchromosomal LD and LD pruning using a R^{2} threshold of 0.9 with a window size of 1,000 markers and a step size of 100 markers. This resulted in up to 471,762 genotyped SNPs that were kept in the analyses.
Data simulation
We performed simulations to assess the performance of the tests in REGENIE under various populationstructure configurations for both quantitative and binary traits. To mimic realistic scenarios, we used genotype array data from the UK Biobank European samples (679,209 array SNPs with a minor allele count > 5). We considered scenarios with 100,000 samples obtained from the set of white British participants or from the full European set so as to incorporate various amounts of population structure. In addition, we varied the proportion of related individuals selected from 0 to 50% of the sample, where we defined a pair of individuals as related if their estimated kinship coefficient, provided by UK Biobank using KING^{40}, was above 0.044. This is to assess how REGENIE would perform in samples with higher amounts of relatedness. We also considered randomly selected samples from the white British or European set, irrespective of the relatedness information, where about 30% of samples in these sets are related up to the third degree. Finally, to consider scenarios of more extreme relatedness, we considered scenarios with samples consisting of only firstdegree white British relatives (N = 22,990), first and seconddegree white British relatives (N = 30,775), and first to thirddegree white British relatives (N = 70,684).
We generated quantitative traits as
where the M causal SNPs were randomly selected only from odd chromosomes with a minor allele count above 100 and not involved in interchromosomal LD, and G_{ij} represents the standardized genotype for individual i at the jth causal SNP, A_{i} represents the score of the individual for the top principal component from a genotype relatedness matrix using SNPs on odd chromosomes and \({\it{\epsilon }}_i\) represents the environmental effects. The effect sizes for the causal SNPs were sampled from a normal distribution with a mean of zero, where the variance was determined based on the desired proportion of trait variance explained by the causal SNPs \(h_g^2\). The effect from population structure γ was set so that the proportion of the trait variance explained by the top principal component was 5%. The environmental effects were sampled from a normal distribution with a mean of zero and the variance was set to correspond to a trait variance of one.
For binary traits, we used the model described above to obtain a quantitative phenotype and then applied a threshold based on a target sample prevalence value K to dichotomize the phenotype and obtain a binary trait. We also considered simulations to assess the effectsize estimates using a logistic model where
and Y_{i}p_{i} ~ Bernoulli(p_{i}), independently, where β_{0} was chosen to achieve the desired prevalence level, and the effect sizes of the causal SNPs were sampled from a normal distribution with a mean of zero and the variance parameter was chosen so that they explain 20% of the variance on the logistic scale.
We simulated up to 100 phenotypic replicates for each simulation setting and selected 10,000, 25,000 or 50,000 SNPs to be causal. For binary traits, we varied the sample prevalence K between 0.1, 0.01 or 0.001, corresponding to a case–control ratio of 1:9, 1:99 or 1:999, respectively, and fixed the number of causal SNPs to 10,000. SNPs on even chromosomes (M_{null} = 324,838 variants) were used to assess the Type 1error performance and the power was estimated using the set of causal SNPs for each trait. REGENIE was compared with BOLTLMM with a mixture of Gaussian’s model (BOLTLMMMoG), BOLTLMM with infinitesimal model (BOLTLMMinf), fastGWA, SAIGE (only for binary traits and run with LOCO scheme) and PCA (using only the top principal component as a covariate in Step 2 of REGENIE without the LOCO predictions from Step 1). The top principal component was included as a covariate for all methods. For REGENIE–Firth, REGENIE–SPA and SAIGE, the Pvalue fallback threshold for Firth/SPA correction was set to 0.05.
Statistical analyses
We used REGENIE to perform genomewide association analyses on up to approximately 11 million imputed variants for 50 quantitative traits and 54 binary traits of up to 407,746 white British participants in the UK Biobank. Quantitative phenotypes were converted to zscores using rankinversebased normal transformation. In the statistical models used, the covariates included age, age^{2}, sex, age × sex and the top10 principal components provided by the UK Biobank to appropriately correct for population stratification. To assess the performance of REGENIE in genomewide association studies, we compared the results from REGENIE with those of existing approaches for largescale analysis, which included BOLTLMM (version 2.3; https://data.broadinstitute.org/alkesgroup/BOLTLMM/) and fastGWA (GCTA version 1.93.0beta; https://cnsgenomics.com/software/gcta/#Overview) for quantitative phenotypes and SAIGE (version 0.36.5.1; https://github.com/weizhouUMICH/SAIGE) with the LOCO option for binary traits. For all methods, Step 1 was run on a set of array SNPs stored in bed/bim/fam format and Step 2 was run on imputed data stored in BGEN format. All association analyses used a \(\chi _{\mathrm{df} = 1}^2\) statistic to test a variant for association with a trait (that is H_{0}:β_{SNP} = 0). All programs were called within R^{41}, where we used the function system.time to track the CPU and wallclock timings.
Reporting Summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
The individuallevel genotype and phenotype data are available through formal application to the UK Biobank (http://www.ukbiobank.ac.uk). Results from the genomewide association study analyses in this paper have been deposited in the GWAS Catalog under the accession numbers GCST90013862–GCST90014022.
Code availability
The C++ source code for REGENIE is available from https://rgcgithub.github.io/regenie/ under an MIT License. Analysis code for the main results in the paper can be found at https://github.com/rgcgithub/regenie/tree/master/scripts.
References
The Wellcome Trust Case Control Consortium. Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).
Purcell, S. et al. PLINK: a tool set for wholegenome association and populationbased linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Kang, H. M. et al. Variance component model to account for sample structure in genomewide association studies. Nat. Genet. 42, 348–354 (2010).
Price, A. L., Zaitlen, N. A., Reich, D. & Patterson, N. New approaches to population stratification in genomewide association studies. Nat. Rev. Genet. 11, 459–463 (2010).
Yu, J. et al. A unified mixedmodel method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 38, 203–208 (2006).
Zhang, Z. et al. Mixed linear model approach adapted for genomewide association studies. Nat. Genet. 42, 355–360 (2010).
Zhou, X. & Stephens, M. Genomewide efficient mixedmodel analysis for association studies. Nat. Genet. 44, 821–824 (2012).
Listgarten, J. et al. Improved linear mixed models for genomewide association studies. Nat. Methods 9, 525–526 (2012).
Meuwissen, T. H., Hayes, B. J. & Goddard, M. E. Prediction of total genetic value using genomewide dense marker maps. Genetics 157, 1819–1829 (2001).
Campos, G. d. L., Hickey, J. M., PongWong, R., Daetwyler, H. D. & Calus, M. P. L. Whole genome regression and prediction methods applied to plant and animal breeding. Genetics 193, 327–345 (2012).
Logsdon, B. A., Hoffman, G. E. & Mezey, J. G. A variational Bayes algorithm for fast and accurate multiple locus genomewide association analysis. BMC Bioinform. 11, 58 (2010).
Carbonetto, P. & Stephens, M. Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian Anal. 7, 73–108 (2012).
Loh, P.R. et al. Efficient Bayesian mixedmodel analysis increases association power in large cohorts. Nat. Genet. 47, 284–290 (2015).
Loh, P.R., Kichaev, G., Gazal, S., Schoech, A. P. & Price, A. L. Mixedmodel association for biobankscale datasets. Nat. Genet. 50, 906–908 (2018).
Kerin, M. & Marchini, J. Inferring genebyenvironment interactions with a Bayesian wholegenome regression model. Am. J. Hum. Genet. 107, 698–713 (2020).
Jiang, L. et al. A resourceefficient tool for mixed model association analysis of largescale data. Nat. Genet. 51, 1749–1755 (2019).
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Zhou, W. et al. Efficiently controlling for casecontrol imbalance and sample relatedness in largescale genetic association studies. Nat. Genet. 50, 1335–1341 (2018).
Yang, J., Zaitlen, N. A., Goddard, M. E., Visscher, P. M. & Price, A. L. Advantages and pitfalls in the application of mixedmodel association methods. Nat. Genet. 46, 100–106 (2014).
KunertGraf, J., Sakhanenko, N. & Galas, D. Allele frequency mismatches and apparent mismappings in UK Biobank SNP data. Preprint at bioRxiv https://doi.org/10.1101/2020.08.03.235150 (2020).
Svishcheva, G. R., Axenovich, T. I., Belonogova, N. M., van Duijn, C. M. & Aulchenko, Y. S. Rapid variance componentsbased method for wholegenome association analysis. Nat. Genet. 44, 1166–1170 (2012).
Breiman, L. Stacked regressions. Mach. Learn. 24, 49–64 (1996).
Young, A. I., Wauthier, F. L. & Donnelly, P. Identifying loci affecting trait variability and detecting interactions in genomewide association studies. Nat. Genet. 50, 1608–1614 (2018).
Wu, M. C. et al. Rarevariant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89, 82–93 (2011).
Lee, S., Wu, M. C. & Lin, X. Optimal tests for rare variant effects in sequencing association studies. Biostatistics 13, 762–775 (2012).
Zhou, W. et al. Scalable generalized linear mixed model for regionbased association tests in large biobanks and cohorts. Nat. Genet. 52, 634–639 (2020).
Chib, S. & Greenberg, E. Analysis of multivariate probit models. Biometrika 85, 347–361 (1998).
Korte, A. et al. A mixedmodel approach for genomewide association studies of correlated traits in structured populations. Nat. Genet. 44, 1066–1071 (2012).
Dutta, D., Scott, L., Boehnke, M. & Lee, S. MultiSKAT: general framework to test for rarevariant association with multiple phenotypes. Genet. Epidemiol. 43, 4–23 (2018).
Rizvi, A. A. et al. gwasurvivr: an R package for genome wide survival analysis. Bioinformatics 35, 1968–1970 (2018).
Morris, A. P. et al. A powerful approach to subphenotype analysis in populationbased genetic association studies. Genet. Epidemiol. 34, 335–343 (2010).
Jostins, L. & McVean, G. Trinculo: Bayesian and frequentist multinomial logistic regression for genomewide association studies of multicategory phenotypes. Bioinformatics 32, 1898–1900 (2016).
Dahl, A. et al. A multiplephenotype imputation method for genetic studies. Nat. Genet. 48, 466–472 (2016).
Kang, H. M., Ye, C. & Eskin, E. Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots. Genetics 180, 1909–1925 (2008).
Shang, L. et al. Genetic architecture of gene expression in European and African Americans: an eQTL mapping study in GENOA. Am. J. Hum. Genet. 106, 496–512 (2020).
Robinson, G. K. That BLUP is a good thing: the estimation of random effects. Stat. Sci. 6, 15–32 (1991).
Dey, R., Schmidt, E. M., Abecasis, G. R. & Lee, S. A fast and accurate algorithm to test for binary phenotypes and its application to PheWAS. Am. J. Hum. Genet. 101, 37–49 (2017).
Horowitz, J. E. et al. Common genetic variants identify therapeutic targets for COVID19 and individuals at high risk of severe disease. Preprint at medRxiv https://doi.org/10.1101/2020.12.14.20248176 (2020).
Chang, C. C. et al. Secondgeneration PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).
Manichaikul, A. et al. Robust relationship inference in genomewide association studies. Bioinformatics 26, 2867–2873 (2010).
R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2013).
Acknowledgements
We thank F. A. Nothaft, H. Davidge, K. Kianfar, K. Feng and Y. Huang for their ongoing advice on developing the REGENIE code in the Databricks environment.
Author information
Authors and Affiliations
Contributions
J. Marchini conceived and supervised the study. J. Marchini, J. Mbatchou, L.B. and E.M. developed the method for quantitative traits. J. Mbatchou and J. Marchini developed the method for binary traits. J. Mbatchou and J. Marchini coded the C++ implementation of the method. J. Mbatchou carried out all of the testing and real data analysis of the C++ method. L.B. and E.M. developed the Apache Spark implementation of the method. C.B. provided advice and code for the LD calculations. J.B., A.M. and J.A.K. tested and provided comments on the C++ version. J. Marchini and J. Mbatchou wrote the manuscript. A.M., J.A.K., A.Z., C.O., M.B., B.B., L.H., J.R., M.F., A.B. and G.A. provided helpful comments at various stages of the project.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Genetics thanks Xia Shen and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Overview of the REGENIE method.
REGENIE consists of two steps: (1) In Step 1, the dimension of the genetic data is reduced using ridge regression applied to blocks of SNPs, and then the resulting predictors are combined using a second round of linear or logistic ridge regression to produce an overall prediction for each trait, split into 23 LOCO predictors. (2) In Step 2, these LOCO predictors are used when testing each phenotype against a set of either imputed, exome or CNV markers.
Extended Data Fig. 2 Scatterplots comparing three LMM methods for three quantitative traits using UK Biobank white British samples.
Results from REGENIE, fastGWA and BOLTLMM are compared for (a) LDL (N = 389, 189), (b) BMI (N = 407, 609) and (c) Bilirubin (N = 388, 303). 9.8 million imputed SNPs with minor allele frequency above 1% are tested for association with each trait. For each trait, the pvalue for each variant was obtained using a \(\chi _{{\mathrm{df}} = 1}^2\) test statistic.
Extended Data Fig. 3 Scatterplots comparing results from different mixed model methods for 4 binary traits using UK Biobank white British samples.
Results from REGENIE using Firth and SPA correction, BOLTLMM and SAIGE are compared for (a) coronary artery disease (case–control ratio=1:11, N = 352,063), (b) glaucoma (case–control ratio=1:52, N = 406,927), (c) colorectal cancer (case control ratio=1:97, N = 407,746), and (d) thyroid cancer (case–control ratio=1:660, N = 407,746). Tests were performed on 11.6 million imputed SNPs, and the plotting symbols represents variant categories based on using a minor allele frequency (MAF) threshold of 1%. For each trait, the pvalue for each variant was obtained using a \(\chi _{{\mathrm{df}} = 1}^2\) test statistic.
Extended Data Fig. 4 Manhattan plots comparing association results for coronary artery disease using 337,484 unrelated white British participants from UK Biobank.
For REGENIE, BOLTLMM and SAIGEnoLOCO, 329,641 genotyped SNPs from chromosomes 122 are included as model SNPs in step 1, and for SAIGELOCO all SNPs from chromosome 9 are excluded which results in 314,309 SNPs. In step 2, 482,884 imputed SNPs on chromosome 9 are tested for association. The red dashed horizontal line represents the genomewide significance level of 5 × 10^{−8}.
Extended Data Fig. 5 Type 1 error performance on simulated quantitative traits with UK Biobank white British samples.
(a) Distribution of λ_{GC} computed at null SNPs. (b) Distribution of empirical type 1 error rates at nominal level 0.05 computed at null SNPs. Each boxplot represents the distribution of the estimated quantity across 100 simulation replicates. Quantitative traits were simulated fixing \(h_g^2\) (the proportion of trait variance explained by causal SNPs) to 0.2 and the number of causal SNPs was varied from 10,000 to 50,000. The proportion of related individuals in the sample of size 100,000 was varied from 0% to 50% (including randomly selecting individuals from the white British set which includes about 30% related individuals). Each box indicates the interquartile range (IQR) and the line inside each box is the median value, the whiskers indicate data up to 1.5 times the IQR, and outliers are indicated by individual dots.
Extended Data Fig. 6 Type 1 error performance on simulated binary traits with UK Biobank white British samples.
Each boxplot represents the distribution of empirical type 1 error rates at nominal level 5 × 10^{−4} across 100 simulation replicates. Each type 1 error rate was evaluated at 324,838 null SNPs using a minor allele frequency filter of 1%. Binary traits were simulated fixing \(h_g^2\) (the proportion of variance on liability scale explained by 10,000 causal SNPs) to 0.2 and the case–control ratio was varied from 1:999 to 1:9. With a total sample size of 100,000, the proportion of related individuals was also varied from 0% to 50% (including randomly selecting individuals from the white British set which includes about 30% related individuals). Each box indicates the interquartile range (IQR) and the line inside each box is the median value, the whiskers indicate data up to 1.5 times the IQR, and outliers are indicated by individual dots.
Extended Data Fig. 7 Effect size estimates for REGENIEFIRTH and SAIGE on simulated binary traits with 100,000 UK Biobank white British samples.
REGENIEFIRTH and SAIGE were run on 10 simulated binary trait replicates with case–control ratio varied between 1:9 and 1:999. A logistic model was used to simulate the traits randomly selecting 10,000 SNPs on odd chromosomes with minor allele count (MAC) above 100 to be causal. The effect size estimates \(\hat \beta\) are compared to the true effect sizes β for (a) causal SNPs or (b) the null SNPs on even chromosomes. Summary statistics were obtained for variants with minor allele count greater than 5, pvalues in REGENIEFIRTH and SAIGE below 0.05 (fallback pvalue threshold for Firth/SPA correction, respectively), and not involved in interchromosomal LD.
Extended Data Fig. 8 Computation time of REGENIE as the number of quantitative traits analyzed increases.
100,000 samples were used in a single run of REGENIE with 1, 10, 100 or 500 simulated quantitative trait (QT) or binary trait (BT) replicates. All axes are on log_{10} scale. The slope of the dotted lines represents the power law scaling with the number of traits P.
Supplementary information
Supplementary Information
Supplementary Note, Figs. 1–32 and Tables 1–18
Supplementary Table
Supplementary Table 19
Rights and permissions
About this article
Cite this article
Mbatchou, J., Barnard, L., Backman, J. et al. Computationally efficient wholegenome regression for quantitative and binary traits. Nat Genet 53, 1097–1103 (2021). https://doi.org/10.1038/s41588021008707
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41588021008707
This article is cited by

H3AGWAS: a portable workflow for genome wide association studies
BMC Bioinformatics (2022)

Open problems in human trait genetics
Genome Biology (2022)

Obesityrelated biomarkers underlie a shared genetic architecture between childhood body mass index and childhood asthma
Communications Biology (2022)

Multiancestry fine mapping implicates OAS1 splicing in risk of severe COVID19
Nature Genetics (2022)

Rare loss of function variants in the hepatokine gene INHBE protect from abdominal obesity
Nature Communications (2022)