Abstract
With decades of electronic health records linked to genetic data, large biobanks provide unprecedented opportunities for systematically understanding the genetics of the natural history of complex diseases. Genomewide survival association analysis can identify genetic variants associated with ages of onset, disease progression and lifespan. We propose an efficient and accurate frailty model approach for genomewide survival association analysis of censored timetoevent (TTE) phenotypes by accounting for both population structure and relatedness. Our method utilizes stateoftheart optimization strategies to reduce the computational cost. The saddlepoint approximation is used to allow for analysis of heavily censored phenotypes (>90%) and low frequency variants (down to minor allele count 20). We demonstrate the performance of our method through extensive simulation studies and analysis of five TTE phenotypes, including lifespan, with heavy censoring rates (90.9% to 99.8%) on ~400,000 UK Biobank participants with white British ancestry and ~180,000 individuals in FinnGen. We further analyzed 871 TTE phenotypes in the UK Biobank and presented the genomewide scale phenomewide association results with the PheWeb browser.
Introduction
Survival models, especially the Cox proportional hazard model^{1}, have been widely used to analyze timetoevent (TTE) outcomes, both in biomedical research^{2,3,4} and in genomewide association studies (GWAS)^{5,6,7,8,9,10,11}. It has been shown that the proportional hazard model can increase the power to detect genetic variants associated with the ageofonset of TTE phenotypes in cohort studies compared to modeling the disease status using a logistic regression model, especially for common events^{12,13,14}. Studying the genetic underpinning of ageofonset, eg. early or late ageofonset, is of substantial interest for understanding the disease etiology and planning interventions. With the availability of detailed timestamped diagnosis data from Electronic Health Records (EHR), large biobanks, such as UK Biobank (UKBB)^{15} (>400,000 individuals) and FinnGen (https://www.finngen.fi/en) (>200,000 individuals), provide unprecedented opportunities to analyze TTE phenotypes to unravel the complex genetic architectures of disease onset, progression, and lifespan. Genomewide scans of TTE phenotypes in large biobanks can potentially identify novel genetic variants associated with the onset of human diseases by leveraging both the disease status and the ageofonset information.
In GWAS analysis, population structure and sample relatedness are often key confounders and factors that need to be controlled for. Biobank cohorts often have substantial population structure and relatedness. For example, in the UK Biobank, 91,392 out of 408,582 subjects with White British ancestry have at least one relative (up to 3rd degree) in the data. Several methods based on linear^{16,17,18} and logistic^{19,20} mixed effects models have been developed to account for relatedness in GWASs for quantitative and binary phenotypes. To account for related subjects in the proportional hazard model, frailty models, which are mixed effects survival models, have been proposed^{21,22}, where event times are assumed to be independent conditional on unobserved random effects called “frailties”. The frailties are modeled based on the dependence and clustering structure of the observations.
Previous research has extensively studied shared frailty models with Gammadistributed frailties^{21,23,24,25,26,27,28}. However, the shared frailty model assumes that the subjects in a cluster share common frailty and thus is limited in its scope to model more complicated dependency structures that arise in cohortbased association studies. Bivariate extensions to the shared frailty model such as the correlated Gamma^{29,30} or the correlated compound Poisson^{31} frailty model allow the frailties to be correlated among two subjects. However, these models are also too restrictive because they model the correlations using one parameter, and effectively, they are more appropriate for twin studies, and cannot model arbitrarily complex relationship structures.
To model complicated dependency structures, such as known familial structures and cryptic relatedness, the multivariate frailty model with Gaussian frailty was proposed^{32,33}, and was later implemented in the R package COXME^{34}, which, however, lacks scalability for GWASs. Recently the COXME method was further improved in COXMEG^{35}, which utilizes several computational optimization strategies to make it applicable in genetic association studies, but COXMEG still cannot handle biobankscale genomewide datasets. Based on our performance benchmarking, for 20,000 subjects, COXMEG requires 3356 CPUhours (1412 CPUhours for the COXMEGSparse option) to perform a GWAS of 46 million variants, thus COXMEG would take over 4.6 days (1.9 days for COXMEGSparse) to complete the GWAS, even with perfect parallelization on 30 CPUs.
In largescale GWASs, the score test is particularly useful among different asymptotic tests, because it requires fitting the model only once under the null hypothesis of no association^{20}. Score tests have also been implemented in COXMEG^{36}. However, score tests can lead to severe type I error inflation for phenotypes with heavy censoring, where the number of subjects who have experienced an event (for example, diagnosed with the phenotype of interest) is small compared to the number of subjects who have not experienced the event (also called censored subjects) during the study followup period. This is common in biobankbased phenotypes. In the UK Biobank phenome that was built based on Phecodes^{37} (see the “Methods” section), 871 TTE phenotypes have at least 500 events (cases), out of which 811 phenotypes have a censoring rate of more than 95%. The inaccuracies of the score test in unbalanced casecontrol phenotypes have been previously shown for logistic regression and logistic mixed effects models^{19,38,39,40}, and a saddlepoint approximation^{41} (SPA)based adjustment has been proposed and successfully implemented^{19} to accurately calibrate the pvalues in such scenarios. Recently, the SPACox^{11} method also used SPA to calibrate pvalues for timetoevent phenotypes in unrelated samples. However, the SPACox method does not account for sample relatedness. Through simulations, we show similar inaccuracies are also present in score tests in frailty models for analyzing heavily censored phenotypes.
Here, we propose a novel method for genomewide survival analysis of TTE phenotypes, which accounts for both population structure and sample relatedness, controls type I error rates even for phenotypes with extremely heavy censoring, and is scalable for genomewide scale phenomewide association studies (PheWASs) on biobankscale data. Our method, Genetic Analysis of TimetoEvent phenotypes (GATE), transforms the likelihood of a multivariate Gaussian frailty model into a modified Poisson generalized linear mixed model (GLMM^{20,42}) likelihood, employs several stateoftheart optimization techniques to fit the modified GLMM under the null hypothesis, and then performs score tests calculated using the null model for each genetic variant. To obtain wellcalibratedpvalues for heavily censored phenotypes, GATE uses the SPA to estimate the null distribution of the score statistic instead of the traditionally used normal approximation. Moreover, our method saves the memory requirement substantially by storing the raw genotypes in binary format and calculating the elements of the GRM on the fly instead of storing or inverting a large dimensional GRM. Through extensive simulations and analysis of TTE phenotypes from the UK Biobank data of 408,582 subjects with White British ancestry as well as the FinnGen study freeze 5 that contains 218,792 subjects, we showed that GATE is scalable to biobankscale GWASs of TTE phenotypes with type I error rates well controlled even for less frequent variants and heavily censored phenotypes. Benchmarking has shown that GATE can analyze 46 million variants in a GWAS with 408,582 subjects in ~14.5 h using 30 CPUs with peak memory usage under 11 GB.
Results
Overview of methods
GATE consists of two main steps: (1) Fitting the null frailty model to estimate the variance component and other model parameters and (2) performing a score statisticbased test for association between each genetic variant and the phenotype. Step 1 involves iteratively fitting the null frailty model by first rewriting the likelihood of the observed censored time to event data under the frailty model as a modified Poisson loglinear mixed effects model likelihood, and then applying modified optimization strategies as described in GMMAT^{20} and SAIGE^{19} to fit the null modified Poisson loglinear mixed models (METHODS). They include using the computationally efficient average information restricted maximum likelihood (AIREML^{20,43}) algorithm for estimating the variance component and using the preconditioned gradient descent (PCG^{44}) method to solve linear systems to avoid inverting the N × N genetic relatedness matrix (GRM), where N is the number of subjects. GATE computes the elements of the GRM onthefly when needed using binary vectors of raw genotypes, and thus it does not require supplying, storing, or inverting a precomputed GRM, which can be extremely time and memoryconsuming for large sample sizes (N). For example, in the UK Biobank data with M = 93,511 markers and N = 408,582 subjects with White British ancestry, the memory requirement drops from 622 GB for storing a precomputed GRM in floating point numbers, to only 8.9 GB for storing the raw genotypes in the binary format.
Step 2 involves scanning the entire genome and testing each variant for association using score statistics. Since the overall cost of computing the variance of the score statistic for all variants is extremely high because it involves operations on the largedimensional GRM, in step 2, GATE uses a variance ratio approximation derived under the modified TTE Poisson loglinear models by extending that used in existing LMM and GLMMbased methods such as GRAMMARGamma^{17}, BOLTLMM^{16}, fastGWA^{18}, and SAIGE^{19}. The ratio of the variance of the score statistic with and without random effects (and an attenuation factor due to estimating the baseline hazards) is computed using a subset of genetic markers. Previously, it was shown that this variance ratio remains approximately constant for variants with MAC ≥ 20 for LMM and GLMMs. Through analytical derivations and simulation examples, we show this observation holds for frailty models as well (Supplementary Note section 3 and Supplementary Fig. 15). Therefore, when performing the genomewide scan, the variance of the score statistic is computed without using the GRM and then calibrated using the variance ratio.
Next, GATE uses the saddlepoint approximation^{41} (SPA) to approximate the null distribution of score statistics for association tests under the modified Poisson loglinear mixed models. SPAbased tests have been successfully used for logistic regression^{39} and logistic mixed models^{19} and provide more accurate pvalues than traditional score tests under normal approximation for lowfrequency variants when the casecontrol ratio is unbalanced. In GATE, we have implemented an efficient SPAbased test for frailty models by extending the fastSPA method in Dey et al.^{39}. Through simulations and real data analysis, we show that SPA tests provide accurate and calibrated pvalues, even for lowfrequency variants when the censoring rate is high to 99%.
Both GATE and COXMEG^{35} conduct genetic association tests for TTE phenotypes using the frailty model. Besides the use of SPAbased tests, GATE uses the variance ratio approach to approximate the variances of the score statistics, while COXMEG calculates the variances using the GRM. Using simulation studies, we have shown that GATE provides association pvalues consistent with COXMEG (\({R}^{2}\) of −log_{10} pvalues > 0.99, slope = 1.008, intercept = −0.0004) for common variants (MAF > 5%) when the censoring rate is 50% and moderate sample sizes (Supplementary Fig. 1A). Further, GATE has wellcontrolled type I error rates even for less frequent variants and phenotypes with heavy censoring rates, where COXMEG results in inflated type I error rates (Supplementary Fig. 1B). Further, as shown below, GATE is computationally much more scalable than COXMEG for large biobank data.
Computation and memory costs
To assess the computational performance of GATE and the tests implemented in the COXMEG package, namely COXMEGScore and COXMEGSparse, we randomly sampled subsets of different sample sizes from 408,582 UK Biobank subjects with White British ancestry. We then benchmarked association tests for overall lifespan (16,375 events, 389,721 censored) adjusting for the top four ancestry principal components, birth year, and sex using GATE, COXMEGScore, and COXMEGSparse on 200,000 variants randomly selected from 46 million genetic variants with imputation info ≥0.3 and MAC ≥ 20. In Step 1, 93,511 highquality genotyped markers were used for the GRM. The projected overall computation time (Fig. 1 and Supplementary Table 1) for GATE to analyze 46 million variants on N = 408,582 subjects was 318 CPUhours, and the actual computation time on a machine with 30 cores was 14.5 h. Step 2, which accounts for the majority of the computation time (95.4% for N = 408,582) requires substantially less memory (peak memory usage 0.85 GB) than Step 1 (peak memory usage 10.6 GB).
However, to perform GWAS on only 20,000 subjects, the projected computation time and memory usage for COXMEGScore were 3356 CPUhours (4.6 days with 30 CPUs) and 32.75 GB, respectively, and for COXMEGSparse, they were 1412 CPUhours (1.96 days with 30 CPUs) and 5.95 GB. As GATE only uses 34 CPUhours and 0.74 GB, it achieves 98% and 88% reductions in computation time and memory, respectively, compared to COXMEG. Note that the computation time and memory requirements increase nearly linearly with the sample size for GATE, whereas they increase quadratically for COXMEGScore and COXMEGSparse.
Phenomewide GWAS of timetoevent phenotypes in the UK Biobank data
We have applied GATE to perform phenomewide GWAS for 871 UKBB TTE phenotypes with at least 500 events, adjusting for the top four PCs, birth year, and sex (except for 93 sexspecific phenotypes). The TTE phenotypes were created based on the International Classification of Disease (ICD) codes version 9 and 10 mapped to the PheWAS code (PheCode^{37}) definitions (see the “Methods” section) as well as their associated diagnosis dates in the UK Biobank electronic medical records. For each phenotype, we analyzed approximately 46 million genetic markers imputed from the Haplotype Reference Consortium^{45} panel and UK10K^{46} with imputation INFO score ≥ 0.3 and MAC ≥ 20. Among the 408,582 UK Biobank subjects with White British ancestry, 91,392 had at least one relative up to a third degree^{15}. To account for the relatedness among the subjects, we used 93,511 highquality genotyped markers with MAF ≥ 0.01 to construct the GRM in Step 1. The same set of markers was used by the UK Biobank research group^{15} for estimating kinship among the samples because they are only weakly informative of the ancestry and therefore provide more accurate kinship estimates. We also performed a sensitivity analysis using a larger set of markers (245,745) for the four exemplary phenotypes discussed before (see Supplementary Note Section 7). We further applied SPAbased adjustment of the score test because the censoring rates (Supplementary Fig. 2) were extremely high for most of the TTE phenotypes in the UKBB (for example, 811 out of 871 have a censoring rate of more than 95%). The summary statistics for all 871 PheCodes analyzed using GATE are available to download from a public repository (see the section “Data availability”) and browsed in the PheWeb^{47} (see the section “Data availability”).
Here we discuss the association results using four phenotypes with different censoring rates as exemplars: ischemic heart disease (IHD: PheCode 411, N events = 36,962, N censored = 370,814, censoring rate = 90.9%), female breast cancer (FBC, PheCode 174.1, N events = 15,396, N censored = 192,764, censoring rate = 92.6%), glaucoma (PheCode 365, N events = 6046, N censored = 392,925, censoring rate = 98.5%), and Alzheimer’s Disease (AD: PheCode 290.11, N events = 822, N censored = 342,059, censoring rate = 99.8%). The Manhattan and QQ plots for the GWAS of these phenotypes using GATE with and without SPA are presented in Figs. 2 and 3, respectively. The results demonstrate that not adjusting for SPA greatly inflates the type I errors, especially for the lowfrequency variants, whereas the SPAadjusted method shows wellcontrolled type I error rates. In total, 114 loci have been identified for the four TTE phenotypes: 55 for IHD, 37 for FBC, 19 for glaucoma, and 3 for AD. We also applied GATE to these four phenotypes in the FinnGen study (see the “Methods” section) and 81 out of the 114 loci were also tested in the FinnGen study, of which 78 had the same effect direction in both UKBB and FinnGen. 69 out of the 81 loci were successfully replicated in FinnGen with pvalue < 0.05. The complete list of all significant loci and the association results in the UKBB, FinnGen as well as the metaanalysis of the two data sets are reported in Supplementary Data 1. Overall, 99 out of the 114 significant loci have been previously reported to be associated with disease risk in casecontrol studies to the best of our knowledge. Several loci that are previously well known as associated with the risk of the diseases have been identified in our study, such as the loci LPA and CELSR2 for IHD^{48,49}, FGFR2^{50} and CASC16^{51} for breast cancer, MYOC^{52} and TMCO1^{53} for glaucoma, and APOE e4 variant for AD^{54}. The agevarying predicted risk of disease onset based on the GATE method, and the agevarying diseasefree probability by genotypes based on the Kaplan–Meier curve^{55} for the exemplary top hits was plotted in Fig. 4 and Supplementary Fig. 3, respectively.
We further applied logistic mixed models using SAIGE to analyze these four UKBB phenotypes using their binary disease status at the latest followup time, accounting for the same covariates as in the GATE application. GATE identified 18 loci (11 for IHD, 2 for FBC, 4 for glaucoma, and 1 for AD) that were not significant using SAIGE logistic mixed models (see Supplementary Data 2). Out of these 18 loci, 12 were previously reported as associated with the corresponding phenotypes in other casecontrol studies. For example, GATE identified an association between AD and an intronic rare variant rs533100590 (MAF = 0.005%, pvalue = \(2.78\times {10}^{8}\)) in gene ATP9B (ATPase, class II, type 9B) while SAIGE did not (pvalue = \(1.09\times {10}^{6}\)). This locus has been previously shown to be associated with AD^{56}. GATE identified the known locus SWAP70 (intronic rs378825, MAF = 42.7%, pvalue = \(4.92\times {10}^{8}\)) for IHD that was missed by SAIGE logistic mixed model (pvalue = \(1.38\times {10}^{7}\)).
GWAS of lifespan in the FinnGen study and the UK Biobank
We have also applied GATE to the overall lifespan in the FinnGen study (N events = 15,152, N censored = 203,244), in which the age of death ranges from 7 years old to 106 years old as shown in Supplementary Fig. 4. We identified the previously reported APOE locus for lifespan^{57} in FinnGen, in which the most significant variant is the APOEe4 missense variant rs429358 (MAF = 18.3%, pvalue = \(1.01\times {10}^{14}\)) and it is wellknown to be associated with lifespan, cardiovascular diseases, stroke, and Alzheimer’s disease^{58,59,60}. However, when SAIGE logistic mixed model was applied to the Finngen binary trait of dead/alive status, it did not identify any significant locus (rs429358 has pvalue \(1.89\times {10}^{6}\)).
The locus rs429358 has also been replicated in UKBB (N events = 16,375 and N censored = 389,721, see Supplementary Fig. 5A) with pvalue \(1.92\times {10}^{5}\) and metaanalysispvalue \(4.04\times {10}^{17}\)(Supplementary Table 2 and Supplementary Fig. 5B). The top hit in UKBB for lifespan (rs157592, MAF = 18.7%, pvalue = \(1.87\times {10}^{8}\)) had LD \({r}^{2}=0.7\) with rs429358 as presented in the Supplementary Table 2. This variant rs157592 is in the intergenic region and has no insilico function according to the FAVOR functional annotation online portal^{61} (see the section “Code availability”).
Simulation studies
We investigated the type I error rates and power of GATE in the presence of sample relatedness using 10,000 simulated samples. Due to computational burden, we used GATEnoSPA instead of COXMEGScore for type I error evaluation as Supplementary Fig. 1C shows the two approaches provide consistent association pvalues (\({R}^{2}\) of −log10 pvalues > 0.99).
The type I error rates of GATE was evaluated based on association tests of 9.4 × 10^{8} simulated genetic markers on 10,000 samples, which contain 500 families and 5000 independent samples. Each family has 10 members, simulated based on the pedigree shown in Supplementary Fig. 6. The variance component parameter τ is set to be 0.1 and 0.25 (see the “Methods” section). The empirical type I error rates at the significance level α = 1 × 10^{−6} and 5 × 10^{−8} are shown in Supplementary Table 3 and Supplementary Fig. 7A. Our simulation results suggest that GATE has wellcontrolled type I error rates even for lowfrequency variants (down to MAC = 20) when the phenotype is heavily censored (90%). However, without SPA, the score tests in GATE suffer from inflated type I error rates as the censoring becomes more extreme and the frequency of variants decreases. We also evaluated type I error rates of GATE in a setting with cryptic sample relatedness by randomly selecting 10,000 UKBB participants with white British ancestry. Phenotypes were simulated using the real genotypes in the UKBB to mimic the sample relatedness of a realworld dataset, and association tests were conducted on the imputed genetic markers in the UKBB (see the “Methods” section). Similarly, we observed that the type I error rates were well controlled in GATE in presence of cryptic sample relatedness with different censoring rates (Supplementary Table 4, Supplementary Figs. 7B and 8).
Next, we evaluated the empirical power of GATE at α = 5 × 10^{−8} and compared it to the power of COXMEGScore. Supplementary Fig. 9 shows the power curve by hazard ratios for variants with MAF 0.05 and 0.2 when τ = 0.25 and the censoring rate = 50%. Both methods have nearly identical power in all simulation settings. We do not compare their powers in the presence of heavy censoring, in view of the inflated type I error rate of COXMEGScore.
Overall simulation studies show that GATE can control type I error rates even when the censoring rate is high and has similar power for common variants as COXMEGScore. In contrast, same as GATEnoSPA, COXMEG suffers type I error inflation and the inflation is especially severe with low MAF and heavy censoring (Supplementary Figs. 1B, C, 7, and 8).
In addition, we compared the empirical power of GATE and the association tests based on a logistic mixed model as implemented in SAIGE (Supplementary Fig. 10), for simulated TTE phenotypes with 50%, 75%, and 95% censoring rates (see the “Methods” section). SAIGE treated all events at the latest followup time as cases and all censored individuals as controls, and tested for associations between genetic markers and the disease risk coded as the casecontrol status while accounting for the age at the latest followup time as a covariate in the linear term. As expected, GATE overall showed a higher empirical power to identify the genetic markers that are associated with the phenotype than SAIGE. The difference in the empirical powers decreased as the censoring rate increased. However, even for the datasets with a 95% censoring rate, GATE empirically had ~5–6% power improvement over SAIGE at the hazard ratio range 2–3 for MAF 0.05, and at the hazard ratio range 1.5–1.8 for MAF 0.2.
Discussion
In this paper, we have proposed a novel method to perform scalable genomewide survival association analysis of censored TTE phenotypes in biobankscale data using an efficient implementation of the frailty model. Our method can adjust for population structure and sample relatedness and provide accurate pvalues even in extreme cases of very lowfrequency variants and heavily censored phenotypes (incidence rate < 0.1%). Applying this approach to the UK Biobank and the FinnGen study, we demonstrated that our method is scalable to the analysis of large biobankscale datasets with >400,000 subjects.
The major methodological improvement in our study is to derive the frailty model as a modified Poisson loglinear mixed model, which allows us to incorporate some of the existing approaches for GLMMbased models into the frailty model for timetoevent (TTE) phenotypes. Frailty model is entirely different from a GLMM, both in the kind of data they are used to analyze and the problems they are applied to, as well as in the ways these models fit. In our paper, we derived and transformed the frailty model likelihood into a modified Poisson GLMM likelihood (Supplementary Note Section 1), and derived the appropriate model fitting procedure for such modified GLMMs. We note that the modified Poisson GLMM is still not a GLMM, and thus the derivation steps and modelfitting techniques are nontrivial.
Biobanks with genetic data linked to EHR records/survey questionnaires provide unprecedented opportunities for genetic association studies on TTE phenotypes to identify genetic risk factors that affect the onset and progression of diseases. However, biobanks pose challenges to such analysis because of the high computational and memory cost required to handle large data sets with extensive population structure and relatedness. Moreover, existing methods such as COXMEG, artificially inflate associations when heavily censored phenotypes (e.g., censoring rate > 75%) and lowfrequency variants (MAF < 1%) are involved. The proposed method GATE performs a frailty modelbased association analysis to account for both population structure and relatedness using score tests with SPA adjustment, which provides accurate pvalues under heavy censoring. In addition, it implements several optimization techniques that were previously used in the context of linear and logistic mixed models in BOLTLMM and SAIGE to make it computationally feasible to analyze large biobank cohorts. We have applied GATE to 871 TTE phenotypes in the UK Biobank data with White British ancestry, which were constructed based on PheCodes mapped to ICD codes and have at least 500 events. The genomewide summary statistics are available for the public to download. We have also created a PheWeb^{47} for users to explore and visualize the PheWAS results.
Lifespan is a typical TTE phenotype and the genetic effects on lifespan can be appropriately modeled by the frailty model. We applied GATE to lifespan in the FinnGen study, whose participants have a wide ageofdeath range from 7 to 106 years, and have successfully identified a wellknown locus rs429358 (pvalue = \(1.01\times {10}^{14}\)). However, this locus has been missed by a logistic mixed model for the dead/alive status. (pvalue = \(1.89\times {10}^{6}\)). This example suggests that applying frailty models can be useful for uncovering genetic risk factors for TTE analysis, as further evidenced through simulation studies (see the “Methods” section). GATE can facilitate these studies.
We also compared GWAS results using logistic mixed models of the binary disease status as implemented in SAIGE in the four example phenotypes presented in the paper and found that across the four phenotypes, SAIGE failed to identify 18 loci (Supplementary Data 2) that were identified GATE, among which 11 were for ischemic heart disease. This shows that a frailty modelbased analysis of TTE phenotypes can lead to the identification of loci that might be missed by only analyzing the disease status using a logistic mixed effects model. The scatter plots comparing the association pvalues from GATE and SAIGE (Supplementary Fig. 11) show that for ischemic heart disease and glaucoma, the pvalues based on GATE overall tend to be smaller than SAIGE, and for female breast cancer and Alzheimer’s disease, the pvalues are similar between these two methods. The TTE outcome is different from the binary casecontrol outcome, and logistic models can result in loss of power for such outcomes, especially for common events. Although the TTE phenotypes in biobanks such as the UK Biobank and FinnGen are currently subject to heavy censoring, as the biobank participants are followed over time, more events will be observed. As events will become more common over time in biobank followup, the power gain of GWAS analysis of TTE phenotypes using frailty models via GATE over logistic mixed models via SAIGE will increase. Logistic models with age (at disease onset or at the latest followup time) as a covariate assume a homogenous effect (in logit scale) of age on the risk of the disease, which may not be valid, especially when the definition of the age covariate can be different between the cases/failure events (ageofonset) and the controls/censored (age at the latest followup time). Survival models, on the other hand, are developed specifically to accommodate ageofonset and ageofcensoring differently, and they model the effect of age on the diseaserisk nonparametrically in the baseline hazard without the homogeneity assumption.
TTE phenotypes are particularly suited not only for studying disease onsets but also for exploring other progression phenotypes such as times of surgery, recurrence, times of onset of secondary phenotypes after an initial diagnosis, etc. Previously, the lack of scalable GWAS methods for TTE outcomes hindered such investigations on massive scales. By facilitating largescale GWAS of TTE phenotypes, GATE opens the door to such investigations in the future at genomewide and phenomewide scales. Further, modeling TTE phenotypes also has the added advantage of designing appropriate intervention responses. Since frailty models explicitly model the ageofonset of the disease, one can design interventions based on the genetic predispositions of the subjects, and also based on whether the disease has early or late onset. Logistic models are not particularly suitable for this purpose as it models the effect of age as a homogeneous effect, which is a much stronger assumption compared to the nonparametric modeling of ageofonset in survival models.
One consideration while analyzing TTE phenotypes is the appropriate choice of unit of time. To assess the impact of timeunits on the GWAS results, we performed a sensitivity analysis using the event and censoring times rounded to the nearest 1 month, 3 months, 6 months, and 12month timeunits for the four exemplary UK Biobank phenotypes presented in this paper, and compared the pvalues across different timeunits (Supplementary Fig. 12). The pvalues were very similar across the four timeunits for all phenotypes, with more detailed timeunits resulting in slightly more significant pvalues.
For the selection of a number of markers to construct the GRM, there is a tradeoff between computation cost and the accuracy of adjusting the sample relatedness. Increasing the number of markers (M) included in the GRM linearly increases the computation time and memory requirement of step 1, whereas using too few markers may not be sufficient to capture the detailed familial and cryptic relatedness among the samples properly^{62}. For the UK Biobank data analysis, we used M = 93,511 LD pruned highquality genotyped markers which were used by the UK Biobank research group for estimating kinship among the samples^{15}. We performed a sensitivity analysis (see Supplementary Note Section 7) by increasing the number of markers to M = 245,975 pruned markers with MAF ≥ 0.01. The results (Supplementary Figs. 13 and 14) showed that the pvalues were generally concordant, and the pvalues using M = 245,975 markers were slightly larger than the pvalues using M = 93,511 markers.
GATE has several limitations. First, similar to other mixed model methods for genetic association tests, the computation time required for the algorithms to converge in step 1 can vary among different phenotypes and study samples because of the difference in heritability and the extent of sample relatedness. Second, to be computationally efficient, GATE uses a score statisticbased test without fitting the model under the alternate hypothesis. Therefore, it does not provide accurate estimates of hazard ratios for genetic variants. Following a similar approach as in several other mixed modelbased methods^{16,17,19,63}, GATE provides hazard ratio estimates for genomewide variants using the null model parameter estimates (see Supplementary Note Section 5). Alternatively, the GATE software also allows users to include variants oneatatime into the model for step 1 in order to get more accurate hazard ratio estimates. Third, GATE performs singlevariant association analysis, which can suffer from low power to detect associations for rare variants. Significant single variantbased GWAS findings for rare variants need to be interpreted with caution, and replication of these findings using independent samples is important. To boost the power of rare variant association tests in whole genome/exome sequencing (WGS/WES) studies, setbased rare variant tests have been commonly used. It is of future research interest to extend GATE to maskbased or regionbased rare variant set association tests in WGS/WES studies by extending burden, SKAT, and other tests^{61,64} to frailty models for censored timetoevent data.
Fourth, the current version of GATE does not incorporate lefttruncated data, which may not be valid for earlyonset phenotypes in biobanks with relatively older participants. For example, the median age of UK Biobank’s participants is 59 and the earliest dates of health data available are around the late 1990s. Assuming no leftcensoring can reduce association power for earlyonset diseases. Future work will extend GATE to accommodate lefttruncated phenotypes. Fifth, since the followup information is based on EHR systems that record ageofdiagnoses instead of true ageofonsets, the actual analysis presented in our paper is based on ageofdiagnoses. As long as ageofdiagnoses are close to ageofonsets, analyzing them can be reasonable. However, as mentioned before, for lefttruncated phenotypes, this may not always be the case. Specific care needs to be taken when analyzing such phenotypes. Finally, the frailty model presented in the paper assumes independent censoring, which is a common assumption in the survival analysis literature. However, for certain phenotypes like IHD, the event of death can be a “competing risk”^{65,66,67,68} which may cause dependent censoring. Competing risk models generally involve other strong assumptions, for which we did not consider them in GATE which is intended to be applied under more general settings. In the future, we plan to include competing risk models into GATE as well for specific phenotypes which may have dependent censoring.
GWAS is an important first step of genetic discovery as evidenced by the extensive GWAS literature. The functions of many GWAS discoveries are unknown and there is a substantial need to identify causal functional variants of these GWAS diseaseassociated loci. Numerous largescale efforts have been ongoing to study the functions of the variants identified by GWAS to accelerate discovery from genetic maps to biological mechanisms to physiology and medicine, and drug target discovery and prioritization. Examples include the recently launched NHGRI Impact of Genomic Variation on Function (IGVF) Consortium, Open Targets, and the International Common Disease Alliance (ICDA).
In summary, we have proposed a scalable and accurate method, GATE, to perform genomewide PheWAS of TTE phenotypes on large biobank cohorts accounting for population structure, sample relatedness, and heavy censoring. We demonstrated that it is possible to efficiently analyze the current largest biobank (UK Biobank) of >400,000 subjects using GATE. Our method facilitates biobankbased PheWAS of TTE phenotypes which ultimately contributes towards identifying genetic components that affect the onset and progression of complex diseases.
Methods
Frailty model for Timetoevent phenotypes
Consider a study of N subjects, where for the ith subject, we observe the data pair \(({\delta }_{i},\, {t}_{i})\), where \({\delta }_{i}\) is a censoring indicator, with \({\delta }_{i}=1\) if the ith subject experiences an event during the study period, and \({\delta }_{i}=0\) otherwise, i.e., censored. Let \({t}_{i}\) denote the observed event or censoring time. For the ith subject, let the p ×1 vector \({X}_{i}\) denote the covariates, and \({G}_{i}=0,\,1,\,2\) denote the minor allele counts for the genetic variant of interest. Then, in a frailty model^{25,32,69}, the conditional hazard function of subject i at time t given the covariates, genotype, and random effect/frailty \({b}_{i}\) is modeled as
where β and γ are the regression coefficients of the covariates \({X}_{i}\) and the genotype \({G}_{i}\) respectively, and \({\lambda }_{0}\left(t\right)\) is the baseline hazard function at time t, the frailty \(b=\left({b}_{1},\ldots,\,{b}_{N}\right)\) follows a multivariate normal distribution \(N\left(0,\,\tau V\right)\), with V being the genetic related matrix (GRM). Unlike standard generalized linear mixed models, the covariate vector \({X}_{i}\) in a frailty model does not include the intercept term, instead, the baseline hazard \({\lambda }_{0}(t)\) works as the intercept in a frailty model. We test the null hypothesis of no genetic association \({{\rm {H}}}_{0}:\gamma=0\) vs \({{\rm {H}}}_{1}:\gamma \,\ne\, 0\).
Estimating the variance component and other null model parameters (step 1)
First, the likelihood for the observed event status–time pairs \(\left({\delta }_{i},\,{t}_{i}\right)\) under the frailty model is derived and expressed as a modified Poisson mixedeffects model likelihood, with the mean function weighted by the cumulative baseline hazard (CBH) function \({\varLambda }_{0}\left(t\right)={\int }_{0}^{t}{\lambda }_{0}(u){{\rm {d}}u}\). The CBH function is estimated by the Breslow’s estimator \({\hat{\varLambda }}_{0}\left(t\right)\) as a step function. Breslow^{70} showed that the maximum likelihood approach for the proportional hazard model (for unrelated subjects) that leads to the estimator \({\hat{\varLambda }}_{0}\left(t\right)\), is equivalent to maximizing the partial likelihood proposed by Cox^{1}. In Supplementary Note Section 6, we have shown that the same maximum likelihood approach holds for frailty models (related subjects) as well given the random effects. Then, using the penalized quasilikelihood (PQL^{42}) method and the AIREML^{43} algorithm, the model parameters under \({H}_{0}\) are estimated iteratively. To avoid storing large N × N GRMs, GATE only calculates the elements of the GRM when they are needed using raw binary format genotypes. For the scalable computation of quantities of the form \({{{{{{\bf{A}}}}}}}^{1}x\) that arises in the model fitting steps, where A is a large matrix and x is a vector, GATE uses the PCG algorithm^{44}, which has been previously used in BOLTLMM^{16} and SAIGE^{19} to accurately compute quantities like \(y={{{{{{\bf{A}}}}}}}^{1}x\) by solving the linear system of equations Ay = x, instead of explicitly inverting the large matrix A.
Once the null model parameters, random effects, and cumulative baseline hazard functions \((\hat{\beta },\,{\hat{b}}_{i},\,{\hat{\varLambda }}_{0}({t}_{i}))\) have been estimated, GATE estimates the variance ratio from a small number of markers. Denote the fitted means by \({\hat{\mu }}_{i}={\hat{\varLambda }}_{0}({t}_{i}){\exp }({{{{{{\bf{X}}}}}}}_{i}^{\top }\hat{\beta }+{\hat{b}}_{i})\), and the weight matrix \(\hat{W}={{\rm {diag}}}\left({\hat{\mu }}_{1},\ldots,\,{\hat{\mu }}_{N}\right)\). Then the score statistic, under \({{\rm {H}}}_{0}:\gamma=0\) is \(T={{{{{{\bf{G}}}}}}}^{\top }\left({{{{{\boldsymbol{\delta }}}}}}\hat{{{{{{\boldsymbol{\mu }}}}}}}\right)={\widetilde{{{{{{\bf{G}}}}}}}}^{\top }\left({{{{{\boldsymbol{\delta }}}}}}\hat{{{{{{\boldsymbol{\mu }}}}}}}\right)\), where \({{{{{\bf{G}}}}}}=\left({G}_{1},\ldots,\,{G}_{N}\right),\,{{{{{\boldsymbol{\delta }}}}}}=\left({\delta }_{1},\ldots,\,{\delta }_{N}\right),\, \hat{{{{{{\boldsymbol{\mu }}}}}}}=\left({\hat{\mu }}_{1},\,\ldots,\,{\hat{\mu }}_{N}\right)\). The covariateandinterceptadjusted genotypes are denoted by \(\widetilde{{{{{{\bf{G}}}}}}}{{{{{\boldsymbol{=}}}}}}{{{{{\bf{G}}}}}}{{{{{\boldsymbol{}}}}}}\widetilde{{{{{{\bf{X}}}}}}}{({\widetilde{{{{{{\bf{X}}}}}}}}^{{{{{{\boldsymbol{\top }}}}}}}\hat{{{{{{\bf{W}}}}}}}\widetilde{{{{{{\bf{X}}}}}}})}^{{{{{{\boldsymbol{}}}}}}{{{{{\bf{1}}}}}}}{\widetilde{{{{{{\bf{X}}}}}}}}^{{{{{{\boldsymbol{\top }}}}}}}{{{{{\bf{G}}}}}}\), where \(\widetilde{{{{{{\bf{X}}}}}}}=[{{{{{\bf{1}}}}}}{{{{{\bf{X}}}}}}]\) is the augmented covariate matrix. Then, the variance of the score statistic under \({{\rm {H}}}_{0}\) is given by \({{{{{{\bf{V}}}}}}}_{{{{{{\bf{T}}}}}}}{{{{{\boldsymbol{=}}}}}}{{{{{{\bf{G}}}}}}}^{{{{{{\boldsymbol{\top }}}}}}}\hat{{{{{{\bf{Q}}}}}}}{{{{{\bf{G}}}}}}{{{{{\boldsymbol{=}}}}}}{\widetilde{{{{{{\bf{G}}}}}}}}^{{{{{{\boldsymbol{\top }}}}}}}\hat{{{{{{\bf{Q}}}}}}}\widetilde{{{{{{\bf{G}}}}}}}\), where \(\hat{{{{{{\bf{Q}}}}}}}{{{{{\boldsymbol{=}}}}}}{\hat{{{{{{\bf{S}}}}}}}}^{{{{{{\boldsymbol{}}}}}}{{{{{\bf{1}}}}}}}{{{{{\boldsymbol{}}}}}}{\hat{{{{{{\bf{S}}}}}}}}^{{{{{{\boldsymbol{}}}}}}{{{{{\bf{1}}}}}}}{{{{{\bf{X}}}}}}{({{{{{{\bf{X}}}}}}}^{{{{{{\boldsymbol{\top }}}}}}}{\hat{{{{{{\bf{S}}}}}}}}^{{{{{{\boldsymbol{}}}}}}{{{{{\bf{1}}}}}}}{{{{{\bf{X}}}}}})}^{{{{{{\boldsymbol{}}}}}}{{{{{\bf{1}}}}}}}{{{{{{\bf{X}}}}}}}^{{{{{{\boldsymbol{\top }}}}}}}{\hat{{{{{{\bf{S}}}}}}}}^{{{{{{\boldsymbol{}}}}}}{{{{{\bf{1}}}}}}}{{{{{\boldsymbol{,}}}}}}\, \hat{{{{{{\bf{S}}}}}}}{{{{{\boldsymbol{=}}}}}}{(\hat{{{{{{\bf{W}}}}}}}{{{{{\boldsymbol{}}}}}}\hat{{{{{{\bf{U}}}}}}})}^{{{{{{\boldsymbol{}}}}}}{{{{{\bf{1}}}}}}}{{{{{\boldsymbol{+}}}}}}\hat{{{{{{\boldsymbol{\tau }}}}}}}{{{{{\bf{V}}}}}}\). The expression of \(\hat{{{{{{\bf{U}}}}}}}\) is described in detail in Supplementary Note Section 1.3. Unlike in the GLMMs, the term \(\hat{{{{{{\bf{U}}}}}}}\) appears in the variance of the score statistic due to the attenuation of information (additional variability) for estimating \({\varLambda }_{0}\left({t}_{i}\right)\)s. The variance ratio is then calculated as \(\hat{r}=\frac{{\widetilde{{{{{{\bf{G}}}}}}}}^{{{{{{\boldsymbol{\top }}}}}}}\hat{{{{{{\bf{Q}}}}}}}\widetilde{{{{{{\bf{G}}}}}}}}{{\widetilde{{{{{{\bf{G}}}}}}}}^{{{{{{\boldsymbol{\top }}}}}}}\hat{{{{{{\bf{W}}}}}}}\widetilde{{{{{{\bf{G}}}}}}}}\). GATE calculates the variance ratio based on 30 randomly selected genotyped markers with MAC ≥ 20 and computes the coefficient of variation (CV). If the CV of the variance ratios is smaller than 0.001, then the mean of the variance ratios is selected as \(\hat{r}\), otherwise more markers are selected at an increment of 10 markers, and the CV is recalculated until the CV becomes smaller than 0.001.
Score test using SPA
Using the estimated variance ratio \(\hat{r}\), the varianceadjusted test statistic can be calculated as \({T}_{{{\rm {adj}}}}={\widetilde{{{{{{\bf{G}}}}}}}}^{{{{{{\boldsymbol{\top }}}}}}}\left({{{{{\boldsymbol{\delta }}}}}}{{{{{\boldsymbol{}}}}}}\hat{{{{{{\boldsymbol{\mu }}}}}}}\right)/\sqrt{\hat{r}{\widetilde{{{{{{\bf{G}}}}}}}}^{{{{{{\boldsymbol{\top }}}}}}}\hat{{{{{{\bf{W}}}}}}}\widetilde{{{{{{\bf{G}}}}}}}}\), under the null hypothesis has mean zero and variance unity. The traditional score test then assumes asymptotic normality of the score statistic T (and thus \({T}_{{{\rm {adj}}}}\) as well) under \({{\rm {H}}}_{0}\), to calculate the pvalue. However, observations have been made before in the context of logistic mixed models that the asymptotic normality assumption of the score test statistic leads to severe Type I error inflation for lowfrequency and rare variants when the casecontrol ratio is unbalanced^{19}. We make the same observations in frailty models as well when the censoring rate is high. In order to provide wellcalibrated pvalues in such situations, we used saddle point approximation (SPA) to approximate the null distribution of the score statistic, which has been shown to have better approximation error bounds compared to the normal approximation^{39,41,71,72}, especially at the extremely small tail probability region of \(\alpha=5\times {10}^{8}\). Contrary to the normal approximation which only utilizes the first two moments only to approximate, SPA utilizes the entire moment generating function (MGF). In fact, it uses the cumulant generating function (CGF), i.e., is the logarithm of the MGF, which for the frailty model, based on the modified Poisson mixed model likelihood, can be derived as \(K(\xi )={\sum }_{i=1}^{N}{\hat{\mu }}_{i}({e}^{{\widetilde{G}}_{i}c\xi }{\widetilde{G}}_{i}c\xi 1)\), where \(c={\left(\hat{r}{\widetilde{{{{{{\bf{G}}}}}}}}^{{{{{{\boldsymbol{\top }}}}}}}\hat{{{{{{\bf{W}}}}}}}\widetilde{{{{{{\bf{G}}}}}}}\right)}^{1/2}\). Then, the distribution of \({T}_{{{\rm {adj}}}}\) can be calculated based on the SPA by \({\Pr }\left({T}_{{{\rm {adj}}}} \, < \, s\right)\,\approx\, \varPhi \left\{w+\frac{1}{w}{\log }\left(\frac{v}{w}\right)\right\}\), and the pvalue is given by \(p={\Pr }\left({T}_{{{\rm {adj}}}} < \lefts\right\right)+{\Pr }\left({T}_{{{\rm {adj}}}} \, > \, \lefts\right\right)\), where \({T}_{{{\rm {adj}}}}=s\) is the observed adjusted score statistic, \(w={{\rm {sign}}}\left(\hat{\xi }\right)\sqrt{2\left(\hat{\xi }sK\left(\hat{\xi }\right)\right)}\), \(v=\hat{t}\sqrt{{K}^{{\prime} {\prime} }\left(\hat{\xi }\right)}\), \(\hat{\xi }\) is the solution to the equation \({K}^{{\prime} }\left(\hat{\xi }\right)=s\), and \({K}^{{\prime} }\left(\xi \right)\) and \({K}^{{\prime} {\prime} }\left(\xi \right)\) are the first and second derivatives of the CGF \(K\left(\xi \right)\), respectively.
Since the normal approximation works well around the mean, we use the normal approximation when \({T}_{{{\rm {adj}}}}\) is less than two standard deviations away from the mean for faster computation. In addition, a faster version of the SPA similar to Dey et al.^{39} is also implemented which reduces the computation time even further, from O(N) to \(O({N}_{{\rm {c}}})\), where \({N}_{{\rm {c}}}\) is the number of minor allele carriers.
Proportional hazard assumption
The proportional hazard (PH) assumption in frailty models is an extremely popular modeling assumption and has been widely used in biomedical research^{2,3,4}, as well as in GWAS^{5,6,7,8,9,10,11}. In practice, diagnostics for the PH assumption^{73,74,75} are difficult and timeconsuming, and the PH assumption is thus impractical to be tested at such a large scale (both sample sizewise and genomewide). To the best of our knowledge, no scalable diagnostic tool is available for testing proportional hazards of a continuous covariate in a frailty model. However, since millions of variants are tested in a GWAS, the quantile–quantile (QQ) plot works as a more practical alternative tool for model diagnostics. The QQ plot allows researchers to capture any unexpected conservativeness or anticonservativeness of the pvalues that may arise from the violation of model assumptions.
Data simulation
We carried out a series of simulations to evaluate the performance of GATE, including the type I error rates and power. To evaluate whether GATE can control type I error rates in presence of sample relatedness, we randomly simulated a set of 1,000,000 basepair “pseudo” sequences, in which variants are independent of each other. Alleles for each variant were randomly drawn from Binomial (n = 2, p = MAF). Then we performed the genedropping^{76} simulation using these sequences as founder haplotypes that were propagated through the pedigree of 10 family members shown in Supplementary Fig. 6. We simulated genotypes of 150,000 genetic variants with MAF ≥ 1% for 5000 independent samples and 500 families based on the pedigree to estimate the GRM onthefly in Step 1 of GATE and genotypes of 1.9 million genetic variants with MAC;≥ 20 for association tests in Step 2. MAFs were randomly sampled from the MAF spectrum in UK Biobank imputation data as shown in Supplementary Fig. 8. For each subject i, the censoring time T_{ci} was randomly selected from an exponential distribution with mean 1/\({\lambda }_{{\rm {c}}}\) and the underlying failure time \({T}_{{{\rm {f}}i}}\) was generated from a frailty model with the underlying exponential hazard function \({T}_{{{\rm {f}}i}}=\frac{{\log }({U}_{i})}{\lambda {\exp }({\eta }_{i})}\), where \({U}_{i}\)~ uniform (0,1) and \({\eta }_{i}\) is the linear predictor. Under the null hypothesis of no genetic effects, \({{\eta }_{i}=X}_{1i}^{\top }\alpha+{b}_{i}\), where \({X}_{1}\) is a covariate that was randomly drawn from \(N({{{{\mathrm{0,1}}}}})\), α is the coefficient and is 0.5 and \({b}_{i}\) is the random effect simulated from \(N(0,\,{{{{{\rm{\tau }}}}}}\psi )\) with τ = 0.1 and 0.25, respectively, which is the variance component parameter. The time for subject i is \({t}_{i}={\min }({T}_{{{\rm {c}}i}},\,{T}_{{{\rm {f}}i}})\) and \({\delta }_{i}=I\left({T}_{{{\rm {f}}i}}\le {T}_{{{\rm {c}}i}}\right).\) We selected λ, the mean of the exponential hazard function, corresponding to different censoring rates \({\sum }_{i=1}^{N}{\delta }_{i}/N=50\%,\, 75\%\) and 90%. We repeated the simulation 500 times. For each phenotype set, a null frailty model was fitted in Step 1 with the covariate \({X}_{1}\). In Step 2, we conducted single variant association tests on 1.9 million simulated genetic markers. In total, about 9.4 × 10^{8} association tests were conducted. We evaluated the empirical type I error rates at the type I error rate α = 1 × 10^{−6} and 5 × 10^{−8} as shown in Supplementary Table 3 and Supplementary Fig. 7A. These results have indicated that GATE can produce wellcalibrated type I error rates in the presence of sample relatedness at the significance level, while GATEno SPA (similar to COXMEG) has inflated type I error rates and inflation gets larger than censoring rates is higher (Supplementary Table 3). For example, GATEno SPA has type I error rate 8.9 × 10^{−6} at α = 5 × 10^{−8} when censoring rate is 75% and 2.8 × 10^{−5} when the censoring rate is 90% with τ = 0.1.
To evaluate whether GATE can control type I error rates in presence of cryptic sample relatedness, we have randomly selected N = 10,000 samples with white British ancestry from UK Biobank and simulated TTE phenotypes based on the observed genotyped of these subjects in the approach described above for pedigreebased data sets, except that under the null hypothesis of no genetic effects, \({{\eta }_{i}={{{{{\boldsymbol{X}}}}}}}_{1i}^{\top }{{{{{\boldsymbol{\alpha }}}}}}+{\sum }_{j=1}^{L}{\hat{G}}_{{ij}}\beta\) and was simulated based on real genotypes of randomly selected L = 30,000 LDpruned (r^{2} < 0.2) markers from the odd chromosomes with MAF ≥ 1%. The real genotypes were used for simulating real sample relatedness in the null model. In particular, \({X}_{1}\) is a covariate that was randomly drawn from N(0, 1), α is the coefficient and is 1, \(\hat{{G}_{{ij}}}\) is the standardized genotype value for the jth marker of ith subject and β is the genetic effect size following \(N(0,\tau /L)\), where τ = 0.25, which is the variance component parameter. The time for subject i is \({t}_{i}={\min }({T}_{{ci}},{T}_{{fi}})\) and \({\delta }_{i}=I\left({T}_{{fi}}\le {T}_{{ci}}\right).\) We selected λ, the mean of the exponential hazard function, corresponding to different censoring rates \({\sum }_{i=1}^{N}{\delta }_{i}/N=50\%,\, 75\%\) and 90%. We repeated the simulation 100 times. For each phenotype set, a null frailty model was fitted in Step 1 with covariates including the first 4 genetic principal components, which were estimated for all WhiteBritish participants in the UK Biobank, and \({X}_{1}\). In Step 2, we conducted single variant association tests on genetic markers on the even chromosome. In total, 8.3 × 10^{8} were conducted. We evaluated the empirical type I error rates at the type I error rate α = 1 × 10^{−6} and 5 × 10^{−8} as shown in Supplementary Table 4 and Supplementary Fig. 7B, which suggests that GATE produces wellcalibrated type I error rates in the presence of cryptic relatedness at the corresponding significance levels.
To evaluate the empirical power of GATE and compare the power to COXMEG and SAIGE, phenotypes were generated under the alternative hypothesis for 10,000 samples, which contain 500 families and 5000 independent samples. The family pedigree is shown in Supplementary Fig. 6. We simulated the phenotypes for the ith individual under the alternative hypothesis β ≠ 0 in the linear term \({\eta }_{i}={X}_{1i}{\alpha }_{1}+{X}_{2i}{\alpha }_{2}+{b}_{i}\)\(+{\sum }_{j=1}^{10}{\hat{G}}_{{ij}}\beta\) of the underlying exponential hazard function for the underlying failure time \({T}_{{{\rm {f}}i}}=\frac{{\log }({U}_{i})}{\lambda {\exp }({\eta }_{i})}\), where \({U}_{i}\)~uniform (0, 1), G_{ij} is the genotype values for the jth marker, β is the genetic log hazard ratio, \({b}_{i}\) is the random effect simulated from \(N(0,{{{{{\rm{\tau }}}}}}\psi )\) with τ = 0.25. Two covariates, X_{1} and X_{2}, were simulated from Bernoulli(0.5) and N(0, 1), respectively, with coefficients \({\alpha }_{1}\) and \({\alpha }_{2}=0.5\). λ was determined to have a censoring rate of 50%. 100 datasets were simulated with 10 genetic markers with different hazard ratios. Power was evaluated at α = 5 × 10^{−8} for MAF 0.05 and 0.2 as presented in Supplementary Figs. 9 and 10.
UK Biobank TTE phenome
The timetoevent phenotypes for the UK Biobank were constructed as the disease phenotypes defined based on the hierarchical PheCodes^{37} that represent different disease groups. The ICD9 and ICD10 codes were mapped to PheCodes using a combination of available maps through the Unified Medical Language System and other sources, string matching, and manual review^{19,37}. For each PheCode, the subjects who had the PheCode were regarded as having failure events, and the subjects who did not have the PheCode were regarded as censored. For each failed subject, the TTE (failure time) was calculated by subtracting the birth year from the earliest time of diagnosis of any of the PheCodespecific ICD codes, rounded to the nearest full month. To obtain the TTE (censoring time) for each censored subject, the birth year was subtracted from the time of the last nonimaging visit to any of the UK Biobank ascertainment centers, or the last time any ICD code was recorded for that subject, or the time of death if death was recorded during the course of the study, whichever is latest, rounded to the nearest full month. For lifespan, the subjects who had their death recorded were assigned the failed status with the ages at death as the corresponding TTE, and the subjects who did not have their death recorded were assigned the censored status with the TTE defined as before.
FinnGen
FinnGen is a public–private partnership project combining genotype data from Finnish biobanks and digital health record data from Finnish health registries (https://www.finngen.fi/en). Release 5 analysis contains 218,792 samples after quality control with population outliers excluded via principal component analysis based on genetic data. TTE phenotypes were constructed from population registries and ICD10 codes, and harmonizing definitions over ICD8 and ICD9, including ischemic heart disease (N events = 30,952, N censored = 187,838, censoring rate = 85.8%), female breast cancer (N events = 8401, N censored = 114,878, censoring rate = 93.2%), glaucoma (N events = 8591, N censored = 210,199, censoring rate = 96.1%) and Alzheimer’s disease (N events = 3899, N censored = 207,324, censoring rate = 98.2%). We conducted genomewide survival analysis using GATE with the first ten genetic PCs, sex, genotyping batch, and birth year as covariates and 240,000 pruned genetic markers for GRM estimation.
Patients and control subjects in FinnGen provided informed consent for biobank research, based on the Finnish Biobank Act. Alternatively, older research cohorts, collected prior to the start of FinnGen (in August 2017), were collected based on studyspecific consent and later transferred to the Finnish biobanks after approval by Fimea, the National Supervisory Authority for Welfare and Health. Recruitment protocols followed the biobank protocols approved by Fimea. The Coordinating Ethics Committee of the Hospital District of Helsinki and Uusimaa (HUS) approved the FinnGen study protocol No. HUS/990/2017.
The FinnGen study is approved by Finnish Institute for Health and Welfare (THL), approval number THL/2031/6.02.00/2017, amendments THL/1101/5.05.00/2017, THL/341/6.02.00/2018, THL/2222/6.02.00/2018, THL/283/6.02.00/2019, THL/1721/5.05.00/2019, Digital and population data service agency VRK43431/20173, VRK/6909/20183, VRK/4415/20193 the Social Insurance Institution (KELA) KELA 58/522/2017, KELA 131/522/2018, KELA 70/522/2019, KELA 98/522/2019, and Statistics Finland TK53104117. The Biobank Access Decisions for FinnGen samples and data utilized in FinnGen Data Freeze 5 include: THL Biobank BB2017_55, BB2017_111, BB2018_19, BB_2018_34, BB_2018_67, BB2018_71, BB2019_7, BB2019_8, BB2019_26, Finnish Red Cross Blood Service Biobank 7.12.2017, Helsinki Biobank HUS/359/2017, Auria Biobank AB175154, Biobank Borealis of Northern Finland_2017_1013, Biobank of Eastern Finland 1186/2018, Finnish Clinical Biobank Tampere MH0004, Central Finland Biobank 12017, and Terveystalo Biobank STB 2018001.
Genome build
The genomic coordinates reported in this paper were based on NCBI Build 37/UCSC hg19.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
UK Biobank
Individuallevel genotype and phenotype data from the UK Biobank are available from http://www.ukbiobank.ac.uk. A formal application to the UK Biobank is required to download the data.
FinnGen
Individuallevel genotype data from Finnish biobanks and digital health record data from Finnish health registries (https://www.finngen.fi/en) can be accessed from the Fingenious portal (https://site.fingenious.fi/en/). A formal approval for the researchers is required to access the data.
Availability of the GWAS results
The GWAS results for 871 timetoevent phenotypes in UK Biobank using GATE are currently available for public download at http://gate.genohub.org/. Manhattan plots, Q–Q plots, and regional association plots for each TTE phenotype as well as the PheWAS plots can be browsed at http://phewas.genohub.org/. The Registry of Open Data on AWS is accessed through https://registry.opendata.aws/broadukbsumstats/.
Code availability
GATE is implemented as an opensource R package available at https://github.com/weizhou0/GATE^{77}. The FAVOR^{61} portal is accessed through favor.genohub.org.
References
Cox, D. R. Regression models and lifetables. J. R. Stat. Soc. Ser. B (Methodol.) 34, 187–220 (1972).
Lee, E. & Go, O. Survival analysis in public health research. Annu. Rev. Public Health 18, 105–134 (1997).
Dg, A., Bl, De,S., Sb, L. & Ka, S. Review of survival analyses published in cancer journals. Br. J. Cancer 72, 511 (1995).
Kasza, J., Wraith, D., Lamb, K. & Wolfe, R. Survival analysis of time‐to‐event data respiratory health research studies. Respirology. 19, 483–492 (2014).
Dunning, A. M. et al. Breast cancer risk variants at 6q25 display different phenotype associations and regulate ESR1, RMND1 and CCDC170. Nat. Genet. 48, 374–386 (2016).
Phipps, A. I. et al. Common genetic variation and survival after colorectal cancer diagnosis: a genomewide analysis. Carcinogenesis 37, 87–95 (2016).
Johnson, D.C. et al. Genomewide association study identifies variation at 6q25.1 associated with survival in multiple myeloma. Nat. Commun. 7, 10290 (2016).
Kulminski, A. M. et al. Pleiotropic associations of allelic variants in a 2q22 region with risks of major human diseases and mortality (research article) (report). PLoS Genet. 12, e1006314 (2016).
Wu, C. et al. Genomewide association study of survival in patients with pancreatic adenocarcinoma. Gut 63, 152 (2014).
Lee, S. & Lim, H. Review of statistical methods for survival analysis using genomic data. Genom. Inf. 17, e41–e41 (2019).
Bi, W., Fritsche, L. G., Mukherjee, B., Kim, S. & Lee, S. A fast and accurate method for genomewide timetoevent data analysis and its application to UK Biobank. Am. J. Hum. Genet. 107, 222–233 (2020).
Green, M. S. & Symons, M. J. A comparison of the logistic risk function and the proportional hazards model in prospective epidemiologic studies. J. Chronic Dis. 36, 715–723 (1983).
Callas, P., Pastides, H. & Hosmer, D. Empirical comparisons of proportional hazards, Poisson, and logistic regression modeling of occupational cohort data. Am. J. Ind. Med. 33, 33–47 (1998).
Staley, J. R. et al. A comparison of Cox and logistic regression for use in genomewide association studies of cohort and casecohort design. Eur. J. Hum. Genet. 25, 854–862 (2017).
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Loh, P. R. et al. Efficient Bayesian mixedmodel analysis increases association power in large cohorts. Nat. Genet. 47, 284–290 (2015).
Svishcheva, G. R., Axenovich, T. I., Belonogova, N. M., van Duijn, C. M. & Aulchenko, Y. S. Rapid variance componentsbased method for wholegenome association analysis. Nat. Genet. 44, 1166–1170 (2012).
Jiang, L. et al. A resourceefficient tool for mixed model association analysis of largescale data. Nat. Genet. 51, 1749–2 (2019).
Zhou, W. et al. Efficiently controlling for casecontrol imbalance and sample relatedness in largescale genetic association studies. Nat. Genet. 50, 1335–1341 (2018).
Chen, H. et al. Control for population structure and relatedness for binary traits in genetic association studies via logistic mixed models. Am. J. Hum. Genet. 98, 653–666 (2016).
Vaupel, J., Manton, K. & Stallard, E. The impact of heterogeneity in individual frailty on the dynamics of mortality. Demography 16, 439–454 (1979).
Hougaard, P. Frailty models for survival data. Lifetime Data Anal. 1, 255–273 (1995).
Clayton, D. & Cuzick, J. Multivariate generalizations of the proportional hazards model. J. R. Stat. Soc.: Ser. A (Gen.) 148, 82–108 (1985).
Klein, J. P. Semiparametric estimation of random effects using the Cox model based on the EM algorithm. Biometrics 48, 795–806 (1992).
McGilchrist, C. A. REML estimation for survival models with frailty. Biometrics 49, 221–225 (1993).
Petersen, J. H., Andersen, P. K. & Gill, R. D. Variance components models for survival data. Stat. Neerl. 50, 193–211 (1996).
Korsgaard, I. R. & Andersen, A. H. The additive genetic gamma frailty model. Scand. J. Stat. 25, 225–269 (1998).
Wienke, A. Frailty Models in Survival Analysis (Chapman and Hall/CRC, London, 2011).
Yashin, A. I., Vaupel, J. W. & Iachine, I. A. Correlated individual frailty: an advantageous approach to survival analysis of bivariate data. Math. Popul. Stud. 5, 145–159 (1995).
Yashin, A. I. & Iachine, I. A. Genetic analysis of durations: Correlated frailty model applied to survival of Danish twins. Genet. Epidemiol. 12, 529–538 (1995).
Yashin, A. I. & Iachine, I. A. Dependent hazards in multivariate survival problems. J. Multivar. Anal. 71, 241–261 (1999).
Ripatti, S. & Palmgren, J. Estimation of multivariate frailty models using penalized partial likelihood. Biometrics 56, 1016–1022 (2000).
Therneau, T. M., Grambsch, P. M. & Pankratz, V. S. Penalized survival models and frailty. J. Comput. Graph. Stat. 12, 156–175 (2003).
Therneau, T. M. coxme: mixed effects cox models. https://cran.rproject.org/package=coxme (2019).
He, L. & Kulminski, A. M. Fast algorithms for conducting largescale GWAS of ageatonset traits using Cox mixedeffects models. Genetics 215, 41–58 (2020).
He, L. coxmeg: Cox mixedeffects models for genomewide association studies. https://sites.duke.edu/barusoftware/rpackages/coxme/ (2020).
Denny, J. C. et al. Systematic comparison of phenomewide association study of electronic medical record data and genomewide association study data. Nat. Biotechnol. 31, 1102–1110 (2013).
Ma, C., Blackwell, T., Boehnke, M., Scott, L. J. & Go, T. D. I. Recommended joint and metaanalysis strategies for casecontrol association testing of single lowcount variants. Genet. Epidemiol. 37, 539–550 (2013).
Dey, R., Schmidt, E. M., Abecasis, G. R. & Lee, S. A fast and accurate algorithm to test for binary phenotypes and its application to PheWAS. Am. J. Hum. Genet. 101, 37–49 (2017).
Dey, R. et al. Robust metaanalysis of biobankbased genomewide association studies with unbalanced binary phenotypes. Genet. Epidemiol. 43, 462–476 (2019).
Daniels, H. E. Saddlepoint approximations in statistics. Ann. Math. Stat. 25, 631–650 (1954).
Breslow, N. E. & Clayton, D. G. Approximate inference in generalized linear mixed models. J. Am. Stat. Assoc. 88, 9–25 (1993).
Gilmour, A. R., Thompson, R. & Cullis, B. R. Average information REML: an efficient algorithm for variance parameter estimation in linear mixed models. Biometrics 51, 1440–1450 (1995).
Tsuruta, S., Misztal, I. & Stranden, I. Use of the preconditioned conjugate gradient algorithm as a generic solver for mixedmodel equations in animal breeding applications. J. Anim. Sci. 79, 1166–1172 (2001).
McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 48, 1279–1283 (2016).
Walter, K. et al. The UK10K project identifies rare variants in health and disease. Nature 526, 82–90 (2015).
Gagliano Taliun, S. A. et al. Exploring and visualizing largescale genetic associations by using PheWeb. Nat. Genet. 52, 550–552 (2020).
Nelson, C. P. et al. Association analyses based on false discovery rate implicate new loci for coronary artery disease. Nat. Genet. 49, 1385–1391 (2017).
Deloukas, P. et al. Largescale association analysis identifies new risk loci for coronary artery disease. Nat. Genet. 45, 25–33 (2012).
Meyer, KerstinB. et al. Finescale mapping of the FGFR2 breast cancer risk locus: putative functional variants differentially bind FOXA1 and E2F1. Am. J. Hum. Genet. 93, 1046–1060 (2013).
Udler, M. S. et al. Fine scale mapping of the breast cancer 16q12 locus. Hum. Mol. Genet. 19, 2507–2515 (2010).
Stone, E. M. Identification of a gene that causes primary open angle glaucoma. Science (Am. Assoc. Adv. Sci.) 275, 668–670 (1997).
Burdon, K. P. et al. Genomewide association study identifies susceptibility loci for open angle glaucoma at TMCO1 and CDKN2BAS1. Nat. Genet. 43, 574–578 (2011).
MorenoGrau, S. et al. Genomewide association analysis of dementia and its clinical endophenotypes reveal novel loci associated with Alzheimer’s disease and three causality networks: The GR@ACE project. Alzheimers Dement. 15, 1333–1347 (2019).
Kaplan, E. L & Meier, P. Nonparametric Estimation from Incomplete Observations (Springer, New York, 1992).
Barber, R. C. et al. Can genetic analysis of putative blood Alzheimer’s disease biomarkers lead to identification of susceptibility loci? PLoS ONE 10, e0142360–e0142360 (2015).
Wolters, F. et al. The impact of APOE genotype on survival: Results of 38,537 participants from six populationbased cohorts (E2CHARGE). PLoS ONE 14, e0219668 (2019).
Rovio, S. et al. Leisuretime physical activity at midlife and the risk of dementia and Alzheimer’s disease. Lancet Neurol. 4, 705–711 (2005).
Schuit, A. J., Feskens, E. J., Launer, L. J. & Kromhout, D. Physical activity and cognitive decline, the role of the apolipoprotein e4 allele. Med. Sci. Sports Exerc. 33, 772–777 (2001).
Smith, J. C., Nielson, K. A., Woodard, J. L., Seidenberg, M. & Rao, S. M. Physical activity and brain function in older adults at increased risk for Alzheimer’s disease. Brain Sci. 3, 54–83 (2013).
Li, X. et al. Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large wholegenome sequencing studies at scale. Nat. Genet. 52, 969–983 (2020).
Yang, J., Zaitlen, N. A., Goddard, M. E., Visscher, P. M. & Price, A. L. Advantages and pitfalls in the application of mixedmodel association methods. Nat. Genet. 46, 100–106 (2014).
Kang, H. M. et al. Variance component model to account for sample structure in genomewide association studies. Nat. Genet. 42, 348–354 (2010).
Wu, M. C. et al. Rarevariant association testing for sequencing data with the sequence Kernel Association Test. Am. J. Hum. Genet. 89, 82–93 (2011).
Satagopan, J. M. et al. A note on competing risks in survival data analysis. Br. J. Cancer 91, 1229–1235 (2004).
Prentice, R. L. et al. The analysis of failure times in the presence of competing risks. Biometrics 34, 541–554 (1978).
Lau, B., Cole, S. R. & Gange, S. J. Competing risk regression models for epidemiologic data. Am. J. Epidemiol. 170, 244–256 (2009).
Andersen, P. K., Geskus, R. B., de Witte, T. & Putter, H. Competing risks in epidemiology: possibilities and pitfalls. Int. J. Epidemiol. 41, 861–870 (2012).
Therneau, T. M., Grambsch, P. M. & SpringerLink (Online service). Modeling Survival Data: Extending the Cox Model (Imprint: Springer, New York, NY, 2000).
Breslow, N. E. Discussion of the paper by D. R. Cox. J. R. Stat. Soc. Ser. B (Methodol.) 34, 216–217 (1972).
BarndorffNielsen, O. E. Approximate interval probabilities. J. R. Stat. Soc. Ser. B (Methodol.) 52, 485–496 (1990).
Kuonen, D. Saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika 86, 929–935 (1999).
Grambsch, P. M. & Therneau, T. M. Proportional hazards tests and diagnostics based on weighted residuals. Biometrika 81, 515–526 (1994).
Schoenfeld, D. Partial residuals for the proportional hazards regression model. Biometrika 69, 239–241 (1982).
Therneau, T. M., Grambsch, P. M. & Fleming, T. R. Martingalebased residuals for survival models. Biometrika 77, 147–160 (1990).
Abecasis, G. R., Cherny, S. S., Cookson, W. O. & Cardon, L. R. Merlin—rapid analysis of dense genetic maps using sparse gene flow trees. Nat. Genet. 30, 97–101 (2001).
weizhouUMICH, J. L. haohao, weizhou0. weizhou0/GATE: v0.42. Zenodo https://doi.org/10.5281/zenodo.6889154 (2022).
Acknowledgements
The FinnGen project is funded by two grants from Business Finland (HUS 4685/31/2016 and UH 4386/31/2016) and 11 industry partners (AbbVie Inc, AstraZeneca UK Ltd, Biogen MA Inc., Celgene Corporation, Celgene International II Sàrl, Genentech Inc, Merck Sharp & Dohme Corp, Pfizer Inc., GlaxoSmithKline, Sanofi, Maze Therapeutics Inc., Janssen Biotech Inc.). Following biobanks are acknowledged for collecting the FinnGen project samples: Auria Biobank (www.auria.fi/biopankki), THL Biobank (www.thl.fi/biobank), Helsinki Biobank (www.helsinginbiopankki.fi), Biobank Borealis of Northern Finland (https://www.ppshp.fi/Tutkimusjaopetus/Biopankki/Pages/BiobankBorealisbrieflyinEnglish.aspx), Finnish Clinical Biobank Tampere (www.tays.fi/enUS/Research_and_development/Finnish_Clinical_Biobank_Tampere), Biobank of Eastern Finland (www.itasuomenbiopankki.fi/en), Central Finland Biobank (www.ksshp.fi/fiFI/Potilaalle/Biopankki), Finnish Red Cross Blood Service Biobank (www.veripalvelu.fi/verenluovutus/biopankkitoiminta), and Terveystalo Biobank (www.terveystalo.com/fi/Yritystietoa/TerveystaloBiopankki/Biopankki/). All Finnish Biobanks are members of BBMRI.fi infrastructure (www.bbmri.fi). This research has been conducted using the UK Biobank Resource under application number 52008. X.L. was supported by NCI R35CA197449, P01CA134294, U19CA203654, and NHLBI R01HL113338. B.M.N. was supported by NHGRI U01HG00908804S3 and NIMH R37MH10764906. R.D. was supported by NCI R35CA197449. W.Z. was supported by an NIH T32 fellowship (Grant number: 1T32HG01046401). A.P. was supported by the Academy of Finland Centre of Excellence in Complex Disease Genetics (Grant No. 312074). We would also like to acknowledge Cotton Seed, the Hail team, and the AWS Open Data Program (see the section “Data availability”) for their help with data storage for UKBB summary statistics, and Hufeng Zhou and Theodore Arapoglou for their valuable help in setting up the website.
Author information
Authors and Affiliations
Consortia
Contributions
R.D., W.Z., X.L., B.M.N., and M.J.D. designed experiments. R.D. and W.Z. performed experiments. R.D. and W.Z. implemented the software with input from X.L., B.M.N., and M.J.D. R.D. constructed phenotypes for UK Biobank data. R.D. and X.L. analyzed UK Biobank data. A.Q., R.D., and W.Z. created the PheWeb browser for UK Biobank results. W.Z., T.K., A.H., A.E., J.K., M.K., and A.P. analyzed data for the FinnGen study. Helpful advice was provided by S.L. R.D., and W.Z. wrote the manuscript with input from all coauthors.
Corresponding author
Ethics declarations
Competing interests
B.M.N. is on the Scientific Advisory Board of Deep Genomics, and is a consultant for CAMP4 Therapeutics, Takeda, and Biogen. X.L. is a consultant to AbbVie Pharmaceuticals and Verily Life Sciences. M.J.D. is a founder of Maze Therapeutics and on the scientific advisory board of BC Platforms. The remaining authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Alexander Kulminski and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Dey, R., Zhou, W., Kiiskinen, T. et al. Efficient and accurate frailty model approach for genomewide survival association analysis in largescale biobanks. Nat Commun 13, 5437 (2022). https://doi.org/10.1038/s4146702232885x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s4146702232885x
This article is cited by

ADuLT: An efficient and robust timetoevent GWAS
Nature Communications (2023)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.