Abstract
While recent advancements in computation and modelling have improved the analysis of complex traits, our understanding of the genetic basis of the time at symptom onset remains limited. Here, we develop a Bayesian approach (BayesW) that provides probabilistic inference of the genetic architecture of ageatonset phenotypes in a sampling scheme that facilitates biobankscale timetoevent analyses. We show in extensive simulation work the benefits BayesW provides in terms of number of discoveries, model performance and genomic prediction. In the UK Biobank, we find many thousands of common genomic regions underlying the ageatonset of high blood pressure (HBP), cardiac disease (CAD), and type2 diabetes (T2D), and for the genetic basis of onset reflecting the underlying genetic liability to disease. Ageatmenopause and ageatmenarche are also highly polygenic, but with higher variance contributed by low frequency variants. Genomic prediction into the Estonian Biobank data shows that BayesW gives higher prediction accuracy than other approaches.
Introduction
Genomewide association studies (GWAS) have greatly expanded our understanding of the genetic architecture of complex traits, but have largely focused on binary phenotypes and quantitative traits^{1}, leaving the ageatonset of symptoms little studied, despite it being one of the key traits in biobank studies of agerelated disease. Understanding the environmental and genetic basis of the time at which symptoms first occur is critical for early screening programs and for gaining insight into disease development and progression, especially as the pathological processes of many agerelated diseases may be triggered decades before the first symptoms appear. Evidence suggests that genomewide analyses conducted with casecontrol phenotypes tend to have less power in comparison with their ageatonset analysis counterparts^{2,3}. Genetic predictors created from casecontrol studies have been shown to be predictive of ageatdiagnosis^{4}, implying that earlyonset is to a certain degree indicative of a higher underlying liability of disease. However, our understanding of the genetic architecture of reproductive timing, and the age at which symptoms first develop for common complex disorders, remains limited.
Statistical modelling of timetoevent data is a highly active research area and is frequently applied to clinical and pharmacogenetic studies. Analogous to single marker regression in GWAS analyses, a Cox proportional hazards (PH) model^{5} for each single nucleotide polymorphism (SNP) j ∈ {1, . . . , M} can be formulated as \({h}_{i}(t)={h}_{0}(t)\exp ({{\bf{x}}}_{ij}{{\boldsymbol{\beta }}}_{j}),\) where h_{0}(t) is the baseline hazard at time t, h_{i}(t) is the hazard for individual i, x_{ij} is the standardised jth SNP marker value, with β_{j} the effect size of the jth SNP and M the total number of SNPs^{6,7,8}. Recently there have been improvements in the computation times using some approximations for singlemarker Cox PH regression^{9}, however, this approach still yields marginal effect size estimates as the markers are not fitted simultaneously. Residual based approaches have also been widely used, which first regress the phenotype on covariates such as gender or age at entry in Cox PH model, and then use the residuals in a second regression on the SNP data, with martingale residuals \({\hat{M}}_{i}={d}_{i}{\hat{{{\Lambda }}}}_{0}({t}_{i})\exp ({Z}_{i}^{t}{\boldsymbol{\gamma }}),\) where \({\hat{M}}_{i}\) is the residual for individual i, δ_{i} is the failure indicator (d_{i} = 1 for the event during the study period, otherwise d_{i} = 0), \({\hat{{{\Lambda }}}}_{0}({t}_{i})\) is the baseline cumulative hazard function at time t_{i}, t_{i} is the followup time for individual i, Z_{i} is the vector of variables used in the first regression step and γ the vector of corresponding parameter estimates^{10,11}. The martingale residual approach retains the linearity between the effect and the phenotype and given the model in the second step it can also be very fast. However, the failure time and censoring indicator are combined to one summary statistic, rather than including censoring information specifically via likelihood. Therefore, the martingale residual approach does not use the censoring information efficiently diminishing the power of this model. Rather than testing markers oneatatime, their effects can be estimated jointly in a mixedeffects Cox PH model, referred to as a frailty model, specified as \({\lambda }_{i}(t b)={\lambda }_{0}(t)\exp ({X}_{i}^{t}{\boldsymbol{\beta }}+b)\), where β is the effect for one SNP being tested along with other fixed effects such as age or sex, b ~ N(0, σ^{2}Σ) is the Ndimensional vector of random effects (N is the sample size), Σ: N × N is the genetic relationship matrix, σ^{2} is the variance of the genetic component, λ_{0}(t) is the baseline hazard function and λ_{i}(t∣b) is the hazard for individual i. This idea has been long limited by computational resources and in the latest implementation (COXMEG)^{12} analyses are constrained to around ~10,000 individuals. For joint marker effect estimation, there is also the CoxLASSO model^{13} which has been recently developed for genetic data in the R package snpnet^{14,15}. Fully parametric alternatives are also the Sparse Bayesian Weibull regression (SBWR), which may outperform LASSObased approaches^{16}, but like other Bayesian methods such as SurvEMVS^{17} or a semiparametric gprior approach of Held et al.^{18} the ultrahigh dimensions of genetic data limit their application. Therefore, approaches that can efficiently handle both the complexity and scale of many millions of sequenced individuals with timetoevent outcomes have not been extensively developed, limiting our understanding and our ability to predict disease progression and the timing of symptom onset.
Here, we take an alternative approach to obtain accurate inference in fullscale phenotypegenotype sequence data sets, by proposing a mixture of regressions model with variable selection, using different regularisation parameters for genetically motivated groups (see “Methods” section). Our suggested model fits all of the markers jointly in a Bayesian framework using the Weibull assumption for the phenotypes. We show that this approach: (1) allows for a contrasting the genetic architectures of ageatonset phenotypes under this flexible prior formulation; (2) yields marker effect estimates β_{j} that represent the effect of each marker conditional on the effects of all the other markers accounting for genetic architecture; (3) provides a determination of the probability that each marker and genomic region is associated with a phenotype, alongside the proportion of phenotypic variation contributed by each, and (4) gives a posterior predictive distribution for each individual. Regardless of the phenotypic distribution, our suggested approach greatly improves genomic prediction for the timing of events for each individual and enables better insight behind the genetic architecture underlying timetoevent traits.
Results
BayesW model
An overview of our model is as follows, suppose that M markers are split between Φ different groups. The groups can be for example formed based on markerspecific genomic annotations, MAF grouping, grouping based on LD score, etc. We assume for an individual i that the ageatonset of a disease Y_{i} has Weibull distribution, with a reparametrisation of the model to represent the mean and the variance of the logarithm of the phenotype as
where μ is the intercept, \({{\bf{x}}}_{i}^{\varphi }\) are the standardised marker values for all SNPs in group φ, β^{φ} are the marker estimates for the corresponding group, z_{i} are additional covariate values (such as sex or genetic principal components), δ is the additional covariate effect estimates and α is the Weibull shape parameter (see “Methods” section). For each group, we assume that β^{φ} is distributed according to a mixture of L_{φ} Gaussian components. Each marker (from group φ) estimate β_{j}j ∈ {1, . . . , M} is related to a corresponding indicator variable γ_{j} ∈ {0, . . . , L_{φ}} where L_{φ} is the number of mixture distributions. β_{j} have zero values if and only if γ_{j} = 0. We assume that nonzero β_{j}, where marker j belongs to group φ, that has been assigned to mixture component (γ_{j} = k ≥ 1) comes from a normal distribution with zero mean and variance \({C}_{k}^{\varphi }{\sigma }_{G\varphi }^{2}\), that is \({{\boldsymbol{\beta }}}_{j} \sim N(0,{C}_{k}^{\varphi }{\sigma }_{G\varphi }^{2})\), where \({\sigma }_{G\varphi }^{2}\) represents the phenotypic variance attributable to markers of group φ and \({C}_{k}^{\varphi }\) is a group and mixture specific factor showing the magnitude of variance explained by this specific mixture that is given by the user. For example, specifying \({C}_{1}^{\varphi }=0.0001\), \({C}_{2}^{\varphi }=0.001\) and \({C}_{3}^{\varphi }=0.01\) gives us mixtures that, respectively, explain 0.01%, 0.1% and 1% of the genetic variance. We also assume that prior probabilities of belonging to each of the mixture distribution k is stored in L_{φ} + 1dimensional vector π^{φ}. Thus the mixture proportions, variance explained by the SNP markers, and mixture constants are all unique and independent across SNP marker groups.
An algorithm for wholegenome data at biobank scale
We develop a computational framework that overcomes previous limitations for the application of ageatonset models to largescale biobank studies. In the model likelihood, we account for right censoring, a situation where only the last known time without an event is recorded, with the event potentially taking place sometime in the future (see “Methods” section). Although we did not apply it in our final analysis, we also formulate the model to accommodate left truncation, a situation where individuals are not missing from the data at random, creating differences in the genetic composition of individuals across age groups (see “Methods” section). We implement a parallel sampling scheme for Eq. (1) that allows the data to be split across compute nodes (in a series of MPI tasks), whilst still maintaining the accuracy of the estimation of β_{j}. With T parallel workers (MPI tasks), Bulk Synchronous Parallel (BSP) Gibbs sampling can sample T marker effects when sequential Gibbs samples a single one, but BSP requires an extra synchronisation step of the tasks after each of them has processed u markers (see “Methods” section). After each worker has processed u markers, we synchronise the workers by transmitting the residual vector across workers. Given our assumption that the phenotype follows a Weibull distribution, we are using a numerical method, Adaptive GaussHermite quadrature, for calculating the mixture membership probabilities for variable selection, and Adaptive Rejection Sampling (ARS) for estimating the marker effects. We implement these approaches to take full advantage of the sparsity of genomic data, converting computationally intensive calculations of exponents and dot products into a series of summations. We provide publicly available software (see “Code Availability”) that has the capacity to easily extend to a wider range of models (not just Weibull) than that described here. Our software enables the estimation of 2,975,268 SNP inclusion probabilities split between T = 96 workers, using 12 compute nodes and synchronisation rate of u = 10, mixture allocation and effect sizes in 371,878 individuals with an average of 49.7 s per iteration without groups model and 50.0 s per iteration with the groups model; using 151,472 individuals with T = 64, 8 compute nodes and u = 10 without groups we get an average of 24.8 and with groups, we get an average of 26.4 s per iteration. Here, we have chosen to run the chains for 10,000 iterations leading to execution times of 69 h (N = 151,472) to 139 h (N = 371,878). The run times ultimately depend upon compute cluster utilisation and the genetic architecture of the phenotype, as calls to the ARS procedure are linear with the number of markers. The calculations were done by using Helvetios cluster of EPFL (see Code availability).
Simulation study
We show in a simulation study that our model estimates SNP marker effect sizes more accurately, with a greater number of discoveries, and thus obtains better model performance with improved genomic prediction accuracy as compared to other available methods (Fig. 1, Supplementary Figs. 1 and 2). The other methods used for comparisons in simulations are CoxLASSO^{13}, Bayesian regression mixture model BayesR^{19} applied on martingale residuals and marginal single marker regression (OLS) applied on martingale residuals. First, we show that the previous statement holds even in the case of model misspecification (Fig. 1), where the phenotypic distribution does not correspond to a Weibull distribution, but rather conforms to a series of different generalised gamma distributions (of which Weibull is one of them in the case where θ = 1 in the parametrisation of Eq. (1) in the Supplementary Note of Supplementary Information), with differing θ value (see “Methods” section and Supplementary Note in Supplementary Information). In a simulation study of N = 5000 individuals and 50,000 uncorrelated markers with p = 500 randomly selected markers as causal variants, our BayesW model obtains higher outofsample prediction accuracy than a CoxLASSO or a martingale residual approach used in several recent studies (Fig. 1a).
Second, we show that this statement also holds in a larger simulation study using a real genomic dataset of N = 20, 000 randomly selected UK Biobank individuals and 194,922 correlated genetic markers on chromosome 22 under different censoring levels (Supplementary Fig. 1). Interestingly, in Fig. 1a we observe that the generalised gamma distributions with θ > 1 lead to more accurate genetic predictions compared to the Weibull model (θ = 1). Such phenotypic distributions are easier to discriminate meaning that for distributions where θ > 1 the same difference in genetic values leads to greater phenotypic distribution differences in KullbackLeibler divergences compared to θ = 1. Our approach achieves better precisionrecall as compared to these approaches (Fig. 1b) across all values of θ and all censoring levels within the data (Supplementary Fig. 2). We choose to use precisionrecall curves due to the great imbalance between the number of causal and noncausal markers^{20}; precision (\(\frac{{\mathrm{TP}}}{{\mathrm{TP}}+{\mathrm{FP}}}=1{\mathrm{FDR}}\)) describes how accurately the markers were identified while recall (\(\frac{{\mathrm{TP}}}{{\mathrm{TP}}+{\mathrm{FN}}}\)) describes the proportion of how many causal markers were discovered. We show that across the range of θ values that generate model misspecification, SNP marker effect estimates remain mostly correctly estimated (Fig. 1c, Supplementary Fig. 9), however, due to the shrinkage effect of the prior distribution to the marker effect size estimates, we observe a very slight underestimation of the effect size estimates for this simulation scenario if the model is correctly specified. On the other hand, if the phenotype is from lognormal distribution (θ = 0) then due to the inflated genetic variance hyperparameter (Fig. 1d) we see the reduced impact of the priors and less shrinkage of the effect size estimates, leading to more accurate effect size estimates. In general, we recognise that Bayesian modelling may induce slight shrinkage in the effect size estimates due to the priors. Nevertheless, we consider this effect negligible (Supplementary Fig. 9), especially in the context of improved genetic prediction and a more flexible framework that Bayesian modelling enables.
Third, in the Supplementary Note, we derive a definition of SNP heritability, the proportion of amongindividual variation in ageatonset that is attributable to SNP effects, for both the variance of the logarithmed phenotype and the original scale (see Supplementary Note). We show that the logscale SNP heritability definition is valid under a Weibull assumption and across the range of theta when restricting the markers entering the model (Fig. 1d, single mixtures 0.01), but maybe inflated under low theta values (Fig. 1d, mixtures 0.001, 0.01) because of the increase in smalleffect false positives that enter the model (Fig. 1e). In addition, we demonstrate that the model is robust to the specification of mixture components (Fig. 1f), false discovery rate is bounded even if we add smaller mixtures. To explore false discovery rate and polygenicity in a more realistic scenario we further simulate different numbers of causal loci on LD pruned set of UK Biobank chromosome 1 (M = 230,227) (see “Methods” section). We show that (a) our model captures accurately the effect size distribution (Supplementary Fig. 11a), (b) our model accurately captures the underlying polygenicity (Supplementary Fig. 11b), (c) our model controls for false discovery rate (Supplementary Fig. 12). Throughout this work, we use the posterior probability of window variance (PPWV)^{21} (see “Methods” section) as the metric to summarise the significance of genetic regions. PPWV shows the probability that a genetic region explains at least some fixed proportion of the genetic variance. We show below that false positives are controlled for under all generative models when conducting LD clumped based variable selection using a PPWV threshold of ≥0.9 (Supplementary Fig. 12) and hence it is justified to use it for calling regionbased discoveries and compare it with other methods that are supposed to control for false discovery rates. Finally, we show that our BSP algorithm is stable under a wide range of synchronisation rates, parallelism, and quadrature point selection (Supplementary Fig. 3).
The genetic architecture of ageatonset
We then applied our model to unrelated UK biobank individuals of European ancestry with a pruned set of M = 2,975,268 SNPs for five traits: two reproductive phenotypes of ageatmenopause (N = 151,472) and ageatmenarche (N = 200,493) and three common complex diseases (selected as they are some of the leading causes of mortality) of timetodiagnosis of type2diabetes (T2D) (N = 372,280), coronaryarterydisease (CAD) (N = 360,715) and high blood pressure (HBP) (N = 371,878) (see Descriptive statistics in Supplementary Table 1). Using our BSP Gibbs sampling scheme, we ran a baseline model without any grouping of markers, and then we reran the model grouping markers into 20 MAFLD bins (quintiles of MAF and then quartiles within each MAF quantile split by LD score). Groups were defined using MAF and LD based on recent theory^{22} and recent simulation study results^{23,24,25,26}, which suggest that accurate estimation of genetic variance might require accounting for the MAFLD structure. To understand the effect size distribution and genetic architecture, four mixture components were specified such that they would represent 0.001%, 0.01%, 0.1% or 1% of the total genetic variance for the no groups model (0.00001, 0.0001, 0.001, 0.01). For the group model, the four groupspecific mixtures for each of the 20 groups were chosen to be 10 times larger (0.0001, 0.001, 0.01, 0.1) such that they would represent 0.01%, 0.1%, 1% or 10% of the groupspecific genetic variances. Additional variables such as sex, UK Biobank assessment centre, genotype chip, and the leading 20 PCs of the SNP data (see “Methods” section) were used as fixed effects in the analysis. We conducted a series of convergence diagnostic analyses of the posterior distributions to ensure we obtained estimates from a converged set of chains (Supplementary Figs. 4, 5, 6 and 7).
Under the assumption that the traits are Weibull distributed, this gives logscale SNP (pseudo)heritability estimates (see Supplementary Note of Supplementary Information) of 0.26 (95% CI 0.25, 0.27) for ageatmenopause, 0.41 (95% CI 0.40, 0.42) for ageatmenarche, 0.36 (95% CI 0.35, 0.37) for ageatdiagnosis of HBP, 0.48 (95% CI 0.44, 0.52) for ageatdiagnosis of CAD, and 0.52 (95% CI 0.50, 0.55) for ageatdiagnosis of T2D. Both the model with and without groups reach similar conclusions in terms of partitioning markers between mixtures (Fig. 2a) indicating that the inference we draw on the genetic architecture is here independent of the groupspecific prior specification. However, our BayesW grouped mixture of regression model allows for contrasting the variance contributed by different MAF and LD groups across traits. For all traits, we find that the majority of the variance contributed by SNP markers is attributable to SNPs that each proportionally contribute an average of 10^{−5} of the genetic variance (Fig. 2a). We find evidence that ageatmenarche is highly polygenic with 88.1% (95% CI 86.8%, 89.4%) of the genetic variance attributable to the SNPs contributed by markers in the 10^{−5} mixture group, similar to CAD with 74.2% (95% CI 63.6%, 81.5%, Fig. 2b). Ageatmenopause and ageatT2D diagnosis stand out with 32.3% (95% CI 28.9%, 35.7%) and 18.9% (95% CI 14.6%, 22.9%) of the genotypic variance attributable to the SNPs contributed by markers in the 10^{−3} mixture, respectively (Fig. 2b), indicating a substantial amount of genetic variance resulting from moderate to large effect sizes. In contrast, for the other traits, the moderate to large effect sizes (mixture 10^{−3}) explain a far smaller part of the total genetic variance with ageatmenarche having almost no genetic variance (0.1%, 95% CI 0.0%, 0.6%) and only a small amount coming from that mixture for ageatHBP diagnosis (5.6%, 95% CI 3.1%, 8.4%) and ageatCAD (9.4%, 95% CI 6.5%, 12.9%).
We find marked differences in the underlying genetic architecture of these different ageatonset phenotypes (Fig. 2c, d). For ageatmenarche, many rare lowLD SNPs and many common SNPs contribute similar proportions to the phenotypic variance attributable to the SNP markers, implying larger absolute effect sizes for rare lowLD variants per minor allele substitution, with ageatmenopause showing a similar but less pronounced pattern with a noticeable proportion of the genetic variance stemming from small effect sizes of the rare variants (Fig. 2d, MAF quintiles 1–3). In contrast, we find evidence that the phenotypic variance attributable to the SNP markers for ageatdiagnosis for CAD, HBP, and T2D is predominantly contributed by common variants of small effect (Fig. 2d). This implies that female reproductive traits may have been under far stronger selection in our evolutionary past than ageatdiagnosis of modernday common complex disease^{27}. In summary, we find that most of the phenotypic variance attributable to SNPs is contributed by very many smalleffect common variants, but that there are key differences among timetoevent phenotypes, with reproductive traits showing different patterns of genetic architecture to timetodiagnosis phenotypes.
We then partitioned the SNP markers into regions of LD clumps (see “Methods” section) and determined the genetic variance each of those regions explains. Then, we calculated the probability (PPWV) that each such region contributes at least 1/1000, 1/10,000 or 1/100,000 of the total genotypic variance, providing a probabilistic approach to assess the contribution of different genomic regions to timetoevent phenotypes. The smallest threshold was chosen to be 1/100,000 of the total genotypic variance corresponding to the smallest mixture component models were estimated with which also represents the magnitude of the smallest effect size we intend the model to capture. We find 291 LD clumped regions for ageatmenarche with ≥0.95 PPWV of 1/100,000, 176 regions for ageatmenopause, 441 regions for ageatdiagnosis of HBP, 67 regions for ageatdiagnosis of CAD, and 108 regions for ageatdiagnosis of T2D from our BayesW grouped mixture of the regression model (Fig. 3a). Our grouped model provides slightly better model performance, as reflected by higher posterior inclusion probabilities at smaller effect sizes (Fig. 3a, Supplementary Fig. 8), with the baseline BayesW mixture of regression model detecting 13.7% fewer LD clump regions for ageatmenarche, 4.5% fewer for ageatmenopause, 34.0% fewer for ageatdiagnosis of HBP, 35.8% fewer for ageatdiagnosis of CAD, and 33.3% fewer for ageatdiagnosis of T2D when using 1/100,000 PPWV threshold. Similarly, we evaluated regionbased significance by calculating PPWV for regions that were created by mapping markers to the closest gene (Fig. 3b). For ageatmenopause we find 101, for ageatmenarche, we find 119, for timetoT2D we find 41, for timetoHBP we find 159 and for timetoCAD, we find 20 geneassociated regions with ≥95% PPWV of explaining at least 1/10,000 of the genetic variance. In addition, we find evidence for differences in the effect size distribution across traits, largely reflecting differences in power that result from sample size differences and different censoring levels across traits (Fig. 3c, d) (see “Methods” section). Overall, these results suggest many hundreds of genomic regions spread throughout the genome contribute to the timing of common complex traits.
We also compared the LD clumped regions discovered by BayesW with LD clumped regions discovered by another association method fastGWA^{28}. Even though fastGWA is a frequentist and BayesW is a Bayesian method and the comparison between the two approaches is not comprehensive, we still use it as both methods should control for false discovery rate (Supplementary Fig. 12) and fastGWA is one of the most recent methods released. Although for ageatmenarche and ageatmenopause we find 191 and 97 regions that are concordantly significant according to the two methods (Table 1), we find less concordance among the other traits. For timetoangina and heart attack fastGWA does not find any significant regions, for timetoHBP BayesW finds greatly more LD clumped regions (BayesW: 663, fastGWA: 14). The striking difference between the numbers of identified regions could be largely attributed to the larger sample size of BayesW as BayesW can also use the data from censored individuals where fastGWA can only resort to the uncensored individuals. For timetodiabetes BayesW identifies more than three times more regions but only a small minority of the discovered regions are concordant. Although for timetomenarche the fastGWA identifies more LD clumped regions, still around half of the regions identified by BayesW are not picked up by fastGWA. We further looked into the properties behind the discoveries that are not concordant between the two methods. It can be seen that the regions discovered by BayesW have lower pvalues compared to the overall pvalues (Supplementary Fig. 13a) indicating that many of those regions could be lacking power with fastGWA whereas BayesW manages to identify them; similarly, the regions that are discovered by fastGWA and not by BayesW tend to have higher PPWV compared to the overall PPWV values (Supplementary Fig. 13b) indicating that some potential signal could be lost when using such PPWV threshold. In terms of the prediction accuracy, the BayesW shows greatly better prediction accuracy to Estonian Biobank compared to fastGWA when predicting ageatmenarche or ageatmenopause (Fig. 4a, b) indicating that the regions identified by BayesW and their effect size estimates might reflect the genetic architecture more accurately. Therefore, BayesW identifies already found regions along with previously unidentified regions compared to previous association methods; for timetodiagnosis traits, it can discover more regions due to using the censored individuals; and BayesW results yield greatly improved prediction accuracy compared to fastGWA.
Out of sample prediction in an Estonian population
We used the estimates obtained from the groupspecific model to predict timetoevent in N = 32,594 individuals of the Estonian Biobank data. We compared our model performance to the CoxLASSO approach implemented in the R package snpnet^{14,15} trained on the same UK Biobank data (see “Methods” section) using two metrics. As some of the Estonian Biobank timetoevent phenotypes are censored, we choose to calculate the R^{2} values between the predicted values and the martingale residuals from the Cox PH model where the true phenotypes are regressed on sex. In addition, we calculate Harrell’s Cstatistic^{29} from the Cox PH model where the true phenotypic values are regressed on the predicted values and sex. BayesW outperforms CoxLASSO for all phenotypes (Fig. 4a,b) by giving R^{2} of 0.032 compared to CoxLASSO’s 0.017 for ageatmenopause of 18,134 women and 0.05 compared to CoxLASSO’s 0.040 for ageatmenarche of 18,134 women. We also get an increase in Harrell’s Cstatistic with BayesW giving 0.623 (se = 0.00443) compared to CoxLASSO’s 0.593 (0.00455) for ageatmenopause and for ageatmenarche we get Cstatistic of 0.598 (0.00290) with BayesW compared to CoxLASSO’s 0.580 (0.00294). For the ageatdiagnosis traits, we obtain R^{2} values of 0.0047, 0.0236, and 0.0441 for BayesW and 0.0030, 0.0135 and 0.0271 with CoxLASSO for CAD, T2D, and HBP, respectively. This shows that our BayesW model gives a higher prediction accuracy compared to the CoxLASSO method, inline with our simulation study results.
We then compared the BayesW prediction results to those obtained from a casecontrol analysis. In a companion paper^{22}, we develop a groupspecific BayesR approach and we use this to analyse the indicator variable (0 = no registered disease, 1 = reported disease) for HBP using the same data and a liability model to facilitate a direct comparison of the methods. For CAD and T2D, we use the results of the companion paper, where there were almost twice as many case observations (for CAD BayesR had 39,776 vs. BayesW 17,452 and for T2D BayesR had 25,773 vs. BayesW 15,813 cases) as it included those with the confirmed diagnosis but no age information and 8.4 million SNPs were analysed. For the prediction of ageatdiagnosis, we compared the R^{2} values between the predicted values and the martingale residuals from the Cox PH model and the Harrell’s Cstatistic (Fig. 4a,b). For HBP, CAD we find that BayesW marginally outperforms BayesR with (HBP R^{2} BayesW 0.0441, BayesR 0.0437; CAD R^{2} BayesW 0.0047, BayesR 0.0046) and for T2D BayesR marginally outperforms BayesW (R^{2} BayesW 0.0236, BayesR 0.0262). A similar ranking can be observed when using Harrell’s Cstatistic for comparison (Fig. 4b). We then compared approaches when predicting 0/1 casecontrol status, rather than ageatdiagnosis (Fig. 4c, d). We find that for predicting HBP BayesW marginally outperforms BayesR in terms of R^{2} (BayesW 0.0375, BayesR 0.0365) and area under PR curve (BayesW 0.339, BayesR 0.336) (used because of the imbalance between cases and controls); for predicting CAD or T2D despite the increase case sample size, BayesR only marginally outperforms BayesW (T2D R^{2} BayesW 0.0127, BayesR 0.0136 and AUC BayesW 0.0766, BayesR 0.0799; CAD R^{2} BayesW 0.0025, BayesR 0.0027 and AUC BayesW 0.0920, BayesR 0.0941). Therefore we get very similar prediction accuracies with both methods when predicting casecontrol phenotypes although the BayesW model was estimated using timetoevent phenotypes with fewer cases for T2D and CAD.
A finding of similar prediction accuracy is unsurprising given the striking concordance between the results of the two models when partitioning the genotype into 50 kb regions. We calculated (on a logarithmic scale) the mean proportion of genetic variance attributed to each 50kb region for the model using casecontrol phenotype (BayesR) and for the model using timetoevent phenotype (BayesW). Both models attribute similar amounts of genetic variance to the same 50 kb regions (Supplementary Fig. 10), with correlations between logarithmic proportions of genetic variances are 0.941, 0.647 and 0.554 for HBP, T2D and CAD, respectively. Thus, we have shown here that using either timetoevent or casecontrol data for genomewide association analysis we find a similar amount of genetic variance attributed to the same regions and both analyses have similar predictive performance when predicting casecontrol phenotypes. This suggests that to some extent there is interchangeability between the casecontrol and timetoevent phenotypes demonstrating that both phenotypes are describing a similar latent mechanism.
The BayesW model enables posterior predictive distributions to be generated for each individual. For evaluating the predictive performance of BayesW on the reproductive traits, we calculated the 95% credibility prediction intervals for each of the subjects in the Estonian Biobank. We chose to evaluate reproductive traits only as it is sure that every woman should experience those events given they have reached a sufficient age. For ageatmenopause 92.3% and for ageatmenarche 94.8% of the true uncensored phenotypes lie within the BayesW 95% credibility prediction intervals. This demonstrates that even though the prediction R^{2} values for those traits are not very high due to the low genetic variance underlying the phenotypic variance, our Bayesian model quantifies the model uncertainty and yields wellcalibrated subjectspecific prediction intervals. An example of the shape of the distribution is shown in Supplementary Fig. 15. A caveat is that the subjectspecific prediction intervals can be rather wide, though approximately half the width of the data range. For ageatmenopause, the data range from 34 to 63 (width of 29 years), and the width of the 95% credibility intervals ranged from 13.4 to 18.6 years with a median width of 15.9 years. For ageatmenarche, data values range from 9 to 19, and the posterior predictive interval width ranged from 5.1 to 7.9 years with a median width of 6.3 years.
Discussion
Here, we have shown that our BayesW mixture of regressions model provides inference as to the genetic architecture of reproductive timing and the age at which symptoms first develop for common complex disorders. We provide evidence for an infinitesimal contribution of many thousands of common genomic regions to variation in the onset of common complex disorders and for the genetic basis of ageatonset reflecting the underlying genetic liability to disease. In contrast, while ageatmenopause and ageatmenarche are highly polygenic, average effect sizes and the variance contributed by lowfrequency variants are higher, supporting a history of stronger selection pressure in the human population^{27}.
Genomewide association studies of timetoevent phenotypes are critical for gaining insights into the genetics of disease etiology and progression, but their application has been hampered by the computational and statistical challenges of the models, especially when the predictors are ultrahighdimensional. Our hybridparallel sampling scheme enables a fully Bayesian model to be applied to largescale genomic data and we show improved genomic prediction over competing approaches, not only in the R^{2} or Cstatistic obtained but in the inference that can be obtained from a full posterior predictive distribution. Previous evidence shows that cohort studies using proportional hazards (or Cox) regression models generally increase statistical power compared to casecontrol studies using logistic regression model^{2,3}. Our results support this and we expect the benefits to become more evident as the number of cases accrues with accurate ageatdiagnosis information.
A typical approach in the timetoevent analysis is the Cox PH model^{5} that uses a nonparametric estimate for the baseline hazard and then estimates other effect sizes proportional to this hazard. Our BayesW model is also a proportional hazard model with the constraint that the baseline hazard follows a Weibull distribution and thus marker effect size estimates have a similar interpretation as those from a Cox PH model. Interestingly, the results from both our simulation study and real data analysis show that when quantifying the significance of the markers and estimating the marker effect sizes, it might not be pivotal to capture the baseline hazard with a nonparametric method. The simulations show that even in the misspecified cases BayesW performs better compared to the semiparametric Cox model, demonstrating that using a parametric assumption might be more descriptive than simply using a Cox PH model from standard practice.
There has been a significant amount of work on the heritability of the timetoevent traits. For example, it has been suggested to define heritability in the Weibull frailty model on the logtime scale^{30}, or on the logfrailty scale in Cox PH model^{31}. Transforming the logscale heritability to the original scale^{32}, has then required approximations and the term of original scale heritability has not been easy to explain and use^{33}. Here, using a similar idea of partitioning the total phenotypic variance into genetic and error variance components, we present an expression for SNP heritability on the logtime scale. We then show that there exists a natural correspondence between logscale and original scale heritability, without the need for any approximations, with logscale and original scale heritability giving similar estimates if the Weibull shape parameter tends to higher values. Therefore, under Weibull assumptions, we provide a definition of SNPheritability for timetoevent traits for the GWASera.
There are a number of key considerations and limitations. The assumption of a Weibull distribution for the traits considered here can induce bias in the hyperparameter estimates, although we have shown that this assumption yields accurate results in terms of prediction regardless of the phenotypic distribution. A third parameter could be introduced through the use of a generalised gamma distribution and this will be the focus of future work as it should allow for unbiased hyperparameter estimation irrespective of the trait distribution. In this study, we have used hardcoded genotypes to make the method computationally efficient which can result in reduced covariance between the imputed marker and the trait. However, we do not believe this to be a hindrance to our method or the application in this work as hardcoded genotypic values will likely be the norm with the upcoming release of wholegenome sequence data and our aim is to provide a timetoevent model that is capable of scaling to these data requirements. We apply our approach only to markers that are imputed in both the UK Biobank and the Estonian genome centre data and by selecting markers present in both populations we are favouring markers that impute well across human populations.
In addition, despite allowing for lefttruncation in the likelihood, we focus on presenting a series of baseline results before extending our inference to account for differences in sampling, semicompeting risks across different outcomes, genomic annotation enrichment, and sexdifferences both the effect sizes and in the sampling of different timetoevent outcomes all of which require extensions to the modelling framework, which are also the focus of future work. Furthermore, we do not consider timevarying coefficients or timevarying covariates, which may improve inference as multiple measurements over time are collected in biobank studies. Nevertheless, this work represents the first step toward largescale inference of the genomic basis of variation in the timing of common complex traits.
Methods
Parametrisation of Weibull distribution
We define Y_{i} as the timetoevent for individual i, with Weibull distribution Y_{i} ~ W(a, b_{i}), where a and b_{i} are correspondingly the shape and scale parameters. The survival function is
We are interested in modelling the mean and the variance of the timetoevent. Unfortunately, the mean and the variance of Weibull are dependent as they share both parameters in their expressions. Moreover, as the expressions for mean and variance contain gamma functions it is rather difficult to dissect the mean and variance to be dependant only on one parameter. One possible solution is to use \({\mathrm{log}}\,{Y}_{i}\) and its moments instead. If Y_{i} ~ W(a, b_{i}) then \({\mathrm{log}}\,{Y}_{i}\) has a Gumbel distribution, with mean and variance
where K is the EulerMascheroni constant, K ≈ 0.57721. The parametrisation for the variance is only dependent on one parameter which we denote as α = a. As we are interested in modelling SNP effects β, covariates δ (sex, PCs) and the average scale for timetoevent μ (intercept), it is possible to introduce them in the following way \({b}_{i}=\exp \left(\right.\mu +{{\bf{x}}}_{i}^{\prime}{\boldsymbol{\beta }}+{{\bf{z}}}_{i}^{\prime}{\boldsymbol{\delta }}+\frac{K}{\alpha }\left)\right.\), resulting in
where x_{i} is the vector of scaled SNP marker values and z_{i} is the vector of covariate values for an individual i and π = 3.14159. . . is a constant. The third and the fourth moment for \({\mathrm{log}}\,{Y}_{i}\) are constant regardless of the parametrisation.
Modelling timetoevent and ageatonset
As a baseline model, we propose to test the association of Y_{i} with a series of covariates (SNP markers in this case) X using a mixture of the regression model, with γ_{j} as the mixture indicator, with γ_{j} = k if jth marker is included in the kth mixture component of the model, k ∈ {1, . . . , L}, and γ_{j} = 0 if it is not included into the model. The expected value of timetoevent logarithm is then a linear combination of the markers included in the model plus the effect of the covariates and the intercept (μ) as in Eq. (6) and error variance is expressed via the shape parameter as shown in Eq. (7). β_{j} have nonzero values if and only if γ_{j} ≥ 1. We assume that nonzero β_{j} from mixture component k > 0 (γ_{j} = k) comes from a normal distribution with zero mean and variance \({C}_{k}{\sigma }_{G}^{2}\), that is \({{\boldsymbol{\beta }}}_{j} \sim N(0,{C}_{k}{\sigma }_{G}^{2})\).
The survival and density function for Y_{i} is correspondingly
The likelihood function for the rightcensored and left truncated data of n individuals is then
where d_{i} is the failure indicator and d is the number of events at the end of the periods; a_{i} is the time of left truncation. It is possible to use the model without left truncation. In order to do so, for every i, we will assume that a_{i} = 0. Whenever a_{i} = 0, we will naturally define \(\exp (\alpha ({\mathrm{log}}\,({a}_{i})\mu {{\bf{x}}}_{i}^{\prime}{\boldsymbol{\beta }}{{\bf{z}}}_{i}^{\prime}{\boldsymbol{\delta }}))=0\), thus the left truncation would not contribute to the likelihood in the Eq. (10).
Let the prior distribution of α be a gamma distribution with parameters α_{0}, κ_{0}
the prior for β_{j} be normal:
the prior for \({\sigma }_{G}^{2}\) being inverse gamma distribution with parameters α_{σ} and β_{σ}
the prior for δ_{q} (qth covariate) be normal with variance parameter \({\sigma }_{{\boldsymbol{\delta }}}^{2}\):
the prior for μ be normal with variance parameter \({\sigma }_{\mu }^{2}\):
the prior for γ_{j} be multinomial:
the prior probabilities of belonging to each of the mixture distributions k are stored in L + 1dimensional vector π with the prior for π a Dirichlet distribution
where I(⋅) is the indicator function and p_{L} is the L + 1dimensional vector with prior values. For the exact values of the prior specification see Data Analysis Details.
The conditional posterior distribution for \({\sigma }_{G}^{2}\) is inverse gamma with parameters
where γ^{k} denotes the set of indices j for which γ_{j} = k. The conditional posterior distribution for π is Dirichlet distribution
Unfortunately, there is no simple form for the conditional posteriors of α, μ, β_{j} and δ_{q}. However, the conditional posterior distributions are logconcave (see Supplementary Note), and thus the sampling for α, μ, β_{j} and δ_{q} can be conducted using adaptive rejection sampling requiring only the log posteriors. Denoting β_{−j} as all the β parameters, excluding β_{j}, and δ_{−q} as all the δ parameters, excluding δ_{q}, these are
Selection of the mixture component
We intend to do variable selection and select mixture components by using the idea of spike and slab priors^{34}, where the spike part of the prior has a point mass of 0. SNP will be assigned to a mixture component by comparing the ratios of the marginal likelihood. For mixture selection for the jth SNP, we need to find the following marginal likelihood for every k. Suppose here that C_{0} > 0 is the factor for the 0th mixture (spike)
where D represents the observed data, Q is a positive constant that is not dependent on k and
The probability for γ_{j} is :
where C is a positive constant that is not dependent on k. Denoting \({{\Theta }}=\{D,{{\boldsymbol{\beta }}}_{j},{\boldsymbol{\delta }},{\sigma }_{G}^{2},\alpha ,\mu \}\), the probability to include SNP j in the component k can be calculated as
For every k
Here, the numerator represents the marginal likelihood assuming jth variable is included in the kth mixture component.
In general, it is not possible to find an analytic expression for the integrals presented in Eq. (23), thus some numeric method has to be used for approximating their values. For this, we use adaptive GaussHermite quadrature as the integral is improper with infinite endpoints.
We start by rewriting the expression (Eq. 24) as
where \({v}_{i}=\alpha ({\mathrm{log}}\,{y}_{i}\mu {\bf{x}}^{\prime} {{\boldsymbol{\beta }}}_{j}{{\bf{z}}}_{i}^{\prime}{\boldsymbol{\delta }})K\) and u_{i} is analogous with a_{i} instead of y_{i}. We introduce a reparameterisation with variable s
and therefore we get from Eq. (23)
in the last expression in Eq. (27), the term \(Q\sqrt{2{\sigma }_{G}^{2}}\) cancels out from the numerator and the denominator.
If the smallest mixture variance factor C_{0} > 0, then the corresponding spike distribution is absolutely continuous. As we would like to use Dirac spike instead, we define the corresponding marginal likelihood by finding the limit of the expression in the process C_{0} → 0+.
We are only interested in C_{0} in the limiting process so without the loss of generality we define C_{0} through an auxiliary positive integer variable l as \({C}_{0}=\frac{1}{l}\) and using the reparametrisation result from Eq. (30) we get that
As f(s, l) ≤ 1 for every possible combination of arguments, because in the data censoring or event occurs only after entering the study, we can write that
which means that the integrand in Eq. (32) is dominated by \(\exp \{{s}^{2}\}\). Furthermore, we see that the limit of the integrand is
As \(\int \exp \{{s}^{2}\}ds\,<\,\infty \), it is possible to use the Lebesgue’s dominated convergence theorem and therefore
In conclusion, we have derived the expression for the marginal likelihood for the Dirac spike variance component as
Adaptive GaussHermite quadrature
It is possible to use GaussHermite quadrature, however, it can happen that for adequate precision one has to use a large number of quadrature points leading to more calculations. Adaptive GaussHermite quadrature can make the procedure more efficient. For any function g_{k}(s) as defined in Eq. (30), we can write
where \(\hat{\mu }\) could be chosen as the mode of g_{k}(s) and \(\hat{\sigma }=\frac{1}{\sqrt{({\mathrm{log}}\,{g}_{k}(s))^{\prime\prime} { }_{s = \hat{\mu }}}}\); m is the number of quadrature points, t_{r} is the roots of mth order Hermite polynomial and w_{r} are corresponding weights^{35}.
Finding the posterior mode can be computationally cumbersome, calculating \(\hat{\sigma }\) requires evaluating the logarithm of g_{k} at this mode. As we assume that the effects sizes are distributed symmetrically around zero, we use \(\hat{\mu }=0\) which avoids numerical posterior mode calculations and evaluating the second derivative at different posterior modes.
Posterior inclusion probability
Combining the previous results we get a numerical solution for calculating the posterior inclusion probability. For every k > 0 the inclusion probabilities are
Similarly, we can find the probability of excluding (γ_{j} = 0) the marker
Both cases \({\hat{\sigma }}_{k}\) are calculated as
For computational purposes, we evaluate \(\frac{{g}_{k}({\hat{\sigma }}_{k}\sqrt{2}{t}_{r})}{T}\) as
Adaptive rejection sampling
To sample α, μ and β_{j}, δ_{q}, we use Adaptive Rejection Sampling, initially outlined by Gilks and Wild^{36}. The prerequisite of the method is the logconcavity of the sampled density function.
The idea of the method is to build an envelope around the logdensity. The lower hull is constructed by evaluating the function at some prespecified abscissae and connecting the evaluation results with linear functions resulting in a piecewise linear lower hull. The upper hull can be constructed either by using tangents evaluated at the prespecified abscissae (Derivative based ARS) or by extending the linear functions obtained in the construction of the lower hull (Derivative free ARS^{37}). Although the derivativebased method might result in a more accurate upper hull, thus leading to faster sampling, it would still require evaluating derivatives and thus we employ the derivativefree method.
The proposals are sampled from an appropriately scaled exponent of the upper hull from which it is easier to sample. The sampling proposal will go through tests. If the proposal is not accepted then it will be included in the set of used abscissae to create a more accurate envelope in the next round. Therefore, the method requires specifying the log posterior and at least three initial abscissae. It also requires some abscissae larger and smaller than the posterior mode. To set the abscissae for some parameter θ, we could, for example, choose the abscissae \(\{\hat{\theta }{c}_{\theta },\hat{\theta },\hat{\theta }+{c}_{\theta }\}\), where \(\hat{\theta }\) is ideally the posterior mode for θ. c_{θ} is some positive number that would guarantee that at least one of the proposed abscissae would be larger than posterior mode and one smaller. If \(\hat{\theta }\) is the posterior mode, then c_{θ} choice is arbitrary and a smaller c_{θ} is preferred, potentially decreasing the sampling time.
In addition, the derivativefree method requires specifying the minimum and maximum value of the distribution, an assumption that is often incorrect. In practice, it poses no problems as we can simply set the required minima and maxima to be extreme enough so that the distribution is very unlikely to reach those values. To sample intercept μ we set the limits to 2 and 5 which after exponentiation would correspond to 7.39 and 148.4 which we believe each of our posterior means should fit in; for α we set the limits to 0 to 40; for nonzero betas, we used the previous beta value \(\pm 2\sqrt{{C}_{k}{\sigma }_{G}^{2}}\) as minimum and maximum limits for sampling as this can adapt to different mixtures and should still safely retain almost the entire posterior distribution. The Adaptive Rejection Sampling was implemented using C code by Gilks (http://www1.maths.leeds.ac.uk/~wally.gilks/adaptive.rejection/web_page/Welcome.html, accessed 26.08.2020). In the Supplementary Note, we provide proof of the logconcavity of the functions sampled.
Sampling algorithm
We summarise the serial sampling algorithm in Algorithm 29 along with the specification for the prior distributions and the initialisation of the model parameters. Algorithm 2 summarises the Bulk Synchronous Gibbs sampling for BayesW that extends Algorithm 29. If the number of workers T = 1 and the synchronisation rate u = 1 then Algorithm 2 reduces down to Algorithm 29.
Extension to a grouped mixture of regressions model
Here, we now assume that the SNP marker effects come from Φ of disjoint groups, with a reparametrisation of the model parameters to represent the mean of the logarithm of the phenotype as
where there is a single intercept term μ, and a single Weibull shape parameter α, but now \({{\bf{x}}}_{i}^{\varphi }\) are the standardised marker values in group φ, β^{φ} are the marker estimates for the corresponding group. Each \({{\boldsymbol{\beta }}}_{j}^{\varphi }\) is distributed according to:
where for each SNP marker group prior probabilities of belonging to each of the mixture distribution k is stored in L_{φ} + 1dimensional vector π^{φ} and these mixture proportions \(\left\{{{\boldsymbol{\pi }}}_{0}^{\varphi },{{\boldsymbol{\pi }}}_{1}^{\varphi },\ldots ,{{\boldsymbol{\pi }}}_{L}^{\varphi }\right\}\), \(\mathop{\sum }\nolimits_{k = 0}^{L}{{\boldsymbol{\pi }}}_{k}^{\varphi }=1\) are updated in each iteration. Each mixture component (γ_{j} = k ≥ 1) is a normal distribution with zero mean and variance \({C}_{k}^{\varphi }{\sigma }_{G\varphi }^{2}\), where \({\sigma }_{G\varphi }^{2}\) represents the phenotypic variance attributable to markers of group φ and \({C}_{k}^{\varphi }\) is group and mixture specific factor showing the magnitude of variance explained by this specific mixture. Thus, the mixture proportions, variance explained by the SNP markers, and mixture constants are all unique and independent across SNP marker groups. The formulation presented here of having an independent variance parameter \({\sigma }_{G\varphi }^{2}\) per group of markers, and independent mixture variance components enable estimation of the amount of phenotypic variance attributable to the groupspecific effects and enable differences in the distribution of effects among groups. All of the steps shown in previous paragraphs are still valid and now we are using groupspecific genetic variances \({\sigma }_{G\varphi }^{2}\), prior inclusion probabilities π^{φ} and mixture proportions \({C}_{k}^{\varphi }\). Furthermore, due to the fact that the model is additive, the sum of groupspecific genetic variances represents the total genetic variance \({\sigma }_{G}^{2}=\mathop{\sum }\nolimits_{\varphi = 1}^{{{\Phi }}}{\sigma }_{G\varphi }^{2}\).
Derivations for the sparse calculations
In order to reduce the number of computations and improve running times we derive a sparse representation of genotypes, given that conditional posterior distributions in our scheme are different, we have to derive different update equations. Suppose ξ_{ij} represents the jth SNP allele count (0, 1 or 2) for the ith individual, and \({\bar{\xi }}_{j},{s}_{j}\) represent the mean and standard deviation of the j−th SNP in our sample. In the regular setting we would like to use standardised count values (\({{\bf{x}}}_{ij}=\frac{{\xi }_{ij}{\bar{\xi }}_{j}}{{s}_{j}}\)) instead and meanwhile, speed up the computations by using the knowledge that x_{ij} can have only three values within an SNP.
There are three equations where we can apply sparsity. Firstly, Eq. (40) for the \({\hat{\sigma }}_{k}\) term (for the jth SNP) in the adaptive GaussHermite quadrature can be expressed as
We see that s_{j} and \({\bar{\xi }}_{j}\) and the expressions containing these terms can be calculated already beforehand for each SNP j.
Secondly, we can use the knowledge about sparsity to simplify expression (Eq. 41). More specifically
Thirdly, in expression (Eq. 20) we can rewrite the transformed residuals as
For all cases, after each update, we need to recalculate the difference \(\exp ({v}_{i})\exp ({u}_{i})\) for each individual i. We notice that it is sufficient to use three sums (\({\sum }_{{\xi }_{ij} = \xi }(\exp ({v}_{i})\exp ({u}_{i}))\), ξ ∈ {0, 1, 2}) that we denote as \({V}_{0}^{j}\), \({V}_{1}^{j}\), \({V}_{2}^{j}\) which are used in both of the final expressions. Thus, we have eliminated the need to calculate exponents and dot products in expressions (Eq. 40) and (Eq. 41), reducing them to a series of sparse summations and making the analysis scale sublinearly with increasing marker number.
Simulation study
We conducted simulations to analyse the performance of our model under model misspecification, where the phenotypic distribution does not conform to a Weibull distribution, and to different censoring levels in the data. We assessed (i) estimation of hyperparameters, (ii) false discovery rate, and (iii) prediction accuracy. We used M = 50, 000 uncorrelated markers and N = 5000 individuals for whom we simulated effects on p = 500 randomly selected markers, heritability (as defined in the Supplementary Note) was set to be h^{2} = 0.5. Then, we generated phenotypes from the generalised gamma distribution (see Supplementary Note), retaining the mean and the variance on a logarithmic scale and thus fixing the heritability, while varying the θ parameter of the generalised gamma distribution between 0 and 2 (five settings of θ = 0, 0.5, 1, 1.5, 2 with θ = 1 corresponding to a Weibull distribution). For these data sets, we also varied the censoring levels of 0%, 20% and 40% (see Supplementary Note). For each of the censoring and phenotypic distribution combinations, 25 replicate phenotypic data sets were created, giving a total of 375 data sets. The prior parameters for \({\sigma }_{G}^{2}\), α and μ were set the same way as described in Data Analysis Details.
To compare our approach with other available methods we analyzed each data set using different approaches: a Cox Lasso^{13}, a martingale residuals approach with singlemarker ordinary least squares (OLS) regression^{11}, and martingale residuals with a Bayesian regression mixture model with a Dirac spike (BayesR)^{19}. For each of the 25 simulation replicates, across the five generalised gamma θ parameters, we calculated the correlation between the simulated genetic values and a genetic predictor, created from the regression coefficients obtained from each approach, in an independent data set (same number of markers, same causal markers and same effect sizes, with N = 1000 individuals), with results shown in Fig. 1a). Secondly, for all four methods, we calculated precisionrecall curves for the generalised gamma distributions θ ∈ {0, 1, 2} and censoring levels 0%, 20% and 40% (Fig. 1b, Supplementary Fig. 2). Bayesian models used 5 chains with 1100 iterations for each chain with a burnin of 100 and no thinning. The BayesW model was estimated using 11 quadrature points. The hyperparameter for Cox Lasso was estimated using 5fold crossvalidation for each simulated data set separately.
In addition, for the BayesW model, we analysed each of the 375 data sets, using only a single mixture distribution set to a constant of 0.01, or two mixture distributions with constants 0.01, 0.001. We compared these model formulations by accessing the slope between true and estimated nonzero marker effects (Fig. 1c) and the estimated heritability, across the range of generalised gamma distributions (Fig. 1d). For each of the 5 generalised gamma θ parameter settings, we also calculated the mean false discovery rate (FDR) levels across the 25 replicate simulations given fixed posterior inclusion probabilities (Fig. 1e) for both the singlemixture and twomixture model formulations (Fig. 1f). Finally, we tested the impact of using a different number of quadrature points by running the model for the Weibull setting data sets. We varied the number of quadrature points from 3 to 25 across 5 simulation replicates, using two mixture distributions (0.001,0.01), and investigated the root mean square error (RMSE: estimated/true) for marker effect estimates within 5 top deciles of the simulated marker effect distribution (Supplementary Fig. 3b).
To test the impact of LD among the markers, we used UK Biobank chromosome 22 imputed genotype data (M = 194, 922 markers, N = 20, 000 randomly selected individuals, p = 2000 randomly selected causal markers, with heritability h^{2} = 0.5) and we simulated the phenotypes from Weibull distribution, with 25 simulation replicates. We used this data to compare BayesW to the same other methods described above, by calculating the correlation of simulated genetic value and a genetic predictor in an independent data set (the same number of markers, same causal markers and same effect sizes, with N = 4000 individuals). We present these results in Supplementary Fig. 1a. In addition, we used the same genetic data set but varied the censoring levels, to examine the stability of the heritability estimate (Supplementary Fig. 1b). Bayesian analyses used 5 chains with 3000 iterations each and a burnin of 1000 and thinning of 5. The Cox Lasso model was trained the same way as in the uncorrelated case.
To validate properties of polygenicity, variance partitioning between mixtures and false discovery rate we used UK Biobank chromosome 1 imputed genotype data that was LD pruned with threshold r^{2} = 0.9 as this data set was later used in the final analyses (M = 230,227 markers, N = 25,000 randomly selected individuals). We ran 10 simulations with three different numbers of causal loci: 200, 2500 and 4000. The phenotypes were simulated from Weibull distribution with a fixed heritability of h^{2} = 0.5. All the models were executed with three variance components (0.0001,0.001,0.01) (Supplementary Figs. 11, 12). The effects were created by first grouping the markers via a clumping procedure (window size 10Mb, LD threshold r^{2} = 0.1) and then assigning the effects to the index SNPs of randomly selected clumps.
Finally, we ran 10 simulations on the same UK Biobank chromosome 1 data as described in the previous section to check the performance of the BSP Gibbs sampling algorithm in a scenario that would be the closest to the empirical UK Biobank data analysis. Here, we only used p = 2, 500 randomly selected causal SNPs, with heritability h^{2} = 0.5. The phenotypes were simulated from Weibull distribution and models were run with three variance components (0.0001,0.001,0.01). Models were run by varying the number of tasks (parallelism) between 1, 4, 8, 16 and synchronisation rate (number of markers processed by each task until synchronisation) between 1, 5, 10, 20, 50 (Supplementary Fig. 3a). The scenario of 8 tasks (~30,000 markers per task) and synchronisation rate of 10 is used in the empirical data analysis.
UK biobank data
We restricted our discovery analysis of the UK Biobank to a sample of Europeanancestry individuals (N = 456,426). To infer ancestry, 488,377 genotyped participants were projected onto the first two genotypic principal components (PC) in 2504 individuals of the 1000 Genomes project with known ancestries. Using the obtained PC loadings, we then assigned each participant to the closest population in the 1000 Genomes data: European, African, EastAsian, SouthAsian or Admixed. As we wished to contrast the genetic basis of different phenotypes, we then removed closely related individuals as identified in the UK Biobank data release. While we expect that our model can accommodate relatedness similar to other mixed linear model approaches, we wished to compare phenotypes at markers that enter the model due to LD with underlying causal variants, and relatedness leads to the addition of markers within the model to capture the phenotypic covariance of closely related individuals.
We used the imputed autosomal genotype data of the UK Biobank provided as part of the data release. For each individual, we used the genotype probabilities to hardcall the genotypes for variants with an imputation quality score above 0.3. The hardcallthreshold was 0.1, setting the genotypes with probability ≤0.9 as missing. From the good quality markers (with missingness less than 5% and pvalue for HardyWeinberg test larger than 10–6, as determined in the set of unrelated Europeans) were selected those with minor allele frequency (MAF) 0.0025 and rs identifier, in the set of Europeanancestry participants, providing a data set of 9,144,511 SNPs, short indels and large structural variants. From this, we took the overlap with the Estonian Biobank data to give a final set of 8,433,421 markers. From the UK Biobank European data set, samples were excluded if in the UK Biobank quality control procedures they (i) were identified as extreme heterozygosity or missing genotype outliers; (ii) had a genetically inferred gender that did not match the selfreported gender; (iii) were identified to have putative sex chromosome aneuploidy; (iv) were excluded from kinship inference. Information on individuals who had withdrawn their consent for their data to be used was also removed. These filters resulted in a dataset with 382,466 individuals. We then excluded markers of high LD by conducting LD pruning using a threshold of r^{2} = 0.9 for a 100kb window leaving us with a final set of 2,975,268 markers. This was done in order to decrease the number of markers that were in high LD and thus giving very little extra information but requiring more than two times the computational resources. Genotype quality control was conducted using plink version 1.9^{38}.
We then selected the recorded measures for the 382,466 to create the phenotypic data sets for ageatmenopause, ageatmenarche and ageatdiagnosis of HBP, T2D or CAD. For each individual i we created a pair of last known time (logarithmed) without an event Y_{i} and censoring indicator C_{i} (C_{i} = 1 if the person had the event at the end of the time period, otherwise C_{i} = 0). If the event had not happened for an individual, then the last time without having the event was defined as the last date of assessment centre visit minus date of birth (only month and year are known, the exact date was imputed to 15).
For ageatmenopause, we used UKB field 3581 to obtain the time if available. We excluded from the analysis (1) women who had reported having and later not having had menopause or vice versa, (2) women who said they had menopause but there are no record of the time of menopause (UKB field 2724), (3) women who have had a hysterectomy or the information about this is missing (UKB field 3591), 4) women whose menopause is before age 33 or after 65. This left us with a total of N = 151,472 women of which 108, 120 had the event and 43, 352 had not had an event by the end of the followup. For timetomenarche we used UKB field 2714 and we excluded all women who had no record for timetomenarche which left us with a total of N = 200,493 women of which all had had the event. For the age of diagnosis of HBP we used the UKB field 2966 for and we left out individuals who had the HBP diagnosed but there was no information about the age of diagnosis (UKB field 6150) which left us with a total of N = 371,878 individuals of which 95,123 had the event and 276, 755 had not had an event by the end of the followup. For age of diagnosis of T2D, we used either the UKB field 2976 or field 20009 or the mean of those two if both were available. We excluded individuals who had indicated selfreported “type 1 diabetes” (code 1222) and had Type 1 Diabetes (ICD code E10) diagnosis; we also excluded individuals who did not have any recorded time for the diagnosis of T2D but had indicated secondary diagnosis (UKB fields 41202 and 41204) of “noninsulindependent diabetes mellitus” (ICD 10 code E11) or selfreported noncancer illness (UKB field 20002) “type 2 diabetes” (code 1223) or “diabetes” (code 1220). That left us with a total of N = 372, 280 individuals of which 15, 813 had the event and 356, 467 had not had an event by the end of the followup. For the age of diagnosis of CAD, we used either the minimum of age at angina diagnosed and age heart attack diagnosed (UKB fields 3627 and 3894) or the minimum age indicated to have either two of diagnoses (codes 1074, 1075) in UKB field 20009 or the mean of those if both were available. We excluded individuals who did not have any information about the time of diagnosis but had the following primary or secondary diagnoses: ICD 10 codes I20, I21, I22, I23, I24 or I25; selfreported angina (code 1074) or selfreported heart attack/myocardial infarction (code 1075). That left us with a total of N = 360, 715 individuals of which 17, 452 had the event and 343, 263 had not had an event by the end of the followup.
In the analysis we included covariates of sex, UK Biobank recruitment center, genotype batch and 20 first principal components of the LD clumped set of 1.2 million marker data set, calculated using flashPCA2^{39} commit version b8044f1, to account for the population stratification in a standard way. We did not include any covariates of age or year of birth because these are directly associated with our phenotypes.
Estonian biobank data
The Estonian Biobank cohort is a volunteerbased sample of the Estonian resident adult population. The current number of participantsclose to 52,000–represents a large proportion, 5%, of the Estonian adult population, making it ideally suited to populationbased studies^{40}. For the Estonian Biobank Data, 48,088 individuals were genotyped on Illumina Global Screening (GSA) (N = 32,594), OmniExpress (N = 8102), CoreExome (N = 4,903) and Hap370CNV (N = 2,489) arrays. We selected only those from the GSA array and imputed the data set to an Estonian reference, created from the whole genome sequence data of 2244 participants^{41}. From 11,130,313 markers with imputation quality score >0.3, we selected SNPs that overlapped with the UK Biobank LD pruned data set, resulting in a set of 2,975,268 markers. The phenotypic data was constructed similarly to the phenotypes based on the UK Biobank data for the N_{Est} = 32,594 individuals genotyped on the GSA array. For timetoevent traits, if no event had happened then the time is considered censored and the last known age without the event was used, calculated as the last known date without event minus the date of birth. Because only the year of birth is known, birth date and month were imputed as July 1 for age calculations.
For ageatmenopause, we excluded women who had reported having menstruation stopped for other reasons which resulted in 6434 women who had had menopause and 12,934 women who had not had menopause. For ageatmenarche we excluded women who had not reported the age when the menstruation started which resulted in 18,134 women. For both ageatmenarche and ageatmenopause, if the event had occurred, selfreported age during that event was used.
Initially, the cases of CAD, HBP or T2D were identified on the basis of the baseline data collected during the recruitment, where the disease information was either retrieved from medical records or selfreported by the participant. Then, the information was linked with additional health insurance information that provided additional information on prevalent cases. To construct the phenotypes for the timetodiagnosis of CAD, HBP or T2D for the individuals with the corresponding diagnosis we used the age at the first appearance of the respective ICD 10 code that was also used for creating the UK Biobank phenotypes. If the selfreported data about the ICD 10 code has only the information about the year, the date and month were imputed as July 1 and if only the date is missing then the date was imputed as 15. Respective casecontrol phenotypes for CAD, HBP or T2D were defined 0 if the person had not had an event (censored) and 1 if the person had had an event and these binary indicators were adjusted for age and sex. For the T2D phenotype, we excluded individuals with a diagnosis of T1D. For CAD we resulted with 30,015 individuals without the diagnosis and 2579 individuals with a diagnosis, for HBP we resulted with 24,135 individuals without the diagnosis and 8459 individuals with a diagnosis and for T2D we resulted with 30,883 individuals without the diagnosis and 1457 individuals with a diagnosis.
Data analysis details
The BayesW model was run on the UK Biobank data without groups and with 20 MAFLD groups that were defined as MAF quintiles and then quartiles within each of those MAF, quintiles split by the LD score. The cutoff points for creating the MAF quintiles were 0.006, 0.013, 0.039, 0.172; the cutoff points for creating LD score quartiles were 2.11, 3.08, 4.51 for the first; 3.20, 4.71, 6.84 for the second; 4.70, 6.89, 9.94 for the third; 7.65, 11.01, 15.70 for the fourth and 10.75, 15.10, 21.14 for the fifth MAF quintile. The prior distributions for the hyperparameters were specified such that they would be only weakly informative: normal priors would have a zero mean and very large variance, Dirichlet priors would be vectors of ones and the rest such that the prior parameter value would have a very small contribution to the conditional distribution compared to the likelihood. Specifically, for μ and δ the mean is chosen 0 and variance \({\sigma }_{\mu }^{2}={\sigma }_{{\boldsymbol{\delta }}}^{2}=100\); for α we choose α_{0} = 0.01 and κ_{0} = 0.01; for \({\sigma }_{G}^{2}\) in without groups and \({\sigma }_{G\varphi }^{2},\) ∀ φ in with groups model, we set parameters to be α_{σ} = 1, β_{σ} = 0.0001; for π and π^{φ} the prior parameters is set to be a vector of ones. The model without groups was executed with mixture components 0.00001, 0.0001, 0.001, 0.01 (reflecting that the markers allocated into those mixtures explain the magnitude of 0.001%, 0.01%, 0.1% or 1% of the total genetic variance), and the model with groups was executed with (groupspecific) mixture components (0.0001, 0.001, 0.01, 0.1). Guided by our simulation study (Supplementary Fig. 3b), we used 25 quadrature points for running each of the models. For the model without groups, we used five chains and for the model, with groups, we used three chains. Each of the chains was run for 10,000 iterations with a thinning of 5 giving us 2000 samples. We applied a stringent criterion of removing the first half of the chain as burnin, giving the convergence statistics of Supplementary Figs. 4, 5, 6, 7. That gave 5000 samples for the model without groups and 3000 samples for the model with groups for each of the five traits.
The BSP Gibbs sampling scheme is implemented by partitioning the markers in equal size chunks assigned to workers (MPI tasks) themselves distributed over compute nodes. For the analyses we used 8 tasks per node; due to the differences in sample size we were using the different number of nodes to accommodate the data in memory: for timetomenopause, we used 8 nodes, for timetomenarche we used 10 nodes and for timetodiagnosis of CAD, HBP or T2D we used 12 nodes. This resulted in splitting the markers between 64 workers for timetomenopause, 80 workers for timetomenarche and 96 workers for timetodiagnosis of CAD, HBP or T2D. For the last case, the average number of markers assigned to one worker is 30,992. We chose to use a synchronisation rate of 10 meaning the synchronisation between all of the workers was done after sampling 10 markers in each of the workers. Both the choice of the maximum number of workers and the synchronisation rate are stringent options considering our simulation study results plotted in Supplementary Fig. 3a.
For testing regionbased significance for BayesW, we used a Posterior Probability of the Window Variance (PPWV)^{21}. PPWV requires first setting a threshold of the proportion of the genetic variance explained. Then, based on the posterior distributions we calculated the probability that each region explained more than the specified threshold of the proportion of the genetic variance and this quantity is denoted as PPWV. The regions were defined via the LD clumping procedure (window size 10 Mb, LD threshold r^{2} = 0.1) resulting in regions that have high interregion correlations but low intraregion correlations. For these LD clumped regions we used thresholds of 1/100,000, 1/10,000 and 1/1000 of the total genetic variance. The smallest threshold for PPWV is 1/100,000 of the total genetic variance as this gives the same magnitude as the smallest mixture component (0.00001) used in the models. The smallest mixture component reflects the smallest effect size the model is intended to capture. The thresholds of 1/10,000 and 1/1000 of the total genetic variance are chosen 10 and 100 times greater than the smallest threshold to point out the regions with larger effect sizes. To check the significance of the geneassociated regions we used more stringent thresholds of 1/10,000 and 1/1000 of the total genetic variance as geneassociated regions can contain greatly more markers. Furthermore, to make geneassociated regions more comparable, we fixed an upper bound of 250 for the markers that can contribute to a geneassociated, markers exceeding the bound were randomly discarded.
To do the comparison in terms of discovered regions and prediction accuracy we used the summary statistics from the fastGWA method^{28}. Because there were no results for our definition of timetoCAD or timetoT2D we used timetoangina and timetoheart attack summary statistics for comparison with CAD and timetodiabetes for comparison with T2D. We called an LD clumped region significant if the region contained at least one SNP with a pvalue < 5 × 10^{−8}. To do the prediction into the Estonian Biobank we only used the markers with pvalue < 5 × 10^{−8}. We did the predictions only for ageatmenarche and ageatmenopause since the number of significant markers for them is higher.
To do the comparison in terms of predictive accuracy with a competing method we also trained the CoxLASSO method with R package snpnet^{14,15} with UK Biobank data and then used the estimates to make predictions into Estonian Biobank. To make the two models comparable, we used exactly the same data sizes for estimating the models on the UK Biobank as were used with the BayesW. For all of the traits, we decided to use 95% of the sample size as the training data and the rest as the validation data. This was done in order to minimise the loss in power due to not using the entire sample and 5% of the sample gives a sufficiently large validation set. We ran the CoxLASSO model using snpnet with 16 threads and allocating 250 GB of memory. This was sufficient to find the optimal hyperparameter for the traits of timetomenopause (22 iterations to find the optimal hyperparameter) and timetoCAD (21 iterations to find the optimal hyperparameter). However, for the other traits, the snpnet procedure ran out of memory and it was decided to use the results from the last available iteration (iteration 28 for timetoHBP, iteration 35 for timetomenarche, iteration 27 for timetoT2D). For the traits for which it was not possible to detect the optimal hyperparameter a sensitivity analysis was done by comparing with the previous iterations. Prediction accuracy was virtually the same between the last available iteration and some iterations before that suggesting that the last available iteration was already providing a hyperparameter close to the optimum.
The prediction based on BayesW into Estonian Biobank \(\hat{{\bf{g}}}\) was calculated by multiplying \(\hat{{\bf{g}}}={{\bf{X}}}_{Est}\hat{{\boldsymbol{\beta }}}\), where X_{Est} is N_{Est} × M matrix of standardised Estonian genotypes (each column is standardised using the mean and the standard deviation of the Estonian data), \(\hat{{\boldsymbol{\beta }}}\) is the M × I matrix containing the posterior distributions for M marker effect sizes across I iterations. To calculate the prediction into Estonia we used the BayesW model with groups using 3000 iterations which gave us posterior predictive distributions of the genetic values with 3000 iterations. To create the final predictor, we calculated the mean genetic value for each individual across 3000 iterations. We also created the predictor using the estimates from CoxLASSO by multiplying the standardised Estonian genotype matrix with the vector of CoxLASSO effect size estimates. We evaluated the performance of the two predictors by comparing them to the true phenotype value and calculating R^{2} and Harrell’s Cstatistic^{29}. Instead of using the exact phenotypes the martingale residuals from the Cox PH model where the true phenotype was regressed on the gender (if applicable) were used to calculate the R^{2}. That enables calculating the R^{2} value using also the censored individuals. Harrell’s Cstatistic was calculated from the Cox PH model where the true phenotype was regressed on the predictor and gender (if applicable).
The BayesW calculations have been performed using the facilities of the Scientific IT and Application Support Center of EPFL and the Helvetios cluster. All of the postanalysis steps were conducted using R software (version 3.6.1)^{42}.
Algorithm 1
Serial algorithm for BayesW sampling from the posterior distribution \(p(\mu ,\alpha ,{\boldsymbol{\delta }},{\boldsymbol{\gamma }},{\boldsymbol{\beta }},{\boldsymbol{\pi }},{\sigma }_{G}^{2} D)\). Initialisation and prior specification.
Data: Matrix x of standardised genotypes, matrix with covariate data z, vector of last time without an event y, vector of failure indicators d, prior hyperparameters α_{0}, κ_{0}, α_{σ}, β_{σ}, \({\sigma }_{\mu }^{2}\), \({\sigma }_{{\boldsymbol{\delta }}}^{2}\), p_{L}, iterations I. \({V}_{0}^{j}\), \({V}_{1}^{j}\) and \({V}_{2}^{j}\) denote the partial sums of the exponentiated residuals (defined in Derivations for the sparse calculations), K is the EulerMascheroni constant.
Initial values: Initially, we exclude all the variables from the model, thus the initial γ_{j} = 0, ∀ j ∈ {1, . . . , M}. We set β_{j} = 0, ∀ j and δ_{q} = 0, ∀ q. The initial value for α is chosen to be the suitably transformed variance of the log sample. The initial value for μ is the mean of the log sample. The \({\sigma }_{G}^{2}\) is initialized as the variance of the log sample divided by the total number of markers M.
Parameters for prior distributions: We set priors weakly informative. Otherwise, if available, prior information could be used. To get weakly informative priors, for α prior, we set parameters to be α_{0} = 0.01 and κ_{0} = 0.01; for \({\sigma }_{G}^{2}\), we set parameters to be α_{σ} = 1, β_{σ} = 0.0001; for μ prior, we set parameter \({\sigma }_{\mu }^{2}=100\) and similarly for δ_{q} we set parameter \({\sigma }_{{\boldsymbol{\delta }}}^{2}=100\). The choice of prior parameters for π, p_{L} is a vector of ones.
1 Initialise for every i: ε_{i} = y_{i} − μ
2 for iteration ← 1 to I do
3 Add the previous effect to the residual: ε_{i} ← ε_{i} + μ^{old}
4 Sample μ using ARS;
5 Subtract the new effect from the residual: ε_{i} ← ε_{i} − μ
6 Shuffle \(\left({\mathrm{covariates}}\right)\);
7 foreach covariate q do
8 \({\varepsilon }_{i}\leftarrow {\varepsilon }_{i}+{{\bf{z}}}_{iq}{{\boldsymbol{\delta }}}_{q}^{old}\)
9 Sample δ_{q} using ARS
10 ε_{i} ← ε_{i} − z_{iq}δ_{q}
11 Sample α using ARS;
12 Shuffle \(\left({\mathrm{markers}}\right)\);
13 Calculate exponentiated residuals: \({\epsilon }_{i}\leftarrow \exp (\alpha {\varepsilon }_{i}K)\);
14 foreach marker j do
15 if \({{\boldsymbol{\beta }}}_{j}^{old}=0\) then
16 Calculate \({V}_{0}^{j}\), \({V}_{1}^{j}\), \({V}_{2}^{j}\);
17 if \({{\boldsymbol{\beta }}}_{j}^{old}\ne 0\) then
18 \({\varepsilon }_{i}\leftarrow {\varepsilon }_{i}+{{\bf{x}}}_{ij}{{\boldsymbol{\beta }}}_{j}^{old};\)
19 \({\epsilon }_{i}\leftarrow \exp (\alpha {\varepsilon }_{i}K);\)
20 Calculate \({V}_{0}^{j}\), \({V}_{1}^{j}\), \({V}_{2}^{j}\);
21 Sample mixture indicator γ_{j};
22 if γ_{j} > 0 then
23 sample β_{j} from the γ_{j}th conditional distribution using ARS;
24 ε_{i} ← ε_{i} − x_{ij}β_{j};
25 \({\epsilon }_{i}\leftarrow \exp (\alpha {\varepsilon }_{i}K);\)
26 if γ_{j} = 0 then
27 set β_{j} = 0;
28 Sample π;
29 Sample \({\sigma }_{G}^{2}\).
Algorithm 2
Bulk Synchronous Parallel Gibbs sampling with BayesW. Data, parameter initialisation and prior values are set as in Algorithm 29.
Input: Define T parallel workers (tasks) and synchronisation rate u. Each worker t ∈ {1, . . . , T} has its corresponding vector of marker effects β^{t} (with the number of markers in each \( \sim \frac{M}{T}\), M is the total number of markers), indicator values γ^{t} to update and set of T messages Δε_{t}: N × 1, N is the sample size.
1 Initialise variables;
2 for iteration ← 1 to I do
3 Update μ, δ, α (as in Algorithm 29);
4 foreach subset of size u do
5 Δε_{t} ← 0, t ∈ {1, . . . , T};
6 for t ← 1 to T in parallel do
7 foreach column j from a subset of size u of the columns assigned to worker t do
8 if \({{\boldsymbol{\beta }}}_{j}^{old}=0\) then
9 Calculate \({V}_{0}^{j}\), \({V}_{1}^{j}\), \({V}_{2}^{j}\) based on ϵ;
10 if \({{\boldsymbol{\beta }}}_{j}^{old}\,\ne\, 0\) then
11 \({\widetilde{\varepsilon }}_{i}\leftarrow {\varepsilon }_{i}+{{\bf{x}}}_{ij}{{\boldsymbol{\beta }}}_{j}^{old}\);
12 \({\widetilde{\epsilon }}_{i}\leftarrow \exp (\alpha {\widetilde{\varepsilon }}_{i}K)\);
13 Calculate \({V}_{0}^{j}\), \({V}_{1}^{j}\), \({V}_{2}^{j}\) based on \(\widetilde{\epsilon }\);
14 Sample mixture indicator γ_{j};
15 if γ_{j} > 0 then
16 sample β_{j} from the γ_{j}th conditional distribution using ARS;
17 if γ_{j} = 0 then
18 set β_{j} = 0;
19 if \(\left({{\boldsymbol{\beta }}}_{j}{{\boldsymbol{\beta }}}_{j}^{old}\right)\,\ne \,0\) then
20 Δε_{ti} ← Δε_{ti} − x_{ij}\(\left({{\boldsymbol{\beta }}}_{j}{{\boldsymbol{\beta }}}_{j}^{old}\right)\);
21 Wait until all workers are finished processing their sets of u markers;
22 \({\varepsilon }_{i}\leftarrow {\varepsilon }_{i}+\mathop{\sum }\nolimits_{t = 1}^{T}{{\Delta }}{\varepsilon }_{ti}\);
23 \({\epsilon }_{i}\leftarrow \exp (\alpha {\varepsilon }_{i}K);\)
24 Update π, \({\sigma }_{G}^{2}\) (as in Algorithm 29);
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
This project uses UK Biobank data under project 35520. UK Biobank genotypic and phenotypic data is available through a formal request at (http://www.ukbiobank.ac.uk). The UK Biobank has ethics approval from the North West Multicentre Research Ethics Committee (MREC). For access to be granted to the Estonian Biobank genotypic and corresponding phenotypic data, a preliminary application must be presented to the oversight committee, who must first approve the project, ethics permission must then be obtained from the Estonian Committee on Bioethics and Human Research, and finally, a full project must be submitted and approved by the Estonian Biobank. This project was granted ethics approval by the Estonian Committee on Bioethics and Human Research (https://genomics.ut.ee/en/biobank.ee/dataaccess). The summary statistics for fastGWA method were accessed through http://fastgwa.info/ukbimp/phenotypes. Summaries of all posterior distributions obtained and full posterior distributions of the SNP marker effects sizes for each trait are deposited on Dryad (https://doi.org/10.5061/dryad.qbzkh18gp)^{43}. Source data are provided with this paper.
Code availability
Our BayesW model is implemented within the software Hydra, with fully open source code available at https://github.com/medicalgenomicsgroup/hydra^{44}. FlashPCA2 commit b8044f1 is available at https://github.com/gabraham/flashpca. fastGWA is part of the GCTA software, version 1.93.2 available at https://cnsgenomics.com/software/. snpnet commit version c6bc103 is available at https://github.com/junyangq/snpnet. plink version 1.9 is available at https://www.coggenomics.org/plink/. R version 3.6.1 is available at https://www.rproject.org/.
References
Visscher, P. M. et al. 10 years of GWAS discovery: biology, function, and translation. Am. J. Human Genet. 101, 5–22 (2017).
Staley, J. R. et al. A comparison of Cox and logistic regression for use in genomewide association studies of cohort and casecohort design. Eur. J. Human Genet. 25, 854–862 (2017).
Syed, H., Jorgensen, A. L. & Morris, A. P. Evaluation of methodology for the analysis of ‘timetoevent’ data in pharmacogenomic genomewide association studies. Pharmacogenomics 17, 907–915 (2016).
Läll, K., Mägi, R., Morris, A., Metspalu, A. & Fischer, K. Personalized risk prediction for type 2 diabetes: the potential of genetic risk scores. Genet. Med. 19, 322–329 (2017).
Cox, D. R. Regression Models and LifeTables. J. Royal Statist. Soc. Series B 34, 187–220 (1972).
Syed, H., Jorgensen, A. L. & Morris, A. P. SurvivalGWAS_SV: software for the analysis of genomewide association studies of imputed genotypes with “timetoevent” outcomes. BMC Bioinformat. 18, 265 (2017).
Syed, H., Jorgensen, A. L. & Morris, A. P. SurvivalGWAS_Power: a user friendly tool for power calculations in pharmacogenetic studies with “time to event” outcomes. BMC Bioinformat. 17, 523–523 (2016).
Rizvi, A. A. et al. gwasurvivr: an R package for genomewide survival analysis. Bioinformatics 35, 1968–1970 (2018).
Bi, W., Fritsche, L. G., Mukherjee, B., Kim, S. & Lee, S. A fast and accurate method for genomewide timetoevent data analysis and its application to UK biobank. Am. J. Human Genet. 107, 222–233 (2020).
Joshi, P. K. et al. Variants near CHRNA3/5 and APOE have age and sexrelated effects on human lifespan. Nat. Commun. 7, 11174 (2016).
Joshi, P. K. et al. Genomewide metaanalysis associates HLADQA1/DRB1 and LPA and lifestyle factors with human longevity. Nat. Commun. 8, 910 (2017).
He, L. & Kulminski, A. M. Fast algorithms for conducting largescale GWAS of ageatonset traits using cox mixedeffects models. Genetics 215, 41–58 (2020).
Tibshirani, R. The lasso method for variable selection in the Cox model. Statist. Med. 16, 385–395 (1997).
Qian, J. et al. A fast and scalable framework for largescale and ultrahighdimensional sparse regression with application to the UK Biobank. PLoS Genet. 16, 1–30 (2020).
Li, R. et al. Fast Lasso method for largescale and ultrahighdimensional Cox model with applications to UK Biobank. Biostatistics kxaa038 (2020).
Newcombe, P. J. et al. Weibull regression with Bayesian variable selection to identify prognostic tumour markers of breast cancer survival. Statist. Methods Med. Res. 26, 414–436 (2017).
Duan, W. et al. Bayesian variable selection for parametric survival model with applications to cancer omics data. Human Genom. 12, 49–49 (2018).
Held, L., Gravestock, I. & Sabanés Bové, D. Objective Bayesian model selection for Cox regression. Statist. Med. 35, 5376–5390 (2016).
Trejo Banos, D. et al. Bayesian reassessment of the epigenetic architecture of complex traits. Nat. Commun. 11, 2865 (2020).
Davis, J. & Goadrich, M. The Relationship between PrecisionRecall and ROC Curves. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, 233–240 (Association for Computing Machinery, 2006).
Fernando, R., Toosi, A., Wolc, A., Garrick, D. & Dekkers, J. Application of wholegenome prediction methods for genomewide association studies: a Bayesian approach. J. Agric. Biol. Environ. Statist. 22, 172–193 (2017).
Patxot, M. et al. Probabilistic inference of the genetic architecture of functional enrichment of complex traits. medRxiv. Preprint at https://doi.org/10.1101/2020.09.04.20188433 (2020).
Evans, L. M. et al. Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits. Nat. Genet. 50, 737–745 (2018).
Speed, D. et al. Reevaluation of SNP heritability in complex human traits. Nat. Genet. 49, 986–992 (2017).
Speed, D., Holmes, J. & Balding, D. J. Evaluating and improving heritability models using summary statistics. Nat. Genet. 52, 458–462 (2020).
Hou, K. et al. Accurate estimation of SNPheritability from biobankscale data irrespective of genetic architecture. Nat. Genet. 51, 1244–1251 (2019).
Simons, Y. B., Bullaughey, K., Hudson, R. R. & Sella, G. A population genetic interpretation of GWAS findings for human quantitative traits. PLoS Biol. 16, 1–20 (2018).
Jiang, L. et al. A resourceefficient tool for mixed model association analysis of largescale data. Nat. Genet. 51, 1749–1755 (2019).
Harrell Jr., F. E., Lee, K. L. & Mark, D. B. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statist. Med. 15, 361–387 (1996).
Ducrocq, V. & Casella, G. A Bayesian analysis of mixed survival models. Genet. Select. Evol. 28, 505–529 (1996).
Korsgaard, I. R., Madsen, P. & Jensen, J. Bayesian inference in the semiparametric log normal frailty model using Gibbs sampling. Genet. Select. Evol. 30, 241–256 (1998).
Ducrocq, V. Two year of experience with the French genetic evaluation of dairy bulls on productionadjusted longevity of their daughters. Interbull Bullet. 21, 60–68 (1999).
Yazdi, M. H., Visscher, P. M., Ducrocq, V. & Thompson, R. Heritability, reliability of genetic evaluations and response to selection in proportional hazard models. J. Dairy Sci. 85, 1563–1577 (2002).
George, E. I. & McCulloch, R. E. Approaches for Bayesian variable selection. Statist. Sin. 7, 339–373 (1997).
Liu, Q. & Pierce, D. A. A Note on GaussHermite Quadrature. Biometrika 81, 624–629 (1994).
Gilks, W. R. & Wild, P. Adaptive rejection sampling for Gibbs sampling. J. Royal Statist. Soc. Ser. C 41, 337–348 (1992).
Gilks, W. R. Derivativefree adaptive rejection sampling for Gibbs sampling. In Bayesian Statistics 4, (eds Bernardo, J., Berger, J., Dawid, A. P., & Smith, A. F. M.) (Clarendon Press, 1992).
Purcell, S. et al. Plink: a tool set for wholegenome association and populationbased linkage analyses. Am. J. Human Genet. 81, 559–575 (2007).
Abraham, G., Qiu, Y. & Inouye, M. FlashPCA2: principal component analysis of Biobankscale genotype datasets. Bioinformatics 33, 2776–2778 (2017).
Leitsalu, L. et al. Cohort Profile: Estonian Biobank of the Estonian Genome Center, University of Tartu. Int. J. Epidemiol. 44, 1137–1147 (2015).
Tasa, T. et al. Genetic variation in the Estonian population: pharmacogenomics study of adverse drug effects using electronic health records. Eur. J. Human Genet. 27, 442–454 (2019).
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2019).
Ojavee, S. E. & Robinson, M. R. BayesW timetoevent analysis posterior outputs and summary statistics. Dryad. https://doi.org/10.5061/dryad.qbzkh18gp (2021).
Robinson, M. Hydra (version v1.0). Zenodo. https://doi.org/10.5281/zenodo.4555238 (2021).
Acknowledgements
This project was funded by an SNSF Eccellenza Grant to MRR (PCEGP3181181), and by core funding from the Institute of Science and Technology Austria and the University of Lausanne; the work of KF was supported by the grant PUT1665 by the Estonian Research Council. We would like to thank Mike Goddard for comments which greatly improved the work, the participants of the cohort studies and the Ecole Polytechnique Federal Lausanne (EPFL) SCITAS for their excellent compute resources, their generosity with their time and the kindness of their support.
Author information
Authors and Affiliations
Contributions
M.R.R. conceived and designed the study. M.R.R. and S.E.O. designed the study with contributions from A.K. and D.T.B. S.E.O., M.R.R., A.K., M.P. and D.T.B. contributed to the analysis. S.E.O., M.R.R., A.K. and D.T.B. derived the equations and the algorithm. S.E.O., E.J.O., D.T.B. coded the algorithm with contributions from AK and MRR. SEO and MRR wrote the paper. Z.K., K.L., K.F., and R.M. provided study oversight, contributed data and ran computer code for the analysis. All authors approved the final manuscript prior to submission.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Communications thanks Duncan Campbell Thomas and the other, anonymous reviewer(s) for their contribution to the peer review of this work. Peer review reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ojavee, S.E., Kousathanas, A., Trejo Banos, D. et al. Genomic architecture and prediction of censored timetoevent phenotypes with a Bayesian genomewide analysis. Nat Commun 12, 2337 (2021). https://doi.org/10.1038/s4146702122538w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s4146702122538w
This article is cited by

Polygenic risk scores for cervical HPV infection, neoplasia and cancer show potential for personalised screening: comparison of two methods
Infectious Agents and Cancer (2023)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.