A comparison of Cox and logistic regression for use in genome-wide association studies of cohort and case-cohort design

Logistic regression is often used instead of Cox regression to analyse genome-wide association studies (GWAS) of single-nucleotide polymorphisms (SNPs) and disease outcomes with cohort and case-cohort designs, as it is less computationally expensive. Although Cox and logistic regression models have been compared previously in cohort studies, this work does not completely cover the GWAS setting nor extend to the case-cohort study design. Here, we evaluated Cox and logistic regression applied to cohort and case-cohort genetic association studies using simulated data and genetic data from the EPIC-CVD study. In the cohort setting, there was a modest improvement in power to detect SNP–disease associations using Cox regression compared with logistic regression, which increased as the disease incidence increased. In contrast, logistic regression had more power than (Prentice weighted) Cox regression in the case-cohort setting. Logistic regression yielded inflated effect estimates (assuming the hazard ratio is the underlying measure of association) for both study designs, especially for SNPs with greater effect on disease. Given logistic regression is substantially more computationally efficient than Cox regression in both settings, we propose a two-step approach to GWAS in cohort and case-cohort studies. First to analyse all SNPs with logistic regression to identify associated variants below a pre-defined P-value threshold, and second to fit Cox regression (appropriately weighted in case-cohort studies) to those identified SNPs to ensure accurate estimation of association with disease.


INTRODUCTION
Cox proportional hazards models are regularly used to analyse timeto-event data in prospective epidemiological cohort and case-cohort studies (Supplementary Text). Case-cohort studies are similar to cohort studies, with the exception that full covariate information is only collected for those individuals who develop the disease over follow-up and a randomly selected subgroup of the initial cohort, referred to as the subcohort (see Supplementary Figure S1). In contrast to traditional epidemiology, logistic regression is often used in genome-wide association studies (GWAS) of cohort and casecohort data to assess the associations of single-nucleotide polymorphisms (SNPs) and disease outcomes, ignoring the time-to-event information in prospective studies (Supplementary Text). 1,2 The reasons for this include the faster computational time of the logistic regression model, the lack of implementation of time-to-event analysis models within most GWAS software and that genetic studies are often combined in multi-study consortia using meta-analysis in which it is convenient to analyse both case-control studies and prospective studies in the same way.
Cox and logistic regression models have been previously compared in cohort studies [3][4][5][6][7][8] but not in case-cohort studies. For cohort studies, it was reported that: (i) the Cox model yields more precise estimates of association; (ii) odds ratios (ORs) and hazard ratios (HRs) diverge as follow-up time, cumulative disease incidence and the strength of the association increase; (iii) in certain situations the Cox model has greater statistical power; and (iv) the Cox model takes a longer time to compute than the logistic regression model. However, most of these studies are at least 20 years old when computational power was limited. Thus, inferences were based on mathematical theory as well as results from small observational studies, and as such do not fully answer the relevant questions for modern GWASs, which are large studies where a small to modest increase in power and precision would probably be sacrificed for an increase in computational efficiency and for practical reasons (such as the availability of suitable software). A more recent study investigated the differences in power between these methods (using simulations) for genetic associations of coronary heart disease (CHD) in a cohort of familial hypercholesterolaemia patients. 9 The Cox model had significantly greater power compared with logistic regression for SNP-CHD associations in this very high-risk population. However, this study is not generalisable to cohorts with lower disease incidences. Furthermore, practitioners have thus far assumed findings for cohort studies would extrapolate to casecohort studies, but this has not been formally investigated.
Hence we investigated the potential gains in power and precision against the greater computation time of using Cox proportional hazards models instead of logistic regression models in prospective cohort and case-cohort genetic studies. We performed a simulation study, and validated our findings empirically using SNP-CHD associations from the EPIC-CVD case-cohort study.

MATERIALS AND METHODS
In the simulation study, we assessed the performance of the Cox and logistic regression models (details of these models are given in the Supplementary Text) in terms of bias, precision and statistical power for cohort and case-cohort studies. The computational time needed to perform a GWAS using both models in cohort and case-cohort studies was also assessed using simulated data. We further compared the Cox and logistic regression models using SNP-CHD associations in EPIC-CVD for loci that have previously been robustly associated with CHD. 1 All analyses were performed in R (version 3.1.0) on a 24 core Linux server (1.8 GHz, 124Gb RAM).

Simulation study
We simulated survival data under both the cohort and case-cohort study designs. HR was used as the underlying measure of association and is assumed to be the reference (or gold) standard, and as such, bias was defined as the estimated log(HR) or log(OR) minus the underlying log(HR) (i.e. log(ĤR) − log(HR) or log(ÔR) − log(HR)). Precision as estimated by the model was assessed using the standard error (SE) of the estimated log(HR) or log(OR). Statistical power was defined (at the per SNP-test level) as the percentage of simulation replications where the SNP-disease association had a P-value less than the type I error rate of α = 0.05.
We considered various combinations of sample size, risk allele frequency (RAF), cumulative disease incidence (proportion of individuals who experienced the disease outcome over the period of follow-up), sampling fraction (for the case-cohort design) and HRs between a single hypothetical SNP and disease, assuming a multiplicative allelic effects model. Sample size was varied over 5000, 10 000 and 25 000 for the cohort study design, and the initial cohort was set to 40 000 for the case-cohort study design. Sampling fractions of 5, 10 and 15% were used in the case-cohort design to generate the subcohort. The additional cases from the initial cohort were then added to create the casecohort data set. Since the differences between the logistic and Cox model are small for rare diseases, [3][4][5] we selected cumulative disease incidences of 5, 10 and 15% to reflect common diseases (e.g. CHD and type II diabetes) often seen over a 20-year period for the age group studied in the simulations. The RAFs examined were 0.05, 0.10, 0.25, 0.50, 0.75, 0.90 and 0.95. The strength of the association was varied over HRs of 1.05, 1.10, 1.15, 1.20, 1.30, 1.50 and 2.00. To examine the type I error rates of the two approaches, we simulated under an HR of 1. In each simulation scenario the simulation under the null yielded a type I error of approximately 0.05 for both the Cox and logistic regression models (Supplementary Tables S1 and S2).
In each simulation, we constructed a population with an age-structure (in years) similar to that of EPIC-CVD using a normal distribution with mean 56 and standard deviation 6, where all the individuals were assumed to be of the same gender and ethnicity. Genotypes were randomly generated according to the specified RAF using the binomial distribution.
We used the Weibull distribution (with scale parameter λ40, shape parameter v40 and hazard function λvt v − 1 ) to generate event times 10 (T) assuming a causal effect of age and SNP genotype on disease and proportional hazards, where T is survival time (in years); U is a random variable following a uniform distribution on [0, 1]; A is age measured in years; G is genotype coded as 0, 1 or 2 according to the number of risk alleles that the individual has; β A is the log (HR) for a 1 year increase in age; and β G is the log(HR) for a one allele increase in the risk allele. Approximate estimates of β A and λ for CHD (fatal and nonfatal) were obtained by fitting a Weibull regression model with age as the only covariate in males from the EPIC-CVD subcohort. β G was set equal to the logarithm of the HR for the hypothetical SNP in each simulation scenario. The shape parameter (v) was scaled according to the RAF and underlying HR for the hypothetical SNP to ensure a number of cases that approximately reflects the cumulative disease incidence in each simulation scenario.
We simulated under three censoring distributions with maximum follow-up of 20 years. The first assumed that all the non-cases had complete follow-up of 20 years ('complete follow-up model'). The second assumed each individual was equally likely to be censored at one of four surveys: 5, 10, 15 and 20 years ('survey follow-up model'). The third randomly censored individuals across the 20 years of follow-up using a uniform distribution with minimum 0 years and maximum 20 years ('random follow-up model'). We would expect the censoring distribution of most studies to lie somewhere in-between these distributions. Within each simulation, an individual was considered to have experienced the disease outcome if their event time was less than their censoring time, and to be disease free otherwise. Their time in the study was set to be the minimum of their event time and censoring time.
In the primary simulations, we set the cumulative disease incidence to be either 5, 10 or 15% regardless of the censoring distribution, as model choice is often solely based on the proportion of cases attained over follow-up. However, keeping the proportion of cases constant across censoring strategies changes the disease incidence rate. Hence, we performed a secondary set of simulations where the survey and random censoring strategies were applied to the event time distributions from the complete follow-up model.
We analysed the simulated data using Cox and logistic regression models with genotype and baseline age included as covariates in the model. The Cox model was fitted using the time-on-study time-scale in both the cohort and case-cohort simulations. In the case-cohort setting, Prentice weights and robust SEs were used for the Cox model to account for the sampling process. 11,12 Logistic regression in case-cohort studies applied directly to the cases and the subcohort non-cases gives asymptotic inference of odds ratios. 11 Each simulation scenario was repeated 5000 times. The P-values were calculated using Wald tests.

Computational time
The computational time needed to perform Cox and logistic regression models was estimated using 10 000 variants (with RAFs between 0.05 and 0.95) assuming only one of which is causal with an HR between 1 and 2. In the cohort setting, we used 10 000 simulated individuals. For the case-cohort setting, we generated the case-cohort in the same way described above using a sampling fraction of 15% and an initial cohort of 40 000 simulated individuals. We used the survey follow-up model (see above) and compared the computational times of the models for cumulative disease incidences of 5, 10 and 15%. To estimate these computational times we extended GenABEL 13 to enable it to perform Prentice-weighted Cox models. This extension is available on request from the authors.

Empirical example: SNP-CHD associations in EPIC-CVD
EPIC-CVD is a case-cohort study of 31 050 participants derived from 29 recruitment centres across 10 European studies nested within the EPIC study. 14,15 The subcohort contains 17 640 individuals, of which 631 had an incident fatal or non-fatal CHD (including angina) event. There were an additional 13 333 CHD events outside the subcohort. The mean age at baseline was 55.3 years and 46.9% of the participants are males. Individuals were genotyped using either the Illumina Cardio-MetaboChip array or the Illumina 660W-Quad BeadChip array. In total, 18 889 individuals had genetic information. We analysed 25 SNPs located within known CHD loci, which were genotyped on both arrays and were available in at least 95% of the genotyped individuals. A multiplicative allelic effects model (on the HR/OR scale) was assumed for the SNPs, and we adjusted for age (in years), sex, EPIC-CVD centre (as a categorical variable) and 10 principal components to adjust for ancestry. Age, sex and EPIC centre were used as covariates to account for differences across EPIC centres (including any differences in the distributions of age and sex). We compared the associations from the Cox and logistic modelling approaches in the full EPIC-CVD case-cohort and in the subcohort of EPIC-CVD (as a surrogate cohort study). The Cox model was fitted using the time-on-study time-scale in both settings. Prentice weights and robust SEs were used for the Cox model in the full case-cohort to account for the sampling process. SNPs were aligned to the plus strand of the human genome reference sequence and are displayed on version 19 (Build 37) of the human genome.

Simulation study
Cohort study design. As expected, the power to detect SNP-disease associations increased as the underlying HR increased for both Cox and logistic regression models (Table 1 and Supplementary Table S3). Power also increased as the cumulative disease incidence increased (Table 1 and Supplementary Table S3). The Cox model tended to have more power than logistic regression, and as the cumulative disease incidence increased this difference in power also increased ( Figure 1). Furthermore, the increase in power for the Cox model was greater in magnitude as the rate of censoring over the 20 years of follow-up increased (compare Figures 1a-c), from very small differences in Table 1 Simulation results for cohort studies with 10,000 individuals for a SNP with RAF = 0.10. power for the complete follow-up model (o1%) to larger differences (up to 7%) for the random follow-up model. The mean bias for the Cox model was approximately zero across all the simulation scenarios (Table 1 and Supplementary Table S3). However, the estimated ORs were on average larger than the underlying HRs, especially for larger effect sizes. There was also a larger degree of divergence between the HRs and the ORs as the length of follow-up time increased (average follow-up time: complete follow-up model = 19.1 years; survey follow-up model = 11.9 years; random follow-up model = 9.4 years), and as the cumulative disease incidence increased (Table 1 and Supplementary Table S3). The HRs were more precise (as estimated by the model) than the ORs across all scenarios, and this relative difference in precision was positively correlated with cumulative disease incidence (Table 1 and Supplementary Table S3). This increase in precision offsets the greater effect sizes of the logistic regression model, and hence leads to the greater differences in power between the Cox and logistic regression models for the larger cumulative disease incidences (Figure 1).
The differences in power between the Cox and logistic regression models were reasonably similar across the RAFs for each of the combinations of cumulative disease incidence and censoring strategy (Supplementary Figures S2-S4). In the secondary simulations, where the survey and random follow-up censoring strategies were applied to the event time distributions from the complete follow-up model, there was less divergence between the HRs and ORs as the rate of censoring increased (Supplementary Table S4). This decreased divergence, when censoring was introduced, led to a small increase in the greater power of the Cox model for the larger cumulative disease incidences (Supplementary Table S4).
Case-cohort study design. The absolute power increased as the underlying effect size increased for both modelling approaches ( Table 2 and Supplementary Table S5). These power curves were steeper as the disease incidence and sampling fraction (which increases the sample size) increased for both models. However, for the casecohort study design, we found a large decrease in power when using the Prentice-weighted Cox model compared with logistic regression. This loss of power worsened as the cumulative disease incidence increased and as the amount of censoring increased over the 20 years ( Figure 2).
The mean bias was approximately zero for the Cox model across all of the scenarios. However, the ORs from the logistic regression model tended to be larger than the underlying HR, and this difference increased as the underlying HR and the length of follow-up increased Genetic associations using Cox and logistic models JR Staley et al (Table 2 and Supplementary Table S5). In contrast to the cohort simulations, the HRs were not consistently more precise than the ORs. The mean SEs were smaller for the logistic regression model than the Cox model in the survey and random follow-up models but were slightly larger for the complete follow-up model ( Table 2 and  Supplementary Table S5). This increased precision of the logistic regression model relative to the Cox model in case-cohort studies (compared with cohort studies) occurs because robust SEs were not required for the logistic regression model but they were for the Cox model to account for the sampling process of this study design. The HRs also became less precise as the average length of follow-up time decreased. For instance, the mean SE of the logarithm of the estimated HR for the sampling fraction of 10%, RAF = 0.10, cumulative disease incidence of 10% and underlying HR of 1.10 were 0.052, 0.056 and 0.060 for the complete, survey and random follow-up models, respectively. These large differences in precision between the censoring strategies were caused by the smaller amount of information from the subcohort contributing to the pseudo-likelihood at each failure as the amount of censoring increased. As expected, the distributions of the ORs were similar across the censoring strategies as there were approximately the same number of cases and non-cases across the censoring strategies for the same cumulative disease incidence. Hence, the greater precision of the logistic regression model compared with the Cox model for the random and survey follow-up models leads to the larger differences in power for these models. The increased power of the logistic regression model in the complete follow-up model was mainly driven by the larger effect estimates of the logistic regression model. The differences in power between the Cox model and the logistic regression models were relatively similar across the RAFs for each of the combinations of cumulative disease incidence and censoring strategy (Supplementary Figures S5-S7). In the secondary simulations, where the event time distributions were kept consistent across the censoring strategies, we observed that there was less divergence between the HRs and ORs as the rate of censoring increased (Supplementary Table S6). There was also a small increase in the difference in power between the two models when there was censoring (Supplementary Table S6). This occurred because, unlike when there was no censoring, the logistic regression model was now more precise than the Cox model (Supplementary Table S6).

Computational time
The computational time required to complete a genetic association study of 10 000 SNPs using the Cox model was approximately 18 times greater than the equivalent analysis using logistic regression for cohort studies (cohort size of 10 000), regardless of the cumulative disease incidence (Supplementary Table S7). Similarly, the computational time needed to analyse 10 000 SNPs with the Cox model was at least 190 times greater than the same analysis using logistic regression for case-cohort studies (sampling fraction of 15%; initial cohort of 40 000). We expect these relative differences in computational time between the models to remain the same for data sets with greater numbers of SNPs.
Empirical example: SNP-CHD associations in EPIC-CVD To assess Cox and logistic regression models in cohort studies, we used the subcohort of EPIC-CVD and 25 SNPs from known CHD loci. Since there were only 437 CHD events in the individuals genotyped in the subcohort, there was limited power to detect SNP-CHD associations in the subcohort with either model. Nevertheless, in the subcohort the effect estimates were directionally concordant for the Cox and logistic regression models for 24 of the 25 SNPs (Table 3), and the SNP that was directionally discordant had effect estimates very close to the null for both models. Of the 24 concordant SNPs, 18 had the same direction of effect as in the literature (Supplementary  Table S8). 1,2 The SNPs that showed different directions of association with the literature were in general those SNPs with smaller effect on CHD and hence required large sample sizes to detect these effects (Supplementary Table S8). The Cox model was more precise than logistic regression for all of the SNPs and for the effect sizes further away from the null the ORs tended to be larger in magnitude (Table 3).
We assessed Cox and logistic regression models in case-cohort studies by using the entire EPIC-CVD case-cohort and the same 25 SNPs as above. Here, the effect estimates were directionally concordant for both models for 22 of the 25 SNPs (Table 3), of which 21 had the same direction of effect as in the literature (Supplementary  Table S8). Like in the subcohort, the effect estimates of the SNPs that were directionally discordant between the models lie close to the null and the SNP that was directionally discordant with the literature had a small effect on CHD in the large consortia (Supplementary Table S8). The logistic regression model was more precise than the Cox model for all of the SNPs often leading to much smaller P-values. For instance, the SNP rs974819 (chr11:g.103660567C4T; an SNP located downstream of PDGFD) had similar effect sizes for the Cox (log(HR) = − 0.082) and logistic (log(OR) = − 0.080) regression models, but the SEs were 0.031 and 0.027, respectively. This led to a smaller P-value for the logistic regression model (P = 0.003 compared with P = 0.007). Again, the effect sizes of the logistic regression model tended to be larger in magnitude than those of the Cox model for the effect estimates of greater magnitude. Two examples of this are the SNPs rs9349379 (chr6:g.12903957A4G; an intronic variant in PHACTR1) and rs2075650 (chr19:g.45395619A4G; an intronic variant in TOMM40 upstream of APOE) where the effect sizes were 0.03 and 0.013 larger in magnitude for the logistic regression model compared with the Cox model. This in combination with smaller SEs for the logistic regression model led to much smaller P-values for this model (Table 3).

DISCUSSION
In this paper, we have investigated the differences between the Cox and logistic regression models in cohort and case-cohort studies using simulations and SNP-CHD associations from EPIC-CVD. In the simulations, we examined a wide range of scenarios, including varying the amount of censoring (over a 20-year follow-up period), the RAF of the SNP, the strength of the association and the number of events (by changing the cumulative disease incidence). The results from the EPIC-CVD case-cohort show similar findings to the simulations. We believe that this is the first study to specifically compare the Cox and logistic regression models in case-cohort studies.
We have shown that there is increased power to detect SNP-disease associations using the Cox model instead of logistic regression for the cohort study design and the extent to which it has additional power depends on cumulative disease incidence and the length of follow-up. However, the increase in power using the Cox model was small when the cumulative disease incidence was low or when the non-cases had complete follow-up. We also observed that the ORs and HRs diverge as follow-up time and cumulative disease incidence increase (especially for large underlying effects), that the HRs were more precise, and logistic regression was more computationally efficient. These findings are in line with and extend those from previous studies with the emphasis in the current work being the application of these modelling approaches in genetic association studies. 4,5,[7][8][9] In contrast to the cohort study design, it appears that there is no additional power to be gained by using Prentice-weighted Cox models for genetic associations in case-cohort studies. Indeed, there is a striking loss in power, which occurs because robust SEs are not necessary for the logistic regression model in addition to its effect estimates being greater in magintude. Furthermore, the computational cost of the Cox model was far greater than that of the logistic regression model. Hence, although we recognise the caveats of using the logistic regression model (inflated effect estimates, especially for associations of greater magnitude) in case-cohort studies, we propose that logistic regression could be used as a filter to detect SNPs below a pre-defined P-value threshold for GWAS (of a large number of SNPs (4500 000)) in case-cohort studies. This threshold should be set suitably high to ensure that all of the SNPs that would have been detected at the overall level of significance with the Cox model are contained within this subset, for example, 1 × 10 − 4 if the overall significance threshold is 5 × 10 − 8 . Prentice-weighted Cox models could then be fitted to this subset of SNPs, avoiding the vast computational time required to complete an entire GWAS using the Prentice-weighted Cox model, while obtaining accurate estimates of effect and inference (including P-values) for the SNPs of particular interest. This two-step procedure could also be applied in the cohort setting to reduce computational time.
We must note here that HRs and ORs are different measures of association and have different interpretations. Cox models incorporate the length of time the individuals are followed up and measure whether the risk factor affects the time at which the disease event occurs. Logistic regression assesses whether the risk factor affects the odds of disease, and hence does not take into account the time of disease occurrence. So early and late failures are given the same weight in the analysis. In addition, individuals who are not observed to have the event during the period of follow-up time are treated as controls. This is different from the time-to-event analysis approaches where these individuals are considered to be censored. That is, we assume that all the individuals will have the event at some point but we just do not observe this event over the follow-up period for some individuals. Since these models have different definitions and estimate different parameters, naturally the results from these models will differ, as observed in our work.
The results of our simulations and EPIC-CVD analyses are directly generalisable to diseases with cumulative incidence between 5 and 15% over a 20-year period of follow-up. However, our results could probably be safely extrapolated to other disease incidences over this period of follow-up (i.e. the Cox and logistic model estimates will become increasingly similar as the disease becomes rarer (cumulative disease incidenceo5%) and will diverge more as the disease becomes more common (cumulative disease incidence415%)). Additional avenues of research could include comparing further time-to-event models (e.g. parametric survival models), examining the effect of violating the proportional hazards assumption, meta-analysing casecontrol studies alongside prospective studies and the inclusion of prevalent cases in the analysis.
In summary, to minimise computational time in analysing genetic associations in cohort and case-cohort studies, while obtaining appropriate estimates of effect for SNPs of greater interest we propose a two-step procedure. Firstly, logistic regression is used to analyse all SNPs as an initial filtering process and secondly, Cox regression is fitted to those SNPs associated below a pre-defined P-value threshold Table 3 Results of 25 SNPs previously reported to be associated with CHD in EPIC-CVD β is the logarithm of the association measure (either odds ratio or hazard ratio); SE, standard errors of the logarithm of the association measure (either odds ratio or hazard ratio). In the full case-cohort, the Cox model was Prentice weighted. Alleles are aligned to the + strand.
to avoid inflated estimates of effect as well as to obtain appropriate Pvalues for these SNPs. However, logistic regression remains a practical alternative to Cox models in cohort and case-cohort studies, especially when the disease incidence is low.