On 11 March 2020, the World Health Organization (WHO) characterized COVID-19—which is caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)—as a pandemic, after 118,000 cases and 4,291 deaths were reported in 114 countries2. As of 6 May 2020 (the date of latest data availability for this study), cases had reached over 3.5 million globally, with more than 240,000 deaths attributed to the virus1. On the same day in the UK, there had been 206,715 confirmed cases of COVID-19, and 30,615 COVID-19-related deaths3.

Age and gender are well-established risk factors for severe COVID-19 outcomes: over 90% of the COVID-19-related deaths in the UK have been in people over 60, and 60% in men4. Various pre-existing conditions have also been associated with increased risk. For example, the Chinese Center for Disease Control and Prevention reported in a study of 44,672 individuals (1,023 deaths) that cardiovascular disease, hypertension, diabetes, respiratory disease and cancers were associated with an increased risk of death5; however, correction for relationships with age was not possible. A UK cross-sectional survey of 16,749 patients who were hospitalized with COVID-19 showed that the risk of death was higher for patients with cardiac, pulmonary and kidney disease, as well as cancer, dementia and obesity (HRs of 1.19–1.39 after correction for age and sex)6. Obesity was associated with treatment escalation in a French intensive care cohort7 (n = 124) and a New York hospital presentation cohort8 (n = 3,615). The risks associated with smoking are unclear9,10,11. People from Black and minority ethnic groups are at increased risk of poor outcomes from COVID-19, for reasons that are unclear12,13.

Patient care is typically managed through electronic health records, which are commonly used in research. However traditional approaches to the analysis of electronic health records rely on intermittent extracts of small samples of historic data. Evaluating a rapidly arising novel cause of death requires a new approach. We therefore set out to deliver a secure analytics platform inside the data centre of major electronic health records vendors, running across the full, linked and pseudonymized electronic health records of a very large population of NHS patients, to determine factors that are associated with COVID-19-related death in England.

Associations with COVID-19-related death

In total, 17,278,392 adults were included (Fig. 1; cohort description in Table 1). Eleven per cent of individuals (1,851,868) had ethnicity recorded as mixed, South Asian, Black or other (hereafter referred to as Black and minority ethnic, BAME). There were missing data for body mass index (3,751,769; 22%), smoking status (720,923; 4%), ethnicity (4,560,113; 26%) and blood pressure (1,715,095; 10%). COVID-19-related death was recorded in linked death registration data for 10,926 of the study population.

Fig. 1: Flow diagram of the cohort.
figure 1

The diagram shows the numbers of individuals (n) excluded at different stages and the identification of cases for the main end points.

Table 1 Cohort description with number of COVID-19 deaths by patient characteristics

The overall cumulative incidence of COVID-19-related death 90 days after the start of the study was less than 0.01% in those aged 18–39 years, rising to 0.67% and 0.44% in men and women, respectively, aged 80 years or over (Fig. 2).

Fig. 2: Kaplan–Meier plots for COVID-19-related death.
figure 2

Plots show COVID-19-related death over time by age and sex.

Associations between patient-level factors and risk of COVID-19-related death are shown in Table 2 and Fig. 3. Increasing age was strongly associated with risk, with people aged 80 or over having a more than 20-fold-increased risk compared to 50–59-year-olds (fully adjusted HR 20.60; 95% confidence interval (CI) 18.70–22.68). With age fitted as a flexible spline, an approximately log-linear relationship was observed (Extended Data Fig. 1). Men had a higher risk than women (fully adjusted HR 1.59 (1.53–1.65)). These findings are consistent with patterns observed in smaller studies worldwide and in the UK14.

Table 2 Hazard ratios and 95% confidence intervals for COVID-19-related death
Fig. 3: Estimated hazard ratios for each patient characteristic from a multivariable Cox model.
figure 3

Hazard ratios are shown on a log scale. Error bars represent the limits of the 95% confidence interval for the hazard ratio. IMD, index of multiple deprivation; obese class I, BMI 30–34.9; obese class II, BMI 35–39.9; obese class III, BMI ≥ 40; OCS, oral corticosteroid; ref, reference group. All hazard ratios are adjusted for all other factors listed other than ethnicity. Ethnicity estimates are from a separate model among those individuals for whom complete ethnicity data were available, and are fully adjusted for other covariates. Total n = 17,278,392 for the non-ethnicity models, and 12,718,279 for the ethnicity model.

People from all BAME groups were at higher risk than those of white ethnicity. When adjusted only for age and sex, hazard ratios ranged from 1.62–1.88 for Black and South Asian individuals and people of mixed ethnicities, compared to white people, decreasing to 1.43–1.48 after adjustment for all included factors (results for more detailed categories are shown in Extended Data Table 1). BAME ethnicity has previously been found to be associated with an increased risk of COVID-19 infection and poor outcomes12,13,15. Our findings show that only a small part of the excess risk is explained by a higher prevalence of medical problems such as cardiovascular disease or diabetes among BAME people, or by higher levels of deprivation.

We found a consistent pattern of increasing risk with greater deprivation, with the most deprived quintile having a hazard ratio of 1.79 compared to the least deprived, consistent with recent national statistics16. Again, very little of this increased risk was explained by pre-existing disease or clinical factors, suggesting that other social factors have an important role.

Increasing risks were seen with increasing obesity (fully adjusted HR 1.92 (1.72–2.13) for a body mass index (BMI; kg m−2) of over 40), and most comorbidities were associated with a higher risk of COVID-19-related death, including diabetes (greater hazard ratio for those with a recently measured glycated haemoglobin (HbA1c) level of at least 58 mmol mol−1), severe asthma (defined as asthma with recent use of an oral corticosteroid), respiratory disease, chronic heart disease, liver disease, stroke, dementia, other neurological diseases, reduced kidney function (greater hazard ratio associated with a lower estimated glomerular filtration rate; eGFR), autoimmune diseases (rheumatoid arthritis, lupus or psoriasis) and other immunosuppressive conditions (Table 2). Those with a recent (that is, in the last five years) history of haematological malignancy had an at least 2.5-fold increased risk, which decreased slightly after five years. For other cancers, hazard ratios were smaller and increased risks were associated mainly with recent diagnoses. A history of dialysis or end-stage renal failure was associated with increased risk when added in a secondary analysis (HR 3.69 (3.09–4.39)). These findings largely concur with other data, including the UK international severe acute respiratory and emerging infection consortium (ISARIC) study of hospitalized UK patients with COVID-19—which indicated an increased risk of death with cardiac, pulmonary and kidney disease, malignancy, obesity and dementia6—and a large Chinese study that, although lacking correction for age, suggested that cardiovascular disease, hypertension, diabetes, respiratory disease and cancers are associated with increased mortality5. Our results showing that severe asthma is associated with a higher risk are notable, as early data suggested that asthma was under-represented in patients with COVID-19 who were hospitalized or had severe outcomes17,18.

Post hoc analyses of smoking and hypertension

Both current and former smoking were associated with a higher risk in models that were adjusted for age and sex only, but in the fully adjusted model current smoking was associated with a lower risk (fully adjusted HR 0.89 (0.82–0.97)), which concurs with the lower than expected prevalence of smoking that was observed in previous studies among patients with COVID-19 in China10, France11 and the United States19. We investigated this in more depth post hoc by adding covariates individually to the age, sex and smoking model, and found that the change in hazard ratio was driven largely by adjustment for chronic respiratory disease (HR 0.98 (0.90–1.06) after adjustment). This and other comorbidities could be consequences of smoking, highlighting that the fully adjusted smoking hazard ratio cannot be interpreted causally owing to the inclusion of factors that are likely to mediate smoking effects. We therefore then fitted a model adjusted for demographic factors only (age, sex, deprivation and ethnicity), which showed a non-significant positive hazard ratio for current smoking (HR 1.07 (0.98–1.18)). This does not support any postulated protective effect of nicotine9,20, but suggests that any increased risk with current smoking is likely to be small and will need to be clarified as the epidemic progresses and more data accumulate.

We similarly investigated the change in the hypertension hazard ratio (from 1.09 (1.05–1.14) adjusted for age and sex, to 0.89 (0.85–0.93) with all covariates included), and found that diabetes and obesity were principally responsible for this reduction (HR 0.97 (0.92–1.01) adjusted for age, sex, diabetes and obesity). Given the strong association between blood pressure and age we then examined the interaction between these variables; this revealed strong evidence of interaction (P < 0.001), with hypertension associated with a higher risk up to the age of 70 years and a lower risk above the age of 70 (adjusted HRs 3.10 (1.69–5.70), 2.73 (1.96–3.81), 2.07 (1.73–2.47), 1.32 (1.17–1.50), 0.94 (0.86–1.02) and 0.73 (0.69–0.78) for ages 18–39, 40–49, 50–59, 60–69, 70–79 and 80 or over, respectively). The reasons for the inverse association between hypertension and mortality in older individuals are unclear and warrant further investigation, including detailed examination of frailty, comorbidity and drug exposures in this age group.

Model checking and sensitivity analyses

The average C-statistic—a measure of the model’s ability to distinguish between patients who experience COVID-19-related deaths and those who do not, ranging from 0 (no ability) to 1 (perfect ability)—was 0.93. Results were similar when missing data were handled using analysis of complete records only, or using multiple imputation (sensitivity analyses; Extended Data Table 2). Non-proportional hazards were detected in the primary model (P < 0.001). A sensitivity analysis with earlier administrative censoring at 6 April 2020—before which mortality should not have been affected by the social distancing policies that were introduced in the UK in late March—showed no evidence of non-proportional hazards (P = 0.83). Hazard ratios were similar but somewhat larger in magnitude for some covariates, whereas the association with increasing deprivation appeared to be smaller (Extended Data Table 2).


This secure analytics platform operating across NHS patient records of over 17 million adults and 6 million children was used to identify, quantify and analyse factors associated with COVID-19-related death in one of the largest cohort studies on this topic conducted by any country so far. Most comorbidities were associated with increased risk, including cardiovascular disease, diabetes, respiratory disease (including severe asthma), obesity, a history of haematological malignancy or recent other cancer, kidney, liver and neurological diseases, and autoimmune conditions. South Asian and Black people had a substantially higher risk of COVID-19-related death than white people, and this was only partly attributable to comorbidities, deprivation or other factors. A strong association between deprivation and risk was also only partly explained by comorbidities or other factors.

Our analyses provide a preliminary picture of how key demographic characteristics and a range of comorbidities—which were a priori selected as being of interest in COVID-19—are jointly associated with poor outcomes. These initial results may be used to inform the development of prognostic models. We caution against interpreting our estimates as causal effects. For example, the fully adjusted smoking hazard ratio does not capture the causal effect of smoking, owing to the inclusion of comorbidities that are likely to mediate any effect of smoking on COVID-19-related death (for example, chronic obstructive pulmonary disease). Our study has highlighted a need for carefully designed analyses that specifically focus on the causal effect of smoking on COVID-19-related death. Similarly, there is a need for analyses exploring the causal relationships that underlie the associations observed between hypertension and COVID-19-related death.

Strengths and weaknesses

The greatest strengths of this study are its size and the speed at which it was conducted. By building a secure analytics platform across routinely collected live clinical data stored in situ, we have produced timely results from the current NHS records of approximately 40% of the English population. The large scale of the study allows more precision—on rarer exposures and on multiple factors—and rapid detection of important signals. Our platform will expand to provide updated analyses over time. Another strength is our use of open methods: we pre-specified our analysis plan and shared our full analytic code and codelists for review and reuse. We ascertained patient demographics, medications and comorbidities from full pseudonymized longitudinal primary care records, which provide substantially more detail than data that are recorded on admission to hospital, and which take into account the total population rather than the selected subset of individuals who present at hospitals. We censored deaths from other causes using data from the UK Office for National Statistics (ONS). Analyses were stratified by area to account for known geographical differences in the incidence of COVID-19.

The study also has some important limitations. In our outcome definition, we included clinically suspected (non-laboratory-confirmed) cases of COVID-19, because testing has not always been carried out, especially in older patients in care homes. However, this may have resulted in some patients being incorrectly identified as having COVID-19. In addition, some COVID-19-related deaths may have been misclassified as non-COVID-19, particularly in the early stages of the pandemic; however, this inaccuracy is likely to have reduced quickly as the number of deaths increased, and a degree of outcome underascertainment—providing it is unrelated to patient characteristics—should not have biased our hazard ratios. Owing to the rarity of the outcome, the associations observed will be driven primarily by the profile of patient characteristics in the included cases. Our findings reflect both an individual’s risk of infection and their risk of dying once infected. We will consider more detailed patient trajectories in future research within the OpenSAFELY platform.

Our large population may not be fully representative. We include only 17% of general practices in London—where many of the earlier cases of COVID-19 occurred—owing to the substantial geographical variation in the choice of electronic health record system. The user interface of electronic health records can affect prescribing of certain medicines20,21,23, so it is possible that coding varies between systems.

Primary care records are detailed and longitudinal, but can be incomplete for data on patient characteristics. Ethnicity was missing for approximately 26% of patients, but was broadly representative24; there were also missing data on obesity and smoking. Sensitivity analyses found that our estimates were robust to our assumptions around missing data.

Non-proportional hazards could be due to very large numbers or unmeasured covariates. However, rapid changes in social behaviours (social distancing, shielding) and changes in the burden of infection may also have affected patient groups differentially. The larger hazard ratios seen for several covariates in a sensitivity analysis with earlier censoring (soon after social distancing and shielding policies were introduced) are consistent with patients who are more at risk being more compliant with these policies. By contrast, the risk associated with deprivation may have increased over time. Further analyses will explore the changes before and after the implementation of national initiatives around COVID-19.

Policy implications and interpretation

The UK has a policy of recommending shielding (staying at home at all times and avoiding any face-to-face contact) for groups who are identified as being extremely vulnerable to COVID-19 on the basis of pre-existing medical conditions25. We were able to evaluate the association between most of these conditions and death from COVID-19, and we confirmed the increased mortality risks, supporting the targeted use of additional protection measures for people in these groups. We have demonstrated that only a small part of the substantially increased risks of COVID-19-related death among BAME groups and among people living in more-deprived areas can be attributed to existing disease. Improved strategies to protect people in these groups are urgently needed26. These might include the specific consideration of BAME groups in shielding guidelines and workplace policies. Studies are needed to investigate the interplay of additional factors that we were unable to examine, including employment, access to personal protective equipment and the related risk of exposure to infection, and household density.

The UK has an unusually large volume of very detailed longitudinal patient data, especially through primary care, and we believe the UK has a responsibility to the global community to make good use of this data. OpenSAFELY demonstrates—on a very large scale—that this can be done securely, transparently and rapidly. We will enhance the OpenSAFELY platform to further inform the global response to the COVID-19 emergency.

Future research

The underlying causes of the higher risk of COVID-19-related death among BAME individuals, and among people from deprived areas, require further investigation. We would suggest collecting data on occupational exposure and living conditions as first steps. The statistical power offered by our approach means that associations with less-common factors can be robustly assessed in more detail and at the earliest possible date as the pandemic progresses. We will therefore update our findings and address smaller risk groups as new cases arise over time. The open source reusable codebase on OpenSAFELY supports the rapid, secure and collaborative development of new analyses; we are currently conducting expedited studies on the effects of various medical treatments and population interventions on the risk of COVID-19 infection, admission to intensive care units and death, alongside other observational analyses. OpenSAFELY is rapidly scalable for the incorporation of more NHS patient records, and new sources of data are progressing.

In conclusion, we have generated early insights into factors associated with COVID-19-related death using the detailed primary care records of 17 million NHS patients, while maintaining privacy, in the context of a global health emergency.


Study design

We conducted a cohort study using national primary care electronic health record data linked to data on COVID-19-related deaths (see ‘Data source’). The cohort study began on 1 February 2020, which was chosen as a date several weeks before the first reported COVID-19-related deaths and the day after the second laboratory-confirmed case27; and ended on 6 May 2020. The cohort study examines risk among the general population rather than in a population infected with SARS-COV-2. Therefore, all patients were included irrespective of any SARS-COV-2 test results. No randomization was undertaken. Outcome assessment was undertaken as part of routine health care, therefore no blinding of any sort was attempted. However, study investigators had no involvement in outcome assessment.

Data source

We used patient data from general practice (GP) records managed by the GP software provider The Phoenix Partnership (TPP), linked to death data from the ONS. ONS data include information on all deaths, including COVID-19-related death (defined as a COVID-19 ICD-10 code mentioned anywhere on the death certificate) and non-COVID-19 death, which was used for censoring.

The data were accessed, linked and analysed using OpenSAFELY, a new data analytics platform that was created to address urgent questions relating to the epidemiology and treatment of COVID-19 in England. OpenSAFELY provides a secure software interface that allows detailed pseudonymized primary care patient records to be analysed in near-real time where they already reside—hosted within the highly secure data centre of the electronic health records vendor—to minimize the reidentification risks when data are transported off-site; other smaller datasets are linked to these data within the same environment using a matching pseudonym derived from the NHS number. More information can be found at

The dataset that was analysed with OpenSAFELY is based on around 24 million currently registered patients (approximately 40% of the English population) from GP surgeries using the TPP SystmOne electronic health record system. SystmOne is a secure centralized electronic health records system that has been used in English clinical practice since 1998; it records data entered (in real time) by GPs and practice staff during routine primary care. The system is accredited under the NHS-approved systems framework for general practice28,29. Data extracted from TPP SystmOne have previously been used in medical research, as part of the ResearchOne dataset30,31. From these electronic health records a pseudonymized dataset was created for OpenSAFELY that consisted of 20 billion rows of structured data; including, for example, the diagnoses, medications, physiological parameters and prior investigations of pseudonymized patients (Extended Data Fig. 2, level 1). All OpenSAFELY data processing took place on TPP’s servers; external data providers securely transferred pseudoymized data (such as COVID-19-related death from ONS) for linkage to OpenSAFELY (Extended Data Fig. 2, level 2); and study definitions developed in Python on GitHub were pulled into the OpenSAFELY infrastructure and used to create a study dataset of one row per patient (Extended Data Fig. 2, level 3). Statistical code was developed using synthetic data and used to analyse the study dataset; this included code to check data ranges, to check consistency of data columns and to produce descriptive statistics for comparison with expected disease prevalences to ensure validity, as well as code to fit our analysis models. Only two authors (K.B. and A.J.W.) accessed OpenSAFELY to run code; no pseudonymized patient-level data were ever removed from TPP infrastructure; and only aggregated, anonymous, manually checked study results were released for publication (Extended Data Fig. 2, level 4), All code for data management and analysis is archived online (see ‘Code availability’).

Study population and observation period

Our study population consisted of all adults (males and females 18 years and above) currently registered as active patients in a TPP GP surgery in England on 1 February 2020. To be included in the study, participants were required to have at least one year of prior follow-up in the GP practice to ensure that baseline patient characteristics could be adequately captured, and to have recorded sex, age and deprivation32 (see ‘Covariates’). Patients were observed from 1 February 2020 and were followed until the first of either their death date (whether COVID-19-related or due to other causes) or the study end date, 6 May 2020. For this analysis, ONS death data were available to 11 May 2020, but we used an earlier censor date to allow for delays in reporting of the last few days of available data.


The outcome was COVID-19-related death; this was ascertained from ONS death certificate data in which the COVID related ICD-10 codes U071 or U072 were present in the record.


Characteristics included: health conditions listed in UK guidance on ‘higher risk’ groups33; other common conditions that may cause immunodeficiency inherently or through medication (cancer and common autoimmune conditions); and emerging risk factors for severe outcomes among COVID-19 cases (such as raised blood pressure).

Age, sex, BMI (kg m−2) and smoking status were included. Where categorized, age groups were: 18–39, 40–49, 50–59, 60–69, 70–79 and 80+ years. BMI was ascertained from weight measurements within the last 10 years, restricted to those taken when the patient was over 16 years old. Obesity was grouped using categories derived from the WHO classification of BMI: no evidence of obesity, BMI < 30; obese class I, BMI 30–34.9; obese class II, BMI 35–39.9; and obese class III, BMI 40+. Smoking status was grouped into current-, former- and never-smokers.

The following comorbidities were also considered: asthma, other chronic respiratory disease, chronic heart disease, diabetes mellitus, chronic liver disease, chronic neurological diseases, common autoimmune diseases (rheumatoid arthritis, systemic lupus erythematosus or psoriasis), solid organ transplant, asplenia, other immunosuppressive conditions, cancer, evidence of reduced kidney function, and raised blood pressure or a diagnosis of hypertension.

Disease groupings followed national guidance on risk of influenza infection34, therefore ‘chronic respiratory disease (other than asthma)’ included chronic obstructive pulmonary disease, fibrosing lung disease, bronchiectasis or cystic fibrosis; and ‘chronic heart disease’ included chronic heart failure, ischaemic heart disease, and severe valve or congenital heart disease likely to require lifelong follow-up. Chronic neurological conditions were separated into diseases with a probable cardiovascular aetiology (stroke, transient ischaemic attack, dementia) and conditions in which respiratory function may be compromised, such as motor neurone disease, myasthenia gravis, multiple sclerosis, Parkinson's disease, cerebral palsy, quadriplegia or hemiplegia and progressive cerebellar disease. Asplenia included splenectomy or a spleen dysfunction, including sickle cell disease. Other immunosuppressive conditions included human immunodeficiency virus (HIV) or a condition inducing permanent immunodeficiency ever diagnosed, or aplastic anaemia or temporary immunodeficiency recorded within the last year. Haematological malignancies were considered separately from other cancers to reflect the immunosuppression associated with haematological malignancies and their treatment. Kidney function was ascertained from the most recent serum creatinine measurement, where available, and was converted into the eGFR using the chronic kidney disease epidemiology collaboration (CKD-EPI) equation35, with reduced kidney function grouped into eGFR 30–59.9 or <30 ml min−1 per 1.73 m2. History of kidney dialysis or end-stage renal failure was separately explored in a secondary analysis. Raised blood pressure was defined as either a previous coded diagnosis of hypertension or the most recent recording indicating systolic blood pressure ≥ 140 mm Hg or diastolic blood pressure ≥ 90 mm Hg.

Asthma was grouped by use of oral corticosteroids as an indication of severity. Diabetes was grouped according to the most recent Hba1c measurement within the last 15 months (Hba1c < 58 mmol mol−1; Hba1c ≥ 58 mmol mol−1; or no recent measure available). Cancer was grouped by time since the first diagnosis (within the last year; between 1 and 4.9 years ago; more than 5 years ago).

Other covariates that were considered as potential upstream factors were deprivation and ethnicity. Deprivation was measured by the index of multiple deprivation (IMD, in quintiles, with higher values indicating greater deprivation), derived from the patient’s postcode at lower super output area level for a high degree of precision. Ethnicity was grouped into white, Black, South Asian, mixed, or other. In sensitivity analyses, a more detailed grouping of ethnicity was explored. The Sustainability and Transformation Partnership (STP, an NHS administrative region) of the patient’s general practice was included as an additional adjustment for geographical variation in infection rates across the country.

Information on all covariates was obtained from primary care records by searching TPP SystmOne records for specific coded data. TPP SystmOne allows users to work with the SNOMED-CT clinical terminology, using a GP subset of SNOMED-CT codes. This subset maps on to the native Read version 3 (CTV3) clinical coding system on which SystmOne is built. Medicines are entered or prescribed in a format compliant with the NHS Dictionary of Medicines and Devices (dm+d)36, a local UK extension library of SNOMED. Codelists for particular underlying conditions and medicines were compiled from a variety of sources. These include British National Formulary (BNF) codes from, published codelists for asthma37,38,39, immunosuppression40,41,42, psoriasis43, systemic lupus erythematosus44, rheumatoid arthritis45,46 and cancer47,48, and Read Code 2 lists designed specifically to describe groups who are at increased risk of influenza infection18. Read Code 2 lists were added to with SNOMED codes and cross-checked against NHS Quality and Outcomes Framework (QOF) registers, then translated into CTV3 with manual curation. Decisions on every codelist were documented and the final lists were reviewed by at least two authors. Detailed information on compilation and sources for every individual codelist is available at and the lists are available for inspection and reuse by the broader research community.

Statistical analysis

Patient numbers are depicted in a flowchart (Fig. 1). The Kaplan–Meier failure function was estimated by age group and sex. For each patient characteristic, a Cox proportional hazards model was fitted, with days in study as the timescale, stratified by geographical area (STP), and adjusted for sex and age modelled using restricted cubic splines. Violations of the proportional hazards assumption were explored by testing for a zero slope in the scaled Schoenfeld residuals. All patient characteristics, including age (again modelled as a spline), sex, BMI, smoking, IMD quintile, and comorbidities listed above were then included in a single multivariable Cox proportional hazards model, stratified by STP. Hazard ratios from the age-and-sex adjusted and fully adjusted models are reported with 95% confidence intervals. Models were also refitted with age group fitted as a categorical variable to obtain hazard ratios by age group.

In the primary analysis, those with missing BMI were assumed to be non-obese and those with missing smoking information were assumed to be non-smokers on the assumption that both obesity and smoking would be likely to be recorded if present. A sensitivity analysis was run among those with complete BMI and smoking data only. Ethnicity was omitted from the main multivariable model owing data being missing for 26% of individuals; hazard ratios for ethnicity were therefore obtained from a separate model among individuals with complete ethnicity data only. Hazard ratios for other patient characteristics, adjusted for ethnicity, were also obtained from this model and are presented in the sensitivity analyses to allow assessment of whether estimates were distorted by ethnicity in the primary model. We conducted an additional sensitivity analysis using a population-calibrated imputation approach to handle missing ethnicity49,50, with marginal proportions of each ethnicity group within each of nine broad geographical regions of England (East, East Midlands, London, North East, North West, South East, South West, West Midlands, Yorkshire and The Humber) taken from Annual Population Survey (APS) data (pooled 2014–2016)51. Five imputed datasets were created with estimated hazard ratios combined using Rubin’s rules.

The C-statistic was calculated as a measure of model discrimination. Owing to computational time, this was estimated by randomly sampling 5,000 patients with and without the outcome and calculating the C-statistic using the random sample, repeating this 10 times and taking the average C-statistic. Weights were applied to account for the sampling56.

All P values presented are two-sided.

Information governance and ethics

NHS England is the data controller; TPP is the data processor; and the key researchers on OpenSAFELY are acting on behalf of NHS England. This implementation of OpenSAFELY is hosted within the TPP environment, which is accredited to the ISO 27001 information security standard and is NHS IG Toolkit compliant52,53; patient data have been pseudonymized for analysis and linkage using industry standard cryptographic hashing techniques; all pseudonymized datasets transmitted for linkage onto OpenSAFELY are encrypted; access to the platform is through a virtual private network (VPN) connection, restricted to a small group of researchers, their specific machine and IP address; the researchers hold contracts with NHS England and only access the platform to initiate database queries and statistical models; all database activity is logged; and only aggregate statistical outputs leave the platform environment following best practice for anonymization of results such as statistical disclosure control for low cell counts54. The OpenSAFELY research platform adheres to the data protection principles of the UK Data Protection Act 2018 and the EU General Data Protection Regulation (GDPR) 2016. In March 2020, the Secretary of State for Health and Social Care used powers under the UK Health Service (Control of Patient Information) Regulations 2002 (COPI) to require organizations to process confidential patient information for the purposes of protecting public health, providing healthcare services to the public and monitoring and managing the COVID-19 outbreak and incidents of exposure55. Together, these provide the legal bases to link patient datasets on the OpenSAFELY platform. GP practices, from which the primary care data are obtained, are required to share relevant health information to support the public health response to the pandemic, and have been informed of the OpenSAFELY analytics platform. This study was approved by the Health Research Authority (REC reference 20/LO/0651) and by the London School of Hygiene and Tropical Medicine (LSHTM) ethics board (reference 21863). No further ethical or research governance approval was required by the University of Oxford but copies of the approval documents were reviewed and held on record. Guarantor: B.G. and L.S.

Patient and public involvement

Patients were not formally involved in developing this specific study design. We have developed a publicly available website ( that allows any patient or member of the public to contact us regarding this study or the broader OpenSAFELY project. This feedback will be used to refine and prioritize our OpenSAFELY activities.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.