Coronavirus disease 2019 (COVID-19) has rapidly affected mortality worldwide1. There is unprecedented urgency to understand who is most at risk of severe outcomes, and this requires new approaches for the timely analysis of large datasets. Working on behalf of NHS England, we created OpenSAFELY—a secure health analytics platform that covers 40% of all patients in England and holds patient data within the existing data centre of a major vendor of primary care electronic health records. Here we used OpenSAFELY to examine factors associated with COVID-19-related death. Primary care records of 17,278,392 adults were pseudonymously linked to 10,926 COVID-19-related deaths. COVID-19-related death was associated with: being male (hazard ratio (HR) 1.59 (95% confidence interval 1.53–1.65)); greater age and deprivation (both with a strong gradient); diabetes; severe asthma; and various other medical conditions. Compared with people of white ethnicity, Black and South Asian people were at higher risk, even after adjustment for other factors (HR 1.48 (1.29–1.69) and 1.45 (1.32–1.58), respectively). We have quantified a range of clinical factors associated with COVID-19-related death in one of the largest cohort studies on this topic so far. More patient records are rapidly being added to OpenSAFELY, we will update and extend our results regularly.
On 11 March 2020, the World Health Organization (WHO) characterized COVID-19—which is caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)—as a pandemic, after 118,000 cases and 4,291 deaths were reported in 114 countries2. As of 6 May 2020 (the date of latest data availability for this study), cases had reached over 3.5 million globally, with more than 240,000 deaths attributed to the virus1. On the same day in the UK, there had been 206,715 confirmed cases of COVID-19, and 30,615 COVID-19-related deaths3.
Age and gender are well-established risk factors for severe COVID-19 outcomes: over 90% of the COVID-19-related deaths in the UK have been in people over 60, and 60% in men4. Various pre-existing conditions have also been associated with increased risk. For example, the Chinese Center for Disease Control and Prevention reported in a study of 44,672 individuals (1,023 deaths) that cardiovascular disease, hypertension, diabetes, respiratory disease and cancers were associated with an increased risk of death5; however, correction for relationships with age was not possible. A UK cross-sectional survey of 16,749 patients who were hospitalized with COVID-19 showed that the risk of death was higher for patients with cardiac, pulmonary and kidney disease, as well as cancer, dementia and obesity (HRs of 1.19–1.39 after correction for age and sex)6. Obesity was associated with treatment escalation in a French intensive care cohort7 (n = 124) and a New York hospital presentation cohort8 (n = 3,615). The risks associated with smoking are unclear9,10,11. People from Black and minority ethnic groups are at increased risk of poor outcomes from COVID-19, for reasons that are unclear12,13.
Patient care is typically managed through electronic health records, which are commonly used in research. However traditional approaches to the analysis of electronic health records rely on intermittent extracts of small samples of historic data. Evaluating a rapidly arising novel cause of death requires a new approach. We therefore set out to deliver a secure analytics platform inside the data centre of major electronic health records vendors, running across the full, linked and pseudonymized electronic health records of a very large population of NHS patients, to determine factors that are associated with COVID-19-related death in England.
Associations with COVID-19-related death
In total, 17,278,392 adults were included (Fig. 1; cohort description in Table 1). Eleven per cent of individuals (1,851,868) had ethnicity recorded as mixed, South Asian, Black or other (hereafter referred to as Black and minority ethnic, BAME). There were missing data for body mass index (3,751,769; 22%), smoking status (720,923; 4%), ethnicity (4,560,113; 26%) and blood pressure (1,715,095; 10%). COVID-19-related death was recorded in linked death registration data for 10,926 of the study population.
The overall cumulative incidence of COVID-19-related death 90 days after the start of the study was less than 0.01% in those aged 18–39 years, rising to 0.67% and 0.44% in men and women, respectively, aged 80 years or over (Fig. 2).
Associations between patient-level factors and risk of COVID-19-related death are shown in Table 2 and Fig. 3. Increasing age was strongly associated with risk, with people aged 80 or over having a more than 20-fold-increased risk compared to 50–59-year-olds (fully adjusted HR 20.60; 95% confidence interval (CI) 18.70–22.68). With age fitted as a flexible spline, an approximately log-linear relationship was observed (Extended Data Fig. 1). Men had a higher risk than women (fully adjusted HR 1.59 (1.53–1.65)). These findings are consistent with patterns observed in smaller studies worldwide and in the UK14.
People from all BAME groups were at higher risk than those of white ethnicity. When adjusted only for age and sex, hazard ratios ranged from 1.62–1.88 for Black and South Asian individuals and people of mixed ethnicities, compared to white people, decreasing to 1.43–1.48 after adjustment for all included factors (results for more detailed categories are shown in Extended Data Table 1). BAME ethnicity has previously been found to be associated with an increased risk of COVID-19 infection and poor outcomes12,13,15. Our findings show that only a small part of the excess risk is explained by a higher prevalence of medical problems such as cardiovascular disease or diabetes among BAME people, or by higher levels of deprivation.
We found a consistent pattern of increasing risk with greater deprivation, with the most deprived quintile having a hazard ratio of 1.79 compared to the least deprived, consistent with recent national statistics16. Again, very little of this increased risk was explained by pre-existing disease or clinical factors, suggesting that other social factors have an important role.
Increasing risks were seen with increasing obesity (fully adjusted HR 1.92 (1.72–2.13) for a body mass index (BMI; kg m−2) of over 40), and most comorbidities were associated with a higher risk of COVID-19-related death, including diabetes (greater hazard ratio for those with a recently measured glycated haemoglobin (HbA1c) level of at least 58 mmol mol−1), severe asthma (defined as asthma with recent use of an oral corticosteroid), respiratory disease, chronic heart disease, liver disease, stroke, dementia, other neurological diseases, reduced kidney function (greater hazard ratio associated with a lower estimated glomerular filtration rate; eGFR), autoimmune diseases (rheumatoid arthritis, lupus or psoriasis) and other immunosuppressive conditions (Table 2). Those with a recent (that is, in the last five years) history of haematological malignancy had an at least 2.5-fold increased risk, which decreased slightly after five years. For other cancers, hazard ratios were smaller and increased risks were associated mainly with recent diagnoses. A history of dialysis or end-stage renal failure was associated with increased risk when added in a secondary analysis (HR 3.69 (3.09–4.39)). These findings largely concur with other data, including the UK international severe acute respiratory and emerging infection consortium (ISARIC) study of hospitalized UK patients with COVID-19—which indicated an increased risk of death with cardiac, pulmonary and kidney disease, malignancy, obesity and dementia6—and a large Chinese study that, although lacking correction for age, suggested that cardiovascular disease, hypertension, diabetes, respiratory disease and cancers are associated with increased mortality5. Our results showing that severe asthma is associated with a higher risk are notable, as early data suggested that asthma was under-represented in patients with COVID-19 who were hospitalized or had severe outcomes17,18.
Post hoc analyses of smoking and hypertension
Both current and former smoking were associated with a higher risk in models that were adjusted for age and sex only, but in the fully adjusted model current smoking was associated with a lower risk (fully adjusted HR 0.89 (0.82–0.97)), which concurs with the lower than expected prevalence of smoking that was observed in previous studies among patients with COVID-19 in China10, France11 and the United States19. We investigated this in more depth post hoc by adding covariates individually to the age, sex and smoking model, and found that the change in hazard ratio was driven largely by adjustment for chronic respiratory disease (HR 0.98 (0.90–1.06) after adjustment). This and other comorbidities could be consequences of smoking, highlighting that the fully adjusted smoking hazard ratio cannot be interpreted causally owing to the inclusion of factors that are likely to mediate smoking effects. We therefore then fitted a model adjusted for demographic factors only (age, sex, deprivation and ethnicity), which showed a non-significant positive hazard ratio for current smoking (HR 1.07 (0.98–1.18)). This does not support any postulated protective effect of nicotine9,20, but suggests that any increased risk with current smoking is likely to be small and will need to be clarified as the epidemic progresses and more data accumulate.
We similarly investigated the change in the hypertension hazard ratio (from 1.09 (1.05–1.14) adjusted for age and sex, to 0.89 (0.85–0.93) with all covariates included), and found that diabetes and obesity were principally responsible for this reduction (HR 0.97 (0.92–1.01) adjusted for age, sex, diabetes and obesity). Given the strong association between blood pressure and age we then examined the interaction between these variables; this revealed strong evidence of interaction (P < 0.001), with hypertension associated with a higher risk up to the age of 70 years and a lower risk above the age of 70 (adjusted HRs 3.10 (1.69–5.70), 2.73 (1.96–3.81), 2.07 (1.73–2.47), 1.32 (1.17–1.50), 0.94 (0.86–1.02) and 0.73 (0.69–0.78) for ages 18–39, 40–49, 50–59, 60–69, 70–79 and 80 or over, respectively). The reasons for the inverse association between hypertension and mortality in older individuals are unclear and warrant further investigation, including detailed examination of frailty, comorbidity and drug exposures in this age group.
Model checking and sensitivity analyses
The average C-statistic—a measure of the model’s ability to distinguish between patients who experience COVID-19-related deaths and those who do not, ranging from 0 (no ability) to 1 (perfect ability)—was 0.93. Results were similar when missing data were handled using analysis of complete records only, or using multiple imputation (sensitivity analyses; Extended Data Table 2). Non-proportional hazards were detected in the primary model (P < 0.001). A sensitivity analysis with earlier administrative censoring at 6 April 2020—before which mortality should not have been affected by the social distancing policies that were introduced in the UK in late March—showed no evidence of non-proportional hazards (P = 0.83). Hazard ratios were similar but somewhat larger in magnitude for some covariates, whereas the association with increasing deprivation appeared to be smaller (Extended Data Table 2).
This secure analytics platform operating across NHS patient records of over 17 million adults and 6 million children was used to identify, quantify and analyse factors associated with COVID-19-related death in one of the largest cohort studies on this topic conducted by any country so far. Most comorbidities were associated with increased risk, including cardiovascular disease, diabetes, respiratory disease (including severe asthma), obesity, a history of haematological malignancy or recent other cancer, kidney, liver and neurological diseases, and autoimmune conditions. South Asian and Black people had a substantially higher risk of COVID-19-related death than white people, and this was only partly attributable to comorbidities, deprivation or other factors. A strong association between deprivation and risk was also only partly explained by comorbidities or other factors.
Our analyses provide a preliminary picture of how key demographic characteristics and a range of comorbidities—which were a priori selected as being of interest in COVID-19—are jointly associated with poor outcomes. These initial results may be used to inform the development of prognostic models. We caution against interpreting our estimates as causal effects. For example, the fully adjusted smoking hazard ratio does not capture the causal effect of smoking, owing to the inclusion of comorbidities that are likely to mediate any effect of smoking on COVID-19-related death (for example, chronic obstructive pulmonary disease). Our study has highlighted a need for carefully designed analyses that specifically focus on the causal effect of smoking on COVID-19-related death. Similarly, there is a need for analyses exploring the causal relationships that underlie the associations observed between hypertension and COVID-19-related death.
Strengths and weaknesses
The greatest strengths of this study are its size and the speed at which it was conducted. By building a secure analytics platform across routinely collected live clinical data stored in situ, we have produced timely results from the current NHS records of approximately 40% of the English population. The large scale of the study allows more precision—on rarer exposures and on multiple factors—and rapid detection of important signals. Our platform will expand to provide updated analyses over time. Another strength is our use of open methods: we pre-specified our analysis plan and shared our full analytic code and codelists for review and reuse. We ascertained patient demographics, medications and comorbidities from full pseudonymized longitudinal primary care records, which provide substantially more detail than data that are recorded on admission to hospital, and which take into account the total population rather than the selected subset of individuals who present at hospitals. We censored deaths from other causes using data from the UK Office for National Statistics (ONS). Analyses were stratified by area to account for known geographical differences in the incidence of COVID-19.
The study also has some important limitations. In our outcome definition, we included clinically suspected (non-laboratory-confirmed) cases of COVID-19, because testing has not always been carried out, especially in older patients in care homes. However, this may have resulted in some patients being incorrectly identified as having COVID-19. In addition, some COVID-19-related deaths may have been misclassified as non-COVID-19, particularly in the early stages of the pandemic; however, this inaccuracy is likely to have reduced quickly as the number of deaths increased, and a degree of outcome underascertainment—providing it is unrelated to patient characteristics—should not have biased our hazard ratios. Owing to the rarity of the outcome, the associations observed will be driven primarily by the profile of patient characteristics in the included cases. Our findings reflect both an individual’s risk of infection and their risk of dying once infected. We will consider more detailed patient trajectories in future research within the OpenSAFELY platform.
Our large population may not be fully representative. We include only 17% of general practices in London—where many of the earlier cases of COVID-19 occurred—owing to the substantial geographical variation in the choice of electronic health record system. The user interface of electronic health records can affect prescribing of certain medicines20,21,23, so it is possible that coding varies between systems.
Primary care records are detailed and longitudinal, but can be incomplete for data on patient characteristics. Ethnicity was missing for approximately 26% of patients, but was broadly representative24; there were also missing data on obesity and smoking. Sensitivity analyses found that our estimates were robust to our assumptions around missing data.
Non-proportional hazards could be due to very large numbers or unmeasured covariates. However, rapid changes in social behaviours (social distancing, shielding) and changes in the burden of infection may also have affected patient groups differentially. The larger hazard ratios seen for several covariates in a sensitivity analysis with earlier censoring (soon after social distancing and shielding policies were introduced) are consistent with patients who are more at risk being more compliant with these policies. By contrast, the risk associated with deprivation may have increased over time. Further analyses will explore the changes before and after the implementation of national initiatives around COVID-19.
Policy implications and interpretation
The UK has a policy of recommending shielding (staying at home at all times and avoiding any face-to-face contact) for groups who are identified as being extremely vulnerable to COVID-19 on the basis of pre-existing medical conditions25. We were able to evaluate the association between most of these conditions and death from COVID-19, and we confirmed the increased mortality risks, supporting the targeted use of additional protection measures for people in these groups. We have demonstrated that only a small part of the substantially increased risks of COVID-19-related death among BAME groups and among people living in more-deprived areas can be attributed to existing disease. Improved strategies to protect people in these groups are urgently needed26. These might include the specific consideration of BAME groups in shielding guidelines and workplace policies. Studies are needed to investigate the interplay of additional factors that we were unable to examine, including employment, access to personal protective equipment and the related risk of exposure to infection, and household density.
The UK has an unusually large volume of very detailed longitudinal patient data, especially through primary care, and we believe the UK has a responsibility to the global community to make good use of this data. OpenSAFELY demonstrates—on a very large scale—that this can be done securely, transparently and rapidly. We will enhance the OpenSAFELY platform to further inform the global response to the COVID-19 emergency.
The underlying causes of the higher risk of COVID-19-related death among BAME individuals, and among people from deprived areas, require further investigation. We would suggest collecting data on occupational exposure and living conditions as first steps. The statistical power offered by our approach means that associations with less-common factors can be robustly assessed in more detail and at the earliest possible date as the pandemic progresses. We will therefore update our findings and address smaller risk groups as new cases arise over time. The open source reusable codebase on OpenSAFELY supports the rapid, secure and collaborative development of new analyses; we are currently conducting expedited studies on the effects of various medical treatments and population interventions on the risk of COVID-19 infection, admission to intensive care units and death, alongside other observational analyses. OpenSAFELY is rapidly scalable for the incorporation of more NHS patient records, and new sources of data are progressing.
In conclusion, we have generated early insights into factors associated with COVID-19-related death using the detailed primary care records of 17 million NHS patients, while maintaining privacy, in the context of a global health emergency.
We conducted a cohort study using national primary care electronic health record data linked to data on COVID-19-related deaths (see ‘Data source’). The cohort study began on 1 February 2020, which was chosen as a date several weeks before the first reported COVID-19-related deaths and the day after the second laboratory-confirmed case27; and ended on 6 May 2020. The cohort study examines risk among the general population rather than in a population infected with SARS-COV-2. Therefore, all patients were included irrespective of any SARS-COV-2 test results. No randomization was undertaken. Outcome assessment was undertaken as part of routine health care, therefore no blinding of any sort was attempted. However, study investigators had no involvement in outcome assessment.
We used patient data from general practice (GP) records managed by the GP software provider The Phoenix Partnership (TPP), linked to death data from the ONS. ONS data include information on all deaths, including COVID-19-related death (defined as a COVID-19 ICD-10 code mentioned anywhere on the death certificate) and non-COVID-19 death, which was used for censoring.
The data were accessed, linked and analysed using OpenSAFELY, a new data analytics platform that was created to address urgent questions relating to the epidemiology and treatment of COVID-19 in England. OpenSAFELY provides a secure software interface that allows detailed pseudonymized primary care patient records to be analysed in near-real time where they already reside—hosted within the highly secure data centre of the electronic health records vendor—to minimize the reidentification risks when data are transported off-site; other smaller datasets are linked to these data within the same environment using a matching pseudonym derived from the NHS number. More information can be found at https://opensafely.org/.
The dataset that was analysed with OpenSAFELY is based on around 24 million currently registered patients (approximately 40% of the English population) from GP surgeries using the TPP SystmOne electronic health record system. SystmOne is a secure centralized electronic health records system that has been used in English clinical practice since 1998; it records data entered (in real time) by GPs and practice staff during routine primary care. The system is accredited under the NHS-approved systems framework for general practice28,29. Data extracted from TPP SystmOne have previously been used in medical research, as part of the ResearchOne dataset30,31. From these electronic health records a pseudonymized dataset was created for OpenSAFELY that consisted of 20 billion rows of structured data; including, for example, the diagnoses, medications, physiological parameters and prior investigations of pseudonymized patients (Extended Data Fig. 2, level 1). All OpenSAFELY data processing took place on TPP’s servers; external data providers securely transferred pseudoymized data (such as COVID-19-related death from ONS) for linkage to OpenSAFELY (Extended Data Fig. 2, level 2); and study definitions developed in Python on GitHub were pulled into the OpenSAFELY infrastructure and used to create a study dataset of one row per patient (Extended Data Fig. 2, level 3). Statistical code was developed using synthetic data and used to analyse the study dataset; this included code to check data ranges, to check consistency of data columns and to produce descriptive statistics for comparison with expected disease prevalences to ensure validity, as well as code to fit our analysis models. Only two authors (K.B. and A.J.W.) accessed OpenSAFELY to run code; no pseudonymized patient-level data were ever removed from TPP infrastructure; and only aggregated, anonymous, manually checked study results were released for publication (Extended Data Fig. 2, level 4), All code for data management and analysis is archived online (see ‘Code availability’).
Study population and observation period
Our study population consisted of all adults (males and females 18 years and above) currently registered as active patients in a TPP GP surgery in England on 1 February 2020. To be included in the study, participants were required to have at least one year of prior follow-up in the GP practice to ensure that baseline patient characteristics could be adequately captured, and to have recorded sex, age and deprivation32 (see ‘Covariates’). Patients were observed from 1 February 2020 and were followed until the first of either their death date (whether COVID-19-related or due to other causes) or the study end date, 6 May 2020. For this analysis, ONS death data were available to 11 May 2020, but we used an earlier censor date to allow for delays in reporting of the last few days of available data.
The outcome was COVID-19-related death; this was ascertained from ONS death certificate data in which the COVID related ICD-10 codes U071 or U072 were present in the record.
Characteristics included: health conditions listed in UK guidance on ‘higher risk’ groups33; other common conditions that may cause immunodeficiency inherently or through medication (cancer and common autoimmune conditions); and emerging risk factors for severe outcomes among COVID-19 cases (such as raised blood pressure).
Age, sex, BMI (kg m−2) and smoking status were included. Where categorized, age groups were: 18–39, 40–49, 50–59, 60–69, 70–79 and 80+ years. BMI was ascertained from weight measurements within the last 10 years, restricted to those taken when the patient was over 16 years old. Obesity was grouped using categories derived from the WHO classification of BMI: no evidence of obesity, BMI < 30; obese class I, BMI 30–34.9; obese class II, BMI 35–39.9; and obese class III, BMI 40+. Smoking status was grouped into current-, former- and never-smokers.
The following comorbidities were also considered: asthma, other chronic respiratory disease, chronic heart disease, diabetes mellitus, chronic liver disease, chronic neurological diseases, common autoimmune diseases (rheumatoid arthritis, systemic lupus erythematosus or psoriasis), solid organ transplant, asplenia, other immunosuppressive conditions, cancer, evidence of reduced kidney function, and raised blood pressure or a diagnosis of hypertension.
Disease groupings followed national guidance on risk of influenza infection34, therefore ‘chronic respiratory disease (other than asthma)’ included chronic obstructive pulmonary disease, fibrosing lung disease, bronchiectasis or cystic fibrosis; and ‘chronic heart disease’ included chronic heart failure, ischaemic heart disease, and severe valve or congenital heart disease likely to require lifelong follow-up. Chronic neurological conditions were separated into diseases with a probable cardiovascular aetiology (stroke, transient ischaemic attack, dementia) and conditions in which respiratory function may be compromised, such as motor neurone disease, myasthenia gravis, multiple sclerosis, Parkinson's disease, cerebral palsy, quadriplegia or hemiplegia and progressive cerebellar disease. Asplenia included splenectomy or a spleen dysfunction, including sickle cell disease. Other immunosuppressive conditions included human immunodeficiency virus (HIV) or a condition inducing permanent immunodeficiency ever diagnosed, or aplastic anaemia or temporary immunodeficiency recorded within the last year. Haematological malignancies were considered separately from other cancers to reflect the immunosuppression associated with haematological malignancies and their treatment. Kidney function was ascertained from the most recent serum creatinine measurement, where available, and was converted into the eGFR using the chronic kidney disease epidemiology collaboration (CKD-EPI) equation35, with reduced kidney function grouped into eGFR 30–59.9 or <30 ml min−1 per 1.73 m2. History of kidney dialysis or end-stage renal failure was separately explored in a secondary analysis. Raised blood pressure was defined as either a previous coded diagnosis of hypertension or the most recent recording indicating systolic blood pressure ≥ 140 mm Hg or diastolic blood pressure ≥ 90 mm Hg.
Asthma was grouped by use of oral corticosteroids as an indication of severity. Diabetes was grouped according to the most recent Hba1c measurement within the last 15 months (Hba1c < 58 mmol mol−1; Hba1c ≥ 58 mmol mol−1; or no recent measure available). Cancer was grouped by time since the first diagnosis (within the last year; between 1 and 4.9 years ago; more than 5 years ago).
Other covariates that were considered as potential upstream factors were deprivation and ethnicity. Deprivation was measured by the index of multiple deprivation (IMD, in quintiles, with higher values indicating greater deprivation), derived from the patient’s postcode at lower super output area level for a high degree of precision. Ethnicity was grouped into white, Black, South Asian, mixed, or other. In sensitivity analyses, a more detailed grouping of ethnicity was explored. The Sustainability and Transformation Partnership (STP, an NHS administrative region) of the patient’s general practice was included as an additional adjustment for geographical variation in infection rates across the country.
Information on all covariates was obtained from primary care records by searching TPP SystmOne records for specific coded data. TPP SystmOne allows users to work with the SNOMED-CT clinical terminology, using a GP subset of SNOMED-CT codes. This subset maps on to the native Read version 3 (CTV3) clinical coding system on which SystmOne is built. Medicines are entered or prescribed in a format compliant with the NHS Dictionary of Medicines and Devices (dm+d)36, a local UK extension library of SNOMED. Codelists for particular underlying conditions and medicines were compiled from a variety of sources. These include British National Formulary (BNF) codes from OpenPrescribing.net, published codelists for asthma37,38,39, immunosuppression40,41,42, psoriasis43, systemic lupus erythematosus44, rheumatoid arthritis45,46 and cancer47,48, and Read Code 2 lists designed specifically to describe groups who are at increased risk of influenza infection18. Read Code 2 lists were added to with SNOMED codes and cross-checked against NHS Quality and Outcomes Framework (QOF) registers, then translated into CTV3 with manual curation. Decisions on every codelist were documented and the final lists were reviewed by at least two authors. Detailed information on compilation and sources for every individual codelist is available at https://codelists.opensafely.org/ and the lists are available for inspection and reuse by the broader research community.
Patient numbers are depicted in a flowchart (Fig. 1). The Kaplan–Meier failure function was estimated by age group and sex. For each patient characteristic, a Cox proportional hazards model was fitted, with days in study as the timescale, stratified by geographical area (STP), and adjusted for sex and age modelled using restricted cubic splines. Violations of the proportional hazards assumption were explored by testing for a zero slope in the scaled Schoenfeld residuals. All patient characteristics, including age (again modelled as a spline), sex, BMI, smoking, IMD quintile, and comorbidities listed above were then included in a single multivariable Cox proportional hazards model, stratified by STP. Hazard ratios from the age-and-sex adjusted and fully adjusted models are reported with 95% confidence intervals. Models were also refitted with age group fitted as a categorical variable to obtain hazard ratios by age group.
In the primary analysis, those with missing BMI were assumed to be non-obese and those with missing smoking information were assumed to be non-smokers on the assumption that both obesity and smoking would be likely to be recorded if present. A sensitivity analysis was run among those with complete BMI and smoking data only. Ethnicity was omitted from the main multivariable model owing data being missing for 26% of individuals; hazard ratios for ethnicity were therefore obtained from a separate model among individuals with complete ethnicity data only. Hazard ratios for other patient characteristics, adjusted for ethnicity, were also obtained from this model and are presented in the sensitivity analyses to allow assessment of whether estimates were distorted by ethnicity in the primary model. We conducted an additional sensitivity analysis using a population-calibrated imputation approach to handle missing ethnicity49,50, with marginal proportions of each ethnicity group within each of nine broad geographical regions of England (East, East Midlands, London, North East, North West, South East, South West, West Midlands, Yorkshire and The Humber) taken from Annual Population Survey (APS) data (pooled 2014–2016)51. Five imputed datasets were created with estimated hazard ratios combined using Rubin’s rules.
The C-statistic was calculated as a measure of model discrimination. Owing to computational time, this was estimated by randomly sampling 5,000 patients with and without the outcome and calculating the C-statistic using the random sample, repeating this 10 times and taking the average C-statistic. Weights were applied to account for the sampling56.
All P values presented are two-sided.
Information governance and ethics
NHS England is the data controller; TPP is the data processor; and the key researchers on OpenSAFELY are acting on behalf of NHS England. This implementation of OpenSAFELY is hosted within the TPP environment, which is accredited to the ISO 27001 information security standard and is NHS IG Toolkit compliant52,53; patient data have been pseudonymized for analysis and linkage using industry standard cryptographic hashing techniques; all pseudonymized datasets transmitted for linkage onto OpenSAFELY are encrypted; access to the platform is through a virtual private network (VPN) connection, restricted to a small group of researchers, their specific machine and IP address; the researchers hold contracts with NHS England and only access the platform to initiate database queries and statistical models; all database activity is logged; and only aggregate statistical outputs leave the platform environment following best practice for anonymization of results such as statistical disclosure control for low cell counts54. The OpenSAFELY research platform adheres to the data protection principles of the UK Data Protection Act 2018 and the EU General Data Protection Regulation (GDPR) 2016. In March 2020, the Secretary of State for Health and Social Care used powers under the UK Health Service (Control of Patient Information) Regulations 2002 (COPI) to require organizations to process confidential patient information for the purposes of protecting public health, providing healthcare services to the public and monitoring and managing the COVID-19 outbreak and incidents of exposure55. Together, these provide the legal bases to link patient datasets on the OpenSAFELY platform. GP practices, from which the primary care data are obtained, are required to share relevant health information to support the public health response to the pandemic, and have been informed of the OpenSAFELY analytics platform. This study was approved by the Health Research Authority (REC reference 20/LO/0651) and by the London School of Hygiene and Tropical Medicine (LSHTM) ethics board (reference 21863). No further ethical or research governance approval was required by the University of Oxford but copies of the approval documents were reviewed and held on record. Guarantor: B.G. and L.S.
Patient and public involvement
Patients were not formally involved in developing this specific study design. We have developed a publicly available website (https://opensafely.org/) that allows any patient or member of the public to contact us regarding this study or the broader OpenSAFELY project. This feedback will be used to refine and prioritize our OpenSAFELY activities.
Further information on research design is available in the Nature Research Reporting Summary linked to this paper.
All data were linked, stored and analysed securely within the OpenSAFELY platform (https://opensafely.org/). Detailed pseudonymized patient data are potentially reidentifiable and therefore not shared. We rapidly delivered the OpenSAFELY data analysis platform without prior funding to deliver timely analyses on urgent research questions in the context of the global COVID-19 health emergency: now that the platform is established we are developing a formal process for external users to request access in collaboration with NHS England. Details of this process will be published shortly on the OpenSAFELY website.
Data management was performed using Python 3.8 and SQL, with analysis carried out using Stata 16.1 and Python. All code is shared openly for review and reuse under an MIT open license. All code for data management and analysis is archived online at https://github.com/opensafely/risk-factors-research. All clinical and medicines codelists are openly available for inspection and reuse at https://codelists.opensafely.org/.
WHO. COVID-19 situation reports. https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports (2020).
WHO. WHO Director-General’s opening remarks at the media briefing on COVID-19 – 11 March 2020. https://www.who.int/dg/speeches/detail/who-director-general-s-opening-remarks-at-the-media-briefing-on-covid-19---11-march-2020 (2020).
Department of Social Care and Public Health England. Number of coronavirus (COVID-19) cases and risk in the UK. https://web.archive.org/web/20200501084711/https://www.gov.uk/guidance/coronavirus-covid-19-information-for-the-public (2020).
NHS England. COVID-19 daily deaths. https://web.archive.org/web/20200501094237/https://www.england.nhs.uk/statistics/statistical-work-areas/covid-19-daily-deaths/ (2020).
Deng, G., Yin, M., Chen, X. & Zeng, F. Clinical determinants for fatality of 44,672 patients with COVID-19. Crit. Care 24, 179 (2020).
Docherty, A. B. et al. Features of 16,749 hospitalised UK patients with COVID-19 using the ISARIC WHO clinical characterisation protocol. Preprint at medRxiv https://doi.org/10.1101/2020.04.23.20076042 (2020).
Simonnet, A. et al. High prevalence of obesity in severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) requiring invasive mechanical ventilation. Obesity 28, 1195–1199 (2020).
Lighter, J. et al. Obesity in patients younger than 60 years is a risk factor for COVID-19 hospital admission. Clin. Infect. Dis. 2020, ciaa415 (2020).
Simons, D., Shahab, L., Brown, J. & Perski, O. The association of smoking status with SARS-CoV-2 infection, hospitalisation and mortality from COVID-19: a living rapid evidence review. Qeios https://doi.org/10.32388/UJR2AW.2 (2020).
Guan, W.-J. et al. Clinical characteristics of coronavirus disease 2019 in China. N. Engl. J. Med. 382, 1708–1720 (2020).
Miyara, M. et al. Low incidence of daily active tobacco smoking in patients with symptomatic COVID-19. Qeios https://doi.org/10.32388/WPP19W.3 (2020).
Khunti, K., Singh, A. K., Pareek, M. & Hanif, W. Is ethnicity linked to incidence or outcomes of covid-19? Br. Med. J. 369, m1548 (2020).
Institute for Fiscal Studies. Are some ethnic groups more vulnerable to COVID-19 than others? https://web.archive.org/web/20200502130148/https://www.ifs.org.uk/inequality/chapter/are-some-ethnic-groups-more-vulnerable-to-covid-19-than-others/ (2020).
Public Health England. Disparities in the risk and outcomes of COVID-19. https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/889195/disparities_review.pdf (2020).
Intensive Care National Audit and Research Centre. COVID-19 report. https://web.archive.org/web/20200425133758/https://www.icnarc.org/Our-Audit/Audits/Cmp/Reports (2020).
Office for National Statistics. Deaths registered weekly in England and Wales, provisional: week ending 17 April 2020. https://web.archive.org/web/20200430191844/https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/bulletins/deathsregisteredweeklyinenglandandwalesprovisional/weekending17april2020 (2020).
Halpin, D. M. G., Faner, R., Sibila, O., Badia, J. R. & Agusti, A. Do chronic respiratory diseases or their treatment affect the risk of SARS-CoV-2 infection? Lancet Respir. Med. 8, 436–438 (2020).
Boddington, N. L. et al. COVID-19 in Great Britain: epidemiological and clinical characteristics of the first few hundred (FF100) cases: a descriptive case series and case control analysis. Preprint at medRxiv https://doi.org/10.1101/2020.05.18.20086157 (2020).
Rentsch, C. T. et al. Covid-19 testing, hospital admission, and intensive care among 2,026,227 United States veterans aged 54–75 years. Preprint at medRxiv https://doi.org/10.1101/2020.04.09.20059964 (2020).
Farsalinos, K., Barbouni, A. & Niaura, R. Smoking, vaping and hospitalization for COVID-19. Qeios https://doi.org/10.32388/Z69O8A.13 (2020).
MacKenna, B. et al. Impact of electronic health record interface design on unsafe prescribing of ciclosporin, tacrolimus and diltiazem: a cohort study in English NHS primary care. J. Med. Internet Res. https://doi.org/10.2196/17003 (2020).
Opondo, D. et al. Quality of co-prescribing NSAID and gastroprotective medications for elders in The Netherlands and its association with the electronic medical record. PLoS One 10, e0129515 (2015).
MacKenna, B. Ghost branded generics: why does the cost of generic atorvastatin vary? EBM DataLab https://web.archive.org/web/20200502135915/https://ebmdatalab.net/ghost-branded-generics-why-does-the-cost-of-generic-atorvastatin-vary%ef%bb%bf/ (2018).
Mathur, R. et al. Completeness and usability of ethnicity data in UK-based primary care and hospital databases. J. Public Health 36, 684–692 (2014).
Public Health England. Guidance on shielding and protecting people who are clinically extremely vulnerable from COVID-19. https://web.archive.org/web/20200501090127/https://www.gov.uk/government/publications/guidance-on-shielding-and-protecting-extremely-vulnerable-persons-from-covid-19/guidance-on-shielding-and-protecting-extremely-vulnerable-persons-from-covid-19 (2020).
Marmot, M., Friel, S., Bell, R., Houweling, T. A. & Taylor, S. Closing the gap in a generation: health equity through action on the social determinants of health. Lancet 372, 1661–1669 (2008).
UK Government. Coronavirus (COVID-19) cases in the UK. https://web.archive.org/web/20200502045059/https://coronavirus.data.gov.uk/ (2020).
NHS Digital. GP systems of choice. https://digital.nhs.uk/services/gp-systems-of-choice (2020).
NHS Digital. Future GP IT systems and services. https://digital.nhs.uk/services/future-gp-it-systems-and-services (2020).
Clegg, A. et al. Development and validation of an electronic frailty index using routine primary care electronic health record data. Age Ageing 45, 353–360 (2016).
Harcourt, S. et al. Estimating primary care attendance rates for fever in infants after meningococcal B vaccination in England using national syndromic surveillance data. Vaccine 36, 565–571 (2018).
Lewis, J. D., Bilker, W. B., Weinstein, R. B. & Strom, B. L. The relationship between time since registration and measured incidence rates in the General Practice Research Database. Pharmacoepidemiol. Drug Saf. 14, 443–451 (2005).
Public Health England. Guidance on social distancing for everyone in the UK. https://web.archive.org/web/20200429043059/https://www.gov.uk/government/publications/covid-19-guidance-on-social-distancing-and-for-vulnerable-people/guidance-on-social-distancing-for-everyone-in-the-uk-and-protecting-older-people-and-vulnerable-adults (2020).
Public Health England. UK immunisation schedule: the green book, chapter 11. https://www.gov.uk/government/publications/immunisation-schedule-the-green-book-chapter-11 (2013).
Levey, A. S. et al. A new equation to estimate glomerular filtration rate. Ann. Intern. Med. 150, 604–612 (2009).
MacKenna, B. What is the dm+d? The NHS Dictionary of Medicines and Devices. EBM DataLab https://web.archive.org/web/20200502143707/https://ebmdatalab.net/what-is-the-dmd-the-nhs-dictionary-of-medicines-and-devices/ (2019).
Nissen, F. et al. Validation of asthma recording in the Clinical Practice Research Datalink (CPRD). BMJ Open 7, e017474 (2017).
Morton, C. & Douglas, I. OpenSAFELY codelists: asthma diagnosis. https://codelists.opensafely.org/codelist/opensafely/asthma-diagnosis/ (2020).
MacKenna, B. & Douglas, I. OpenSAFELY codelists: asthma oral prednisolone medication. https://codelists.opensafely.org/codelist/opensafely/asthma-oral-prednisolone-medication/ (2020).
Grint, D. J. et al. Safety of inadvertent administration of live zoster vaccine to immunosuppressed individuals in a UK-based observational cohort analysis. BMJ Open 10, e034886 (2020).
McDonald, H. & Smeeth, L. OpenSAFELY codelists: permanent immunosuppression. https://codelists.opensafely.org/codelist/opensafely/permanent-immunosuppression/ (2020).
Smeeth, L. & McDonald, H. OpenSAFELY codelists: temporary immunosuppression. https://codelists.opensafely.org/codelist/opensafely/temporary-immunosuppression/ (2020).
Wong, A., Schmidt, S. A. J. & Langan, S. Clinical code list – psoriasis – read codes [Data collection]. https://doi.org/10.17037/DATA.00001255 (London School of Hygiene and Tropical Medicine, 2019).
Forbes, H. et al. Clinical code list – SLE codes [Data collection]. https://doi.org/10.17037/DATA.162 (London School of Hygiene and Tropical Medicine, 2014).
Pujades-Rodriguez, M. et al. Rheumatoid arthritis and incidence of twelve initial presentations of cardiovascular disease: a population record-linkage cohort study in England. PLoS One 11, e0151245 (2016).
Morton, C. & Tomlinson, L. Open SAFELY codelists: RA/SLE/psoriasis. https://codelists.opensafely.org/codelist/opensafely/ra-sle-psoriasis/ (2020).
Strongman, H. et al. Medium and long-term risks of specific cardiovascular diseases in survivors of 20 adult cancers: a population-based cohort study using multiple linked UK electronic health records databases. Lancet 394, 1041–1054 (2019).
Morton, C. & Walker, A. Open SAFELY codelists: cancer excluding lung and haematological. https://codelists.opensafely.org/codelist/opensafely/cancer-excluding-lung-and-haematological/ (2020).
Carpenter, J. R. & Kenward, M. G. Multiple Imputation and its Application (John Wiley & Sons, 2012).
Pham, T. M., Carpenter, J. R., Morris, T. P., Wood, A. M. & Petersen, I. Population-calibrated multiple imputation for a binary/categorical covariate in categorical regression models. Stat. Med. 38, 792–808 (2019).
Office for National Statistics. Population characteristics research tables. https://web.archive.org/web/20200513113451/https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates/datasets/populationcharacteristicsresearchtables (2019).
NHS Digital. BETA – data security standards. https://digital.nhs.uk/about-nhs-digital/our-work/nhs-digital-data-and-technology-standards/framework/beta---data-security-standards (2020).
NHS Digital. Data security and protection toolkit. https://digital.nhs.uk/data-and-information/looking-after-information/data-security-and-information-governance/data-security-and-protection-toolkit (2018).
NHS Digital. ISB1523: Anonymisation standard for publishing health and social care data. https://digital.nhs.uk/data-and-information/information-standards/information-standards-and-data-collections-including-extractions/publications-and-notifications/standards-and-collections/isb1523-anonymisation-standard-for-publishing-health-and-social-care-data (2019).
Department of Health and Social Care. Coronavirus (COVID-19): notification to organisations to share information. https://web.archive.org/web/20200421171727/https://www.gov.uk/government/publications/coronavirus-covid-19-notification-of-data-controllers-to-share-information (2020).
Sanderson, J., Thompson, S.G., White, I.R., Aspelund, T. & Pennells, L. Derivation and assessment of risk prediction models using case-cohort data. BMC Med. Res. Methodol. 13, 113 (2013).
All authors are from The OpenSAFELY Collaborative. We are grateful for all the support received from the TPP Technical Operations team throughout this work; for assistance from the information governance and database teams at NHS England and NHSX; and for additional discussions on disease characterization, codelists and methodology with H. Drysdale, B. Nicholson, N. DeVito, W. Hulme, I. Lipska, J. Morley, J. Quint and T. Pham. No dedicated funding has yet been obtained for this work. TPP provided technical expertise and infrastructure within their data centre pro bono in the context of a national emergency. The work of B.G. on better use of data in healthcare more broadly is currently funded in part by: the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre, NIHR Applied Research Collaboration Oxford and Thames Valley, the Mohn-Westlake Foundation, NHS England and the Health Foundation; all DataLab staff are supported by the grants of B.G. for this work. L.S. reports grants from Wellcome, MRC, NIHR, UKRI, British Council, GSK, British Heart Foundation and Diabetes UK outside this work; K.B. holds a Sir Henry Dale fellowship jointly funded by Wellcome and the Royal Society; H.I.M. is funded by the NIHR Health Protection Research Unit in Immunisation (a partnership between Public Health England and LSHTM); A.Y.S.W. holds a fellowship from BHF; R.M. holds a Sir Henry Wellcome fellowship funded by the Wellcome Trust; E.J.W. holds grants from MRC; R.G. holds grants from NIHR and MRC; I.J.D. holds grants from NIHR and GSK; and H.F. holds a UKRI fellowship. The views expressed are those of the authors and not necessarily those of the NIHR, NHS England, Public Health England or the Department of Health and Social Care. The funders had no role in the study design; the collection, analysis and interpretation of data; the writing of the report; and the decision to submit the article for publication.
All authors have completed the International Committee of Medical Journal Editors (ICMJE) uniform disclosure form at www.icmje.org/coi_disclosure.pdf. C.B., J.P., F.H., J.C. and S.H. are employees of TPP. A.M. was interim Chief Medical Officer of NHS Digital April–Sept 2019 (left NHS Digital at the end of January 2020) and Digital Clinical Champion NHS England 2014–2015. All other authors have no competing interests.
Peer review information Nature thanks David Christiani, Jeffrey Morris and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
From the primary fully adjusted model containing a four-knot cubic spline for age, and adjusted for all covariates listed in Table 2 except for ethnicity.
Overview of the architecture of the OpenSAFELY platform. EHR, electronic health record.
About this article
Cite this article
Williamson, E.J., Walker, A.J., Bhaskaran, K. et al. Factors associated with COVID-19-related death using OpenSAFELY. Nature (2020). https://doi.org/10.1038/s41586-020-2521-4
The Lancet (2020)
Socio-demographic heterogeneity in the prevalence of COVID-19 during lockdown is associated with ethnicity and household size: Results from an observational cohort study
New England Journal of Medicine (2020)
Journal of Travel Medicine (2020)