OpenSAFELY: factors associated with COVID-19 death in 17 million patients

Summary COVID-19 has rapidly impacted on mortality worldwide.1 There is unprecedented urgency to understand who is most at risk of severe outcomes, requiring new approaches for timely analysis of large datasets. Working on behalf of NHS England we created OpenSAFELY: a secure health analytics platform covering 40% of all patients in England, holding patient data within the existing data centre of a major primary care electronic health records vendor. Primary care records of 17,278,392 adults were pseudonymously linked to 10,926 COVID-19 related deaths. COVID-19 related death was associated with: being male (hazard ratio 1.59, 95%CI 1.53-1.65); older age and deprivation (both with a strong gradient); diabetes; severe asthma; and various other medical conditions. Compared to people with white ethnicity, black and South Asian people were at higher risk even after adjustment for other factors (HR 1.48, 1.29-1.69 and 1.45, 1.32-1.58 respectively). We have quantified a range of clinical risk factors for COVID-19 related death in the largest cohort study conducted by any country to date. OpenSAFELY is rapidly adding further patients’ records; we will update and extend results regularly.


Introduction
On March 11th 2020, the World Health Organisation characterised COVID-19 as a pandemic after 118,000 cases and 4,291 deaths were reported in 114 countries. 2 As of 6 May (the date of latest data availability for this study), cases reached over 3.5 million globally, with more than 240,000 deaths attributed to the virus.1 On the same day in the UK, there were 206,715 confirmed cases, with 30,615 deaths. 3 Age and gender are well-established risk factors for severe COVID-19 outcomes, with over 90% of UK deaths being in people over 60, and 60% in men 4 . Various pre-existing conditions have also been associated with increased risk. For example, the Chinese center for disease control and prevention (44,672 patients, 1,023 deaths) reported cardiovascular disease, hypertension, diabetes, respiratory disease, and cancers as associated with increased risk of death 5 , but correction for relationships with age was not possible. A UK crosssectional survey describing 16,749 patients hospitalised with COVID-19 showed higher risk of death for patients with cardiac, pulmonary and kidney disease, as well as malignancy, dementia and obesity (hazard ratios 1.19-1.39 after age and sex correction). 6 Obesity was associated with treatment escalation in a French ITU cohort (n=124) and a New York hospital presentation cohort (n=3615). 7,8 Risks associated with smoking are unclear. 9,10,11 People from black and minority ethnic (BME) groups are at increased risk of bad outcomes from COVID-19, for reasons that are unclear. 12,13 Patient care is typically managed through electronic health records (EHR) which are commonly used in research. However traditional approaches to EHR analysis rely on intermittent extracts of small samples of historic data. Evaluating a rapidly arising novel cause of death requires a new approach. We therefore set out to deliver a secure analytics platform inside the data centre of major electronic health records vendors, running across the full live linked pseudonymised electronic health records of a very large population of NHS patients, to determine factors associated with COVID-19 related death in England (referred to as "death" in text that follows). 17,278,392 adults were included (Figure 1; cohort description in Table 1). 1,851,868 (11%) individuals had non-white ethnicities recorded. There were missing data for body mass index (3,751,769, 22%), smoking status (720,923, 4%), ethnicity (4,560,113, 26%), and blood pressure (1,715,095, 10%). 10,926 of the study population had COVID-19 related death recorded in linked death registration data.

Results
The overall cumulative incidence of death 90 days after study start was <0.01% in those aged 18-39 years, rising to 0.67% and 0.44% in men and women respectively aged ≥80 years ( Figure 2).
Associations between patient-level factors and risk of death are shown in Table 2 and Figure  3. Increasing age was strongly associated with risk, with those ≥80 years having more than 20-fold increased risk than 50-59 year olds (fully adjusted HR 20.60; 95% CI 18.70-22.68). With age fitted as a flexible spline, an approximately log-linear relationship was observed (Extended Data Figure 1). Men had higher risk than women (fully adjusted HR 1.59, 1.53-1.65). These findings are consistent with patterns observed in smaller studies worldwide and in the UK. 14 All non-white ethnic groups had higher risk than those with white ethnicity: HRs adjusted for age and sex only ranged from 1.62-1.88 for Black, South Asian and mixed ethnicities compared to white; attenuated to 1.43-1.48 on adjustment for all included risk factors (results for more detailed categories are shown in Extended Data Table 1). Non-white ethnicity has previously been found to be associated with increased COVID-19 infection and poor outcomes. 12,13,15 Our findings show that only a small part of the excess risk is explained by higher prevalence of medical problems such as cardiovascular disease or diabetes among BME people, or higher deprivation.
We found a consistent pattern of increasing risk with greater deprivation, with the most deprived quintile having a HR of 1.79 compared to the least deprived, consistent with recent national statistics. 16 Again, very little of this increased risk was explained by pre-existing disease or clinical risk factors, suggesting that other social factors may have an important role.
(defined as asthma with recent use of an oral corticosteroid), respiratory disease, chronic heart disease, liver disease, stroke/dementia, other neurological diseases, reduced kidney function (with greater HR for lower estimated glomerular filtration rate), autoimmune diseases (rheumatoid arthritis, lupus or psoriasis) and other immunosuppressive conditions, as per Table 2. Those with a recent (<5 years) history of haematological malignancy had a ≥2.5-fold increased risk, decreasing slightly after 5 years. For other cancers, increased HRs were smaller and mainly with recent diagnoses. History of dialysis or end-stage renal failure was associated with increased risk when added in a secondary analysis (HR 3.69, 3.09-4.39). These findings largely concurred with other data including the UK ISARIC study of hospitalised UK patients with COVID-19 that indicated increased risk of death with cardiac, pulmonary and kidney disease, malignancy, obesity and dementia, 6 and a large Chinese study which, though lacking age correction, suggested cardiovascular disease, hypertension, diabetes, respiratory disease, and cancers to be associated with increased mortality. 5 Our findings that severe asthma was associated with higher risk were notable since early data suggested underrepresentation of asthma in patients hospitalised or with severe COVID-19 outcomes, 17,18

Post-hoc analyses: smoking and hypertension
Both current and former smoking were associated with higher risk in models adjusted for age and sex only, but in the fully adjusted model current smoking was associated with a lower risk (fully adjusted HR 0.89, CI 0.82-0.97), concurring with lower than expected smoking prevalences in previous studies among hospitalised patients in China, 10 France 11 and the USA. 19 We further explored this post-hoc by adding covariates individually to the age, sex and smoking model, and found the change in HR to be largely driven by adjustment for chronic respiratory disease (HR 0.98, 0.90-1.06 after adjustment). This and other comorbidities could be consequences of smoking, highlighting that the fully adjusted smoking HR cannot be interpreted causally due to the inclusion of factors likely to mediate smoking effects. We therefore then fitted a model adjusted for demographic factors only (age, sex, deprivation, ethnicity), which showed a non-significant positive HR for current smoking (HR 1.07, 0.98-1.18). This does not support any postulated protective effect of nicotine 9,20 but suggests that any increased risk with current smoking is likely to be small, and will need to be clarified as the epidemic progresses and more data accumulate.
We similarly explored the change in the hypertension HR (from 1.09, 1.05-1.14 adjusted for age and sex to 0.89, 0.85-0.93 with all covariates included), and found diabetes and obesity to be principally responsible for this reduction (HR 0.97, 0.92-1.01 adjusted for age, sex, diabetes, obesity). Given the strong association between blood pressure and age we then examined an interaction between these variables; this revealed strong evidence of interaction (p<0.001) with hypertension associated with higher risk up to age 70 years and lower risk at older ages (adjusted HRs 3. 10 40-<50, 50-<60, 60-<70, 70-<80 and >80 respectively). The reasons for the inverse association between hypertension and mortality in older individuals are unclear and warrant further investigation including detailed examination by frailty, comorbidity and drug exposures in this age group.

Model checking and sensitivity analyses
The average C-statistic was 0.78. Results were similar when missing data were handled using analysis of complete records only, or using multiple imputation (sensitivity analyses: Extended Data Table 2). Non-proportional hazards were detected in the primary model (p<0.001). A sensitivity analysis with earlier administrative censoring at 6th April 2020, before which mortality should not have been affected by UK social distancing policies introduced in late March, showed no evidence of non-proportional hazards (p=0.83). HRs were similar but somewhat larger in magnitude for some covariates, while the association with increasing deprivation appeared to be smaller (Extended Data Table 2).

Summary
This secure analytics platform operating across over 23 million patient records for the COVID-19 emergency was used to identify, quantify, and explore risk factors for COVID-19 related death in the largest cohort study conducted by any country to date. Most comorbidities were associated with increased risk, including cardiovascular disease, diabetes, respiratory disease including severe asthma, obesity, history of haematological malignancy or recent other cancer, kidney, liver, neurological and autoimmune conditions. People from South Asian and black groups had a substantially higher risk of death, only partially attributable to co-morbidity, deprivation or other risk factors. A strong association between deprivation and risk was only partly attributable to co-morbidity or other risk factors.
These analyses provide a preliminary picture of how key demographic characteristics and a range of comorbidities, a priori selected as being of interest in COVID-19, are jointly associated with poor outcomes. These initial results may be used subsequently to inform the development of prognostic models. We caution against interpreting our estimates as causal effects. For example, the fully adjusted smoking hazard ratio does not capture the causal effect of smoking due to the inclusion of comorbidities which are likely to mediate any effect of smoking on COVID-19 death (e.g. COPD). Our study has highlighted a need for carefully designed causal analyses specifically focusing on the causal effect of smoking on COVID-19 death. Similarly, there is a need for analyses exploring the causal relationships underlying the associations observed between hypertension and COVID-19 death.

Strengths and weaknesses
The greatest strengths of this study were speed and size. By building a secure analytics platform across routinely collected live clinical data stored in situ we have produced timely results from the current records of approximately 40% of the English population. This scale allows more precision, on rarer exposures, on multiple risk factors, and rapid detection of important signals. Our platform will expand to provide updated analyses over time. Another strength is our use of open methods: we pre-specified our analysis plan and shared our full analytic code and code lists for review and re-use. We ascertained demographics, medications and co-morbidities from full pseudonymised longitudinal primary care records, providing substantially more detail than data recorded on admission, and on the total population rather than the selected subset presenting at hospital. We censored deaths from other causes using ONS data. Analyses were stratified by area to account for known geographical differences in incidence of COVID-19.
We also identify important limitations. In our outcome definition, we included clinically suspected (non laboratory confirmed) COVID-19, because testing has not always been carried out, especially in older patients in care homes. However, this may have incorrectly identified some patients as having COVID-19. Some COVID-19 deaths may have been misclassified as non-COVID-19, particularly in the early stages of the pandemic, though this is likely to have reduced quickly as deaths accumulated, and a degree of outcome underascertainment, providing unrelated to patient characteristics, should not have biased our hazard ratios. Due to the rarity of the outcome, the associations observed will be driven primarily by the profile of risk factors in the included cases. Our findings reflect both an individual's risk of infection, and their risk of dying once infected. We will explore more detailed patient trajectories in future research within the OpenSAFELY platform.
Our large population may not be fully representative. We include only 17% of general practices in London, where many earlier COVID-19 cases occurred, due to the substantial geographic variation in choice of EHR system.. The user interface of electronic health records can affect prescribing of certain medicines 21-23 so it is possible that coding may vary between systems.
Primary care records, though detailed and longitudinal, can be incomplete for data on risk factors and other covariates. Ethnicity was missing for approximately 26%, but was broadly representative; 24 there were also missing data on obesity and smoking. Sensitivity analyses found our estimates were robust to our assumptions around missing data.
Non-proportional hazards could be due to very large numbers or unmeasured covariates. However, rapid changes in social behaviours (social distancing, shielding) and changes in the burden of infection may also have affected patient groups differentially. The larger hazard ratios seen for several covariates in a sensitivity analysis with earlier censoring (soon after social distancing and shielding policies were introduced) are consistent with more atrisk patients being more compliant with these policies. In contrast, the risk associated with deprivation may have increased over time. Subsequent analyses will further explore changes before and after national initiatives around COVID-19.

Policy Implications and Interpretation
The UK has a policy of recommending shielding (staying at home at all times and avoiding any face to face contact) for groups identified as being extremely vulnerable to COVID-19 on the basis of pre-existing medical conditions. 25 We were able to evaluate the association between most of these conditions and death from COVID-19, and confirmed increased mortality risks, supporting the targeted use of additional protection measures for people in these groups. We have demonstrated -for the first time -that only a small part of the substantially increased risks of COVID-19 related death among non-white groups and among people living in more deprived areas can be attributed to existing disease. Improved strategies to protect people in these groups are urgently needed. 26 These might include specific consideration of BME groups in shielding guidelines and work-place policies. Subsequent studies are needed to investigate the interplay of additional factors we were unable to explore, including employment, access to personal protective equipment and related risk of exposure to infection and household density.
The UK has an unusually large volume of very detailed longitudinal patient data, especially through primary care. We believe the UK has a responsibility to the global community to make good use of such data. OpenSAFELY demonstrates at an unprecedented scale that this can be done securely, transparently, and rapidly. We will enhance the OpenSAFELY platform to further inform the global response to the COVID-19 emergency.

Future Research
The underlying causes of higher risk of COVID-19 related death among those from nonwhite backgrounds, and deprived areas, require further exploration; we would suggest collecting data on occupational exposure and living conditions as first steps. The statistical power offered by our approach means that associations with less common risk factors can be robustly assessed in more detail, at the earliest possible date, as the pandemic progresses. We will therefore update our findings and address smaller risk groups as new cases arise over time. The open source reusable codebase on OpenSAFELY supports rapid, secure and collaborative development of new analyses: we are currently conducting expedited studies on the impact of various medical treatments and population interventions on the risk of COVID-19 infection, ITU admission, and death, alongside other observational analyses. OpenSAFELY is rapidly scalable for additional NHS patients' records, with new data sources progressing.

Conclusion
We generated early insights into risk factors for COVID-19 related death using an unprecedented scale of 17 million patients' detailed primary care records, maintaining privacy, in the context of a global health emergency.

Study design
We conducted a cohort study using national primary care electronic health record data linked to COVID-19 death data (see Data Source). The cohort study began on 1st February 2020, chosen as a date several weeks prior to the first reported COVID-19 deaths and the day after the second laboratory confirmed case; 27 and ended on 6th May 2020. The cohort explores risk among the general population rather than in a population infected with SARS-COV-2. Therefore, all patients were included irrespective of any SARS-COV-2 test results.

Data Source
We used patient data from general practice (GP) records managed by the GP software provider The Phoenix Partnership (TPP), linked to Office for National Statistics (ONS) death data. ONS data includes information on all deaths, including COVID-19 related death, defined as a COVID-19 ICD-10 code mentioned anywhere on the death certificate and non-COVID-19 death, which was used for censoring.
The data were accessed, linked and analysed using OpenSAFELY, a new data analytics platform created to address urgent questions relating to the epidemiology and treatment of COVID-19 in England. OpenSAFELY provides a secure software interface that allows detailed pseudonymised primary care patient records to be analysed in near real-time where they already reside, hosted within the EHR vendor's highly secure data centre, to minimise the re-identification risks when data are transported off-site; other smaller datasets are linked to these data within the same environment using a matching pseudonym derived from the NHS number. More information can be found on https://opensafely.org/.
The dataset analysed with OpenSAFELY is based on 24 million currently registered patients (approximately 40% of the English population) from GP surgeries using the TPP SystmOne electronic health record system. SystmOne is a secure centralised EHR used in English clinical practice since 1998; it records data entered (in real time) by GPs and practice staff during routine primary care. The system is accredited under the NHS approved systems framework for General Practice. 28,29 Data extracted from TPP SystmOne have previously been used in medical research, as part of the ResearchOne dataset. 30,31 From this EHR a pseudonymised dataset was created for OpenSAFELY consisting of 20 billion rows of structured data including for example pseudonymised patients' diagnoses, medications, physiological parameters, and prior investigations [Extended Data Figure 2, Level 1]. All OpenSAFELY data processing took place on TPP's servers; external data providers securely transferred pseudoymised data (such as COVID-19 related death from ONS) for linkage to OpenSAFELY [Extended Data Figure 2, Level 2]; study definitions developed in Python on GitHub were pulled into the OpenSAFELY infrastructure, and used to create a study dataset of one row per patient [Extended Data Figure 2, Level 3]. Statistical code was developed using synthetic data and used to analyse the study dataset; this included code to check data ranges, to check consistency of data columns, and to produce descriptive statistics for comparison with expected disease prevalences to ensure validity, as well as code to fit our analysis models. Only two authors (KB/AJW) accessed OpenSAFELY to run code; no pseudonymised patient-level data were ever removed from TPP infrastructure; only aggregated, anonymous, manually checked study results were released for publication [Extended Data Figure 2, Level 4], All code for data management and analysis is archived online (see Code Availability, below).

Study Population and Observation Period
Our study population consisted of all adults (males and females 18 years and above) currently registered as active patients in a TPP general practice in England on 1st February 2020. To be included in the study, participants were required to have at least 1 year of prior follow-up in the GP practice to ensure that baseline patient characteristics could be adequately captured, and to have recorded sex, age, and deprivation (see covariates, below). 32 Patients were observed from the 1st of February 2020 and were followed until the first of either their death date (whether COVID-19 related or due to other causes) or the study end date, 6th May 2020. For this analysis, ONS death data were available to 11th May 2020, but

Europe PMC Funders Author Manuscripts
we used an earlier censor date to allow for delays in reporting in the last few days of available data.

Outcomes
The outcome was death among people with COVID-19, ascertained from ONS death certificate data, where the COVID related ICD-10 codes U071 or U072 were present in the record.

Covariates
Potential risk factors included: health conditions listed in UK guidance on "higher risk" groups; 33 other common conditions which may cause immunodeficiency inherently or through medication (cancer and common autoimmune conditions); and emerging risk factors for severe outcomes among COVID-19 cases (such as raised blood pressure).
Age, sex, body mass index (BMI; kg/m 2 ), and smoking status were considered as potential risk factors. Where categorised, age groups were: 18-<40, 40-<50, 50-<60, 60-<70, 70-<80, 80+ years. BMI was ascertained from weight measurements within the last 10 years, restricted to those taken when the patient was over 16 years old. Obesity was grouped using categories derived from the World Health Organisation classification of BMI: no evidence of obesity <30 kg/m 2 ; obese I 30-34.9; obese II 35-39.9; obese III 40+. Smoking status was grouped into current, former and never smokers.
The following comorbidities were also considered potential risk factors: asthma, other chronic respiratory disease, chronic heart disease, diabetes mellitus, chronic liver disease, chronic neurological diseases, common autoimmune diseases (Rheumatoid Arthritis (RA), Systemic Lupus Erythematosus (SLE) or psoriasis), solid organ transplant, asplenia, other immunosuppressive conditions, cancer, evidence of reduced kidney function, and raised blood pressure or a diagnosis of hypertension.
Disease groupings followed national guidance on risk of influenza infection, 34 therefore "chronic respiratory disease (other than asthma)" included COPD, fibrosing lung disease, bronchiectasis or cystic fibrosis; chronic heart disease included chronic heart failure, ischaemic heart disease, and severe valve or congenital heart disease likely to require lifelong follow-up. Chronic neurological conditions were separated into diseases with a likely cardiovascular aetiology (stroke, TIA, dementia) and conditions in which respiratory function may be compromised such as motor neurone disease, myasthenia gravis, multiple sclerosis, Parkinson's disease, cerebral palsy, quadriplegia or hemiplegia, and progressive cerebellar disease. Asplenia included splenectomy or a spleen dysfunction, including sickle cell disease. Other immunosuppressive conditions included HIV or a condition inducing permanent immunodeficiency ever diagnosed, or aplastic anaemia or temporary immunodeficiency recorded within the last year. Haematological malignancies were considered separately from other cancers to reflect the immunosuppression associated with haematological malignancies and their treatment. Kidney function was ascertained from the most recent serum creatinine measurement, where available, converted into estimated glomerular filtration rate (eGFR) using the Chronic Kidney Disease Epidemiology Collaboration (CKD-EPI) equation, 35 with reduced kidney function grouped into defined as  47,48 and Read Code 2 lists designed specifically to describe groups at increased risk of influenza infection. 18 Read Code 2 lists were added to with SNOMED codes and cross-checked against NHS QOF registers, then translated into CTV3 with manual curation. Decisions on every code list were documented and final lists reviewed by at least two authors. Detailed information on compilation and sources for every individual codelist is available at https:// codelists.opensafely.org/ and the lists are available for inspection and re-use by the broader research community. 49

Statistical Analysis
Patient numbers are depicted in a flowchart. The Kaplan-Meier failure function was estimated by age group and sex. For each potential risk factor, a Cox proportional hazards model was fitted, with days in study as the timescale, stratified by geographic area (STP), and adjusted for sex and age modelled using restricted cubic splines. Violations of the proportional hazards assumption were explored by testing for a zero slope in the scaled Schoenfeld residuals. All potential risk factors, including age (again modelled as a spline), sex, BMI, smoking, index of multiple deprivation quintile, and comorbidities listed above were then included in a single multivariable Cox proportional hazards model, stratified by STP. Hazard ratios from the age/sex adjusted and fully adjusted models are reported with 95% confidence intervals. Models were also refitted with age group fitted as a categorical variable in order to obtain hazard ratios by age group.
In the primary analysis, those with missing BMI were assumed non-obese and those with missing smoking information were assumed to be non-smokers on the assumption that both obesity and smoking would be likely to be recorded if present. A sensitivity analysis was run among those with complete BMI and smoking data only. Ethnicity was omitted from the main multivariable model due to 26% of individuals having missing data; hazard ratios for ethnicity were therefore obtained from a separate model among individuals with complete ethnicity only. Hazard ratios for other risk factors, adjusted for ethnicity, were also obtained from this model and are presented in the sensitivity analyses to allow assessment of whether estimates may have been distorted by ethnicity in the primary model. We conducted an additional sensitivity analysis using a population-calibrated imputation approach to handle missing ethnicity, 50 52 Five imputed datasets were created with estimated hazard ratios combined using Rubin's rules.
The C-statistic was calculated as a measure of model discrimination. Due to computational time, this was estimated by randomly sampling 5000 patients with and without the outcome and calculating the C-statistic using the random sample, repeating this 10 times and taking the average C-statistic.
All p-values presented are two-sided.

Information governance and ethics
NHS England is the data controller; TPP is the data processor; and the key researchers on OpenSAFELY are acting on behalf of NHS England. This implementation of OpenSAFELY is hosted within the TPP environment which is accredited to the ISO 27001 information security standard and is NHS IG Toolkit compliant; 53,54 patient data has been pseudonymised for analysis and linkage using industry standard cryptographic hashing techniques; all pseudonymised datasets transmitted for linkage onto OpenSAFELY are encrypted; access to the platform is via a virtual private network (VPN) connection, restricted to a small group of researchers, their specific machine and IP address; the researchers hold contracts with NHS England and only access the platform to initiate database queries and statistical models; all database activity is logged; only aggregate statistical outputs leave the platform environment following best practice for anonymisation of results such as statistical disclosure control for low cell counts. 55

Figure 1. Estimated log hazard ratio by age in years
Footnote: From the primary fully adjusted model containing a 4-knot cubic spline for age, and adjusted for all covariates listed in Table 2