Main

The first case of SARS-CoV-2 infection in the United States was reported on 20 January 2020 in the state of Washington2 and to date, the pandemic has led to more than 240,000 COVID-19 deaths, making the United States by far the most affected country globally. However, there is high variation in rates of infections and underlying deaths across US states, counties and cities. Local population characteristics, such as population density3 and mobility patterns4, as well as mitigation measures5,6, define background risk of illness and death across the regions. Further, predisposing factors, including age, sex, ethnicity and racial background, social conditions and pre-existing conditions, put individuals within the same community at differential risk of serious illness and mortality7,8,9,10,11,12,13,14,15.

To date, the United States and other countries have mostly relied on community-based intervention measures, such as lockdowns, social distancing and guidance on mask wearing, for mitigating the worst effects of the pandemic. A variety of pandemic scenario models with increasing sophistication are available for forecasting future trends in infection, hospitalizations and deaths at the population level (https://www.cdc.gov/coronavirus/2019-ncov/covid-data/forecasting-us.html). Although a variety of predisposing factors are known, there has been limited effort to incorporate these factors into prevention strategies and/or forecasting models. In the future, however, as the United States and other countries continue to face increasing societal and economic pressure for relaxing some of the broad intervention measures, consideration of risk associated with predisposing factors for individuals and at the population level will be important for developing more equitable strategies for prevention16,17,18. Promising results from early phases of a number of vaccine trials (https://www.businesswire.com/news/home/20201109005539/en/, https://investors.modernatx.com/news-releases/news-release-details/modernas-covid-19-vaccine-candidate-meets-its-primary-efficacy and https://www.pfizer.com/news/press-release/press-release-detail/pfizer-and-biontech-conclude-phase-3-study-covid-19-vaccine)19,20,21,22 have raised the likelihood of available vaccines by the end of 2020 and a number of national and international bodies (https://www.csis.org/analysis/advancing-research-and-planning-equitable-distribution-covid-19-vaccine, https://www.nationalacademies.org/news/2020/07/national-academies-launch-study-on-equitable-allocation-of-a-covid-19-vaccine-first-meeting-july-24)23 have been developing frameworks that will allow equitable distribution of vaccines, taking into account differential risk for individuals and communities.

We describe the development and validation of a COVID-19 mortality risk calculator for the US adult (aged 18 years and older) population, integrating information from a variety of datasets for the estimation of risk associated with predisposing factors. We further extend the calculator to integrate information from pandemic forecasting models so that an individual’s absolute risk can be informed based not only on their underlying risk factors, but also on community-level risk due to underlying pandemic dynamics. We use the information on the prevalence and co-occurrence of risk factors from various national databases to make population-level projections of risk associated with these predisposing factors for the adult population across 477 US cities and for the 65-years-and-older population enrolled in Medicare across 3,113 US counties. We provide national-, state- and city and/or county- level estimates for the size of populations who are at or above different risk thresholds and can be gradually prioritized for vaccination and other preventive efforts. We also provide a web-based individual-level risk calculator and interactive maps for viewing population-level risk projections to facilitate future policy decisions. The main findings and limitations of the study are summarized in Table 1.

Table 1 Policy summary

Risk of mortality associated with various age groups in the United States follows a comparable pattern to that reported by the UK OpenSAFELY study15 (Extended Data Fig. 1). Relative to their respective white reference populations, African American populations in the United States (relative risk (RR) = 3.18, 95% confidence interval (CI): 3.02–3.36) are at a higher risk compared to the black population (RR = 1.88, 95% CI: 1.65–2.14) in the United Kingdom. Further, compared with non-Hispanic white people in the United States, the non-Hispanic Asian population are at an elevated risk (RR = 1.38, 95% CI: 1.25–1.53) and the Hispanic population (RR = 2.77, 95% CI: 2.60–2.94) and non-Hispanic American Indian/Alaskan Native population (RR = 1.72, 95% CI: 1.23–2.41) are at a substantially elevated risk. External covariate adjustment indicates that accounting for other risk factors, including social deprivation and pre-existing conditions, such as obesity, which are more prevalent in various minority groups than in white populations, only explains a small fraction (<16.0%) of the racial differences in mortality rates (Supplementary Table 1). Adjusted estimates associated with age and racial groups, together with estimates of risk associated with the other factors from an underlying fully adjusted model reported in the UK OpenSAFELY study15, are used to define a risk score for individuals in the United States (Supplementary Table 1).

The distribution of the risk score in the National Health Interview Survey (NHIS) sample indicates wide variation of risk across individuals in the United States (Extended Data Fig. 2). Overall, for the US adult population, we estimate that 18.1%, 11.0%, 4.3% and 1.6% of the individuals are at or above risk thresholds associated with elevated (≥ 1.2-fold), substantially elevated (≥ twofold), high (≥ fivefold) and very-high (≥ tenfold) risk categories, respectively (Table 2, with 95% CIs provided in Supplementary Table 2). The percentage of each population that exceeds these thresholds varies strongly by age. Only a small fraction (0.1%) of the individuals who are younger than 65 years exceed the threshold for high risk. We further examine the distribution of other risk factors among individuals in the defined high-risk groups for the general population (Extended Data Figs. 3 and 4) and the 65-years-and-older population (Extended Data Figs. 5 and 6). Males, Hispanic people, African American people and individuals with history of obesity, diabetes, cancers, high blood pressure, stroke, chronic heart disease, kidney disease, arthritis and respiratory diseases (excluding asthma) are more common in all of the high-risk groups compared with the general NHIS population. We observed a similar pattern for the 65-years-and-older population.

Table 2 Estimated percentages of the NHIS population and corresponding size of US population that exceed various risk thresholds, overall and within the 18–64-years or 65-years-and-older age group

We observed a substantial variation in risk due to predisposing factors across US communities (Fig. 1). The Index of Excess Risk (IER; Methods) varies around tenfold and eightfold across cities and counties for the underlying adult and 65-years-and-older Medicare populations, respectively (Supplementary Tables 3 and 4). A number of major cities, including Detroit, Miami, Baltimore City, New Orleans and Philadelphia, rank very high according to this index. The proportion of individuals crossing various risk thresholds varies more widely across these communities (Supplementary Tables 3 and 4). For example, the percentage of the adult populations in cities that exceed the fivefold risk threshold varies from 0.4 (Layton, UT) to 10.7 (Detroit, MI). Similarly, the percentage of the 65-years-and-older Medicare population that exceeds the same threshold varies from <1.0% (multiple counties in CO) to >55.0% (multiple counties in TX). Risk distribution for the 65-years-and-older Medicare population also varies substantially across the states (Extended Data Fig. 7 and Supplementary Table 5). Our projections also show that high-risk groups will be disproportionately enriched for deaths across all communities (Fig. 1, Extended Data Fig. 7 and Supplementary Tables 3, 4 and 5). For example, the ratio of the proportion of deaths that are expected to occur in the ≥ fivefold risk group to the proportion of the population at or above the same risk threshold ranges 6.4–32.9 across 477 US cities (Fig. 1).

Fig. 1: Distribution of the IER for COVID-19 mortality and projections for the proportion of high-risk populations across US communities.
figure 1

Results for the general adult population across 477 US cities (left); results for the 65-years-and-older Medicare population across 3,113 US counties (right). a,b, Histograms of IER. c,d, Proportions of underlying populations exceeding fivefold risk threshold. e,f, Proportions of underlying populations exceeding fivefold risk threshold against proportions of total deaths in underlying populations that are expected to occur within the ≥ fivefold risk group. Results for additional risk thresholds and corresponding 95% CIs are provided in Supplementary Tables 3 and 4. IER is an aggregated measure of risk for a community that accounts for prevalence and co-occurrences of the underlying risk factors and weights the contribution of the different factors according to their associated risk.

In a negative binomial regression analysis of deaths (between 7 June 2020 and 1 October 2020) in the 259 counties that contain the 477 cities, we found the coefficient of log(IER) to be statistically significant throughout moving windows of 2 weeks, with an average value of 0.94, close to its ideal value of 1.0, indicating excellent calibration of the underlying individual-level model for the population (Methods and Supplementary Table 6). Further, in a weighted least-squares analysis, we found that log(IER) explains on average 15.4% (95% CI: 12.7–17.3%) of the variation of death rates (on the logarithmic scale) over this time period across the underlying counties (Fig. 2). By comparison, population density and reported 2-week infection rate 3 weeks before the corresponding windows over which deaths were aggregated explain on average 2.4% (95% CI: 2.3–4.1%) and 20.0% (95% CI: 16.9–27.1%) of the variance of the underlying death rates, respectively. In a conditional analysis that accounts for major regional differences in pandemic dynamics, we found that IER explains as much or more of the variance of death rates as 3-weeks-before infection rates (Supplementary Table 6). In an additional validation analysis, we observed similar performance for an IER that was derived for the medication population aged 65 years and older for predicting death rates across 2,999 counties (Supplementary Table 7). In both analyses, we observed the relationship between IER and death rates to be close to linear throughout the range of risk (Extended Data Fig. 8). Finally, using the NHIS risk distribution, we projected that the underlying risk model is expected to have an area under the receiver operating characteristic curve (AUC) of 0.90 for individual risk prediction in the United States (Methods), close to that reported for the UK population in the OpenSAFELY study15. Further, a simulation study shows that the level of R2 achieved for community-level projections is consistent with the projected value of AUC for individual-level predictions (Extended Data Fig. 9).

Fig. 2: Validation of the risk model using county-level mortality data.
figure 2

a, Explained variance (R2) of log death rate in a moving window of 2-week periods by IER, population density and 3-weeks-before infection rate (also aggregated over a 2-week window), across 259 counties representing 477 studied cities. All data are transformed to log scale. The county-level IER was calculated as the weighted average of city-level IERs with each city being weighted by its population size. The dashed curves show results from a conditional analysis, where region indicators for the Northeast, Midwest, South and West are introduced as additional covariates in the regression to account for major regional differences in pandemic dynamics. b, Cumulative number of confirmed COVID-19 deaths across 259 counties during corresponding 2-week periods. During each 2-week period, only the counties with nonzero deaths and infections were included in analyses.

We have made available a web-based risk calculator (https://covid19risktools.com/riskcalculator) that allows an individual to input information on risk factors and obtain estimates of individualized risk for COVID-19 mortality on both relative and absolute risk scales (Extended Data Fig. 10). The relative risk for individuals are reported based on the underlying risk score benchmarked with respect to a ‘population average risk’ defined as population-weighted average risk (IER) across the cities. The calculator returns a numerical value for relative risk and a color-coded categorization of risk into five categories: <1.2 (close to or less than average risk), 1.2–2.0 (moderately elevated risk), 2.0–5.0 (substantially elevated risk), 5.0–10.0 (high risk) and >10 (very high risk). Further, for each person, information on risk score is combined with projections available from the Ensemble pandemic forecasting models (https://covid19forecasthub.org/doc/ensemble/, https://covid19forecasthub.org/doc/ensemble/ and https://github.com/reichlab/covid19-forecast-hub#ensemble-model) in their state of residence to report an absolute rate of mortality over a specified period of time. Both types of risk are provided with 95% CIs to reflect various sources of uncertainty in the projections. We have also made available interactive maps for viewing projections of the number and proportion of individuals at different risk categories across US cities, counties and states (https://jhucovid19.policymap.com/app).

We developed a COVID-19 mortality risk model for the general US population by combining information across multiple data sources. We believe that the model is unique in that it can be used to project absolute rate of mortality for individuals with different risk profiles by combining information on individual-level risk factors, as well as on changing dynamics in the epidemic at the community-level captured through available forecasting models. We applied the model to data available from US national databases to identify high-risk cities and counties and estimate the size of populations at risk within these communities. Our findings could inform policy developments for equitable distribution of early vaccines and other scarce preventive resources. Further, the proposed methodological framework could aid flexible development and updating of other risk models and subsequently use them to perform population-level risk projections.

The US National Academy of Sciences, Engineering and Medicine (NASEM) has released a comprehensive report on guidelines, and underlying principles, for equitable vaccine allocations23. The report recommends a four-phase plan with different populations being prioritized on the basis of evidence on risk of infection, severity and/or mortality following infection, negative societal impact such as those on national security and public services and transmission of infection to others. Although risk factors such as pre-existing conditions and living in overcrowded settings are considered, it is not clear how age and pre-existing conditions are relatively weighted to derive priority categories. Further, while the report acknowledges the excess burden of COVID-19 for various minority populations and areas of high social vulnerability, the development of actual criteria based on these factors has been left to be determined by local governments. Thus, we believe that while the NASEM report provides a strong set of ethical and procedural principles and a broad framework for vaccine prioritization, more quantitative analyses are needed so that priority categories can be better aligned with underlying levels of risk of individuals and populations.

In principle, similar analyses to ours could also be useful for developing strategies for the global distribution of vaccines and preventive therapeutics24. We are working with investigators from the Pan American Health Organization to identify risk factor surveillance datasets and provide estimates of the size of high-risk populations across countries in South America. Development of policies for the efficient and impactful distribution of vaccines, however, also depends on many other factors, including overall social benefit, ease of implementation and available infrastructures. In particular, using priority categories based on a simple set of rules, as opposed to individual risk calculations, might be desirable for the ease of implementation. Risk models, nevertheless, can be useful to evaluate effectiveness of any proposed prioritization plan for its effectiveness in reducing population burden of serious illness or deaths. Further, when data from vaccine trials become available, it will be important to explore possible heterogeneity in vaccine efficacy by risk groups, defined by individual and combinations of risk factors and appropriately modify allocation strategy to maximize population benefit.

Our risk tools and projections can be useful for identifying high-risk groups that would benefit most from ‘shielding’ efforts until they can be vaccinated. In the beginning of the pandemic, the National Health Service of the United Kingdom identified about 1.5 million individuals to be at extremely high risk due to selected conditions and provided them with assistance for food delivery and medical services25. In California, local and state governments developed Project Roomkey (https://covid19.lacounty.gov/project-roomkey/) to provide free-of-charge hotel rooms, meals and other services to asymptomatic homeless people who are at high risk due to their age or/and underlying health conditions. In the future, as businesses, schools and higher education institutes reopen, strategies need to be in place to identify and shield high-risk individuals. Finally, general population risk tools can also help healthy individuals to understand future risk for serious outcomes, not only for themselves, but also for family members and friends and thus could better motivate them to adhere to standard guidelines for infection prevention, such as through handwashing and mask wearing.

A few studies in the past have investigated the proportions of ‘high-risk’ individuals for COVID-19-related serious illness or mortality in the United Kingdom, the United States and across nations globally (https://www.nytimes.com/interactive/2020/05/18/us/coronavirus-underlying-conditions.html?auth=link-dismiss-google1tap)25,26,27. These studies have defined high-risk individuals based on the prevalence of one or more risk factors without taking into account the relative contribution of these factors. Further, because of the broad definition used, they estimate that a very large fraction of the populations, 20% in the United Kingdom and 16–31% globally, are at high risk. By contrast, we have defined different risk categories based on an underlying score that allows one to assign a more precise magnitude of risk to these categories. Further, our framework allows evaluation of future absolute risk for individuals and communities, incorporating information from pandemic forecasting models and thus is uniquely suitable for planning vaccination and other prevention efforts across regions that may have wide variation in the infection dynamics.

Our study has several limitations. First and foremost, information on risk for the majority of risk factors was derived from the UK-based OpenSAFELY study15. We have modified the model to make it suitable for the US population by incorporating population-based information on age- and race-associated rate of mortality. Further, we have empirically shown through independent validation analyses that the projected risk is well calibrated for the general US adult population and correlates strongly with death rates across counties in the United States. There is, however, an urgent need for individual-level data from large population-based studies, akin to the UK OpenSAFELY study, for building and validating general population risk models in the US setting.

Another limitation of our study is that we have not incorporated information associated with front-line occupations that pose higher risk of infection (https://www.ons.gov.uk/peoplepopulationandcommunity/healthandsocialcare/causesofdeath/bulletins/coronaviruscovid19relateddeathsbyoccupationenglandandwales/deathsregistereduptoandincluding20april2020)28. We have mapped individuals in the NHIS study to various high-risk occupation categories (Supplementary Table 8) and have observed overrepresentation of minority populations in highest risk categories, such as for black and Hispanic individuals in the low-skilled elementary occupation category. Further, a post-publication report from the UK OpenSAFELY study (https://doi.org/10.1038/s41586-020-2521-4) indicates the presence of interactions between age and other risk factors, although required information on all parameters from a fully multivariate model is not available yet. We plan to continually update our risk model, and the corresponding community-level risk projections, through incorporation of emerging information on risk associated with occupations, age interactions and other new risk factors.

We present a comprehensive and flexible framework for assessing general population risk of COVID-19 mortality incorporating individual-level risk profiles, population-level risk factor distribution and time-varying pandemic dynamics. Our risk projections for US cities and counties might be useful for guiding strategies for equitable allocation of early vaccines and other preventive resources in the coming months. Our risk tool and the underlying statistical methodologies can be applied to carry out similar analyses internationally and thus to inform prevention efforts globally.

Methods

Definition of COVID-19 mortality risk score

The risk score for an individual is defined as a weighted combination of various sociodemographic characteristics and predisposing health conditions, with weights defined by the relative magnitude of the contribution of these factors to the risk of death due to COVID-19 in the adult population. We use two sources of information to build the risk score: (1) multivariate-adjusted estimate of risk associated with sex, quintiles for social deprivation index (SDI), which is an approximation for the Index of Multiple Deprivation (IMD), and 12 pre-existing conditions, including body mass index (BMI), smoking status, blood pressure, respiratory disease excluding asthma, asthma, chronic heart disease, diabetes, nonhematological cancer, hematological cancer, stroke, kidney disease and rheumatoid arthritis, from the recently published UK-based OpenSAFELY study15 and (2) death rates associated with different age and racial/ethnic groups in the United States published by the Centers for Disease Control and Prevention (CDC), after performing external covariate adjustment accounting for the correlation of these factors with other risk factors in the model (described below).

Estimation of US-specific risk associated with age and racial/ethnic groups and external covariate adjustments

We used data made available by the CDC on reported COVID-19 deaths as of 6 June 2020, by race, age and state. We fitted a Poisson regression to observe death counts across strata defined by age, race/ethnicity and state. Specifically, the model takes the following form: \(\log \left( {D_{ijk}} \right) = \log \left( {P_{ijk}} \right) + \mathop {\sum}\nolimits_i {\gamma _iState_i + } \mathop {\sum}\nolimits_j {\beta _jRace_j + } \mathop {\sum}\nolimits_k {\delta _kAge_k + {\it{\epsilon }}_{ijk}}\), where Dijk and Pijk denote the death count and population total in state i, race category j and age category k, respectively. The corresponding error term is denoted by \({\it{\epsilon }}_{ijk}\). The model accounts for the underlying population size by an offset term and assumes additive effects of age and race/ethnicity categories on log death rate after adjusting for states as fixed effects in the model. While fitting the model, we excluded the race categories: ‘more than one race’ and ‘Native Hawaiian or Other Pacific Islander’. We subsequently performed external covariate analysis to account for possible correlation between age and race/ethnicity with other risk factors in the model. We used the NHIS dataset as a reference dataset for estimation of joint distribution of all risk factors in the US setting. We assumed that the association of all risk factors, except age and race, in the fully adjusted model was the same across the United Kingdom and United States. We used generalized method of moment techniques that we developed earlier29 to obtain estimates for the effects of age and race adjusted for other risk factors, with their effects being fixed at those available from the fully adjusted model from the UK OpenSAFELY study, based on corresponding unadjusted estimates available from CDC data and correlations of these risk factors with others as observed in the NHIS dataset. The final set of parameters used for the US fully adjusted model is shown in Supplementary Table 1. We observed that while adjustment for other risk factors attenuates the risk associated with some of the ethnic/racial groups, to a large extent the excess risk remains.

Data sources and processing

We utilized a variety of data sources to obtain the latest information on the prevalence of demographic variables and health conditions across US cities. The resources include American Community Survey (ACS) of the US Census Bureau for age, sex and race (2017/2018 Table, 1-year estimates), Behavioral Risk Factor Surveillance System (BRFSS) 2017 survey of the CDC for various health conditions and smoking, National Health and Nutrition Examination Survey (NHANES) for estimating relative proportions of certain subcategories of conditions that were not available in BRFSS, US Cancer Statistics (2012–2016 incidence rates) maintained by the National Cancer Institute (NCI) and the CDC to derive prevalence of hematological and nonhematological malignancies and a database available from the Robert Graham Center on SDI derived from data available from the ACS (2011–2015, 5-year estimates). In addition, we accessed individual-level data from the NHIS of CDC. We extracted individual-level data on the risk factors of 22,109 adults from the 2017 NHIS study. All the required variables, except SDI, were available for individuals in the NHIS. For projections of risk for the United States overall, we applied the most recent age distribution from the US Census Bureau 2019 data and information on other risk factors within age groups from the NHIS.

We used 2018 data from the Centers for Medicare and Medicaid Services (CMS) to obtain the latest information on prevalence of chronic health conditions across US counties for the 65-years-and-older Medicare population (the <65-year-old Medicare population has missing information on age and thus was not included in the analyses). We utilized data for individuals aged 65 years or older from 2017–2018 NHANES and 2017 NHIS to estimate relative proportions of certain subcategories of health conditions that were not available from the CMS. Prevalence of age, race and sex variables for the Medicare population were obtained from the ACS.

Definitions of categories of each risk factor, the corresponding data sources and weights (βk s) are summarized in the following sections.

Behavioral Risk Factor Surveillance System

We used the BRFSS ‘500 Cities: Local Data for Better Health, 2019 release’. This dataset provides information on estimates of crude prevalence for covariates related to unhealthy behaviors and health outcomes among adult US residents in 2017. As the 2019 release is from the 2017 data, which were based on a telephone survey across the states, the total sample size (landline and cellphone) varied from 39,510 to 754,950 with a mean response rate of 44.9%.

In the UK analysis, BMI in the obese group has three categories: Obese class I (30–34.9 kg m2), Obese class II (35–39.9 kg m2) and Obese class III (>40 kg m2). From the BRFSS data, we obtained estimates of total prevalence for these three obese groups for each city, which is defined as the proportion of respondents aged ≥18 years who have a BMI ≥ 30.0 kg m2. We then obtained the relative prevalence of Obese I, Obese II and Obese III subcategories from the NHIS 2017 population. Assuming that among obese individuals, the relative proportion of the three obese categories was the same across the NHIS population and cities, we applied the NHIS estimate of relative proportions to city-level information on overall prevalence to obtain prevalence estimates for each of the three categories of obesity across different cities.

Regarding smoking status, each individual was defined as either nonsmoker, former smoker or current smoker. We obtained information on the estimate of crude prevalence of smokers (former and current smokers combined) for each city from BRFSS data. This was defined as the proportion of respondents aged ≥18 years who reported having smoked ≥100 cigarettes in their lifetime and currently smoke every day or some days. Using the relative proportion of current to former smokers from the NHIS, we further allocated smokers into current and former smokers and obtained the respective prevalence for each city.

Blood pressure is categorized as either normal or high/diagnosed hypertension, where hypertension is defined as systolic BP ≥ 140 mm Hg or diastolic BP ≥ 90 mm Hg. We used estimates of crude prevalence for this group, defined as the group of respondents aged ≥18 years who reported ever having been told by a doctor, nurse or other health professional that they have high blood pressure, from the BRFSS data. Estimates of city-level crude prevalence for respiratory disease other than asthma, chronic heart disease, stroke/dementia, kidney disease and rheumatoid arthritis/lupus/psoriasis were directly obtained from BRFSS data.

Diabetes is quantified based on glycated hemoglobin (Hba1c) measurement in mmol mol−1 into three categories: controlled (HbA1c < 58 mmol mol−1), uncontrolled (HbA1c ≥ 58 mmol mol−1) and no recent HbA1c measure. From BRFSS we obtained the crude prevalence of diabetes, which is the proportion of respondents aged ≥18 years who reported ever been told by health professional that they have diabetes other than diabetes during pregnancy. We then estimated the relative proportion of controlled and uncontrolled diabetes using information on HbA1c in the 2016–2017 NHANES and applied this proportion to the prevalence of overall diabetes, information on which was available for each city from BRFSS, and thus obtained estimates of prevalence of controlled and uncontrolled diabetes across cities.

Asthma was grouped by status of recent use of oral corticosteroids (OCSs). The BRFSS defines crude prevalence of asthma as the ratio of weighted number of respondents who answered ‘yes’ to both of the following questions: “Have you ever been told by a doctor, nurse or other health professional that you have asthma?” and “Do you still have asthma?” to the weighted number of respondents. The OpenSAFELY study reported about 10.69% of patients with asthma were recent OCS users in the United Kingdom and we used this same proportion to allocate the proportion of patients with asthma for each city to recent OCS users.

For the United States, we have not found available data for city-level crude prevalences of several risk factors considered in the UK study15, including liver disease, organ transplant, spleen diseases and other neurological and immunosuppressive conditions and hence did not take these comorbidities into account in the risk score calculation. As these conditions are generally rare, we do not expect that exclusion of these variable from the risk score will lead to a change in risk coefficients for the other variables.

US Census Bureau

We used the publicly available ACS data to obtain the prevalence of demographic variables including age, sex and ethnicity at city level (‘census-designated place’).

Distribution of age and sex for each city were obtained from the 2017 table. The number of individual interviews were around 2,303,000. The age categories available from the census data could be collapsed into age groups 15−<45, 45−<55, 55−<65, 65−<75, 75−<85 and 85+. Sex was categorized into male and female.

The city-wide information on ethnicity was extracted from the latest available 2018 data. The number of individual interviews were around 2,300,000. Ethnicity included in the analysis has the following categories: white (reference), black, Asian, American Indian/Alaskan Native and Hispanic.

Centers for Disease Control and Prevention

We used state-wide COVID-19 death rates reported till 6 June 2020 to estimate risk of COVID-19 mortality associated with age and ethnicity/race categories in the United States. The age categories were 15−<45, 45−<55, 55−<65, 65−<75, 75−<85 and 85+ and the race categories were white, black, Asian, American Indian/Native Alaskan and Hispanic.

United States Cancer Statistics

The United States Cancer Statistics data are a combined cancer data source from the CDC and the NCI that provides statistics on the cancer incidence (newly diagnosed cases) across different cancer sites. We used the most recent county-level incidence data available for 5 years combined from 2012 to 2016. In the UK analysis, the cancer-related risk factors include hematological malignancies and nonhematological cancer, each of which was grouped by time since first diagnosis (within the last year, 2−<5 years or >5 years). To use this categorization for cancer comorbidities, we carried out the following steps.

We first classified the incidence data for various cancer sites into hematological cancer and nonhematological cancer for each county. In particular, ‘Hodgkin lymphoma’, ‘non-Hodgkin lymphoma’, ‘leukemias’ and ‘myeloma’ were grouped into hematological cancer and the rest were grouped into nonhematological cancer. For each of the two cancer groups, we adjusted the county-level annual incidence rates with corresponding overall annual US survival rates (2017 Leading Cancer Cases and Deaths) to obtain county-level prevalence of cancer groups across categories based on diagnosis years. We conducted the following steps.

We first obtained 5-year US survival rates for each cancer group using a weighted combination of annual US survival rates of cancer sites belonging to that specific group. The weights are the proportion of incident cases of cancer sites among incident cases of the particular group. We then computed annual death rate from the 5-year survival rate.

We show the calculations for hematological cancer as an example. Let us denote the obtained annual US death rate for hematological cancer by Dhema. Next, we denote the annual incidence rates of the hematological cancer group of the kth county by Ihema,k. We calculate the prevalence in each of the following categories:

  1. (1)

    Diagnosed in <1 year. We adjust 1-year incidence rates using 6-month survival rates assuming during that 1-year period, people on average had cancer 6 months ago. The prevalence of this category is computed by \(I_{hema,k} \times \left( {1 - \frac{{D_{hema}}}{2}} \right).\)

  2. (2)

    Diagnosed in ≥1 year and <5 years. We adjust 4-year incidence rates with 2-year survival rates. Using a similar idea, we calculate the prevalence using the formula given by 4Ihema,k × (1 − 2Dhema).

  3. (3)

    Diagnosed in ≥5 year. We consider the upper bound for time since diagnosis to be 20 years and hence, we adjust 15-year incidence rates with 7.5-year survival rates in a similar way as for steps (1) and (2).

We repeat the above steps to obtain the cancer prevalence of nonhematological cancer in the three categories.

For city-level projections, we assumed the prevalence for different cancer in a city to be the same as that of the underlying countries because city-level cancer incidence rates were unavailable. The approximation is expected to be reasonable for major metropolitan cities that tend to contain the majority of the population for the underlying county. However, for smaller cities that represent a small proportion of the population for the underlying counties, there is likely to be some bias due to the difference in sociodemographic characteristics of individuals living in and outside cities.

Robert Graham Center and American Community Survey

One of the upstream risk factors considered in the UK analysis was social deprivation, quantified using the IMD in quintiles, which is not available in the US setting. It is a geographic level measure with higher values indicating greater deprivation. In our study, we considered an approximate measure that is available in the United States, the SDI. SDI is a composite measure of seven demographic characteristics, developed by the Robert Graham Center from the ACS. We obtained the latest available data for county-level SDI quintiles in 2015 and for our city-level analysis, we defined the SDI quintile for each city as the SDI quintile of the corresponding county to which the city belongs to. The major components of IMD and SDI are similar and capture income, education, employment and housing condition. The seven factors used for calculating the SDI include the percentage population <100% Federal Poverty Line, percentage population aged 25 years or more with <12 years of education, percentage nonemployed, percentage population living in renter-occupied housing units, percentage population living in crowded housing units, percentage single-parent households with dependents <18 years and percentage population with no car.

National Health Interview Survey

The NHIS data are based on in-person interviews conducted by the National Center for Health Statistics. A sample weight is assigned to each participant to ensure that the NHIS data are representative of civilian noninstitutionalized US population. We used the 2017 NHIS data to obtain individual-level information on various risk factors.

Center for Medicare and Medicaid Services Office of Minority Health

The CMS data are based on the administrative claims of the Medicare beneficiaries obtained from the Chronic Conditions Warehouse. The data provide extensive information on county-wise prevalence of chronic and disabling health conditions (2018) for the 65-years-and-older population enrolled through the Free-for-Service Program. The health conditions include liver disease, cerebral palsy and HIV/AIDS in addition to the ones in BRFSS. We obtained county-level prevalence of race and sex categories for the Medicare population from the US census. We excluded 11 counties that had missing data for multiple risk factors. All remaining 3,113 counties had prevalence data available for hematological/nonhematological cancers, but some had missing information on subcategories of cancer prevalence, defined by number of years since cancer onset. For the 966 and 196 counties that had missing information on subcategories of hematological and nonhematological cancers, respectively, we conducted imputation conditional on age groups, considering the three subcategories of hematological cancer as an example. We first used NHIS data to estimate age-stratified (65–74, 75–84 and 85+ age groups) proportions of each subcategory of hematological cancer (diagnosed within 1 year, 1–5 years ago or >5 years ago) among the 65-years-and-older NHIS individuals with a history of hematological cancer. For each county, we then allocated the prevalence of age groups and the overall prevalence of hematologic cancer for the county to these different subcategories according to their relative proportions observed in the NHIS sample.

Statistical models and methods

A framework for integrating population- and individual-level risk

Similar to the OpenSAFELY study, we assumed that the risk of COVID-19 death at time t for an individual i residing in location l, for example a city or a county, can be described by the proportional risk model

$$\lambda _{il}\left( {X,t} \right) = \lambda _l\left( t \right)\exp \left( {\mathop {\sum }\limits_{k = 1}^K \beta _kX_{ik}} \right) = \lambda _l\left( t \right)R_i(\beta )$$

where λl(t) denotes the baseline risk for location l due to underlying pandemic characteristics, such as social distance measures, population density and mobility patterns and \(R_i\left( \beta \right) = {\mathrm{exp}}\left( {\mathop {\sum}\nolimits_{k = 1}^K {\beta _kX_{ik}} } \right)\) denotes a multiplicative factor associated with risk due to various predisposing factors. Here, t refers to calendar time since some landmark, such as the day when cumulative death reaches some minimum threshold. The average risk of the population at location l can be defined as

$$\lambda _l^A(t) = \lambda _l\left( t \right)E_l\left\{ {{\mathrm{exp}}\left( {\mathop {\sum }\limits_{k = 1}^K \beta _kX_{ik}} \right)} \right\},$$

where El denotes the expectation (average) with respect to distribution of the individual-level risk factors in location l. The above formula allows linking individual-level relative risk models to pandemic scenario models and hence can produce estimates of absolute risk of individuals, taking into account both individual-level risk factors and community-level risk due to pandemic dynamics. In particular, there are a variety of pandemic models available to produce estimates of population-level risk \(\lambda _l^A(t)\), for example in the state of residence of an individual, over the course of a period of time in the future, which could take into account local characteristics, such as reproduction rate, population density and mobility patterns and such information can be used to calculate baseline risk λl(t) through the equation:

$$\lambda _l\left( t \right) = \frac{{\lambda _l^A(t)}}{{E_l\left\{ {{\mathrm{exp}}\left( {\mathop {\sum}\limits_{k = 1}^K {\beta _kX_{ik}} } \right)} \right\}}},$$

and hence the absolute risk denoted as λil(X,t). Here we consider the Ensemble model developed by the Reich Laboratory at the University of Massachusetts Amherst (https://covid19forecasthub.org/doc/ensemble/ and https://github.com/reichlab/covid19-forecast-hub#ensemble-model)30, which produces projection of deaths by taking the arithmetic average of the projections from up to 61 individual forecasting models. The list and detailed descriptions of the individual forecasting models, including assumptions about changing dynamics in the epidemic, are provided at the Reich Laboratory COVID-19 Forecast Hub (https://github.com/reichlab/covid19-forecast-hub#ensemble-model). We download state-level projections and corresponding 95% CIs on a weekly basis.

Approximate characterization for the distribution of risk score (risk on log scale)

Recall that we defined the risk score of COVID-19 death for an individual i as \(RS_i = \mathop {\sum}\nolimits_{k = 1}^K {\beta _kX_{ik}}\). We observed that the distribution of the calculated individual-level risk score for the NHIS population could be approximated by a mixture-normal distribution with three components that correspond to the 18–44-year age group, 45–74-year age group and 75+ age group, respectively (Extended Data Fig. 2). Under the mixture-normal approximation, we can characterize the distribution of risk scores for each location by evaluating its mean and variance within each of the three age groups:

$$E_{l,a}\left\{ {RS} \right\} = \mathop {\sum }\limits_{k = 1}^K \beta _kE_{l,a}\{ X_k\} ,$$
(1)
$$V_{l,a}\left\{ {RS} \right\} = \mathop {\sum }\limits_{k = 1}^K \beta _k^2Var_{l,a}\{ X_k\} + \mathop {\sum }\limits_{k = 1}^K \mathop {\sum }\limits_{k\prime = 1}^K \beta _k\beta _{k\prime }Cov_{l,a}\left( {X_k,X_{k\prime }} \right),$$
(2)

where a is the indicator for age group (a = 1: aged 18–44 years, a = 2: aged 45–74 years, a = 3: aged 75+ years), El,a and Vl,a denote the expectation and variance of the risk score with respect to distribution of the risk factors among age group a in location l.

In the following paragraphs, we will focus on the discussion of approximating characteristics of the risk score distribution for the age 15+ years general population in the 477 cities by the mixture-normal distribution. For the 65-years-and-older Medicare population in each county, we observed that the distribution of risk score could be adequately approximated by a single normal distribution as shown by the NHIS 65-years-and-older population (Extended Data Fig. 2). The corresponding calculations were thus performed based on a single normal distribution.

For simplicity, we denote the mean and variance of the risk score within age group a in location l as \((\mu _{l,a},\sigma _{l,a}^2)\), a = 1,2,3. In this particular analysis, the Xks denote the dummy variables that indicate the levels of categorical risk factors and thus the mean and variance of the risk score can be characterized by the estimates of prevalence of the individual risk factor categories and joint prevalence of pairs of categorical variables. Briefly, for a location l, within each age group, we first obtained the age-stratified prevalence of each risk factor category, then calculated the co-prevalence of each pair of categorical variables using the corresponding odds ratio estimated from NHIS individual-level data.

We first discuss how to calculate the prevalence of each categorical variable within each age group in a location using information on the overall prevalence for that location and relationship between age and that variable estimated from NHIS data. We denote the age group by an indicator variable X1 (X1 = 1 denotes the 18–44 age group and 0 denotes the 45+ age group), and the categorical variable by X2. In the scenarios where X2 has more than two categories, we re-classified X2 into two categories, one for which we computed the prevalence and the other where we collapsed the rest. We estimated the odds ratio, a measure of association, between X1 and X2 using NHIS data after accounting for sampling weights. The odds ratio is defined as (P(X1 = 1, X2 = 1)P(X1 = 0, X2 = 0)) / (P(X1 = 1,X2 = 0)P(X1 = 0,X2 = 1)) and, under a logistic regression framework with X1 being the outcome and X2 being the exposure, it is equal to eθ, where θ is the regression parameter associated with X2. Thus, we fit a logistic regression model with X1 as the outcome and X2 as the exposure and estimate θ by taking into account the individual weights θ in NHIS data. We assumed that the odds ratio estimates obtained from the NHIS could be generalized to underlying populations of cities. We plugged in the odds ratio estimate and the city-specific marginal prevalence of the two groups in equation (3), described below, to derive the proportions of cells in a 2 × 2 contingency table defined by Y and X. We then evaluated age-group-specific prevalence of the required category by taking the ratio of the corresponding cell proportion and city-specific marginal prevalence of corresponding age groups.

Obtaining joint prevalence of two factors from marginal prevalence and odds ratios

Suppose that we have two binary variables for two different risk factors, which, for city l, are denoted as \(X_1^l\) and \(X_2^l\) with marginal prevalence \(p_1^l\) and \(p_2^l\), respectively and odds ratio \(r = p_{11}^lp_{00}^l/(p_{01}^lp_{10}^l)\), which is estimated based on NHIS individual-level data and is assumed to be constant across all cities. We will use these notations without l for simplicity. The 2 × 2 table for the city-specific X1 and X2 is

 

X1 = 1

X1 = 0

Total

X2 = 1

p11

p10

p2

X2 = 0

p01

p00

1−p2

Total

p1

1−p1

1

Given p1, p2 and r, we have the following equations:

$$\begin{array}{l}p_{11} + p_{10} = p_2,\\ p_{11} + p_{01} = p_1,\\ \frac{{p_{11}p_{00}}}{{p_{01}p_{10}}} = r,\\ p_{11} + p_{10} + p_{01} + p_{00} = 1.\end{array}$$

It is easy to see that p00 = 1 − p1 − p2 + p11. We then have

$$\begin{array}{l}\frac{{p_{11}\left( {1 - p_1 - p_2 + p_{11}} \right)}}{{\left( {p_1 - p_{11}} \right)\left( {p_2 - p_{11}} \right)}} = r\\ \Rightarrow \left( {1 - r} \right)p_{11}^2 + \left[ {1 + \left( {r - 1} \right)\left( {p_1 + p_2} \right)} \right]p_{11} - rp_1p_2 = 0.\end{array}$$

We define a = 1 − r, b = 1 + (r − 1)(p1 + p2) and c = − rp1p2, then the solution is given by the quadratic formula:

$$p_{11} = \frac{{ - b + \sqrt {b^2 - 4ac} }}{{2a}}.$$
(3)

After obtaining the marginal prevalence of the variables and their co-prevalence, p11, we can then estimate the expectation, variance of the risk factors and the covariance across them using the following formulas

$$\begin{array}{l}E\left( {X_i} \right) = p_i,\quad i = 1,2,\\ Var\left( {X_i} \right) = p_i\left( {1 - p_i} \right),\quad i = 1,2,\\ Cov\left( {X_1,X_2} \right) = p_{11} - p_1p_2.\end{array}$$

For the risk factors that have multiple categories, such as hematological cancer status, we denote dummy variables for the jth and kth categories as Xj and Xk, respectively. Given marginal probabilities pj and pk, we can calculate the expectation and variance of Xj and Xk based on the properties of multinomial distribution:

$$\begin{array}{l}E\left( {X_i} \right) = p_i\quad i = j,k,\\ Var\left( {X_i} \right) = p_i\left( {1 - p_i} \right),\quad i = j,k,\\ Cov\left( {X_j,X_k} \right) = - p_jp_k.\end{array}$$

We apply these formulas repeatedly to calculate the age-stratified individual and joint prevalence of different risk factors and thus calculate the mean and variance for the risk score for each of the three components of the normal mixture model.

The IER of a city/county

We define the quantity \(\overline {R_l} \left( \beta \right) = E_l\left\{ {\exp \left( {\mathop {\sum}\nolimits_{k = 1}^K {\beta _kX_{ik}} } \right)} \right\}\) as an IER for the population associated with the underlying risk factor distribution in location l, and present the scaled version of IER as \(\overline {R_l} \left( \beta \right)/\bar R\), where \(\bar R\) denotes the weighted average of \(\overline {R_l} \left( \beta \right)\) across cities/counties with population sizes as the weights, to rank cities/counties for their excess risk due to the underlying distribution of risk factors in the populations. As the individual-level data from each location are unavailable, we estimated IER using the available city- and county-level data for the prevalence of Xks, and individual-level data from a representative sample of the US population from the NHIS. Taking city-level analysis as an example, we first obtained the mean and variance of individual-level risk score (RS) within each age group a (μl,a = El,a{RS} and \(\sigma _{l,a}^2 = V_{l,a}\left\{ {RS} \right\}\), respectively, a = 1,2,3). Note that IER can be written as \(IER_l(\beta ) = E_l\left\{ {\exp \left( {RS} \right)} \right\}/\bar R\). Given the mixture-normal assumption for RS, we have

$$E_l\left\{ {\exp \left( {RS} \right)} \right\} = \mathop {\sum }\limits_{a = 1}^3 p_{l,a}\exp \left( {\mu _{l,a} + 0.5\sigma _{l,a}^2} \right)$$

where pl,a denotes the prevalence of age group a at location l, which is available from ACS. We defined an average risk of the US population \(\left( {\bar R} \right)\) based on a weighted average of IER across 477 cities with weights proportional to population size. For the county-level analysis of the 65-years-and-older Medicare population, similar calculations were performed based on a single normal distribution for underlying risk scores.

Model validation

We correlated IER with recent death rates across US cities to validate the underlying model.

Specifically, we conducted independent validation analyses using county-level mortality information from the CDC between 7 June and 1 October 2020, which did not contribute to the model development. In one analysis, for each of the 259 counties that contained the 477 studied cities, we calculated a weighted IER with each city being weighted by its population size. We then examined how strongly the county-level IER predicted the underlying death rates using two approaches. First, we fitted a negative binomial regression of the death counts on log(IER), where the underlying population sizes were used as offset terms for modeling rates and residual heterogeneity in the model was accounted for using an underlying Poisson-Gamma random effects model. If the underlying individual-level risk model is correctly specified, then in this group-level model, one would expect the slope of log(IER) to be close to 1.0. We further modeled the log of death rates across the counties as a linear function of log(IER) and used weighted least squares to estimate a measure of explained variance (R2) of log(death rate) associated with log(IER). As a benchmark, we estimated similar measures for two other likely predictors, log of population density (https://covid19.census.gov/datasets/21843f238cbb46b08615fc53e19e0daf_1?geometry=-168.434%2C28.795%2C169.066%2C67.148&selectedAttribute=B01001_calc_PopDensity) and log of 3-weeks-before infection rate (https://github.com/CSSEGISandData/COVID-19). The 95% CIs for R2 were calculated based on 1,000 bootstrap replicates of county-level data. We also conducted a conditional analysis by incorporating region indicators for the Northeast, Midwest, South and West as additional covariates in the regression to account for major regional differences in pandemic dynamics between regions (https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf).

Additionally, we conducted a validation analysis using a much expanded set of 2,999 counties, but with the IER derived based on the underlying Medicare population aged 65 and older, which was expected to lead to the majority of the observed deaths, as information on risk factor prevalence for the general population is not available across US counties. For each county, we derived a conditional IER based on the risk-factor prevalence for the underlying 65-years-and-older population and then multiplied it by the proportion of population who are 65 years and older in that county to capture its associated risk. All counties with zero death or zero infection over the 2-week period were excluded from the analyses. All analyses were performed using information on deaths over a moving window of 2-week periods for the detection of potential temporal effects. In both analyses, we observed that the relationship between IER and death rates were fairly linear throughout the range of risk.

Further, we use the above calculations to project expected discriminatory performance of the underlying risk model for the US population. We calculated AUC, which is the probability that risk score value is higher for a randomly selected case compared to that for a randomly selected control, based on the observed distribution of risk scores in the NHIS sample (treated as controls) and the projected distribution of risk scores among cases (D = 1).

Calculating the proportion and size of the vulnerable population within a city/county

We further examined the distribution of Ri(β) across individuals within a location to identify the size of the underlying most ‘vulnerable’ populations. For these evaluations, ideally one would require individual-level data for a representative sample of individuals from each location. However, in the absence of such data, we developed a framework to approximate the distributions using city/county-specific information on prevalence and information on the co-occurrence of these factors, captured through the underlying odds ratio parameters, from the NHIS (Extended Data Fig. 2 and details below).

We used \(\bar R\), the average risk for the US population associated with these risk factors, as a reference risk. The proportion of individuals who are at k-fold or higher risk compared to the reference risk can be defined as

$$\Pr \left( {R_{il} > k \times \bar R} \right) = \Pr \left( {{\mathrm{log}}(R_{il}) > \log \left( k \right) + {\mathrm{log}}(\bar R)} \right).$$

We first discuss calculation of the proportion and number of high-risk individuals among the general 18-years-and-older population in each of the 477 US cities. As mentioned previously, the distribution of the risk scores of the 18-years-and-older US population shows a mixture-normal pattern due to the substantial difference in risk of mortality across different age groups. As shown in Extended Data Fig. 2, the distribution of the risk scores of the general adult population in NHIS population can be approximated by a mixture of three normal distributions that correspond to the 18–44 age group, 45–74 age group and 75+ age group, respectively. We therefore assumed that log(Ril), the risk score of an individual i in city l, follows a mixture-normal distribution of three components, \(N(\mu _{l,a},\sigma _{l,a}^2)\) with weight pl,a being to the prevalence of age group a, a = 1,2,3, in city l. The proportion of the population in city l that exceed k-fold of the reference risk is then

\(\Pr \left( {R_{il} > k \times \bar R} \right) = 1 - {\mathrm{CDF}}_{N(\mu _{l,a},\sigma _{l,a}^2,p_{l,a}),a = 1,2,3}\left( {\log \left\{ k \right\} + \log \left\{ {\bar R} \right\}} \right),\) where \({\mathrm{CDF}}_{N(\mu _{l,a},\sigma _{l,a}^2,p_{l,a}),a = 1,2,3}\) denotes the cumulative density function (CDF) of the assumed mixture-normal distribution for the individual-level risk score in city l. Finally, the actual size of the vulnerable population that exceed k-fold of average risk in city l is estimated by multiplying the proportion of vulnerable population by total population size Ml, that is \(M_l\left( {1 - {\mathrm{CDF}}_{N(\mu _{l,a},\sigma _{l,a}^2,p_{l,a}),a = 1,2,3}\left( {\log \left\{ k \right\} + \log \left\{ {\bar R} \right\}} \right)} \right).\)

For the 65-years-and-older Medicare population, we expected that the risk score Rils has an approximately normal distribution, which was validated using the NHIS 65-years-and-older individuals (Extended Data Fig. 2). Under the normal assumption, \(R_{il} \sim N(\mu _l,\sigma _l^2)\), we estimated the mean μl and variance \(\sigma _l^2\) of the risk score Ril using previously described methods and the proportion of the 65-years-and-older Medicare population in location l that exceed k-fold of the reference risk is

$$\Pr \left( {R_{il} > k \times \bar R} \right) = 1 - {\mathrm{{\Phi}}}_{\mu _l,\sigma _l^2}\left( {\log \left\{ k \right\} + \log \left\{ {\bar R} \right\}} \right),$$

where \({\mathrm{{\Phi}}}_{\mu ,\sigma ^2}( \cdot )\) denotes the CDF of N(μ,σ)2.

Expected proportion of deaths at high risk within a location

As validation analysis show that the underlying model is well calibrated for the US population, the distribution of risk score among individuals who are expected to die can be derived based on that for the general population and the underlying risk model. Thus, we make a series of projections for the proportion of total deaths that are expected to occur within various risk categories based on the model. Specifically, we calculate the proportion of deaths expected to occur at or higher than the k-fold risk threshold as

$$\Pr \left( {R_{il} > k \times \bar R|D_{il} = 1} \right) = \frac{{\Pr \left( {D_{il} = 1,R_{il} > k \times \bar R} \right)}}{{\Pr \left( {D_{il} = 1,R_{il} > k \times \bar R} \right) + \Pr \left( {D_{il} = 1,R_{il} \le k \times \bar R} \right)}},$$

where Pr(Dil = 1|Ril), the probability of death given risk Ril, is proportional to Ril.

For the city-level analysis of the general 18-years-and-older population in the 477 US cities, we have

$$\begin{array}{l}\Pr \left( {D_{il} = 1,R_{il} > k \times \bar R} \right) = \mathop {\smallint }\limits_{R_{il} > k \times \bar R} \Pr \left( {D_{il} = 1|R_{il}} \right)f(R_{il})dR_{il}\\ \quad = \mathop {\sum }\limits_{j = 1}^3 \left\{ {\mathop {\smallint }\limits_{{\mathrm{log}}(k \times \bar R)}^\infty \Pr \left( {D_{il} = 1|RS_{il}} \right)f\left( {RS_{il}{\mathrm{|}}a_{il} = j} \right)dRS_{il}} \right\} \times Pr(a_{il} = j)\\ \quad \propto \mathop {\sum }\limits_{j = 1}^3 p_{l,j}\mathop {\smallint }\limits_{\log \left( {k \times \bar R} \right)}^\infty e^{RS_{il}}e^{ - \frac{{\left( {RS_{il} - \mu _{l,j}} \right)^2}}{{2\sigma _{l,j}^2}}}dRS_{il}\\ \quad \propto \mathop {\sum }\limits_{j = 1}^3 p_{l,j}e^{\mu _{l,j}^2 + 0.5\sigma _{l,j}^2}\left[ {1 - {\mathrm{{\Phi}}}_{\mu _{l,j} + \sigma _{l,j}^2,\sigma _{l,j}^2}\left( {\log \left( {k \times \bar R} \right)} \right)} \right],\end{array}$$

similarly,

$$\Pr \left( {D_{il} = 1,R_{il} \le k \times \bar R} \right) \propto \mathop {\sum }\limits_{j = 1}^3 p_{l,j}e^{\mu _{l,j}^2 + 0.5\sigma _{l,j}^2}{\mathrm{{\Phi}}}_{\mu _{l,j} + \sigma _{l,j}^2,\sigma _{l,j}^2}\left( {\log \left( {k \times \bar R} \right)} \right),$$

where RSil = log(Ril), pl,j denotes the prevalence of age group j, ail is the indicator of age group (ail = 1: age 15–44, ail = 2: age 45–74, ail = 3: age 75+). The proportion of deaths in location l that exceed k-fold of the reference risk is then

$$\begin{array}{l}{\mathrm{Pr}}\left( {R_{il} > k \times \bar R|D_{il} = 1} \right)\\ = \frac{{\mathop {\sum }\nolimits_{j = 1}^3 p_{l,j}e^{\mu _{l,j}^2 + 0.5\sigma _{l,j}^2}\left[ {1 - {\mathrm{{\Phi}}}_{\mu _{l,j} + \sigma _{l,j}^2,\sigma _{l,j}^2}\left( {\log \left( {k \times \bar R} \right)} \right)} \right]}}{{\mathop {\sum }\nolimits_{j = 1}^3 p_{l,j}e^{\mu _{l,j}^2 + 0.5\sigma _{l,j}^2}\left[ {1 - {\mathrm{{\Phi}}}_{\mu _{l,j} + \sigma _{l,j}^2,\sigma _{l,j}^2}\left( {\log \left( {k \times \bar R} \right)} \right)} \right] + \mathop {\sum }\nolimits_{j = 1}^3 p_{l,j}e^{\mu _{l,j}^2 + 0.5\sigma _{l,j}^2}{\mathrm{{\Phi}}}_{\mu _{l,j} + \sigma _{l,j}^2,\sigma _{l,j}^2}\left( {\log \left( {k \times \bar R} \right)} \right)}}\end{array}$$

For the county-level analysis of the 65-years-and-older Medicare population in the 3,113 US counties, we have

$$\begin{array}{l}{\mathrm{Pr}}\left( {R_{il} > k \times \bar R|D_{il} = 1} \right) \propto \mathop {\smallint }\limits_{R_{il} > k \times \bar R} \Pr \left( {D_{il} = 1|R_{il}} \right)f(R_{il})dR_{il}\\ = \mathop {\smallint }\limits_{\log (k \times \bar R)}^{\infty} \Pr \left( {D_{il} = 1|RS_{il}} \right)f\left( {RS_{il}} \right)dRS_{il}\\ = 1 - {\mathrm{{\Phi}}}_{\mu _l + \sigma _l^2,\sigma _l^2}\left( {\log \left( {k \times \bar R} \right)} \right),\end{array}$$

where μl and \(\sigma _l^2\) are the mean and variance of the risk score RSil in county l.

Uncertainty in community-level projections

We provided estimates of uncertainties for community-level projections, taking into account various sources of uncertainties in the estimates of risk parameters and random variations associated with various survey datasets (such as BRFSS, NHANES, NHIS). We created bootstrap samples for all inputs that went into the risk calculations, including parameter values, risk factor prevalence and survey datasets and carried out all the calculations on these replicated datasets as we did for our original analyses. We report 95% empirical CI for the final risk estimate for all community-level projections based on their underlying bootstrap distributions.

We first simulated 1,000 bootstrap replicates for the set of coefficients for the UK OpenSAFELY model by simulating them from a multivariate normal distribution with mean and dispersion matrix fixed at the set of original parameter estimates and underlying variance–covariance matrix (obtained by personal communication), respectively. We created 1,000 bootstrap replicates for the NHIS and NHANES datasets by sampling subjects with replacement. For the BRFSS study, for which individual-level data were not available, we obtained bootstrap replicates for the estimates of the risk factor prevalence through simulations. Let Xil denote the indicator variable for a risk factor category for individual i in location l. We simulated Xil from Ber(nl,pl), where pl denotes the original estimate of risk-factor prevalence in location l reported in BRFSS and nl denote the sample size in BRFSS for location l. As the city-wise sample sizes were not reported in BRFSS, we used the total sample size across cities within a state and distributed it to the underlying cities proportionately to their population sizes. We did not consider uncertainty in the estimates associated with the risk factors: sex, ethnicity and age, as the information came from the entire US population collected through census data.

Uncertainty in individual-level risk estimation

We also provided estimates of uncertainties for individual-level projections. For projections of individual-level relative risk, we took into account uncertainty in the estimates of risk parameters. For the projections of individual-level absolute risk, we further incorporated uncertainties associated with state-level projections across up to 61 different forecasting models included in the Ensemble estimator (https://covid19forecasthub.org/doc/ensemble/, https://github.com/reichlab/covid19-forecast-hub#ensemble-model)30.

Denoting X as the vector of risk factors of an individual, \(\hat \beta\) as the vector of the estimated fully adjusted log hazard ratios of the risk factors and \(\bar R(\hat \beta ) = \left( {\mathop {\sum}\nolimits_{l = 1}^L {w_l\bar R_l(\hat \beta )} } \right)/\left( {\mathop {\sum}\nolimits_{l = 1}^L {w_l} } \right)\) the weighted average of the estimated mean risk in each city (that is \(\bar R_l(\hat \beta )\)), with weight wl denoting the population size of city l, the estimated relative risk of the individual is reported as

$$\widehat {RR} = \frac{{e^{X^T\hat \beta }}}{{\bar R(\hat \beta )}}.$$

The sources of uncertainty in \(\widehat {RR}\) include the variation in \(\hat \beta\), \(\bar R(\hat \beta )\) and the covariance of \(\hat \beta\) and \(\bar R(\hat \beta )\). We quantified these sources of uncertainty by constructing a 95% bootstrap CI for \(\widehat {RR}\). Specifically, we have

$$\begin{array}{lll}Var\left\{ \log \left( \widehat{RR} \right) \right\} &=& Var\left[ X_i^T\hat \beta - \log \left\{ \bar R\left( {\hat \beta } \right) \right\} \right] \\&=& X^TCov(\hat{\beta})X + Var\left[ \log \left\{ \bar{R}\left( \hat{\beta}\right) \right\} \right] - 2X^TCov\left[\hat{\beta}, \log \left\{ \bar{R} \left( \hat{\beta} \right) \right\}\right], \end{array}$$

where \(Cov(\hat \beta ), Var\left[ \log \left\{ \bar{R}\left( \hat{\beta}\right) \right\} \right]\) and \(Cov\left[\hat{\beta}, \log \left\{ \bar{R} \left( \hat{\beta} \right) \right\}\right]\) are estimated from 1,000 bootstrap samples of \(\hat \beta\), city-level BRFSS risk factor prevalence and NHIS and NHANES datasets (described above).

The 95% CI of an individual’s relative risk is computed as

$$\left[ {e^{\log \widehat {RR} - z_{1 - \frac{\alpha }{2}} \times sd\left\{ {\left( {{\mathrm{log}}\left( {\widehat {RR}} \right.} \right)} \right\}},e^{\log \widehat {RR} + z_{1 - \frac{\alpha }{2}} \times sd\left\{ {\left( {{\mathrm{log}}\left( {\widehat {RR}} \right.} \right)} \right\}}} \right],$$

where α = 0.05 and \(z_{1 - \frac{\alpha }{2}}\) denotes the \(\left( {1 - \frac{\alpha }{2}} \right)\) quantile of the standard normal distribution.

Recall that the estimate of an individual’s absolute risk is calculated by combining the estimated relative risk that quantifies the risk due to a set of predisposing factors with the estimated baseline risk in the individual’s state of residence that quantifies the risk due to the underlying pandemic characteristics over time:

$$\begin{array}{l}\hat \lambda _s\left( {X,t} \right) = \hat \lambda _s\left( t \right)e^{X^T\hat \beta }\\ \quad = \frac{{\hat \lambda _s^A\left( t \right)}}{{\bar R_s\left( {\hat \beta } \right)}}e^{X^T\hat \beta },\end{array}$$

where \(\hat \lambda _s^A\left( t \right)\) and \(\bar R_s\left( {\hat \beta } \right)\) denote the projected mortality rate and the average relative risk of the population, respectively, in state s (the individual’s state of residence). Taking log on both sides of the above equation, we get \(\log \hat \lambda _s\left( {X,t} \right) = \log \hat \lambda _s^A\left( t \right) - \log \bar R_s\left( {\hat \beta } \right) + X^T\hat \beta .\) We assume that \(\log \hat \lambda _s\left( {X,t} \right)\) is approximately normally distributed and derive its variability as

$$\begin{array}{l}Var\left\{ {\log \hat \lambda _s\left( {X,t} \right)} \right\} = Var\left\{ {\log \hat \lambda _s^A\left( t \right) - \log \bar R_s\left( {\hat \beta } \right) + X^T\hat \beta } \right\}\\ \quad = Var\left\{ {\log \hat \lambda _s^A\left( t \right)} \right\} + Var\left( {\log \bar R_s\left( {\hat \beta } \right)} \right) + Var(X^T\hat \beta )\\ \quad - 2Cov\left( {\log \hat \lambda _s^A\left( t \right),\log \bar R_s\left( {\hat \beta } \right)} \right)\\ \quad - 2Cov\left( {X^T\hat \beta ,\log \left( {\bar R_s\left( {\hat \beta } \right)} \right)} \right) + 2Cov\left( {\log \hat \lambda _s^A\left( t \right),X^T\hat \beta } \right)\\ \quad = Var\left( {\log \bar R_s\left( {\hat \beta } \right)} \right) + Var\left\{ {\log \hat \lambda _s^A\left( t \right)} \right\} + X^TCov(\hat \beta )X\\ \quad 2X^TCov\left( {\hat \beta ,\log \left( {\bar R_s\left( {\hat \beta } \right)} \right)} \right).\end{array}$$

The other two covariance terms are zero as \(\hat \lambda _s^A\left( t \right)\) and \(\hat \beta\) are independent. Here, \(Cov\left( {\hat \beta } \right)\)\(Var\left( {\log \bar R_s\left( {\hat \beta } \right)} \right)\) and \(Cov\left( {\hat \beta ,\log \left( {\bar R_s\left( {\hat \beta } \right)} \right)} \right.\) are estimated from 1,000 bootstrap samples as described above. We estimate \(Var\left\{ {\log \hat \lambda _s^A\left( t \right)} \right\}\) based on the variation of death reported across the 58 models included in the Ensemble estimator30. Specifically, based on the reported CI ([MRL,MRU]) and normally, we calculate

$$Var\left\{ {\log \hat \lambda _s^A\left( t \right)} \right\} = \left\{ {\frac{{\log \left( {MR_U} \right) - \log \left( {MR_L} \right)}}{{2\left( {z_{1 - \frac{\alpha }{2}}} \right)}}} \right\}^2.$$

The 100(1 − α)% CI of an individual’s absolute risk is then finally computed as

$$\left[ {e^{\log \hat \lambda _s\left( {X,t} \right) - z_{1 - \frac{\alpha }{2}} \times sd\left( {\log \hat \lambda _s\left( {X,t} \right)} \right)},e^{\log \hat \lambda _s\left( {X,t} \right) + z_{1 - \frac{\alpha }{2}} \times sd\left( {\log \hat \lambda _s\left( {X,t} \right)} \right)}} \right].$$

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.