County-level longitudinal clustering of COVID-19 mortality to incidence ratio in the United States

As of November 12, 2020, the mortality to incidence ratio (MIR) of COVID-19 was 5.8% in the US. A longitudinal model-based clustering system on the disease trajectories over time was used to identify “vulnerable” clusters of counties that would benefit from allocating additional resources by federal, state and county policymakers. County-level COVID-19 cases and deaths, together with a set of potential risk factors were collected for 3050 U.S. counties during the 1st wave of COVID-19 (Mar25–Jun3, 2020), followed by similar data for 1344 counties (in the “sunbelt” region of the country) during the 2nd wave (Jun4–Sep2, 2020), and finally for 1055 counties located broadly in the great plains region of the country during the 3rd wave (Sep3–Nov12, 2020). We used growth mixture models to identify clusters of counties exhibiting similar COVID-19 MIR growth trajectories and risk-factors over time. The analysis identifies “more vulnerable” clusters during the 1st, 2nd and 3rd waves of COVID-19. Further, tuberculosis (OR 1.3–2.1–3.2), drug use disorder (OR 1.1), hepatitis (OR 13.1), HIV/AIDS (OR 2.3), cardiomyopathy and myocarditis (OR 1.3), diabetes (OR 1.2), mesothelioma (OR 9.3) were significantly associated with increased odds of being in a more vulnerable cluster. Heart complications and cancer were the main risk factors increasing the COVID-19 MIR (range 0.08–0.52% MIR↑). We identified “more vulnerable” county-clusters exhibiting the highest COVID-19 MIR trajectories, indicating that enhancing the capacity and access to healthcare resources would be key to successfully manage COVID-19 in these clusters. These findings provide insights for public health policymakers on the groups of people and locations they need to pay particular attention while managing the COVID-19 epidemic.

www.nature.com/scientificreports/ higher risk of developing more severe symptoms out of a simple upper respiratory infection (similar to the Bhat et al. suggestion 14 ).
Cardiovascular disease, CVD. In addition to respiratory complications, published studies are showing the impact of pre-exist CVDs on developing COVID-19 and on worsening its severity and clinical outcomes. Hendren et al. showed that COVID-19 might cause myocarditis-like syndrome and acute myocardial injury associated with reduced left ventricular ejection fraction (LVEF), which can also be complicated by heart failure 15 . A different analysis based on Chinese data showed that 8-20% of the patients hospitalized with COVID-19 had abnormal cardiac troponin I (cTnI), were also older and had more comorbid diseases 16,17 . There is also published literature suggesting that SARS-CoV-2 can infect fibroblasts and cardiomyocytes via the ACE2-pathway causing myocardial injury [18][19][20][21][22] . Moreover, it is shown that patients with viral myocarditis, which commonly exhibit chest pains, can mimic ventricular arrhythmia or coronary syndrome 23,24 . Historically, research has shown a significant increase in SARS patients' mortality with pre-existing CVD 25-30 . Demographic and social factors. Age. People 65 years of age and older are at significantly higher risk of experiencing COVID-19 or hospitalization and death, especially if they have pre-existing comorbidities such as CVD, DM, CLD, Hypertensive heart disease, and obesity 31,32 . Ferguson et al. reported that 27-71% of patients older than 60 years needed special care in an ICU with an infection fatality rate of about 2-9.5% 33,34 . Stang et al. discussed a potential bias in age-significance in COVID-19 patients due to overestimation caused by the limited testing capacity to more symptomatic patients. They showed that the fatality rate from COVID-19 started increasing after the age of 60 years in Italy, Spain, and the USA 35,36 . There is also a study on children with a median age of 7 years in China (April 1, 2020) in which most of the cases were male (not significant, though) with mild symptoms 37 . Note that the evidence and data to confirm whether increase in mortality is directly related to age or other confounders related to age is still rather mixed. For instance, Starke et al. 38 showed that when adjusting for other comorbidities, there is no additional risk of death by age. Other similar studies in Austria 39 and Italy 40 support the insignificant effect of age on COVID-19 severity and mortality rate, after accounting for other factors. However, recent studies in the USA (New York) 41 and Brazil (Espírito Santo) 42 showed an increase in COVID-19 mortality (OR 6.3% in Brazil and OR 1.7 in the USA).
Gender. Most evidence suggests that men are infected at a higher rate than women by COVID-19 and exhibit a higher mortality rate. However, most studies showed no significant differences in infection and mortality between men and women in COVID-19 cases 3,43 . Wenham et al. indicated that although an equal number of male and female COVID-19 cases was observed, MR is different by gender. Wenham et al. also suggested that women can be at high risk of getting infected since they have more front-line interaction with communities and provide more informal care within families besides their physical and cultural differences 44,45 . Further, selected studies report significantly different gender-distributions between male and female COVID-19 cases. For example, Zhao Y et al., using single-cell data, reported that ACE2 was upregulated in Asian males compared to women and other ethnicities, which may lead to more severe incidents of COVID-19 11,46-49 . Environmental factors. Air pollution. Exposure to air pollution and particulate matter (PM) can have a positive association with increased risk of certain viral respiratory diseases such as influenza and SARS pandemic 2003. Studies show that exposure to PM increased the MR from 2009 H1N1 and Spanish influenza [50][51][52][53] . Air pollution is also linked to cellular damage, inflammation, CVD, and CLD, which are potential comorbidities associated with COVID-19 severity 50,54-56 . Ye et al. showed that air pollution could also play a role in infectious disease transmission, although it has not been studied for COVID-19 as of May 15, 2020 57 .
Wu et al. and Mollalo et al., in nationwide studies in the USA, showed that exposure to PM increased COVID-19 mortality and severity 50,58,59 . Setti et al. reported a significant relationship between PM and experiencing COVID-19 in Italy (Jan 1, 2020) 60 .
A number of studies did not confirm the association between air pollution and COVID-19 severity, mortality, and transmission. However, they agreed that since exposure to air pollution, and PM has a link with other complications, there can be a risk factor in increasing COVID-19 MR and disease severity [61][62][63][64] .

Methods
Data resources. We collected county-level cumulative COVID-19 confirmed cases and death from Mar 25 to Nov 12, 2020, across the contiguous United States from USAFacts (usafacts.org). As explained in the introductory section, we considered Mar 25 to Jun 3 as the "1st wave", Jun 4 to Sep 2 as the "2nd wave", and Sep 3 to Nov 12 as the "3rd wave" of COVID-19. For the 2nd and 3rd waves, we analyzed targeted counties in the sunbelt region (including AL, AZ, AR, CA, FL, GA, KS, LA, MS, NV, NM, NC, OK, SC, TX, TN, and UT states) and the great plains region (including IA, IL, IN, KS, MI, MO, MN, ND, NE, OH, SD, and WI states), respectively. MIR, as a proxy for survival rate, is calculated by dividing the number of confirmed deaths in each county by the confirmed cases in the same county at the same time-period multiplied by 100. MIR ranges from 0 to 100%, 100% indicating the worst situation where all confirmed cases have died.

Analysis (descriptive methods and models).
We first provide summary statistics for COVID-19 data for the period under consideration. Full descriptive statistics for n = 38 potential risk factors are provided in Table S1in the Supplement. Second, we applied GEE marginal approaches to model the COVID-19 MIR over time and found significant risk factors. To this end, we first used the forward-selection method to select the most relevant risk factors (covariates) among the covariates using univariate GEE models 65 , as follows: where µ ij indicates the mean COVID-19 MIR for the i th county in week j th , β 0 is the starting rate of MIR before considering the effect of any potential risk factor (intercept), β 1 and β 2 s are the effects of time and risk factors X (such as Asthma) on the COVID-19 MIR. For variable selection purposes, we chose variables with (univariate) P value < 0.2 to be included in the final multivariate GEE model, as follows: where µ ij indicates the overall marginal mean MIR for the i th county in the j th week. α 0 is the intercept and α p is the coefficient of the p th potential risk factor ( X p ), p = 1, 2, . . . , n 1 , where n 1 is the total number of the selected variables based on the univariate GEE model (Eq. 1). Variables with (multivariate) P value < 0.05 will be selected as the potential risk factors. In each marginal model, an appropriate correlation structure (with the best goodness of fit index, QIC) was utilized. Statistical analysis and visualization for this step were performed using the geepack R-package (https ://cran.r-proje ct.org/web/packa ges/geepa ck/). Third, we evaluated COVID-19 MIR growth trajectories over the study time periods (1st, 2nd, and 3rd waves) using a latent growth model (LGM). An LGM approach considers both the mean MIR differences between counties at each time point (inter-subject) and MIR growth trajectories over time (intra-subject). Specifically, suppose y ti is the COVID-19 MIR in the i th county at time t ; then, it can be modeled as follows 66 : where η 0i and η 1i are two latent growth factors and t s are time scores (factor loadings); ε ti is a normally distributed error term for the i th county at time t ; η 0 and η 1 indicate the estimated overall mean COVID-19 MIR in each county and the average rate of MIR change, respectively. We also employed a number of non-linear (quadratic) LGMs, based on a polynomial time function (quadratic or higher-order) of time scores 67 to decrease estimation bias to account for the MIR trajectories exhibiting non-linear behavior over time. The non-linear LGM using a quadratic time function is given by: where η 2 indicates the growth factor, which can be a concave or convex form of the COVID-19 MIR pattern over the study time periods (1st, 2nd, and 3rd waves), and 2 t are the squared time scores. Both linear and non-linear LGMs were applied to 1736 U.S. counties with MIR > 0, i.e., counties with at least one confirmed death between Mar 25 to Nov 12, 2020. We then used information criteria (AIC, BIC) to find the best model among linear and non-linear LGMs to determine the COVID-19 MIR changes and patterns over the study time. Smaller AIC and BIC values indicate a better fit of the underlying models. We also calculated Moran's I 68 to evaluate the spatial autocorrelation of COVID-19 MIR across the U.S. counties.
Fourth, we identified clusters of the U.S. counties based on the COVID-19 MIR growth trajectory over time using longitudinal LGMMs 66 , as follows: www.nature.com/scientificreports/ where k is the upper bound of the number of the clusters, η k 00 indicates the initial COVID-19 MIR at the beginning of the study, and η k 10 indicates the average rate of COVID-19 MIR change over time. To find the optimal number of clusters ( k ), we fit a series of LGMMs with different numbers of clusters of counties and conducted tests for the adequacy of the reduced models with respect to the number of clusters. Information criteria such as AIC, BIC, and a bootstrap likelihood ratio test (BLRT) were used to compare the k-cluster model to the (k − 1) -clsuter model 69,70 . Also, cluster sample sizes greater than 1% of the total sample size and a relative entropy (REN) statistic greater than 0.8 were considered as the qualified latent class membership classification criteria 71 . The REN statistic for a k-class model is calculated as , where k and i correspond to the number of clusters and counties, respectively, and P ik indicates the posterior probability for the i th county to be in cluster k . We then applied a multinomial logit model to find the significant risk factors in each cluster as follows: where y i is a categorical variable with K possible categories (indicating the cluster number), α k is the intercept for cluster k , β k is a vector of regression coefficients of the p th potential risk factor ( X p ), p = 1, 2, . . . , n 1 , where n 1 is the total number of the selected variables based on the univariate GEE model (Eq. 1).
At the beginning of the 1st wave (Mar 25), about 93% (n = 2830) of the U.S. counties had zero confirmed death (MIR = 0%), which decreased to 42.9% (n = 1311) by the end of the 1st wave (Jun 3, 2020). This percentage at the beginning of the 2nd wave (Jun 4) for the states under consideration was 32.9% (n = 442) and decreased to 11.2% (n = 150) by the end of this wave (Sep2, 2020). During the 3rd wave for the states under consideration, this rate started from 30.3% (n = 320) and decreased to 10.0% (n = 105).
On Jun 3, 2020, the median population of the 3050 U.S. counties was 25 884, with Loving county in Texas having the smallest population (n = 169) and Los Angeles County in California the largest one (n = 1,039,107). Queens County in New York state had the maximum number of confirmed cases at the beginning of the study on Mar 25 (n = 6,420), while Cook County in Illinois had the maximum confirmed cases (n = 80,204) at the end of the 1st wave on Jun 3, 2020; whereas the maximum number of confirmed death was reported in King County in Washington state on Mar 25 (n = 100) and in Kings County in New York state on Jun 3 (n = 6,774). On Jun 4 (the beginning of the 2nd wave for the states under consideration), Los Angeles County in CA had the maximum number of both confirmed cases (n = 59,650) and deaths (n = 2,531). This county had the maximum number of both confirmed cases (n = 243,935) and deaths (n = 5,878) at the end of the 2nd wave (Sep 2), as well. During the 3rd wave for the states under consideration, Cook County in IL had the maximum number of cases (n = 128,012 on Sep 3 and n = 227,425 on Nov12) and deaths (n = 5,080 on Sep 3 and n = 5,667 on Nov 12).
Based on the univariate variable selection method (Table 2), some potential risk factors were excluded from the final analysis. The description table of the potential risk factors is provided in Table S1 in the Supplement.
The effect of time on the COVID-19 MIR was significant and negative for both the 2nd (β = − 0.09, P value < 0.001) and the 3rd (β = − 0.03, P value < 0.001) waves, suggesting that the use of longitudinal (repeated measures) approaches instead of cross-sectional studies are more suitable to evaluate the growth trajectory of COVID-19 MIR over time.
Tables S2-S4 show the full results based on the LGMs. Based on the information criteria, a non-linear LGM with a quadratic term exhibited a better fit than the linear LGM. Figure 1 shows the overall COVID-19 MIR Table 1 Table S4). Note that for the targeted counties (great plains) during the 3rd wave, the mean COVID-19 MIR was already elevated, therefore, we observe a Based on the LGMM results, an 8-cluster non-linear model for the 1st wave, a 5-cluster non-linear model for the 2nd wave, and a 4-cluster non-linear model for the 3rd wave were selected as the best models to find clusters of the U.S. counties. Detailed results for the LGMM models are provided in tables S5-S9. Table 4 and Fig. 2 show the detailed MIR information over time (factor loadings are reported in Table S6). www.nature.com/scientificreports/ Details of the nine clusters (including a cluster of counties with zero MIR) during the 1st wave are as follows: Cluster 0 contains 1314 counties with zero confirmed death from COVID-19 (i.e., MIR = 0) during the study time (1st wave).
Cluster 1, with 52 counties from 28 different states, had the highest MIR at the beginning of the study (intercept = 12.9% ± 3.1%) compare to other clusters (Table 4). This cluster continued having the highest MIR at the end of the study, on Jun 3, 2020 (Table S7, (Jun 3, 2020).
Details of the six clusters (including the cluster of counties with zero MIR) during the 2nd wave are as follows: Cluster 0 contains 156 counties with zero confirmed death from COVID-19 during the 2nd wave (i.e., MIR = 0). Cluster 1, with 32 counties from 7 different states (AR, GA, LA, MS, NM, SC, and TX), had the lowest MIR at the beginning of the 2nd wave (Intercept = 1.5% ± 0.3%). However, by the end of the 2nd wave (Sep 2, 2020), it had the second-highest MIR (MIR = 4.8%) compare to other clusters (with the maximum increase in COVID-19 MIR of 3.3%↑, Table S8) 16,2020. From here, the rate increased to MIR = 3.5% till Aug 13, and thereafter, slightly decreased to MIR = 3.3% till Aug 27, 2020.
Cluster 5, with 19 counties from 9 different states, had the highest MIR at the beginning of the study (intercept = 14.1% ± 4.5%) compare to other clusters (Table 4). However, on Aug 27, it had the third-lowest MIR compare to other clusters (Table S7, MIR = 4.0%). TX (Brown, Callahan, Fisher, Hood, Martin, and Palo Pinto counties) and OK (Cotton, Delaware, Kiowa, and Latimer counties were the most frequent states present in this cluster. Within this cluster, Fisher (TX), Cotton (OK), and Jenkins (GA) counties had the highest COVID-19 MIR. COVID-19 MIR growth trajectory for the counties in this cluster showed a 1.0% increase from Jun 4 (MIR = 14.1%) to Jun 18 (MIR = 15.1%) but thereafter, had a sharp decrease to MIR = 5.2% till Jul 16, 2020. This rate slightly decreased to MIR = 4.0% till Aug 27, 2020.
Details of the five clusters (including the cluster of counties with zero MIR) during the 3rd wave are as follows: Cluster 0 contains 111 counties with zero confirmed death from COVID-19 during the 3rd wave (i.e., MIR = 0). Cluster 1, with 125 counties from 11 different states, had the highest MIR at the beginning of the 3rd wave (intercept = 5.2% ± 0.2%). However, by the end of the 3rd wave (Nov 12, 2020), it had the second-highest MIR (MIR = 3.6%) compare to other clusters (also, with the maximum decrease in COVID-19 MIR of 1.6%↓, Table S9). IN (25 counties) and MI (25 counties) were the most frequent states present in this cluster. Within this cluster, Morton (KS) and Monroe (OH) counties had the highest COVID-19 MIR. COVID-19 MIR growth trajectory for the counties in this cluster showed a 1.6% decrease from Sep 3, 2020 (MIR = 5.2%) to Nov 12, 2020 (MIR = 3.6%).
Cluster 2 with 47 counties from 12 different states had the lowest MIR at the beginning of the 3rd wave (intercept = 1.0% ± 0.6%) compare to other clusters (Tables 4 and S9 www.nature.com/scientificreports/ Cluster 3, with 11 counties from 6 different states (IL, KS, MO, NE, ND, OH), had the second-highest MIR at the beginning of the 3rd wave (intercept = 3.5% ± 1.1%) compare to other clusters (  (74 counties) were the most frequent states present in this cluster. Within this cluster, Phillips (KS) and Renville (MN) counties had the highest COVID-19 MIR. COVID-19 MIR growth trajectory for the counties in this cluster was MIR = 1.4% at the beginning of the 3rd wave and decreased to MIR = 1.2% till the end of the wave (on Nov 12, 2020).
More information about the COVID-19 MIR estimation at both the beginning and the end of each wave, the amount of increase (or decrease) in this rate, and each cluster's rank are presented in tables S7-S9. One important point in Table S7 is that during the 1st wave, counties in cluster 4 (MIR: 0.8% → 10.5%) and cluster 7 (MIR: 1.5% → 11.6%) had the highest increase in COVID-19 MIR from Mar 25 to Jun 3, 2020. During the 2nd wave, counties in cluster 1 (MIR: 1.5% → 4.8%) had the highest increase in COVID-19 MIR; whereas, counties in cluster 5 (MIR: 14.1% → 4.0%) had the highest decrease in this rate from Mar 25 to Jun 3, 2020 (Table S8). During the 3rd wave, counties in cluster 2 (MIR: 1.0% → 2.5%) had the highest increase in this rate from Sep 3 to Nov 12, 2020 (Table S9). Counties in cluster 1 (MIR: 5.2% → 3.6%) had the highest decrease in COVID-19 MIR; however, it had the second-highest COVID-19 MIR compare to other clusters.
Tables 5, 6 and 7 show the significant risk factors in each cluster during the 1st, 2nd, and 3rd waves, respectively. To find the odds ratios (ORs), we used cluster 0 as the baseline (with MIR = 0) and compared all other clusters to it. The full results of the multinomial logit models are provided in the Supplement (Tables S10-S12).
For the 3rd wave, cardiomyopathy and myocarditis (OR 1.

Discussion
This study investigated the county-level COVID-19 confirmed cases and death from Mar 25 to Nov 12, 2020, in a longitudinal fashion in the contiguous United States. We considered Mar 25 to Jun 3 as the "1st wave", Jun 4 to Sep 2 as the "2nd wave", and Sep 3 to Nov 12 as the "3rd wave" of COVID-19. We assessed the growth trajectories of COVID-19 MIR and found the county-level clusters of the contiguous United States with similarities in COVID-19 MIR growth trajectory over time. We also considered the effects of different county-level potential risk factors on MIR (for each wave), including comorbidities & disorders, demographics & social factors, and environmental factors. We selected MIR as a measure of mortality since it also considers the number of confirmed cases to adjust the mortality rates. However, the estimates of all COVID-19 epidemiological-measures (i.e., incidence, prevalence, and mortality rates) are subject to bias due to the imprecise number of affected (confirmed) cases, especially those with mild or no disease symptoms. Moreover, there are not enough studies presenting and discussing county level risk factors, especially pre-existing comorbidities, with COVID-19 incidence and mortality. Table 5. 1st Wave (Mar 25-Jun 3, 2020): significant risk factors and their odds ratios in each cluster compare to cluster 0 (counties with MIR = 0). Blank spots indicate the insignificant risk factors. *For instance, OR 1.1 means that 1% increase in CVD MR is associated with a 10% increase in the relative log odds of being in cluster 5 vs. cluster 0 (MIR = 0). www.nature.com/scientificreports/ We found nine, six and five clusters of U.S. counties (including a cluster of counties with zero MIR) based on the COVID-19 MIR pattern (growth trajectory) using a longitudinal LGMM in the 1st, 2nd and 3rd waves, respectively. All counties in the same cluster have a similar COVID-19 MIR growth pattern over the study time. This approach considered both spatial and temporal heterogeneities in COVID-19 MIR due to pre-existing comorbidities, environmental factors, and demographics. We also identified significant risk factors associated with the identified clusters using a multinomial logit model. It is shown that different age and sex distributions in the U.S. counties impact differentially COVID-19 mortality and severity 72,73 . Race is also a factor that leads to heterogeneity. For instance, it has been reported African Americans having a higher risk of getting infected, experiencing more severe COVID-19 and death 74 . Further, note that about 43% of the northern and central U.S. counties did not experience death from COVID-19 until Jun 3.
During the 1st wave, nearly 116 counties in clusters 1 and 6 had the highest mean COVID-19 MIR at the beginning of the study on Mar 25, 2020. On Jun 3, 2020, cluster 1 still had the highest mean COVID-19 MIR (MIR = 13.2%), while counties in cluster 6 improved to the third lowest (excluding the cluster with MIR = 0). Counties in cluster 7 had a low level of COVID-19 MIR at the beginning of the study on Mar 25 (MIR = 1.5%). However, they had a very dramatic increase (10.1%↑) in COVID-19 MIR till Jun 3, 2020 (MIR = 11.6%). Cluster 7 became the one with the second-highest COVID-19 MIR at the end of the 1st wave on Jun 3, 2020. Based on Table 6. 2nd Wave (Jun 4-Sep 2, 2020): significant risk factors and their odds ratios in each cluster compare to cluster 0 (counties with MIR = 0). Blank spots indicate the insignificant risk factors. *For instance, OR 1.1 means that 1% increase in diabetes MR is associated with a 10% increase in the relative log odds of being in cluster 2 vs. cluster 0 (MIR = 0). www.nature.com/scientificreports/ these clustering results, we considered clusters 1 and 7 as the so-called "more vulnerable" clusters of counties requiring a more concerted effort and stronger mitigation strategies to control disease mortality. . In states where the majority of "more vulnerable" clusters (1 and 7) were during the first wave, there were no state-wide face-mask mandates, which might cause an increase in COVID-19 incidence and subsequently in COVID-19 MIR 75,76 . For instance, OK does not have any state mandate for public mask-wearing to date. A partial maskwearing rule was announced in IA in Nov 2020 (for Iowans age 2 and up in indoor public places). Therefore, different face-mask mandates can be one reason for having higher COVID-19 MIR in these vulnerable clusters and be further mitigated by issuing state-wide full face-covering mandates. Table 7. 3rd Wave (Sep 3-Nov 12, 2020): significant risk factors and their odds ratios in each cluster compare to cluster 0 (counties with MIR = 0). Blank spots indicate the insignificant risk factors. *For instance, OR 1.2 means that 1% increase in diabetes MR is associated with a 20% increase in the relative log odds of being in cluster 1 vs. cluster 0 (MIR = 0). **Due to the sparsity of hepatitis mortality rate in these particular counties (during the 3rd wave), the odds ratio estimation of hepatitis is not reliable. One way around this issue is to categorize the hepatitis MR and use the categorical version of this variable in the multinomial model. However, we decided to avoide this approach to stay consistent with the rest of the results. www.nature.com/scientificreports/ TB (OR 1.3) and drug use disorder (OR 1.1) are two significant comorbidities associated with a 30% and 10% increase in the odds of being in cluster 7 vs. cluster 0. Among the demographic and environmental factors, male-WA% (OR 1.8) and PM (OR 1.1) are significantly associated with an 80% and 10% increase in the relative log-odds of being in cluster 7 vs. cluster 0. Therefore, protecting subjects with TB and drug use disorder and managing the PM 2.5 level of the air (a mixture of solid particles and liquid droplets found in the air, such as dust, dirt, or smoke) can help ameliorate the COVID-19 mortality in these counties. Moreover, more than 80% of the counties in clusters 1 and 7 were rural areas based on the U.S. Census Bureau definition (https ://www.censu s.gov/ progr ams-surve ys/geogr aphy/guida nce/geo-areas /urban -rural .html). Lack of access to health and critical care infrastructure and more limited resources, in general, may be responsible for higher COVID-19 MIR. Therefore, addressing these factors would be beneficial in the long run for managing the epidemic.  www.nature.com/scientificreports/ factors, unemployed% (OR 1.5) and temperature (OR 1.2) are significantly associated with a 50% and 20% increase in the relative log-odds of being in cluster 1 vs. cluster 0 (tables S10-S12). Therefore, protecting subjects with hepatitis, HIV/AIDS, and TB and managing the unemployment rate can help ameliorate the COVID-19 mortality in these counties. The effect of temperature, however, could be due to other confounding variables. For instance, when the weather is cold, people spend more time together indoors. Therefore, informing the residents of these counties about distancing and mask-wearing may help to improve the COVID-19 MIR. Moreover, about 60% of the counties in cluster 1 were rural areas based on the U.S. Census Bureau definition (https ://www.censu s.gov). Similar to the conlucsion for the 1st wave, lack of access to health and critical care infrastructure and more limited resources, in general, may be responsible for higher COVID-19 MIR. During the 3rd wave, 125 counties in cluster 1 (MIR = 5.2%) had the highest mean COVID-19 MIR at the beginning of the wave on Jun 4, 2020. Although the mean COVID-19 MIR of the counties in cluster 1 decreased (MIR = 3.6%) by the end of the wave, this cluster remained the second-highest compared to other clusters. Based on the clustering result (as of Nov 12, 2020), we considered cluster 1 as the so-called "more vulnerable" cluster of counties requiring more attention to control disease mortality. IN  3) are four significant comorbidities that are associated with an increase in the odds of being in cluster 1 vs. cluster 0. Among the demographic and environmental factors, female AA% (OR 33.4), smokers% (OR 1.3), and population density (OR 1.02) are significantly associated with increased relative log-odds of being in cluster 1 vs. cluster 0 (tables S10-S12). Therefore, protecting subjects with diabetes, TB, mesothelioma and cardiomyopathy and myocarditis, and smoking history can help ameliorate COVID-19 mortality in these counties. The effect of population density, however, could be complicated and due to other confounding variables. At the beginning of the COVID-19 pandemic, dense (urban) areas around the world such as New York (USA), Madrid (Spain), Milan (Italy), London (UK), and Tehran (Iran) were identified as disease hotspots. In our analysis, nearly 40% of the counties in cluster 1 (during the 3rd wave) were urban areas based on the U.S. Census Bureau definition (https ://www.censu s.gov). One reason that may explain the effect of population density on disease mortality/spread could be that large cities are mostly connected with many other locations 77 . Crowding and transport infrastructure quality are conducive for the spread of the disease 78 . Therefore, addressing these factors and continuously informing residents about social distancing, mask-wearing, and self-isolation (and household quarantine) would be beneficial in the long run for managing the epidemic in this region.
Amongst the comorbidities, we found a significant positive association between COVID-19 MIR and heart diseases, including cardiomyopathy and myocarditis (0.15% MIR↑ in the 1st wave, and 0.12% MIR↑ in the 2nd wave), hypertensive heart disease (0.11% MIR↑ in the 1st wave, and 0.09% MIR↑ in the 2nd wave), peripheral vascular disease (0.31% MIR↑ in the 1st wave), and cerebrovascular disease (0.07% MIR↑ in the 1st wave, and 0.07% MIR↑ in the 2nd wave). This finding is in accordance with recent studies on the topic, even though its etiology remains uncertain. This can be due to antiviral drugs (as a treatment of COVID-19), which can cause different cardiovascular disorders (such as cardiac insufficiency and arrhythmia) 79 . Moreover, most of the patients with pre-existing heart disorders use renin-angiotensin-aldosterone system (RAAS) blockers, which are suggested to increase the COVID-19 severity and MR 80,81 . Additionally, SARS-CoV-2 infection can act as a precipitating factor that worsens the cardiac insufficiency and leads to death in patients with pre-existing heart complications 79 . Cardiovascular diseases can also increase the COVID-19 severity and MR via aggravating pneumonia 79 . Historically, it is shown that patients with pre-existing heart and lung diseases had a higher mortality rate from SARS 18,[25][26][27][28][29][30] 17 . A metaanalysis with 46,248 confirmed COVID-19 cases showed that patients with severe disease were more likely to have CVD (odds ratio = 3.4) and hypertensive heart disease (odds ratio = 2.4) 84 . Recent studies have reported ACE2 as the coreceptor for the coronavirus in patients with different complications as well as heart and lung disorders compared with healthy individuals 30,85 . There is also evidence showing the critical role of the ACE2 and its peptides in the inflammatory 86,87 and oxidative organ activities 88,89 , which are significant triggers in the initiation and progression of cardiovascular disease, cardiac hypertrophy, lung complications, and acute pancreatitis.
We did not find a significant positive association between most of the respiratory diseases (including COPD, Asthma, and lower respiratory infection) and COVID-19 MIR, which is consistent with the Halpin et al. study 4 ,Onder et al. in Italy (Mar 2020) 5 , and the CDC report of health conditions' prevalence in the USA (April 2020) 6 . We only found a positive association between interstitial lung disease and pulmonary sarcoidosis during the 3rd wave. One possible explanation might be that having CLD causes a different immune response, which eventually protects against infection from SARS-CoV2 4 . However, this is not supported by other publications showing a significant association between COPD and an increased COVID-19 MR. Another possibility is that treatments and therapies used by patients with CLD can protect against COVID-19 as well (for instance, topical intranasal sprays 90  www.nature.com/scientificreports/ diagnosis 4 . Notably, the Chinese CDC (http://www.china cdc.cn/en/) has reported a 6.3% COVID-19 case-fatality rate for cases with pre-existing chronic respiratory diseases. Besides heart diseases, we found significant positive associations between COVID-19 MIR and cancer, including mesothelioma (0.58% MIR↑ in the 1st wave) and pancreatic (0.51% MIR↑ in the 1st wave) in the United States. Typically, patients with cancer are known to be at higher risk for community respiratory viruses (such as influenza and coronaviruses) due to their suppressed immune system and poor physiological baseline [93][94][95] . Based on a descriptive study from Wuhan, China (Mar 2020), the incidence of COVID-19 patients with pre-existing cancer was about 1%, which is five times higher than the general cancer incidence in China 64 . In a report of 72 314 cases from the Chinese CDC (Mar 2020), the COVID-19 case fatality for cancer patients was 3.5% higher than those without cancer 96 . In another report from Italy (April 2020), the prevalence of pre-existing cancer among COVID-19 death was 16.5% 5 . Du et al., in a multi-omics study, indicated an indirect connection between the ACE2 pathway and cancer via Transforming Growth Factor Beta 1, TGFB1, association with colorectal cancer 97,98 .
Our findings also indicated that demographics and social factors at the county level, such as mean age, drug use disorders, smokers%, uninsured%, and population density, significantly increased COVID-19 MIR by 0.12%, 0.08%, 0.11%, 0.08%, and 0.0003%, respectively. One possible explanation might be that uninsured patients or patients with drug use disorder, especially in the areas with more health disparities, are less likely to seek medical care 99,100 . Moreover, drug use disorders can result in increased inflammation of multiple organ systems, particularly lungs, which may lead to respiratory failure. In turn, it can directly contribute to the elevated mortality rate of COVID-19 among confirmed cases. Marsden et al. showed that people with opioid use disorder have a higher prevalence of co-occurrence of health problems, subsequently leading to an increased rate of COVID-19 101 . Regarding the effect of population density on disease mortality/spread, one reason could be that the large cities are mostly connected with many other locations 77 ; plus, crowding is conducive to the spread of the disease.
This study has several limitations. First, the mortality and MIR estimates from the current COVID-19 related data are biased since most of the individuals with mild or no symptoms have not been tested for COVID-19 in most of the counties. Moreover, the COVID-19 reporting system appears to differ regionally, which introduces further inaccuracies in the available data. For example, for a small number of counties, we found MIR = 100%, which is an unlikely event and can be due to an incomplete disease recording system. Timely sharing of information and collaboration between organizations and governors can partly solve this problem. There also needs to be additional testing and follow-ups to have higher quality data, especially for younger individuals with mild symptoms. Recent data (CDC Jun 19, 2020 102 ) showed that more young people are testing positive for COVID-19 in the United States. Second, the reporting of disease data is mostly based on ICD9/10 codes, which can be fairly inaccurate 103 . Third, the analysis was based on county-level data. It would be beneficial to analyze individual-level and multi-countries data to gain deeper insights into the impact of risk factors on COVID-19 progression. Fourth, some of the counties, especially in Maine, were excluded from the study because some of the environmental factors such as climate and air pollution were not directly available. Fifth, different testing strategies (especially among health-care workers), re-opening, self-isolation, physical distancing, and mask policies can act as cofounders in the analysis of COVID-19 MIR.
In summary, accounting for heterogeneity in both risk factors and COVID-19 mortality patterns over time leads to a more informative clustering system, which can then be leveraged in managing the epidemic by identifying and informing groups of people at higher risk and also in managing healthcare resources (access to facilities, ICUs, vaccination, etc.) more judiciously. Findings of this study suggest that counties in clusters 1 and 7 (in the 1st wave), cluster 1 (in both 2nd and 3rd waves) experience higher COVID-19 MIR growth trajectories over time and are facing more challenges due to the prevalence of rural counties (60-80%), and different face-covering rules/mandates in managing the disease. Further, heart complications and cancer were statistically significant pre-existing comorbidities related to COVID-19 MIR across the U.S. TB, drug use disorder, HIV/AIDS, diabetes, and hepatitis were explicitly associated with an increased chance of being in a more "vulnerable" cluster.

Data availability
All datasets used in the current study are publicly available (sources are mentioned in Table S1). Datasets generated during the study are available from the corresponding author.