Risk factors and geographic disparities in premature cardiovascular mortality in US counties: a machine learning approach

Disparities in premature cardiovascular mortality (PCVM) have been associated with socioeconomic, behavioral, and environmental risk factors. Understanding the “phenotypes”, or combinations of characteristics associated with the highest risk of PCVM, and the geographic distributions of these phenotypes is critical to targeting PCVM interventions. This study applied the classification and regression tree (CART) to identify county phenotypes of PCVM and geographic information systems to examine the distributions of identified phenotypes. Random forest analysis was applied to evaluate the relative importance of risk factors associated with PCVM. The CART analysis identified seven county phenotypes of PCVM, where high-risk phenotypes were characterized by having greater percentages of people with lower income, higher physical inactivity, and higher food insecurity. These high-risk phenotypes were mostly concentrated in the Black Belt of the American South and the Appalachian region. The random forest analysis identified additional important risk factors associated with PCVM, including broadband access, smoking, receipt of Supplemental Nutrition Assistance Program benefits, and educational attainment. Our study demonstrates the use of machine learning approaches in characterizing community-level phenotypes of PCVM. Interventions to reduce PCVM should be tailored according to these phenotypes in corresponding geographic areas.

www.nature.com/scientificreports/ Identifying not only the clusters of high PCVM but also the clusters of PCVM phenotypes acknowledges the complex relationships among selected drivers of high PCVM and PCVM disparities. This place-specific phenotype approach offers researchers and practitioners a framework for addressing community-level disparities in PCVM.

Methods
Study population. Our study population included individuals aged 15-64 years who died from CVD during the years 2015-2019 in the contiguous United States. PCVM was defined as the number of deaths in persons aged 15-64 years caused by CVD per 100,000 people at the county level, age-adjusted to the 2000 US Standard Population. We only included counties with at least 20 deaths from CVD during the study period to mitigate against unstable PCVM estimates. Counties from Hawaii and Alaska were excluded from the analysis due to the lack of complete risk factor data.
Data sources. Mortality data were accessed through the multiple cause of death files, maintained by the National Center for Health Statistics via the Centers for Disease Control and Prevention Wide-ranging Online Data for Epidemiologic Research (CDC-WONDER) database 8 . This database contains death certificates data from all fifty states, with cause of death identified by the international classification of disease, version 10 (ICD-10) coding schema. Data from the CDC-WONDER also include age at death, sex, race, and county of death. If multiple underlying causes of death on the death certificate are noted, a single cause is inserted according to the sequence of conditions on the certificate and contributing causes of death according to prespecified methods 8 . ICD-10 codes for CVD mortality were defined as follows: ischemic heart disease (I20-I25), heart failure (I50), cerebrovascular diseases (I60-I69), and hypertensive heart disease (I10-I15).
County-level risk factor data were harvested from a variety of data sources (Table 1), including County Health Rankings & Roadmaps 9 , Area Health Resources Files 10 , and Environmental Protection Agency's Environmental Justice Screening tool (EJSCREEN) 11 . To best align temporally with the PCVM data, we used risk factor data collected in 2017 (the mid-year of the PCVM data) or the year closest to 2017. We used the 2020 EJSCREEN data (covering the years 2014-2020), and we re-estimated county-level exposures using the method outlined by the EPA EJSCREEN technical documentation guide since EJSCREEN data is natively reported at the census block group level 11 . We also visualized the geographic distributions of all county-level risk factors used in the study (Supplemental Fig. S1).
Given the deidentified nature of the data and no individual-level data was used, institutional review board approval was not required.

Statistical analysis.
We applied CART and random forest machine learning methods and geographic information systems to explore the association between county-level risk factors and PCVM. CART was used to identify phenotypes of PCVM, or combinations of county-level characteristics that were associated with PCVM 12 . We performed additional analyses to examine whether the county-level mortality rates for each subtype of PCVM (i.e., heart failure, hypertension, ischemic heart disease, and stroke) have a similar pattern after group them according to the phenotypes identified by the main model. Finally, we used random forest analysis 13 to examine the relative importance of risk factors in predicting PCVM. We compared the concordance between the CART and random forest models, with a key focus on whether high-importance variables from the random forest models were included in the phenotypes identified by the CART analysis.
CART uses conditional inference to recursively partition data into smaller and homogeneous groups characterized by combinations of predictors 14,15 . At each split, the data are divided into two groups by an algorithmselected variable and a threshold value that maximizes the difference between the split groups. The splitting procedure recursively repeats for each split group until some user-defined stopping criteria are met. We set the following stopping criteria: a maximum tree depth of six splits, a minimum number of 200 counties in a terminal node, and a statistical significance for variable splits (α < 0.05) using the Pearson correlation test. Each terminal node of the tree consists of a group of counties with similar levels of PCVM. The combination of characteristics associated with a terminal node represents a phenotype of PCVM. We then used geographic information systems to visualize the distribution of the identified phenotypes.
The CART models were established using a randomly sampled training set (consisting of 80% of all counties) and the results were validated against the test set (consisting of the rest 20% of the counties). To validate the results and the reproducibility of the CART model, we performed sensitivity analyses using three additional random samples as the training set and compared the results with the main model. We also conducted a sensitivity analysis of the CART approach with a different minimum number of counties (100) in a terminal node.
In contrast to CART which relies on only one tree, random forest creates and aggregates an ensemble of trees using random variable selection and bootstrap sampling 13 . It then takes an average of the outputs of these trees as a prediction. Next, the mean decrease in node impurity is used to calculate variables' relative importance in predicting the outcome. We created 20,000 trees incorporating all risk factors as predictors. The number of variables randomly sampled as candidates at each tree split was set to 5. SAS v9.4 was used for data management activities. R v3.6.1 was used for the machine learning analyses (packages "partykit"-ctree for CART and "randomForest" for random forest). Python 3.10.6 (packages "geopandas" and "matplotlib") was used for maps in Figs. 2, S1. ArcGIS Pro v2.7.0 was used for maps in Fig. 3 www.nature.com/scientificreports/

Results
The study included 2509 counties, representing a total of 604,810 deaths from PCVM. There were 2008 and 501 randomly sampled counties in the training set and the test set, respectively. The baseline county characteristics were similar between the training and test sets as shown in Supplemental Table S1. The CART analysis identified seven phenotypes (A to G, in ascending order of the median PCVM) using the training dataset (n = 2008) (Fig. 1). The algorithm selected five variables from all candidate predictors serving as the six splitting nodes in the outcome tree, with under 200% of poverty at the top of the tree followed sequentially by physical inactivity, median household income, food insecurity, physical inactivity, and excessive drinking. All splits were statistically significant (p < 0.001). Applying the CART model to the test dataset showed no substantial differences in the PCVM distributions versus the training dataset ( Supplementary Fig. S2). We summarized the statistics, characteristics, as well as geographic distribution of the identified phenotypes in Fig. 2, including counties in both training and test sets.
On the right side of the tree (Fig. 1), phenotype G (Impoverished) had the highest median PCVM (96.6) among all phenotypes, consisting of counties with more people (aged 18-64) under 200% of the federal poverty level (> 33.7%) and a lower median household income (≤ $39,898). Compared to phenotype G counties, counties of both phenotypes D (Middle Class-Active) and F (Middle Class-Inactive) had a lower median PCVM. Phenotype F counties differentiated from those of Phenotype D by having more people who were physically inactive.
On the left side of the tree (Fig. 1), all counties had fewer people (aged 18-64) under 200% of the federal poverty level and generally had lower rates of PCVM (except for phenotype E counties). Phenotype A (Affluent-Active), with a lower physical inactivity rate (≤ 21.4%), had the lowest median PCVM (34.2), about a third of the median PCVM for phenotype G (96.6). With more people who were physically inactive, phenotypes B (Affluent-Inactive-Food Secure), C (Affluent-Inactive-Food Insecure-Excessive Drinking), and E (Affluent-Inactive-Food Insecure-No Excessive Drinking) also had a higher median PCVM compared to phenotype A. Food insecurity further distinguished phenotype B with C and E, where phenotype B had fewer people who lack adequate access to food (≤ 11.2%) and had about 9 to 16 fewer deaths from CVD per 100,000 people compared to phenotypes C and E. Excessive drinking further separated phenotypes C and E, where phenotype C had more adults reporting binge or heavy drinking and a slightly lower median PCVM compared to phenotype E (53.1 vs. 60.2). We calculated the county-level PCVM rates of each CVD subtype and grouped counties according to the phenotypes identified by the main model. Supplementary Fig. S3 shows that, for each subtype of PCVM, the median rates of PCVM grouped by phenotype were in ascending order from phenotype A to G, which is consistent with the main model. www.nature.com/scientificreports/ The results of the sensitivity analysis of the CART model using three additional samples of counties as the training set were shown in Supplementary Fig. S4. We noticed that the top three nodes (under 200% poverty, physical inactivity, and median household income) in the additional models were the same as in the main model in Fig. 1, despite their splitting values being slightly different. In all four models, physical inactivity was the next splitting node after median household income. Food insecurity and excessive drinking, two variables that were present in the main model, appeared once and not at all, respectively, in the additional models. Poverty, a variable not presented in the main model, was present in all additional models. The results of the sensitivity analysis suggest that CART was relatively stable to changes in data structure, especially for the top splitting variables.
The sensitivity analysis of the CART model with a minimum number of 100 counties in a terminal node included more splitting nodes as well as more phenotypes in the model output ( Supplementary Fig. S5), suggesting that additional risk factors were significantly associated with county-level PCVM in different subgroups of the population. These additional splitting variables included broadband access, uninsured (age 18-64), smoking, and receipt of Supplemental Nutrition Assistance Program (SNAP) benefits. Supplementary Fig. S6 illustrates the CART model applied to the test dataset, which revealed no significant differences compared to the model derived from the training dataset. Figure 3A,B present the geographic distributions of the county-level PCVM and the phenotypes (for counties in both the training and test sets) from the main model. We observed that counties with high PCVM were mostly in the Southern US. Most of these counties corresponded to the highest-risk phenotypes G (Impoverished) and F (Middle Class-Inactive), which were mostly distributed across the American South and the Appalachian region, especially in Kentucky, West Virginia, Mississippi, Arkansas, southern Alabama, southern Georgia, southern Missouri, and New Mexico for phenotype G. In contrast, many populous coastal counties in the Northeast and the West were of phenotype A (Affluent-Active), the lowest-risk phenotype. Counties of phenotype B (Affluent-Inactive-Food Secure), the second lowest risk phenotype, were mostly found in the Northeast and the Midwest. A large proportion of counties of phenotype C (Affluent-Inactive-Food Insecure-Excessive Drinking) were found in rural New York and Pennsylvania, as well as in many counties in the Midwest, West, and the state of Texas. Many counties of phenotype D (Middle Class-Active), the median-risk phenotype, were in   Fig. 4 suggested that variables that appeared in the CART output were also among the top-ranking variables in the random forest analysis. Notably, median household income, under 200% poverty, and food insecurity were the top three important variables in the  www.nature.com/scientificreports/ random forest plot. Other high-importance variables included broadband access, smoking, and receipt of SNAP benefits, which also appeared in the output of the CART analysis with a minimum number of 100 counties in terminal nodes (Supplementary Fig. S3). Excessive drinking, high school degree, and physical inactivity were ranked 8th to 10th in the variable importance plot.  www.nature.com/scientificreports/

Discussion
Our study identified county phenotypes of PCVM and examined their geographic distributions using machine learning approaches and geographic information systems. We found an approximately threefold difference in the PCVM comparing the highest-risk phenotype in the American South, an area termed the stroke belt due to high rates of stroke 16 , with the lowest-risk phenotype in the coastal areas in the Northeast and the West. Our findings suggest that counties of the highest-PCVM-risk phenotype were highly impoverished. The association between poverty and PCVM has been identified by numerous studies 1-3 . Our study further affirms that income/poverty was the most important predictor of PCVM among various other risk factors related to environmental exposure, health status, health behaviors, and other aspects of socioeconomic status. Previous studies also suggest that physical inactivity was a strong risk factor for PCVM [1][2][3] . Our study additionally demonstrated that physical inactivity may be more important in predicting PCVM among counties with higher www.nature.com/scientificreports/ income than those with lower income (as seen that physical inactivity was a splitting node in the lower poverty group or the higher median household income group in Fig. 1). Similarly, food insecurity, an indicator of dietary behavior and socioeconomic status, may have a stronger association with PCVM among counties with higher physical inactivity (i.e., phenotypes A vs. B). These findings suggest that there may be effect measure modifications between risk factors and their association with PCVM, as may be the case between poverty and physical inactivity, or between physical inactivity and food insecurity. Notably, counties of the impoverished phenotype (G) and the Middle Class-Inactive phenotype (F), the two highest-risk phenotypes, were mostly located in the American South and the Appalachian region. The concentration of these two phenotypes in the same geographic area provides an opportunity to study in greater detail the interaction between poverty and physical inactivity in the causal pathway to PCVM.
Our study included multiple environmental risk factors in the models. Environmental exposures, especially air pollution, have been mechanistically and epidemiologically linked with disproportionate cardiometabolic outcomes [17][18][19] . However, none of the environmental factors appeared in the CART output, nor were they listed as the top ten variables in the random forest plot. On the other hand, multiple studies have demonstrated remarkable overlap between several environmental exposures and socioeconomic factors 20 , with significant effect interactions between factors such as air pollution and social vulnerability 7 . One reason behind this discordance is that individuals within counties may have been disproportionately exposed to pollutants, and it is difficult to evaluate to which groups and to what extent of individuals were exposed to the pollutants using data from the current study. Future studies should focus on associations between environmental factors and PCVM at a finer geographic scale.
We also note that risk factors not presented in the CART output may be still highly associated with PCVM, such as broadband access, smoking, receipt of SNAP benefits, and high school education, as suggested by the random forest variable importance plot.
There are several methodological advantages that lend confidence to our study. First, unlike traditional statistical methods (such as regression analysis), CART and random forest machine learning methods can handle a large number of highly correlated variables simultaneously without concerns about multicollinearity due to their variable selection and bootstrap sampling strategies. A second advantage of our methods is that CART has the advantage of visualizing and conceptualizing phenotypes, while random forest complements CART in risk factor importance evaluation and model stability. Specifically, CART selects variables and presents "pathways" for each observation towards its "destination", where the characteristics along the "pathways" can be used to determine phenotypes associated with PCVM. On the other hand, random forest evaluates all risk factors on their relative importance, including those not selected by CART. Additionally, the variable importance plot of random forest is less sensitive to changes in the data (such as using different years of data) compared to the result of the single-tree CART algorithm.
The above advantages of using CART and random forest methods, together with geographic information systems, have been demonstrated in a prior study investigating the phenotypes of late-stage breast cancer diagnosis 21 and cancer mortality 22 . This study further demonstrates the validity of this approach in uncovering the combination of risk factors and their relative importance in predicting county-level PCVM.

Limitations.
The findings of our study should be interpreted within the context of its limitations. First, the accuracy of diagnostic codes from death certificates cannot be ascertained, and there might be additional exposures and proximal contributors to mortality that we were not able to capture. Second, the data collection period for the risk factors did not perfectly match that for the PCVM data, which may be problematic if there is a temporal lag in the effect of risk factors on PCVM. Future studies should explore temporal associations between risk factors and PCVM. Third, data for many risk factors were collected from self-reported surveys based on a sample of the population, where the quality of reporting, response rates, and selection bias may impact the accuracy of the measures. Fourth, to ensure statistical stability, our analyses excluded counties with less than 20 deaths caused by CVD, which might have led to a bias towards less populated areas, especially in the many states in the West and Midwest. Future studies should consider regionalization methods, such as the Max-P-regions model 23 , to combine counties with small numbers of cases. Finally, counties are relatively large geographic units with seemingly heterogeneous populations and exposures. Whether the associations discovered in the current study are also present in smaller geographic scales (e.g., census tracts or block groups) or at the individual level with long-term cardiovascular outcomes remains to be elucidated. Nevertheless, this proof-of-concept study provides a platform for characterizing the relationships between community-level risk factors and health outcomes.

Conclusion
The use of CART and random forest machine learning methods and geographic information systems can help uncover risk factor associations in predicting PCVM. Interventions to reduce PCVM should be tailored and target geographic areas with high-risk phenotypes of PCVM.

Data availability
The data sets generated during this study are available from the corresponding author upon reasonable request. www.nature.com/scientificreports/