Main

Diabetes, dementia, cardiovascular disease (CVD) and chronic kidney disease (CKD) are leading causes of death in the United States, in other high-income nations and, increasingly, in low-income and middle-income countries1,2. Obesity, short stature, high blood pressure, high heart rate, hyperglycemia, non-optimal lipid profiles and poor kidney function are established risk factors for one or more of these diseases3,4,5,6,7,8,9,10,11,12,13,14,15 and, in some cases, for infections such as coronavirus disease 2019 (ref. 16). As a result, people who have optimal levels of all or most risk factors are at low risk of cardiovascular and renal disease and cancer and vice versa17,18,19,20,21. Physiological risk factors can have complex correlations and co-occurrence patterns for at least two reasons. First, these physiological factors have shared as well as distinct genetic, behavioral, environmental and dietary determinants. For example, consumption of fruits and vegetables, meat, dairy, unsaturated versus saturated fats, processed versus whole grain carbohydrates and alcohol affect multiple cardiometabolic and renal traits beneficially or adversely, whereas others, such as sodium and potassium, affect only one or two traits (blood pressure and kidney function)22,23,24,25,26,27,28,29. Furthermore, these factors may cluster differently among different subgroups of a population30 and change over time31. Second, some of these physiological risk factors are themselves etiologically related; for example, obesity is a risk factor for dyslipidemia, elevated blood pressure and hyperglycemia32,33.

At the population level, some studies have quantified trends in individual cardiometabolic risk factors in the US population, other countries or globally34,35,36,37,38,39,40,41,42,43. Other studies have counted the number of cardiometabolic risk factors44,45, with some also quantifying association with the risk of coronary heart disease45. Some studies have used concepts such as metabolic syndrome46, optimal cardiometabolic health44 and metabolically healthy obesity47,48,49 to identify groups of people with a specific pre-determined risk factor profile. Studies that used data-driven methods to identify cardiometabolic phenotypes were mostly based on data from specific subgroups of a population (for example, older adults)50, users of specific health programs51 or people with a specific index disease, such as diabetes52,53,54, sepsis55 or cardiogenic shock56. The only study analyzing health-related phenotypes in an entire national population57 used a mix of behavioral, physiological and diagnostic variables at a single point in time for methodological assessment; it did not analyze change over time or the clinical or epidemiological characteristics of the clusters. Beyond cardiometabolic and renal health, some studies identified co-occurrences, or subtypes, of specific diseases in large cohorts, such as the UK Biobank58, in primary care patients from different countries59,60, especially using electronic health records61,62,63,64,65,66,67. These studies used a range of clustering methods66,68.

In the present study, we applied a data-driven approach to repeated nationally representative health examination surveys, namely the National Health and Nutrition Examination Survey (NHANES), from 1988 to 2018, to identify a comprehensive set of cardiometabolic and renal phenotypes in the United States adult population. We measured how the prevalence of these phenotypes has changed over time and characterized their sociodemographic, epidemiological and clinical predictors. This information is needed for planning and priority setting for population-based prevention programs and health system interventions to coherently and effectively prevent and manage conditions based on their co-occurrence in the population69,70.

Cardiometabolic and renal phenotypes of the US population

We identified 10 clusters (phenotypes) for both men and women that collectively characterized the cardiometabolic and renal traits of the US population from 1988 to 2018 (Fig. 1). The reasons for using 10 clusters are stated in the Methods, and the results with other cluster numbers are presented below. The identified phenotypes were similar between men and women, even though we analyzed data for the two sexes separately.

Fig. 1: Risk factor profiles of the cardiometabolic and renal clusters of US adults for women and men.
figure 1

Each panel corresponds to a cluster; each bar shows the median value of one risk factor for all participants in the cluster. The number next to the cluster name represents the percentage of the participants grouped in this cluster. The concentric circles show the minimum, 25th, 50th and 75th percentiles and maximum in the whole sample, with the median shown in darker color. The height and color of the bar represent the median level of each risk factor, positioned relative to the distribution in the whole population, so that the scale is common across all clusters. The bottom-right panel shows the median value for each risk factor in the whole sample, and Supplementary Table 1 shows other percentiles for each risk factor in each cluster. The scale is reversed for height, eGFR and HDL because lower values indicate higher risk.

For both sexes, we identified a ‘low risk’ phenotype with near-optimal risk factor levels, accounting for 15% and 13% of the sample for women and men, respectively. We also identified two clusters (‘mid risk short’ and ‘mid risk tall’) jointly accounting for 25% and 28% of the sample for women and men, respectively, with risk factor levels mostly around sample medians. These two clusters differed by their average height and, to a lesser extent, by blood pressure and estimated glomerular filtration rate (eGFR) levels, with the ‘mid risk short’ cluster having, on average, shorter height (median of 155 cm versus 167 cm for women; 168 cm versus 182 cm for men) (Supplementary Table 1), lower blood pressure and higher eGFR than the ‘mid risk tall’ cluster. We also identified a group (‘low BMI, high HDL’) characterized by low levels of body mass index (BMI) and waist-to-height ratio (WHtR) and high high-density lipoprotein (HDL) cholesterol relative to the rest of the NHANES sample but with other risk factors being around the sample median.

Five clusters were characterized by having high levels of one or two related risk factors accounting together for 40% of the sample for both sexes. These were ‘high cholesterol’, ‘high blood pressure’, ‘severe hyperglycemia’, ‘high heart rate’ and ‘severe obesity’. For instance, the ‘severe hyperglycemia’ phenotype had a median glycated hemoglobin (HbA1c) of 9.9% for women and 9.8% for men, but their median BMI (and WHtR) was much lower than those of the ‘severe obesity’ cluster (median BMI of 31.8 kg m2 and 29.7 kg m2 in the ‘severe hyperglycemia’ cluster for women and men, respectively, compared to a median BMI of 41.1 kg m2 and 38.2 kg m2 in the ‘severe obesity’ cluster). Similarly, the ‘high blood pressure’ cluster had a median systolic blood pressure (SBP) of 159 mmHg for both sexes, and the ‘high cholesterol’ cluster had a median non-HDL cholesterol of 5.5 mmol L−1 for both women and men, with other risk factor levels lying between the median and 75th percentiles of the entire NHANES sample. In all these clusters, the defining risk factor varied less among member participants than the other risk factors (Extended Data Fig. 1), further illustrating that its high value was the shared feature among participants who fell in the cluster. Finally, in both sexes, the last cluster (‘low DBP, low eGFR’) was characterized by low levels of diastolic blood pressure (DBP) and eGFR. For example, women who fell in the ‘low DBP, low eGFR’ cluster had a median DBP of 61 mmHg and a median eGFR of 63 ml/min/1.73 m2.

Demographic and clinical characteristics of clusters

Most of the identified cardiometabolic and renal phenotypes had a mix of young (20–39 years), middle-aged (40–59 years) and old (60 years and older) adults. The exceptions were two clusters for men and three for women with predominantly young people (‘low risk’ and ‘mid risk short’ for both sexes and ‘high heart rate’ for women) and one with predominantly old people (‘low DBP, low eGFR’) (Table 1). Even though 73% of women and 77% of men in the ‘low risk’ phenotype were aged 20–39 years, 4% and 6%, respectively, were older than 60 years with near-optimal risk factor profiles similar to their younger peers, except for slightly lower eGFR and higher HbA1c. Similarly, although most (92% of women and 90% of men) in the cluster ‘low DBP, low eGFR’ were 60 years or older, a small percentage (1% and 2%, respectively) were aged 20–39 years. Within each cluster, individuals of different age groups generally had similar risk factor profiles, especially on the defining risk factors in the higher risk phenotypes (Extended Data Fig. 2).

Table 1 Demographic characteristics and medication use of cardiometabolic and renal clusters of US adults

The ‘low risk’ group had the lowest number of morbidities and medication use (Table 1 and Extended Data Table 1). As expected, 96% of women and 98% of men in the ‘high blood pressure’ cluster had hypertension, yet this condition was also prevalent in ≥50% of participants in some other clusters—for example, ‘low DBP, low eGFR’ and ‘severe hyperglycemia’ for both sexes and ‘severe obesity’ phenotype for men (most of those with hypertension in the ‘low DBP, low eGFR’ cluster had isolated systolic hypertension). Similarly, all participants in the ‘severe hyperglycemia’ cluster had diabetes; the next highest diabetes prevalence was in the ‘low DBP, low eGFR’ cluster (31% in both sexes), with the ‘severe obesity’ cluster having only the third highest prevalence (22% in women and 25% in men). Median HbA1c of people with diabetes in the ‘severe obesity’ cluster (6.88% for men and 6.77% for women) was much lower than median HbA1c of those in the ‘severe hyperglycemia’ cluster (9.9% for women and 9.8% for men). Finally, those in the ‘low DBP, low eGFR’ phenotype more frequently had a history of myocardial infarction (MI), stroke and congestive heart failure (CHF) than the other phenotypes—for example, 19% of men in this phenotype had a history of MI compared to 6% in the whole sample; similarly, 12% of men in this phenotype had a previous history of CHF compared to 4% in the whole sample.

The use of statins was relatively low in the ‘high cholesterol’ group—13% for women and 8% for men—with that of men being lower than the overall NHANES sample (Table 1). In contrast, statin and antihypertensive use was high in the ‘low DBP, low eGFR’ and ‘severe hyperglycemia’ groups (26–41% of participants in different cluster–sex combinations, which is 2–3 times more than in the overall samples), consistent with the clinical guidelines that recommend the use of these medicines among people with diabetes and history of MI and stroke, especially in older ages. In the ‘severe obesity’ cluster, antihypertensive and statin use was above average, which may partly account for this group having blood pressure and cholesterol levels around the population median. The use of most medicines was higher in the 2011–2018 period than over the entire analysis period, with the largest increase being that of statins (Extended Data Table 2). The increase in statin use was, however, less pronounced in the ‘high cholesterol’ phenotype (+38% relative increase for women and +4% for men) than in the whole sample (+48% for women and +45% for men), demonstrating that this phenotype was characterized by insufficiently treated or controlled levels of non-HDL cholesterol.

Trends over time

The cardiometabolic and renal risk profile of the US population changed from 1988 to 2018 (Fig. 2). The age-standardized prevalence of the ‘severe obesity’ phenotype more than tripled for both sexes and that of the ‘low DBP, low eGFR’ phenotype almost doubled over the entire analysis period. Most of the increase of the ‘low DBP, low eGFR’ phenotype occurred between 2000 and 2010, before plateauing after 2010 (P value for trend from 2010 to 2018 was 0.96 for women and 0.97 for men; Extended Data Table 3). In contrast, the prevalence of the ‘high blood pressure’ and ‘high cholesterol’ phenotypes more than halved in both sexes (P value for trend was <0.0001 for both sexes over the entire analysis period). However, since the late 2000s, there has been a reversal of the earlier declines in the prevalence of the ‘high blood pressure’ phenotype (P value for increasing trend from 2010 to 2018 was 0.0015 for women and 0.0346 for men). There was no statistically detectable change in the ‘severe hyperglycemia’ phenotype (P = 0.09 for women and 0.79 for men), which indicates that, despite the increase in the prevalence of diabetes in the United States, those at extreme values of HbA1c were stable. Rather, many of the additional people with diabetes fell in the ‘severe obesity’ and ‘low DBP, low eGFR’ clusters for which the prevalence increased over time. Most trends were consistent between the two sexes. A notable exception was the ‘low risk’ phenotype, which remained constant for men but decreased by 4.5 percentage points for women (P value for trend was 0.0006 over the entire analysis period), even though its prevalence remained higher in women than men throughout the analysis period. Trends in crude prevalence were nearly identical to the age-standardized trends (Extended Data Fig. 3).

Fig. 2: Trends in cardiometabolic and renal clusters from 1988 to 2018.
figure 2

The P values for trends were obtained from two-sided t-test from a logistic regression using the cluster assignment of individual participants, with adjustment for age as described in the Methods. No adjustments were made for multiple comparisons. The figure shows age-standardized prevalence for all clusters (bar charts) as well as individual clusters (lines). See Extended Data Fig. 3 for trends in crude prevalence. See Extended Data Table 3 for trends in pre-specified periods.

Changes in age patterns of clusters

The various cardiometabolic and renal phenotypes had differing age associations (Fig. 3). The ‘low risk’ and ‘mid risk short’ phenotypes for both sexes, and the ‘high heart rate’ phenotype for women, were more common among younger adults, and their prevalence decreased with age, with a much steeper age association for the ‘low risk’ group. Conversely, the ‘low DBP, low eGFR’ and ‘high blood pressure’ phenotypes became more prevalent throughout the life course, with a steeper age association for the ‘low DBP, low eGFR’ group. Other phenotypes tended to peak in middle ages.

Fig. 3: Age patterns of cardiometabolic and renal clusters.
figure 3

Each point represents the prevalence of a cluster for an age group from a survey mid-year. The color of the point represents the year of the survey. The lines represent the fitted local polynomial regression for each survey round.

Both ‘high blood pressure’ and ‘high cholesterol’ phenotypes decreased sharply in people aged 50 years and older from 1991 to 2008, likely due to the increased use of statins and antihypertensive medication; however, the decreases may have slowed down or stagnated in the past decade. In contrast, for both sexes, the age association of the ‘low DBP, low eGFR’ phenotype became steeper over time.

Predictors of cardiometabolic and renal traits

We analyzed the sociodemographic, behavioral and clinical predictors of cluster membership in multivariate regressions as described in the Methods. Both education and ethnicity were associated with the partition of the participants into some of the cardiometabolic and renal phenotypes. Higher education was associated with lower odds of allocation to the ‘high cholesterol’ phenotype for both men and women, lower odds of allocation to the ‘severe hyperglycemia’ phenotype for men and lower odds of allocation to the ‘low DBP, low eGFR’ phenotype for women; it was associated with higher odds of being in the ‘low risk’ phenotype for women (Figs. 4 and 5). Hispanic and non-Hispanic Black women and men had higher odds of belonging to the ‘severe hyperglycemia’ and ‘high blood pressure’ phenotypes than non-Hispanic Whites; Hispanic and non-Hispanic Black women had lower odds of belonging to the ‘low risk’ phenotype than non-Hispanic Whites; and non-Hispanic Black men and women had lower odds of belonging to the ‘high cholesterol’ phenotype.

Fig. 4: Predictors of the allocation to cardiometabolic and renal phenotypes in women.
figure 4

Each point shows one predictor used in the multivariable logistic regressions, as described in the Methods, with its position indicating its coefficient and P value obtained from a two-sided t-test and not adjusted for multiple comparison. Predictors with P < 0.05 are labeled. The reference categories were: 20–25-year-old individuals for age group, non-Hispanic White for ethnicity, below high school for education and never-smokers for smoking. The year coefficient represents changes in odds per decade.

Fig. 5: Predictors of the allocation to cardiometabolic and renal phenotypes in men.
figure 5

Each point shows one predictor used in the multivariable logistic regressions, as described in the Methods, with its position indicating its coefficient and P value obtained from a two-sided t-test and not adjusted for multiple comparison. Predictors with P < 0.05 are labeled. The reference categories were: 20–25-year-old individuals for age group, non-Hispanic White for ethnicity, below high school for education and never-smokers for smoking. The year coefficient represents changes in odds per decade.

The use of statins was associated with lower odds of belonging to the ‘high cholesterol’ phenotype for both men and women, demonstrating its effectiveness in controlling hypercholesterolemia. In contrast, diabetes medications, both oral and insulin, were associated with the ‘severe hyperglycemia’ phenotype in both sexes, as were antihypertensive medications for the ‘high blood pressure’ phenotype, albeit with a smaller magnitude than the former association. This shows that many individuals in these two phenotypes have uncontrolled diabetes or hypertension despite being treated41. Individuals on antihypertensive medicines also had higher odds of belonging to the ‘severe obesity’ phenotype, which provides one explanation for this group having a blood pressure level around the population median, despite the association between obesity and hypertension33. We also found that previous history of MI (both sexes) as well as previous history of CHF (women) were associated with the ‘low DBP, low eGFR’ phenotype even after adjusting for age and other predictors.

Influence of the number of clusters

As described in the Methods, while our main results are based on 10 clusters we also investigated cluster membership and characteristics when sequentially changing the number of clusters (k) from 5 to 12. Even with five clusters (k = 5), four epidemiologically relevant cardiometabolic and renal phenotypes were identified—‘low risk’, ‘severe hyperglycemia’, ‘high blood pressure’ and ‘severe obesity’—along with a ‘mid risk’ cluster that captured all other participants (Fig. 6 and Supplementary Fig. 1). As the number of clusters increased, more refined and specific groups were identified as subsets of one or more of the existing clusters. For instance, the ‘high cholesterol’ cluster appeared at k = 7 for women, with participants coming from the clusters of ‘high blood pressure’ and ‘mid risk’ at k = 6. Similarly, the ‘mid risk’ group for men at k = 7 split into ‘mid risk tall’ and ‘mid risk short’ at k = 8. For both sexes, the ‘severe hyperglycemia’ cluster appeared at k = 5 and remained relatively unchanged as k increased, as did the ‘low DBP, low eGFR’ cluster after k = 6.

Fig. 6: Changes in cardiometabolic and renal clusters in relation to the number of clusters.
figure 6

Each segment corresponds to a phenotype identified for a specific number of clusters (k), as k changes from 5 to 12. For each k, the vertical height shows the cluster prevalence, and the clusters were named based on their risk factor levels, as seen in Supplementary Fig. 1. The flow between segments indicates how clusters partition and merge as the number of clusters changes.

Strengths and limitations

The strengths of our study include using a novel approach to identifying a comprehensive set of epidemiologically and clinically relevant phenotypes that characterizes the entire national population while covering four decades using repeated nationally representative samples with a largely consistent methodology, which allowed measuring change and disparities in phenotype prevalence and its predictors. Our study has some limitations. First, we did not include any inflammation-related biomarkers, such as C-reactive protein, or other cardiometabolic or renal biomarkers, such cystatin C or apolipoprotein B, because these data were not available in some rounds of NHANES. Second, this analysis was based on a series of repeated cross-sectional samples and was not designed to evaluate how an individual with a specific phenotype in one year may have shifted to another in a later year or how the identified phenotypes affect the risk of disease onset or death, which should be pursued with data from prospective cohort studies. Third, other clustering methods should be tested in future methodological assessments, especially probabilistic clustering methods that estimate the probabilities that each participant belongs to each cluster. Finally, although we analyzed some predictors of cluster allocation, future research should investigate how other factors, including genetics, diet, behaviors and the living environment, affect assignment to specific clusters.

Discussion

Application of data-driven clustering, which has been applied extensively to genomics data, to population-based risk factor data identified a comprehensive set of clinically relevant cardiometabolic and renal phenotypes in the US adult population over a period of four decades. The results showed an increase in the ‘severe obesity’ phenotype whose other cardiometabolic risks were not noticeably different from the average population, a stable prevalence of the ‘severe hyperglycemia’ phenotype and a sharp decrease in the ‘high cholesterol’ and ‘high blood pressure’ phenotypes. This improvement in vascular health has been partly offset by rising prevalence of those with poor kidney function in the ‘low DBP, low eGFR’ cluster.

To our knowledge, no study has applied data-driven clustering methods to repeated nationally representative data to identify multifactorial cardiometabolic and renal phenotypes, and to analyze their trends, in the US population. Our results were consistent with single-risk-factor trend studies on obesity, hypertension or blood lipids, which showed a rise in the former but a decline in the latter two risk factors, including in individuals with obesity34,35,36,42,43. Our result on the higher prevalence of the ‘low risk’ phenotype in women than in men was also consistent with previous findings on cardiovascular health of the US population44. We further observed a decrease in the ‘low risk’ phenotype in women and no detectable change for men, which was consistent with a reported statistically insignificant trend in the prevalence of optimal cardiometabolic health for both sexes combined44. We did not observe an increase in the ‘severe hyperglycemia’ phenotype between 1988 and 2018 despite the reported rise in diabetes in the United States71. This was because the ‘severe hyperglycemia’ phenotype was characterized by very high HbA1c levels and included individuals with uncontrolled diabetes, consistent with previous findings on diabetes subgroups53,54. The prevalence of people at such high levels of HbA1c has been relatively stable because improvements in diagnosis and management have countered the rise in total diabetes prevalence72. The ‘low DBP, low eGFR’ phenotype, which had two dominant features (high pulse pressure and poor kidney function), is consistent with the association between atherosclerosis and CKD73. This phenotype was found predominantly in older ages, had a high prevalence of diabetes and was associated with a history of MI and CHF for women, consistent with high levels of vascular–renal comorbidity in older ages74 and with the association of CHF with pulse pressure75. The observed increase in the ‘low DBP, low eGFR’ phenotype, especially in the early 2000s, was also consistent with the previously reported rise in the prevalence of CKD in the United States76. We did not identify a metabolically healthy obesity phenotype, which accounted for 9.7% of the US population in one study on this specific group77, even after allowing 12 clusters to be formed. There may be two reasons for this apparent difference. First, half of the people classified as metabolically healthy in the aforementioned study77 had one metabolic risk factor. Second, in our study, such people were clustered either in the ‘severe obesity’ phenotype or in the two mid-risk phenotypes. Finally, our results on ethnic and educational disparities in the prevalence of specific clusters were consistent with previous studies that considered risk factors either individually36,78 or through the lens of optimal cardiometabolic health23, but these studies did not examine disparities in a comprehensive set of cardiometabolic and renal phenotypes of risk factors. Our results are not directly comparable with those using electronic health records due to differences in the study population, methods and clinical conditions used in the clustering and because some of these studies aimed at identifying subtypes of specific diseases45,47,48,50,51,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68. Among such studies, two studies in different populations identified phenotypes characterized by compromised kidney function and low DBP50,56. Another study that used electronic health records in London also found a cluster with both CHF and CKD62, which is analogous to our ‘low DBP, low eGFR’ phenotype. One study using electronic health records found a subtype of type 2 diabetes characterized by very high HbA1c levels analogous to the ‘severe hyperglycemia’ phenotype identified in our study54.

Our analysis coherently uncovered epidemiological subgroups of the US population characterized by distinct profiles of cardiometabolic and renal risk factors. Some of these phenotypes were characterized by high levels of one or two closely related risk factors, whereas others were more complex and based on multiple seemingly unrelated traits that may share upstream clinical and sociodemographic determinants. Although genetics influences individual or multiple risk factors79,80,81,82,83,84,85, the risk factors that characterized the clusters identified in our study are also influenced by behavioral, environmental and dietary determinants as well as the use (or non-use) of medicines that lower risk factor levels. Future research combining these determinants with genetic data is needed to discern their contributions to the prevalence and trends in cardiometabolic phenotypes and their influence on the occurrence of disease. Our results apply to the US population, and future research should also compare cardiometabolic and renal phenotypes across populations with different diets, health behaviors, healthcare and genetics.

Although the prevalence of the phenotype characterized by very high BMI and WHtR has increased, this group had about average levels of other risk factors. Nonetheless, higher-than-median BMI was also a trait of the ‘severe hyperglycemia’ phenotype, which has not declined despite improvements in diabetes detection and treatment, reflecting the growth of incidence and prevalence of diabetes during the period examined38. There was a substantial decline in phenotypes characterized by high levels of non-HDL cholesterol and SBP and DBP, despite the rise in the ‘severe obesity’ phenotype. The use of antihypertensive medicines, which increased over time, may be one of the reasons that those in the ‘severe obesity’ cluster have near-average blood pressure levels despite their high BMI and WHtR levels. The use of statins and antihypertensive medications may have also shifted some treated individuals from the ‘high blood pressure’ and ‘high cholesterol’ groups into the two mid-risk ones, as seen in the correlated trends in the prevalence of the ‘high cholesterol’ and ‘high blood pressure’ phenotypes with the use of statins and antihypertensive medications, respectively (Fig. 7)86,87. These improvements have contributed to the decades-long decline in cardiovascular mortality in the United States through lower event rates and better survival88,89. The delayed vascular events and better survival, however, may have engendered a rise in an older group with increasingly vascular–renal comorbidities, represented by the ‘low DBP, low eGFR’ phenotype, among whom history of MI and stroke was common and the prevalence of CHF was high. The increase of the ‘high blood pressure’ phenotype since late 2000s may be due to the fact that hypertension treatment and control in the United States, and in other high-income countries, has not improved over the past decade90. This stagnation may be partly responsible for the recent deceleration in the decline of CVD mortality89. Public health actions, especially those that enhance access to healthier foods, such as fresh fruits and vegetables, legumes and unprocessed grains, as well as treatment of hypertension, high cholesterol and diabetes, can help shift an increasing share of the population from some of the high-risk phenotypes to low-risk and mid-risk ones and delay the onset of comorbid chronic conditions that characterized the ‘low DBP, low eGFR’ phenotype. New medicines for obesity, if their cost is lowered, may also reduce the prevalence of the ‘severe obesity’ phenotype, which has average levels of other risk factors, and also reduce BMI among people who fall in other high-risk clusters91. These interventions may be optimized and targeted in the future through precision public health approaches that use the entire risk factor profile or more efficient risk stratification and risk factor management through both clinical and community-based interventions.

Fig. 7: Age-standardized trends in hypertension, treated hypertension and prevalence of the ‘high blood pressure’ phenotype (a) and in hypercholesterolemia, treated hypercholesterolemia and prevalence of the ‘high cholesterol’ phenotype (b).
figure 7

Hypertension is defined as having SBP 140 mmHg or greater, DBP 90 mmHg or greater or taking medication for hypertension. Hypercholesterolemia is defined as having non-HDL 4.92 mmol L−1 or greater or taking medication for hypercholesterolemia.

Methods

Data

The NHANES is a nationally representative survey of the US non-institutionalized civilian population aged 2 months or older with a multistage, stratified clustered probability sample design. The first round of NHANES was done in 1959, and, since 1999, it has been conducted in continuous 2-year rounds. Details of survey design and sampling are provided elsewhere92 and are summarized below.

We used 11 rounds of NHANES, including NHANES III (1988–1994) and various rounds of continuous NHANES from 1999 to 2018, for analyzing trends in cardiometabolic and renal traits. We did not use rounds before NHANES III because they did not measure HbA1c. NHANES participants are not re-enrolled in subsequent years, except through chance. Therefore, our results represent cardiometabolic and renal clusters present in successive US populations.

Participants in each round of NHANES were sampled to be collectively representative of the population in the survey year. Ethnic minorities as well as older adults were oversampled to provide stable estimates for these groups. Sample weights were calculated to account for the complex survey design, survey non-response and post-stratification adjustment to match total population counts from the Census Bureau.

We restricted the analysis to participants aged 20 years and older who had all the required biomarker measurements available. We used the following risk factors in our study, based on their relevance to cardiometabolic and renal diseases and their availability in NHANES data.

Anthropometric measures: we used height (cm); BMI, defined as weight divided by height squared (kg m2); and WHtR, defined as waist circumference divided by height. Being taller is associated with a lower risk of CVDs and all-cause mortality but a higher risk of some cancers13. High BMI is a risk factor for diabetes, CVDs, several cancers and kidney and liver diseases9,14. WHtR was included as a measure of abdominal obesity, which may increase the risk of disease and death independently of BMI93.

Blood pressure and heart rate: we used SBP and DBP as they are associated with increased risk of CVDs, kidney disease and dementia8. We included resting heart rate (RHR), as higher values have been associated with increased risk of cardiovascular and all-cause mortality3. RHR was measured as 60-s pulse and referred to as pulse rate.

Lipids: we used HDL and non-HDL cholesterol defined as total cholesterol (TC) minus HDL cholesterol. Non-HDL cholesterol is associated with higher risk of ischemic heart disease and stroke, and HDL cholesterol is a marker for lower risk11.

Glycemia: we used HbA1c as a proxy of average glucose levels in the blood for recent weeks, which has been associated with CVDs12, as the marker for glycemic risk and control.

Kidney function: we used eGFR (using the CKD-EPI creatinine equation) as a measure of kidney function, which is a predictor of CKD and CVDs5,6.

All the risk factors used in the clustering were measured. Physical examinations were conducted in a mobile examination center, and blood samples were drawn from a random subset of the participants. Blood pressure was measured three times on the right arm with a sphygmomanometer and appropriate cuff size in seated position after a 5-min rest period in all rounds. Both TC and HDL analyses were conducted on venous samples collected according to a standardized protocol. Although there were changes in the laboratories, methods and instruments used to measure lipid concentrations across survey periods were standardized according to the criteria of the Centers for Disease Control and Prevention (CDC) or the National Heart, Lung, and Blood Institute Lipid Standardization Program of the CDC94. HbA1c was measured in all NHANES cycles using high-performance liquid chromatography. We followed NHANES recommendations and did not apply any calibration correction based on cross-over regression. Before eGFR calculation, serum creatinine measurements were calibrated using a previously reported calibration equation95 to account for potential drift in measurement methods. More information on NHANES measurement, laboratory procedures and careful quality controls can be found on the survey website: http://www.cdc.gov/nchs/nhanes.htm.

We did not use data on inflammation markers, such as C-reactive protein, because these data were only available in some rounds of NHANES. We also used data on age, sex, race and ethnicity, education, history of diseases and medication use for examining the demographic and clinical characteristics of the clusters; these data were collected through a questionnaire.

Data cleaning

Before analyses, we conducted the following data cleaning procedure. First, we removed measurements outside pre-defined plausibility ranges (Supplementary Table 2). Second, for blood pressure, we discarded the first measurement and used the average of the remaining measurements. Third, for all participants, we confirmed that SBP > DBP and TC ≥ HDL. Finally, we applied an outlier detection procedure based on Mahalanobis distance96 to exclude risk factor pairs that had an implausible pairwise relationship relative to the overall data. This method uses the empirical relationship between risk factor pairs to detect extreme combinations, for example, a high SBP of 248 mmHg but low DBP of 40 mmHg or a high BMI of 42 kg m2 but small waist circumference of 74 cm. We applied this technique separately to all pairs of anthropometric variables (height, weight, BMI, waist circumference and WHtR), those of blood pressure (SBP and DBP) and those of lipids (TC and HDL). All variables except height and DBP were log transformed before outlier detection to account for their skewed distributions. For each pair considered, observations with a Mahalanobis distance larger than 40.08 (equivalent to a distance of six standard deviations from the mean) were excluded. The present analysis used data from 58,452 participants (28,272 men and 30,180 women) after applying the above steps (Extended Data Fig. 4).

Statistical analysis—cluster identification

Our analytical objective was to divide the NHANES sample into groups of participants with risk factor levels that are similar to each other but distinct from those in other clusters. In extreme cases of one or more risk factors—for example, familial hypercholesterolemia or possibly type 1 diabetes—this task is relatively straightforward and may even be feasible based on prior knowledge or visual inspection of data. For national populations, however, such partitioning requires a method that operationalizes the analytical objective by partitioning the joint distribution of risk factors.

We used a k-means clustering algorithm to identify cardiometabolic and renal phenotypes of the US population in an unsupervised data-driven approach. The k-means algorithm partitions participants into non-overlapping clusters that are relatively homogeneous while maximizing the heterogeneity between clusters, by minimizing the sum of distances of all data points from the center of the cluster they belong to. The k-means algorithm is a specific form of Gaussian mixture method where only the means of the clusters are estimated but not their covariance97. It is a widely used and computationally efficient clustering algorithm that produces non-overlapping clusters. We took 50 different random sets of starting values to avoid converging to local minima and used Euclidian distance and the Lloyd implementation of the algorithm.

All analyses were conducted by pooling individual participant data across all survey rounds but separately for men and women to allow for potentially different clustering of cardiometabolic traits between them. We centered and scaled each risk factor by subtracting the overall mean and dividing by the standard deviation before clustering. In k-means, the number of clusters (k) must be pre-specified. Various heuristics have been suggested for selecting the optimal number of clusters—for example, the elbow method and the silhouette method—which compare measures of cluster cohesion and cluster separation for different choices of k. Neither the elbow nor the silhouette method provided a definitive optimal number of clusters (Supplementary Fig. 2). Therefore, we investigated cluster membership, and characteristics when sequentially changing k from 5 to 12, and selected k based on these heuristics as well as on the epidemiological interpretability of the results.

Stability of the clustering results

After selecting the number of clusters, we evaluated the stability of the resultant clusters by calculating the average Jaccard index98 between the clustering results over the entire sample and that of 1,000 subsamples of 50% of the data drawn without replacement (Extended Data Table 4). The Jaccard index is a measure of similarity between two groups and ranges from 0 to 1, with 0 indicating no overlap and 1 indicating identical results. For men, all clusters had an average Jaccard index of 0.87 or above; for women, all clusters had an average Jaccard index of 0.80 or above, except for the ‘mid risk tall’ phenotype that had an average Jaccard index of 0.70. To evaluate whether our analysis met our analytical objective of partitioning the joint distribution of risk factors based on a true correlation structure, we also used k-means to cluster 30,180 simulated data points (the same number as used in the main analysis). The simulated data were generated from a 10-dimensional normal distribution with no correlation. All the resulting clusters were highly unstable with a Jaccard index below 0.30, which is much lower than those of clusters identified on NAHNES data (Extended Data Table 4).

Intra-cluster and inter-cluster distances

We also report (Extended Data Fig. 5) the intra-cluster and inter-cluster distances as a measure of how the method achieves the analytical objective. The intra-cluster distance was calculated as the average Euclidian distance between all pairs of points in the same cluster, and the inter-cluster distance was calculated as the average Euclidian distance between all pairs of points from two different clusters. These metrics show that participants assigned to every cluster were, on average, more similar to one another in terms of their risk factor levels than they were to participants in any other cluster.

Consistency of clusters over time

We investigated whether clusters emerging from the analysis of all rounds of NHANES from 1988 to 2018 were similar to those that would emerge if we repeated the analysis for subperiods consisting of NHANES III 1988–1994, NHANES 1999–2008 and NHANES 2009–2018 separately (Supplementary Fig. 3). The phenotypes identified in subperiods were similar to those identified when aggregating all rounds from 1998 to 2018 for men. For women, most of the phenotypes identified over the entire analysis period remained in subperiod clustering, except the ‘mid risk tall’ phenotype, which was replaced by either an ‘obesity’ phenotype or a ‘mid risk’ phenotype, and except the ‘low DBP, low eGFR’ phenotype in NHANES III, which was replaced with a ‘high risk’ phenotype with hazardous levels of all risk factors.

Statistical analysis—trends in prevalence and predictors of cluster membership

In addition to graphical presentation of how cluster prevalence has changed over time, we analyzed the presence of a trend in a regression analysis. We fitted one logistic regression per cluster, with time as the independent variable. We adjusted for age by 5-year age bands and report the P value for the coefficient of the time term. In addition to the entire analysis period, we analyzed trends for pre-specified time periods of 1988–2000, 2000–2010 and 2010–2018 (Extended Data Table 3).

We also used multivariate logistic regression to analyze the predictors of cluster membership. The predictors included age group, survey year, race or ethnicity (non-Hispanic White, non-Hispanic Black, Hispanic and Other ethnicity), education (below high school, high school and university or college), medication use (antihypertensive, statin, oral hypoglycemic diabetes medication and insulin), smoking (current smoking, never smoking and former smoking) and previous history of disease (MI, stroke and CHF).

When reporting the prevalence of clusters over time, and the potential predictors of cluster membership, we accounted for the sampling design through the use of sample weights in the regressions. In all regressions, we rescaled sample weights so that they summed to the same total in each round. We did this so that each round of NHANES contributes the same effective sample size to the analysis of trends and predictors. When evaluating trends over time and predictors of cluster membership, we also adjusted the sample weights by 5-year age bands to match the age distribution of the 2020 US census population. All analysis were done using R software version 4.0.3

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.