Towards precision cardiometabolic prevention: results from a machine learning, semi-supervised clustering approach in the nationwide population-based ORISCAV-LUX 2 study

Given the rapid increase in the incidence of cardiometabolic conditions, there is an urgent need for better approaches to prevent as many cases as possible and move from a one-size-fits-all approach to a precision cardiometabolic prevention strategy in the general population. We used data from ORISCAV-LUX 2, a nationwide, cross-sectional, population-based study. On the 1356 participants, we used a machine learning semi-supervised cluster method guided by body mass index (BMI) and glycated hemoglobin (HbA1c), and a set of 29 cardiometabolic variables, to identify subgroups of interest for cardiometabolic health. Cluster stability was assessed with the Jaccard similarity index. We have observed 4 clusters with a very high stability (ranging between 92 and 100%). Based on distinctive features that deviate from the overall population distribution, we have labeled Cluster 1 (N = 729, 53.76%) as “Healthy”, Cluster 2 (N = 508, 37.46%) as “Family history—Overweight—High Cholesterol “, Cluster 3 (N = 91, 6.71%) as “Severe Obesity—Prediabetes—Inflammation” and Cluster 4 (N = 28, 2.06%) as “Diabetes—Hypertension—Poor CV Health”. Our work provides an in-depth characterization and thus, a better understanding of cardiometabolic health in the general population. Our data suggest that such a clustering approach could now be used to define more targeted and tailored strategies for the prevention of cardiometabolic diseases at a population level. This study provides a first step towards precision cardiometabolic prevention and should be externally validated in other contexts.

Cluster analyses are useful approaches to identify subgroups with different cardiometabolic profiles. Such an approach has recently been developed among people with diabetes, the analysis revealing 5 subgroups with different clinical profiles and risks of diabetes-related complications 11 , but has never been investigated in the general population at large scale 12 . Besides, clustering approaches used in the litterature so far were mostly unsupervised where it is assumed that there is no outcome variable nor is anything known about the relationships between the observations in the data set, which is not a reliable hypothesis with respect to cardiometabolic prevention. Semi-supervised clustering techniques may therefore be more adapted to derive meaningful groups 13 , similarly to what has been recently suggested in people with type 1 diabetes 14 , to redefine the way we consider, prevent and treat cardiometabolic diseases in the general population, not as independent entities but rather with a more comprehensive, patient-centered, approach.
Therefore, based on the unique set of cardiometabolic data available in the nationwide population-based ORISCAV-LUX 2 study, our objective was to stratify the general population in terms of cardiometabolic profiles with a high level of granularity, guided by two key factors to assess cardiometabolic health, namely (1) body mass index (BMI), the most frequently used indicator to evaluate adiposity in large populations and an established risk factor of numerous cardiometabolic disorders, highly correlated with various cardiometabolic and cardiovascular risk factor and (2) glycated hemoglobin (HbA1c), a reliable and documented biomarker of glycemic control that is also correlated with many cardiometabolic conditions and surrogate markers [15][16][17][18] . This new clustering will help to have a better understanding of the cardiometabolic health of the general population and might eventually help to tailor and target early prevention strategies to people who would benefit the most, thereby representing a first step towards precision prevention for cardiometabolic diseases.

Materials and methods
ORISCAV-LUX 2 study. The "Observation of cardiovascular risk factors in Luxembourg" (ORISCAV-LUX) 2 is the second wave of the nationwide cross-sectional, population-based ORISCAV-LUX study. The ORISCAV-LUX 1 survey, conducted between November 2007 and January 2009, was the first nationwide cross-sectional survey of cardiovascular health monitoring in Luxembourg with the objective of describing baseline information on the prevalence of "traditional" cardiovascular risk factors, including obesity, hypertension, diabetes mellitus, lipid disorders, smoking and physical inactivity among the general adult population in Luxembourg 19 . The second wave of ORISCAV-LUX was initiated in 2016 to update and monitor the evolution of cardiometabolic parameters in the general population. An extended set of health indicators, new clinical examinations and selfreported information were then integrated in this second round of data collection. The data collection workflow has already been detailed extensively elsewhere 20 . Informed consent was obtained from all participants. The study design and information collected were approved by the National Research Ethics Committee (CNER, No 201,505/12) and the National Commission for Private Data Protection (CNPD). All methods were carried out in accordance with the Declaration of Helsinki, 2008. Study population. We included participants from the second wave of the ORISCAV-LUX study (2016)(2017)(2018), where more detailed information on cardiometabolic health was available. We initially included 1558 participants, then excluded participants who only filled in the self-administrated questionnaire (n = 120), did not get a lab test (n = 51), with no body composition measures available (n = 30) and an outlier in the HbA1c distribution (HbA1c = 109 mmol/mol, n = 1). Therefore, we finally considered N = 1356 participants in the present analysis (see flow chart, Fig. 1).
Clinical and laboratory data assessment. HbA1c was measured on an HPLC analyser, Tosof G8™.
Heart rate, pulse wave velocity, central pressure, arterial age, lying position blood pressure were measured with Complior™. ECG were read and interpreted by a cardiologist and then categorized as normal or abnormal. Bioimpedanciometry measures of body fat percentage in the trunk, muscle mass in the trunk, total fat and fat free mass in the trunk were assessed with a Tanita™ digital scale. Insulin was measured on Abbott immunology analyser (chemiluminescence technique). Insulin resistance was assessed with the HOMA-IR index, calculated as Insulin (mIU/l) × Glucose (mmol/l)/22.5. Insulin sensitivity was estimated with the Quicki index, calculated as 1/[log (Insulin, mUI/l) + log (Glucose, mg/dl)]. Glomerular filtration Rate was estimated with the MDRD formula.

Cluster analysis.
We performed a semi-supervised cluster analysis guided by BMI and HbA1c to identify subgroups of interest 13 . Five measures, i.e. the means and variances of BMI and HbA1c, as well as the covariance between BMI and Hba1c, were predicted for each individual using reinforcement learning trees (RLT), a type of tree-based machine learning technique 21 . The five clustering variables (RLT-predicted means and variances of BMI and HbA1c and their covariance) were standardized and a k-means clustering algorithm 22 with Euclidean distance was applied. Clustering was tested with and without taking the covariance between BMI and HbA1C into account.
A set of 51 cardiometabolic factors was available in ORISCAV-LUX 2. The factors of body fat and muscle mass from different body parts were highly correlated (pearson coefficient > 0.95), so we only kept the body fat and muscle mass from the trunk for further analysis to increase clustering stability. Overall, we used a subset of 31 factors to be included in the cluster analysis (the remaining factors were only used a posteriori for illustrative purposes, see Table 1). RLT prediction was performed based on the following set of cardiometabolic factors: demographic (age and sex), clinical (ECG interpretation, heart rate, carotid-femoral pulse wave velocity, central pressure, arterial age, defined as the average age for a given carotid-femoral pulse wave velocity 23  www.nature.com/scientificreports/ anthropometrically predicted visceral adiposity 22 , body fat percentage in the trunk, muscle mass in the trunk, total fat and fat free mass in the trunk), and laboratory (insulin, insulin resistance, insulin sensitivity, glomerular filtration rate, creatinine, total cholesterol, LDL cholesterol, HDL-cholesterol, triglycerides, CRP) measures. A missing at random mechanism was assumed and missing values were imputed using multiple imputation by chained equations (mice R package 24 ). Clustering stability was assessed using clusterboot function from the fpc R package. The data is resampled 100 times using bootstrap and the Jaccard similarities 25 of the original clusters to the most similar clusters in the resampled data are computed. The mean over these similarities is used as an index of the stability of a cluster. The assessment was applied to the clustering with the number of clusters from 3 and 8. We chose the clustering with the highest mean Jaccard similarity index of the clusters and the smallest cluster greater than 20 participants. Clusters were ordered by increasing HbA1c median. Each cluster was then described according to the variables used for the clustering, but also with additional illustrative variables: lifestyle factors (physical activity assessed with the International Physical Activity (IPAQ) questionnaire, time spent in seated position and smoking status categorized into never, former and current smoker), equivalised disposable income, sedentary occupation and other health factors such as self-perceived health (five categories from excellent to poor), family history of diabetes, hypertension, hypercholesterolemia and personal history of diabetes, cancer and hypertension.
Data are presented in Table 1 as n [%] and median [min, max] for categorical and continuous variables, respectively in the entire population In Table 2, study participants' characteristics are displayed according to their clusters. In Table 2, we also computed the average 10-year cardiovascular risk [%] per cluster, based on either the SCORE 26 (validated for people < 70 years and no previous cardiovascular disease or type 2 diabetes mellitus) or the ADVANCE 27 (validated for people with type 2 diabetes) risk score, whichever was most appropriate. We used the median values of the continuous variables, and considered that the binary variables were present if more than 50% of the cluster were concerned. In Fig. 2, a scatter plot of body mass index and HbA1c distribution was computed and stratified by cluster group. In Fig. 3, we have plotted the distribution of the clusters in radar diagrams according to 35 key characteristics grouped in 5 themes (Diabetes-related factors, Anthropometry, Lipids & Biomarkers, Cardiovascular Health, Sociodemographic, Lifestyle and other Health Factors). For each feature, we computed the relative difference, expressed in percentage, between the median value (or frequency for categorical variables) in the cluster and the median value (or frequency for categorical variables) in the overall population.

Results
Population study characteristics. The RLT model without taking the covariance between BMI and HbA1C into account provided the most stable clusters. We tested iteratively clustering with k = 3 to 8 and we defined the final number of clusters as the one which maximized the stability index while ensuring a sufficient number of individuals in each group, with at least 20 individuals. Therefore, the optimal number of clusters appeared to be 4 and the analysis revealed a very high level of stability, with Jaccard similarity index values of 100%, 100%, 94% and 92% for clusters 1, 2, 3 and 4 respectively ( Table 1). Based on the extensive description of characteristics of individuals in each cluster, Cluster 1 was labeled "Healthy", Cluster 2 was labeled "Family his- www.nature.com/scientificreports/ tory-Overweight-High Cholesterol", Cluster 3 was labeled "Severe Obesity-Prediabetes-Inflammation" and Cluster 4 was labeled "Diabetes-Hypertension-Poor CV Health". Cluster 1 "Healthy" encompassed a total of N = 729 participants (53.76% of the total population). Compared to the overall population (Table 1), members of Cluster 1 were characterized by young individuals (median, m = 46.69 years old) with a low median HbA1c level (m = 34.00 mmol/mol) and low BMI (m = 23.36 kg/m 2 ) (Fig. 2). They also had the lowest values for anthropometric features such as waist-to-hip ratio (m = 0.85), fat mass percentage (m = 24.30%) or predicted visceral adiposity (m = 6.00 cm 2 ). In terms of lipids and biomarkers, they had the highest level of HDL cholesterol (m = 60.00 mg/dl), a high percentage of family history of hypercholesterolemia (42.39%) and the best renal function (GFR = 84.88 ml/min/1.73 m 2 ). Regarding diabetesrelated factors, Cluster 1 members had the lowest values for fasting blood glucose (m = 86.00 mg/dl), diabetes diagnosis (1.10%) and HOMA-IR (m = 1.24). Oppositely, they had the highest insulin sensitivity (Quicki index m = 0.37). Cluster 1 can be considered as the healthiest cluster in terms of cardiovascular health, as they had the lowest values of vascular age (m = 43.00 years), central pulse pressure (m = 38.00 mmHg), pulse wave velocity (m = 7.50 m/s), abnormal ECG reading (10.70%), and systolic blood pressure (m = 120.00 mmHg). Finally, they were also more frequently non-smokers (m = 62.83%), had higher income (3750.00 €/month) and had a higher median time spent sitting (m = 360.00 min/day) and sedentary occupation (m = 59.26%) ( Table 1, Fig. 3). The average 10-year cardiovascular risk for Cluster 1 was 0%.
Cluster 2 "Family history-Overweight-High Cholesterol" encompassed N = 508 participants (37.46% of the total population). Members of Cluster 2 were in the vast majority overweight (m = 28.48 kg/m 2 ) with low values of HbA1c levels (m = 37.00 mmol/mol). Overall, they had intermediate values for all considered anthropometric features. They were characterized by elevated total (m = 205.00 mg/dl) and LDL cholesterol levels (m = 128.50 mg/ dl). They also had a high frequency of family history of diabetes (25.00%) and a high percentage of family history of high blood pressure (43.70%). The average 10-year cardiovascular risk for Cluster 2 was 2%.
Cluster 3 "Severe Obesity-Prediabetes-Inflammation" encompassed N = 91 participants (6.71% of the total population). Cluster 3 included individuals with obesity or severe obesity with a higher BMI (m = 35.69 kg/ m 2 ) and a higher HbA1c level (m = 39.00 mmol/mol) than those in Cluster 2.

Discussion
In this large, nationwide population-based study, we have observed 4 stable clusters of individuals from the general population with diverse cardiometabolic health profiles. Our study suggests that this classification could help disentangle the heterogeneity in the general population in terms of cardiometabolic health and be used to tailor prevention strategies. Whereas a first group of more than 50% of the total population (Cluster 1 "Healthy") was characterized with healthy cardiometabolic features and could benefit from a general prevention strategy, the other 3 groups (Clusters 2-4) may benefit from a more personalized and intensive approach to improve their health. Individuals in Cluster 2 "Family history-Overweight-High Cholesterol" may benefit from a more comprehensive strategy regarding overweight/obesity management and cholesterol with a personalized treatment (e.g. through diet, physical activity, psychology or pharmacological treatment) and starting from an early age for individuals with family history of cardiometabolic diseases. This could delay or prevent them from transitioning from Cluster 2 to Clusters 3 or 4 28 . People in the Cluster 3 "Severe Obesity-Prediabetes-Inflammation" may benefit from an intense lifestyle management strategy adapted to individuals with moderate obesity 29,30 , or bariatric surgery for those with severe obesity 31,32 with a close monitoring of the impact on low-grade inflammation levels and the reverse of prediabetes to a normoglycemic status 33,34 . Cluster 4 'Diabetes-Hypertension-Poor Cardiovascular Health" are often in a multimorbid state, with diabetes and hypertension simultaneously and for a third of them with an abnormal ECG reading or elevated triglyceride levels. Therefore, they could benefit from an intensive combined approach, personalized according to the socioeconomic profile and occupation, with nutritional/dietary 35 or lifestyle 36 interventions, smoking cessation 37 , medication or surgery strategies, targeting both high blood pressure and diabetes with the ultimate objective to reduce arterial stiffness and prevent the occurrence of cardiovascular disease and improve general health status 38,39 .
Overall, these groups may benefit from more efficient prevention and therapeutic strategies. If externally validated, general practitioners could one day rely on this profiling to have a better picture of a new patient when limited information is available and try to optimize several cardiometabolic parameters simultaneously. Some These approaches, along with other recent technologies (big data analysis of gut microbiota, integration of real-time data from wearables), are still complex and not yet cost-effective to implement in practice 42 and our approach could help to fill the gap and help move towards precision cardiometabolic prevention. These findings are also an opportunity to rethink the strategies that can be offered, for instance to people with obesity 43 , with new models developed according to a more refined definition of the targeted sub-population. Cardiometabolic health relies on complex, intricate, physiological relationships between all the considered parameters in this work. These results imply a move from a "one-size fits-all" vision to a precision cardiometabolic prevention approach to tackle cardiometabolic diseases according to the variety of phenotypes observed in the general population 14 . Strengths and limitations. This study has numerous strengths. First, the large population size, combined with a unique set of cardiometabolic features or lifestyle and demographic factors, enabled us to extensively and deeply phenotype the general population in terms of cardiometabolic health. It has been shown that the ORISCAV-LUX 2 population was representative of the Luxembourgish adult population in terms of geographical district, but not with respect to sex and age distribution, young and elderly individuals being slightly underrepresented and women over-represented. Nonetheless, it has been demonstrated that ORISCAV-LUX 2 is a reliable tool for epidemiological research and for cardiometabolic health monitoring in the adult residents in Luxembourg 20 . We also used a semi-supervised clustering approach, guided by two main features for cardiometabolic health, which seems to be more adapted than totally unsupervised clustering to the reality of the knowledge of cardiometabolic health 13 .
This study also has some limitations. Cluster labelling is always subject to interpretation. We used, to the best of our ability, a systematic approach and relied on the most distinctive characteristics in each cluster to label them. Changing the choice of the key factors to guide the semi-supervised clustering (here BMI and HbA1c) could yield to different distributions, but they were chosen as they are frequently assessed in large populations and valid surrogate of the overall cardiometabolic health status [15][16][17][18] . The relatively low number of individuals in clusters 3 and 4 could limit the inference that can be made out of these groups.
Stability of the clusters has been evaluated internally but now there is a need to replicate this approach externally, in other large nationwide population-based studies to evaluate external validation of this grouping. Some factors used to describe the clusters, such as physical activity, are self-reported, and therefore could be reported differently in the clusters. Besides, no mental health nor sleep-related factors were included in the descriptive analysis. In future replication studies, wearable devices could be used to collect objective measures of physical activity and sleep quality, which may be valuable information to add in the cluster description.

Conclusion
In conclusion, our work provides an in-depth characterization and thus, a better understanding of the general population in terms of cardiometabolic health. Our data suggest that such a clustering approach could now be used to define more targeted and tailored strategies for the prevention of cardiometabolic diseases at a population level. This study provides a first step towards precision cardiometabolic prevention and should be replicated in other contexts. Further studies evaluating the associations between these clusters and subsequent incidence of various cardiometabolic and cardiovascular diseases are warranted. Figure 3. Radar diagrams of the median values for each cluster, according to 35 key diabetes-related factors, anthropometry, lipids and biomarkers, cardiovascular health, sociodemographic, lifestyle and other health factors. BMI body mass index, FMP fat mass percentage, VISC ADI anthropometrically predicted visceral adiposity, WC waist circumference, HC hip circumference, TC thigh circumference, WHR waist-to-hip ratio, CHOL total cholesterol, FAM HC family history of hypercholesterolemia, CRP C-reactive protein, GFR glomerular filtration rate, TRIG triglycerides, LDL LDL cholesterol, HDL HDL cholesterol, HbA1c glycated hemoglobin, DIABETES diabetes diagnosis, FAM DIABETES family history of diabetes, QUICKI quantitative insulin sensitivity check index, INSULIN insulin, HOMA-IR homeostatic model assessment for insulin resistance, FBG fasting blood glucose, VASC AGE vascular age, HTA hypertension diagnosis, FAM HBP family history of high blood pressure, SBP systolic blood pressure, ECG electrocardiogram, PWV pulse wave velocity, CPP central pulse pressure, SEX sex, NEVER SMOKER never smoker, SITTING time spent sitting, PA physical activity, INCOME income, AGE age. For each feature, we computed the relative difference, expressed in percentage, between the median value (or frequency for categorical variables) in the cluster and the median value (or frequency for categorical variables) in the overall population.