Machine learning-based diagnosis and risk factor analysis of cardiocerebrovascular disease based on KNHANES

The prevalence of cardiocerebrovascular disease (CVD) is continuously increasing, and it is the leading cause of human death. Since it is difficult for physicians to screen thousands of people, high-accuracy and interpretable methods need to be presented. We developed four machine learning-based CVD classifiers (i.e., multi-layer perceptron, support vector machine, random forest, and light gradient boosting) based on the Korea National Health and Nutrition Examination Survey. We resampled and rebalanced KNHANES data using complex sampling weights such that the rebalanced dataset mimics a uniformly sampled dataset from overall population. For clear risk factor analysis, we removed multicollinearity and CVD-irrelevant variables using VIF-based filtering and the Boruta algorithm. We applied synthetic minority oversampling technique and random undersampling before ML training. We demonstrated that the proposed classifiers achieved excellent performance with AUCs over 0.853. Using Shapley value-based risk factor analysis, we identified that the most significant risk factors of CVD were age, sex, and the prevalence of hypertension. Additionally, we identified that age, hypertension, and BMI were positively correlated with CVD prevalence, while sex (female), alcohol consumption and, monthly income were negative. The results showed that the feature selection and the class balancing technique effectively improve the interpretability of models.

between variables and outcomes 8 . However, these assumptions are rarely satisfied with real data. As an alternative to LR, nonlinear ML (e.g., support vector machine, random forest) has been used for data analysis. Nonlinear ML can identify complex patterns among multiple variables without any assumptions. One study obtained high classification performance in diabetes classification using ML and rural Chinese data 9 . Other studies developed highly accurate ML-based CVD classifiers using UK biobank data or the National Health and Nutrition Examination Survey 10,11 .
The development of ML has resulted in higher-performance models than pre-existing CVD risk models 12 . However, the ML models are refer to as "black-box" so it is hard to explain how the algorithm derived a specific outcome. Feature importance-based explanation has been used to describe how the ML models depend on particular risk factors. Recent studies identified that major risk factors for CVD were age, systolic blood pressure, and BMI by using the permutation feature importance (PFI) and mean decrease impurity (MDI) of tree ensemble models 13,14 . Other studies analyzed risk factors for CVD using Shapley additive explanation (SHAP) and PFI on the Korean national health insurance service screening database 15,16 . However, most ML-based studies overlook the effect of multicollinearity and irrelevant variables on risk factor analysis. Although multicollinearity does not affect the predictive power of ML models, variables with high collinearity offset the importance of each other, consequently leading to an erroneous evaluation of the importance of the variables. As a solution for multicollinearity, principal component analysis (PCA) transformation has been used in medical data analysis 17,18 . However, it is not an appropriate way to conduct risk factor analysis with PCA because it breaks up the informativity of original variables. A methodology that eliminates multicollinearity and does not decrease the interpretability is required for appropriate risk factor analysis. As much as multicollinearity, irrelevant variables are also detrimental to ML-based data analysis. They slow down the processes and lower the generalization performance and interpretability of ML 19 . For the relevant variable selection, MDI-based selection has been used with the XGBoost model 11 . Compared to these studies, we propose effective pipelines to reduce the effects of multicollinearity and irrelevant variables on risk factor analysis.
This study aims to develop ML-based CVD classifiers: multi-layer perceptron (MLP), support vector machine (SVM), random forest (RF), and light-gradient boosting machine (L-GBM) without multicollinearity and redundant variables. We used two-stage variable selection, variance inflation factor (VIF)-based filtering, and the Boruta algorithm to remove multicollinearity and redundant variables. VIF-based filtering has the benefit of eliminating multicollinearity while maintaining interpretability of the model. Further, the Boruta is an empirical and reliable algorithm that utilizes a statistical significance test to determine outcome-relevant variables. In addition to feature selection, we applied class balancing strategies to increase the confidence score of the classification models. For the model interpretation, we identified key risk factors of CVD using Shapley additive explanations (SHAP), which is a post hoc model interpretation technique that is theoretically based on the Shapley value 20 . We identified the key risk factors of CVD and the direction of the relationship between the risk factors and CVD using SHAP. In addition, we ablated feature selection (i.e., VIF and Boruta) and data balancing techniques to show how they affect to SHAP analysis.

Methods
Study design. KNHANES (2007KNHANES ( -2018) is a legal survey on the health behavior, the prevalence of chronic diseases, and nutritional status of Koreans. The KNHANES is a complex, stratified, and multi-stage probability cluster survey in which the participants were not randomly selected from the Korean population 21 . KNHANES also includes sample design errors, inequality sampling ratios for interest factors, nonparticipation errors, and non-response errors. Therefore, it is necessary to consider complex sampling weights to reduce the biases between the estimators and population parameters. KNHANES provides complex sampling weights of each participant to allow data researchers to correct for these biases. Previous studies have analyzed CVD risk with logistic regression using the complex sampling weights 22,23 . In this study, we present a method for applying complex sampling weights to ML model-based data analysis.
We collected 97,622 samples with health questionnaire, health examination, and itvex variables (a complex sampling weight for health questionnaire and examination variables) from KNHANES 2007-2018. The itvex corresponds to the number of people that an individual represents in population, and the sum of itvex equals the South Korean population size (49,631,475 people). Following the KNHANES data analysis guideline, we integrated data from 2007 to 2018 and divided itvex by 11.5. Then, we duplicated the dataset proportional to each individual's itvex. Since 49.6 million people are too many to be analyzed by ML models, we limited the total number of duplicated samples by three times (n = 292,866) to the original dataset. Then, those who were <19 years (n = 49,427) and those with no response about CVD (n = 25,841) were excluded. As a result, 217,598 participants were selected as the study population. The participants were randomly divided into a training set (n = 174,078 (80%)) and a test set (n = 43,520 (20%)). As a result of duplication, some samples were selected in both the training and test sets. We removed them before the feature selection and model training (n = 83,510). The flow diagram of data preparation is shown in Fig. 1. We conducted this study using Python 3.8 (https:// www. python. org/) and its compatible open-source packages.
Primary outcome. The primary outcome of this study was determining the occurrence of CVD in each participant. CVD was defined based on participants' responses to the checklist survey questionnaire, where they were asked to indicate whether they were 'Diagnosed by a doctor and currently suffering from the disease' for myocardial infarction, angina pectoris, or stroke on the KNHANES Health Survey. Participants with at least one myocardial infarction, angina, and/or stroke were defined as the CVD class, and participants without any of the three diseases were defined as the non-CVD class. We refined variables to match the input format of the ML model. Binary responses were transformed to 0: a negative and 1: a positive variables. Numerical variables were redefined so that their quantitative values were consistent with their qualitative meaning, e.g., Stress perception 0: Little, 1: A little, 2: Much, and 3: Very much. Then, Z-normalization was applied to numerical variables so that the deviation of the variables did not affect classification and model interpretation. Z-normalization is processed by: where x is the raw variable, µ is the mean of the variable x, and σ is the standard deviation of the variable x. Like previous medical data studies, we replaced the missing values with the mode for binary data and the median for the numerical data 25 .
Variable selection. Before we develop ML-based CVD classifiers, we removed multicollinearity and CVDirrelated variables. Multicollinearity indicates how much information is shared among the variables and it makes difficult to distinguish how much each variable influenced the regression or classification 26 . We detected the multicollinearity by calculating VIF: where N is the number of participants, Y i is a value of variable i, Ŷ i is the predicted value of the ordinary least square (OLS) regression model that is trained to predict Y i using the other variables, and Ȳ i is the mean of the variable i. TSS measures the variation of Y i and ESS measures the difference between Y i and Ŷ i (i.e., error of the OLS). When Y i has a linear relationship with other variables, ESS decreases and becomes 0, R 2 increases to 1, and VIF becomes large. Thus, we can conclude that a large value of VIF indicates Y i linearly related with other variables 27 . We considered that a variable has multicollinearity if its VIF is greater than 4 28 . We removed multicollinearity by sequentially excluding variables with the largest VIF from the dataset until all the variables had a VIF smaller than 4. www.nature.com/scientificreports/ In addition, we applied the Boruta algorithm to further remove irrelevant variables 29 . First, we concatenated the original variables and its randomly shuffled variables (shadow variables) into one dataset. Second, we trained the RF to classify CVD and measured the variable importance using SHAP (SHAP is described in a following subsection). Third, we repeated this process 1000 times and counted the number of hitting times that each original variable had a greater SHAP than a maximum importance shadow variable (MISV), t hitting . Finally, we performed a binomial test to verify whether the each original variable is statistically significant than MISV to CVD classification. For the binomial test, we established the null hypothesis H 0 : the expected hitting probability of an original variable is 50% (i.e., the variable and MISV have the statistically same importance), as well as the alternative hypothesis: H a : the expected hitting probability of an original variable is greater than 50% (i.e., the variable is statistically more important than MISV). We tested these hypotheses with right-side significance test: where p is the success probability in the binomial distribution, t is the number of successes, T is the total number of trials, and M is the number of variables after VIF filtering. Bonferroni correction was applied to the significance level with α = 0.05 30 . If the p right is lower than 0.05 N , we accepted the alternative hypothesis and used the variable as the input for the CVD classifier. The Boruta selection implies that the selected variables are significantly more related to CVD than the randomly generated variables (shadow variables) 29 .
Training of machine learning models. After the variable selection, we trained four ML models for CVD prediction: MLP, SVM, RF, and L-GBM. We applied 5-fold cross-validation and grid search for hyperparameter tuning. Cross-validation (CV) is an efficient technique to prevent overfitting by validating the model with the various training and validation data. First, we randomly split the train set into five folds and trained models on four out of five folds (training data). Next, we measured the area under the receiver operating characteristic (AUC) on the other fold (validation data). Then, we iterated this procedure five times while alternating validation data. We used the averaged AUC over five iterations as a CV score. We found optimal hyperparameters that have maximum CV score using grid-search: where H is the cartesian product of the hyperparameters and f is the CV score.
The class proportion of the study population was extremely imbalanced (CVD:non-CVD = 1:31.23). The model should overfit the non-CVD cases and achieve low sensitivity performance with imbalanced dataset. Therefore, we artificially balanced the ratio between CVD and non-CVD cases, making this ratio equal to 16.12:16.12 by using undersampling and synthetic minority oversampling technique (SMOTE) so that the model sufficiently train CVD patient cases 31 . These balancing algorithms were only applied in the training process but not in the validation or test. We used Scikit-Learn (https://scikit-learn.org) and Imbalanced-Learn (https:// imbalanced-learn.org) to implement the model training.
Variable importance. We conducted SHAP analysis to understand the influence of each variable on the prediction results and what are the main variables that affect the CVD prediction. SHAP is a model-agnostic feature importance method that does not temper the properties of the trained model. SHAP separately measures the contributions of each variable to the CVD prediction based on the Shapley value. SHAP also shows whether the risk factors positively or negatively impacted the prediction 20 . The Shapley value is defined as: where v i is the Shapley value of variable x i , S is the set of variables excluding x i , n is the total number of variables, and p(C | X) is the classifier output or the posterior probability that a participant has CVD given the set of variables X. The Shapley value quantifies the expected change on the posterior probability when the variable x i is excluded from the dataset. SHAP value approximates the Shapley value as the weight of a linear regression model. A positive SHAP value indicates the variable contributed to classifying a participant as part of the CVD class. Conversely, a negative value means the variable contributed to classifying a participant as belonging to the non-CVD class. We used the SHAP algorithm developed by Lundberg and Lee publicly available at https:// github.com/slundberg/shap 20 .
Performance metrics. The proposed models predict the posterior probability of a participant having CVD, given the selected variables. Classification means that the predicted sample is discriminated as positive or negative based on a threshold value. If the prediction is greater than the threshold, it is classified as CVD, and if it is less than the threshold, it is classified as non-CVD. Area under the receiver operating characteristic curve (AUC) is a widely used metric to indicate binary classification performance in medical data analysis 32  True positive+False negative = sensitivity) by varying the prediction threshold from 0 to 1. Therefore, AUC is a threshold-independent performance metric. However, when using the classifier in practice, it is necessary to set a specific threshold. We www.nature.com/scientificreports/ found the optimal threshold ( Th ) that maximizes the geometrical mean of sensitivity and specificity (G-mean). The G-mean is a widely used metric to represent the balanced performance of sensitivity and specificity 33 . The threshold of each model Th was optimized by: We represented the overall performance with the mean and 95% confidence intervals (CI) on the bootstrapping samples 34 . A graphical illustration of the study methods is shown in Fig. 2.

Ethics statement. Approval of the research: The institutional review board (IRB) of Kyung Hee University
Medical Center approved the study (IRB No. KHUH 2020-06-054). All methods were carried out in accordance with the KNHANES analytic guidelines and regulations. Informed consent: All subjects provided written informed consent before participating in this survey (KNHANES).

Results
Variable selection and general characteristics. This study initially extracted 34 variables that are related to CVD from KNHANES 24 . We provide a complete list of initial and selected variables in Supplementary Table S1. The weight, waist circumference, hematocrit, and hemoglobin were filtered out according to the condition of VIF< 4. In addition, the smoking amount, exercise frequency, hepaB, hypertriglyceridemia, urine ketone, urine bilirubin, and urine nitrite were identified as CVD-irreverent variables by the Boruta algorithm. As a result, 23 variables were selected to develop the CVD classifiers. We defined the study population by applying complex sampling weights to reduce bias in the original dataset. The general characteristics of the study population with selected variables are presented in Table 1. Among a total of 217,598 participants, 6752 (3.1%) participants had CVD. Compared with participants in the non-CVD group, participants with CVD had higher age, ALT, AST, BMI, white blood cell, and urine protein. Also, the participants with CVD had a higher rate of anemia, diabetes mellitus, high cholesterol, hypertension, and irregular pulse than the non-CVD group. On the contrary, monthly income, HBsAg, red blood cell count, drinking frequency, drinking amount, successful high school graduate status, and sex (female) were higher in the non-CVD group.
Comparison of model performance. , which means it was best for classifying the negative class (non-CVD participants). The G-mean showed that the MLP model was the best classifier when considering negative and positive classes together (0.792 [95% CI, 0.783-0.802]). Next, we analyzed the impact of variable selection, and class balancing approaches (i.e., SMOTE and random undersampling) through the performance of MLP. Feature selection did not cause a significant change in overall performance while class balancing decreased Th significantly. This result indicates that the classifier trained on imbalanced data, thereby overfitting the non-CVD class. As shown in Table 2, we observed that with a threshold of 0.5, the G-mean and sensitivity decreased significantly, and the specificity increased to 1 when class balancing was removed. In summary, feature selection had little effect on the performance of the model, but imbalanced data made the model overfit the non-CVD class. Supplementary Tables S2 and S3 show the complete list of classification performance. All the performances were calculated with the optimal hyperparameters found by five-fold CV and grid-search. We provide the optimal hyperparameters and their grids in Supplementary Table S4.  www.nature.com/scientificreports/ Table 1. General characteristics of selected variables. Continuous variables were expressed as a mean value (standard deviation), while discrete (except binary) and binary variables were expressed as the number of participants, n (percentage, %). We calculated the P-values of continuous, discrete, and binary variables using t-test, Mann-Whitney, and chi-square techniques, respectively. A two-tailed P-value (<0.001) was considered statistically significant. In representing monthly income, 1172 KRW was converted to 1 USD. www.nature.com/scientificreports/

Risk factor analysis.
SHAP measures the contribution of each variable to the posterior probability from an individual prediction. We averaged the absolute value of SHAP over the test set to identify the top risk factor of CVD. We sorted the importance of variables in descending order according to the average value over the four models (i.e., Avg+S+B column of Table 3). The result shows that age was the most influential risk factor for CVD classification (0.185). The second and third were sex (0.053) and hypertension (0.053). Next, successful high school graduate status (0.020), drinking amount (0.020), BMI (0.016), and high cholesterol (0.015) followed sequentially. On the contrary, anemia (0.002), urine glucose (0.002), and urine protein (0.002) had minor effects on the CVD classification. The impact of class balancing and variable selection on SHAP analysis was significant. SHAP became very small when class balancing was ablated. This is because SHAP quantifies the importance of variables depending on the output of models (The predicted probability that the sample is a CVD patient). The imbalanced data lowered the output, thereby reducing SHAP values. In addition, the results showed that if we did not perform feature selection, multicollinearity occurred in the data, and the major risk factors (age, sex and hypertension) were underestimated. The results are shown in Table 3. SHAP can be positive or negative depending on whether the variable contributed to classifying the participant as belonging to the CVD or non-CVD class. Table 4 shows the directionality of the top 10 contribute variables. The age, hypertension, BMI, high cholesterol, and stress perception level were positively correlated with CVD. This means that the higher the values of these variables, the more likely the model will classify the participant as a CVD patient. On the contrary, sex (female), successful high school graduate status, drinking amount, monthly income, and drinking frequency were negatively correlated with SHAP, and participants with lower values of Table 2. Classification performance with comparing the effect of variable selection and class balancing techniques. All performance metrics are described as bootstrap mean and 95% confidence interval. The bold values indicate the best classification performance when variable selection and class balancing both applied: variable selection S, class balancing B.

Method
Th www.nature.com/scientificreports/ these variables are likely to be classified as a CVD patient. We also calculated Pearson coefficients between the input variables and the outcome variable, the CVD (Supplementary Table S5). The directionality analysis of SHAP and the Pearson coefficients were consistent except for the stress perception level. Stress perception had a very weak negative correlation (−0.02) with CVD according to the Pearson coefficient, but classifiers used the stress perception as a CVD-positive variable.

Discussion
The prevalence of CVD is continuously increasing due to the influence of westernized dietary habits and health behaviors, and accompanying complications and mortality are also increasing 35,36 . As policy interest in chronic diseases with high socioeconomic cost increases, it has become a major task of public health to discover and manage people with risk factors before the onset of the disease as an effective prevention method 37 . Therefore, based on KNHANES, this study confirmed the prevalence of CVD in adults over 20 years of age in Korea, and identified the risk factors for CVD associated with demographic factors, comorbid factors, lifestyle factors, and physiological factors. We also suggested correlations between risk factors and their contributions on CVD classification.  www.nature.com/scientificreports/ This study developed several risk assessment models to characterize the risk of CVD using ML methods. All the models achieved high predictive performance with AUCs ranging from 0.853 to 0.862. Compared to other models, the MLP model showed the best performance with an AUC value of 0.862 (AUC, 0.862 [95% CI 0.854-0.869]), and the model's performance was significantly better than the existing risk score 38 . Our research has shown that ML technology is a successful standout when it comes to identifying important risk factors in large-scale epidemiological studies. Previous studies have demonstrated the important role of ML in other medical fields, such as T2DM, obesity, and heart failure 9,39,40 . Our results confirmed the outstanding performance of ML in CVD risk assessment. This is the first study that evaluates the importance of variables using various ML methods with KNHANES data and checks the risk factors of CVD. Previous study in recent years has shown that machine learning improves prediction accuracy in cardiovascular event prediction and these methods may lead to greater insights regarding subclinical disease markers 41 . The most recent development has shown that the ML Risk Calculator outperformed the ACC/AHA Risk Calculator by recommending less drug therapy and additional studies are underway to validate the ML model in other cohorts and to explore its ability in short term CVD risk prediction 42 . In addition, recently, the ML model for predicting CVD events in asymptomatic individuals was built using data for more than 400,000 UK participants 10 . The most predictive non-laboratory variables were ages, gender, smoking status, usual walking pace, self-reported overall health rating, previous diagnoses of high blood pressure, income, and parents' ages at death 10 . Most CVD risk factors are consistent with previous results. However, our result shows that smoking status was not an important variable, and gender was the second CVD risk factor. Our results were more fairly derived because the previous study used an empirical permutation method (Alaa, Bolton et al. 2019). In contrast, we used SHAP, which has a solid theoretical foundation in game theory. There is a similar previous study in that they also analyzed Korean national data and used SHAP for CVD risk factor analysis 16 . However, we provided a more intuitive interpretation. We designed the ML to output posterior probabilities of CVD, p(CVD|X) while other studies have different scales of SHAP for each model. We can intuitively interpret our results (Table 3) as the average change in the CVD prediction probability that is caused by each variable. Also, We further investigated the directionality between risk factors and CVD using SHAP. Age, hypertension, and BMI positively affect CVD risk, whereas sex (female), alcohol consumption, and income had a negative effect on CVD risk.
Our risk factor analysis suggests that VIF filtering was a useful technique for removing multicollinearity and increasing the interpretability of the models. We observed some CVD risk factors were clarified by applying the variable selection. For example, BMI has been reported as a significant risk factor for CVD 43,44 . However, as shown in Avg+B column of Table 3, the importance of BMI was underestimated to 0.003. This is because weight and waist circumference have collinearity with BMI, leading to offsetting the SHAP of the BMI. We removed those variables as VIF< 4, thereby correcting the importance of BMI to 0.016. Also, the VIF filtering affects correcting underestimated risk factors (e.g., age, sex, and Hypertension) and overestimated risk factors (e.g., successful high school graduate status and drinking amount). We concluded that removing multicollinearity based on VIF is helpful to reduce the error of SHAP-based risk factor analysis.
Feature selection is a commonly used technique to reduce computational costs and prevent models from overfitting. Some studies analyzed the effect of filter-based feature selection, such as chi-square, ReliefF, and SU on heart disease classification [45][46][47] . Those studies have demonstrated that variable selection improved or degraded the classification performance depending on the model. Compared to these studies, we used a model-based selection algorithm, i.e., the Boruta, which conducts statistical significance tests for variable selection. As shown in Table 2 and Supplementary Table S2, our feature selection also improved or degraded prediction performances depending on which model was used. However, we demonstrated that Boruta correctly distinguished irrelevant variables from CVD-relevant variables. The P-values in Table 1 show that all the selected variables had significantly different distribution between CVD patients and normal people.
We also observed the imbalanced data decreased sensitivity when we used default thresholds (Table 2). This is because imbalanced data led the model to overfit the majority class (non-CVD) and significantly reduced the prediction of ML. We solved overfitting by applying class balancing techniques. However, if the goal is only to achieve higher sensitivity, adjusting the classification threshold is an easier way than applying a class balancing technique. Because the sampling rate of SMOTE or undersampling is also a hyperparameter and tuning them is another burden for data scientists. In contrast, the threshold adjustment does not require additional hyperparameter tuning and has a lower computational cost than the balancing techniques. The results show that we achieved high sensitivity only by adjusting the threshold (MLP+S row in Table 2). However, Table 3 shows that class balancing was essential to clarify the importance of risk factors. The small SHAP value made it challenging to compare which variable is more important. Class balancing magnified SHAP values and made it easier to distinguish variable importance.
We also analyzed the effect of the interaction of variables on CVD prediction with SHAP. Supplementary  Fig. S1 shows the interaction between age and alcohol consumption on CVD. According to the plot, for participants under 40 years of age, the SHAP values go sideways then increase linearly from 40. Our analysis shows that if the age factor is over 40, the participants are more likely to be classified as CVD patients. However, since age is a factor beyond the clinician's control, it can be emphasized that there is a need to pay attention to adults who can manage modifiable and important risk factors, such as their alcohol consumption, BMI, and cholesterol. As shown in Supplementary Fig. S2, the effect of drinking frequency is dependent on age (the red dots on the plot represent the SHAP of participants with high alcohol consumption). For participants under the age of 60, there was no relationship between alcohol consumption and the SHAP of age. However, for participants over 60 years of age, those who drink more tend to have higher SHAP values. Controlling the drinking frequency of the population over 60 years of age can lower the prevalence of CVD. In conclusion, our analysis can provide the probability of CVD and individual risk factors for each group or patient, which can be used in preventive management of CVD. www.nature.com/scientificreports/ Limitations of this study should be addressed. First, since it is a cross-sectional study that analyzed data without a follow-up investigation of participants, there is a limit in explaining the direction of the causal relationship between risk factors and CVD. Second, to evaluate the generalization performance, future research should be conducted with external verification. Third, when defining CVD, the presence of disease was judged through the subject's self-response. In order to increase the validity of the responses, we defined the presence of disease using the phrasing 'Diagnosed by a doctor and currently suffering from the disease, ' but there are still some limitations in validity. Fourth, we could not exclude the influence of people who have already started CVD treatment. In our future work, this effect can be removed by excluding these CVD patients.
In conclusion, this study investigated the prevalence of cardiovascular disease, which has a very high disease burden, based on the latest data from KNHANES (i.e., a representative sample survey for the entire Korean population). Also, this study applied ML and the model interpretation technique to identify risk factors that can be reversibly controlled to prevent CVD. Our results show that ML can correlate many variables with individual diseases based on a wide range of data. This provides integrated insights into the multivariate risk factors of CVD.

Data availability
The datasets analysed during the current study are available in the KNHANES repository, https:// knhan es. kdca. go. kr/ knhan es/ main. do.