Machine learning-aided risk prediction for metabolic syndrome based on 3 years study

Metabolic syndrome (MetS) is a group of physiological states of metabolic disorders, which may increase the risk of diabetes, cardiovascular and other diseases. Therefore, it is of great significance to predict the onset of MetS and the corresponding risk factors. In this study, we investigate the risk prediction for MetS using a data set of 67,730 samples with physical examination records of three consecutive years provided by the Department of Health Management, Nanfang Hospital, Southern Medical University, P.R. China. Specifically, the prediction for MetS takes the numerical features of examination records as well as the differential features by using the examination records over the past two consecutive years, namely, the differential numerical feature (DNF) and the differential state feature (DSF), and the risk factors of the above features w.r.t different ages and genders are statistically analyzed. From numerical results, it is shown that the proposed DSF in addition to the numerical feature of examination records, significantly contributes to the risk prediction of MetS. Additionally, the proposed scheme, by using the proposed features, yields a superior performance to the state-of-the-art MetS prediction model, which provides the potential of effective prescreening the occurrence of MetS.

www.nature.com/scientificreports/ the high-risk group of MetS using a weighted radar chart, where different importance of each variable as well as continuous numerical input was considered. Machine learning has been regarded as a promising technique due to its powerful learning capability 16,17 . With the help of machine learning, non-invasive indicators without blood drawing can be applied to predict MetS, enabling early diagnosis on MetS even in the areas with poor medical conditions 18,19 . Besides, this technology has enabled the prediction of MetS to be applied to some uncommon fields like metabolic spectrum 20 and FibroScan ultrasonic elastography equipment 21 . The above works can achieve accurate identification of MetS. Since MetS are often accompanied by various complications 22,23 , it is of significance for potential MetS patients to provide effective risk prediction in advance.
Empowered by machine learning, researches on risk prediction of MetS have been widely concerned in recent years. Farzaneh et al. 24 predicted the risk of MetS after 7 years by using anthropometric and some commonly used MetS related clinical examination indicators, and concluded that TG, blood pressure (BP) and BMI are the most important risk factors. Lee et al. 25 constructed a 2-year risk prediction model of MetS and showed the relationship that weight control in different BMI groups to the reduction of MetS predictive index (MPI) 2 years later. In 26 and 27 , the genetic information was considered, but the results demonstrated that the diet, lifestyle and clinical information still plays a leading role in the risk prediction of MetS. Based on this fact, Lee et al. 28 combined the "Sasang constitutional (SC) types" features, which involving facial expressions and body posture into account to achieve a long-range prediction of MetS over 14 years. Li et al. 29 studied the relationship between children's retinol binding protein 4 (RBP4) and 10-year risk of MetS. Although the above-mentioned models demonstrated that the relationship between MetS and some key clinical variables, such as TG, BP and BMI, are important for the risk prediction of MetS, the impact of the numerical and state changes of such clinical variables on MetS has not been reported yet.
To address the above issues, this paper concerns with a machine learning-aided longitudinal study on risk prediction of MetS by using a total of three consecutive years examination records of 67,730 individuals. To be specific, in addition to the numerical features of examination records, the numerical changes and the normal/ abnormal state changes over the past two consecutive years are employed as features for classification for the prediction of MetS in the forthcoming year. To the best of the authors' knowledge, it is the largest number of samples involved for MetS risk prediction. From numerical results, it is shown that the proposed risk prediction model yields a higher performance in comparison with the state-of-the-art methods. More importantly, we show that the impact of differential state features (DSFs) w.r.t. the clinical variables, i.e., TG, WC, BP and BMI, in addition to the numerical features of examination records, are significant to the risk of MetS, demonstrating that long-term unhealthy lifestyle over 2 years, regardless of age and gender, leads to a high incidence of MetS.

Results
Performance of differential features with different classifiers. Table 1 shows the performance comparison of MetS prediction models using three different classifiers with and without the proposed differential numerical features (DNFs) and DSFs. For fairness of comparison, all examination indicators of the previous 2 years with and without DNFs and DSFs are considered in experiments. 10-fold cross-validation experiment is carried out, where the metric of AUC is described in mean ± standard deviation (STD), and the best performance in each metric is marked in bold. In addition, we further plot the receiver operating characteristic (ROC) curves of the proposed MetS prediction model with/without the DNFs and DSFs. It can be seen from both Table 1 and Fig. 1 that both the proposed MetS predictive models with and without differential features perform robust with a very small STD value in terms of AUC. The result is reasonable, since the dimension of the dataset employed in this work reaches 67,730 individuals, which is larger than that has been reported by the existing contributions. Furthermore, it can be easily observed that the performance using DNFs and DSFs are superior to that without differential features in terms of all metrics. This result demonstrates that the variations of examination indicators during the consecutive 2 years can be viewed as effective features for predicting MetS in the forthcoming year. In addition, XGBoost performs the best in terms of AUC, Accuracy, Precision, F1-score, Specificity and F2-score, and it yields an AUC and Accuracy of up to 0.930 and 0.849, respectively. It is worth noting that the Precision and F1-score are 0.43 and 0.58 respectively. The result is similar to that of the existing Table 1. Results based on three models with and without differential features. The result with the best performance in each metric using different classifiers are marked in bold characters.

Model
Threshold AUC Accuracy Precision Recall F1-score Specificity F2-score  14,19,25,27,30 , and is expected, since the number of positive samples is significantly less than that of negative ones. As a consequence, we select XGBoost as the classifier for the rest experiments unless indicated.   In order to further analyze the contribution of the top 20 features to the prediction of MetS, we provide an explainability analysis using SHAP tool 31 . As shown in Fig. 2b, among all 9 DSFs, the state changes in FGLU and TG contribute the most to the prediction of MetS. By similarity, the examination indicators of FGLU and TG are the top two features with the highest contribution to MetS. In addition to FGLU and TG, both the state changes and the examination indicators of WC and BMI are also important, suggesting that both the conditions whether the values of such indicators exceeding the normal upper limits or the status changes of N2A and A2A over the past 2 years could significantly increase the risk of MetS. It is also noted that the state changes of HGB from N2A and A2A are important features of increasing the risk of MetS, which has not been reported yet.
In view of this, we will further analyze the impact of abnormality in important clinical variables and two differential states (N2A and A2A) of important DSFs on MetS in different gender and age (divided by the world health organization) groups.

Impact of important clinical variables on MetS risk in different gender and age groups. Firstly,
we statistically analyze the risk of MetS in different gender and age groups. As shown in Table 2, the prevalence of MetS for both genders grows with age, and it is higher in male than in female 30 , but the differences are gradually reduced with age growth. For example, for the group aged 18-44, the prevalence ratio of MetS in male is approximately 8 times higher than that of female. For elder age group of more than 60 years old, the prevalence of MetS in male and female are comparable, i.e., 25.41% and 19.07%, respectively. The results are expected, and demonstrate that 20-25% elder people suffers the onset of MetS.
Then, we statistically analyze the contribution of the clinical variables to different gender and age groups, by calculating the odds ratio (OR) of feature's abnormality to MetS risk in the next year (the largest values of OR in different age groups are bold marked). As can be seen from Table 3 that the main risks of MetS in male aged 18-44 and 45-59 are abnormal TG and BMI. In addition, WC and FL also contribute to the risk of MetS in men under 44 years old. For male group over 60, the risks of MetS in addition to BMI, is mainly due to the abnormality of FL. Besides, the abnormalities of TG and WHR are also relatively important to this group.
Interestingly, it is seen from Table 4 that the most important risk factors of MetS for female aged 18-59 are TG, BMI, FL and FGLU. As age grows, WHR, in comparison with BMI, contribute more significance to the risk of MetS for female aged ≥ 45. For elder age group of ≥ 60, the most important clinical variables are HDL-C and WHR, respectively. From the aspect of age groups, it is observed that, (1) the impact of clinical variables on younger female (i.e., < 45) is more significant to that on elder ones. (2) The impact of clinical variables on the risk of MetS for female is more significant to that for male of the same age groups.
The above observations are expected and can be explained as follows. Elder female, in comparison with younger female, generally suffer from more concomitant diseases, of which the influences could potentially neutralize the contribution of single clinical variable on the risk of MetS. By similarity, the prevalence of male  www.nature.com/scientificreports/ suffering from MetS is higher than that of female of the same age groups, and thus, the contribution of clinical variables to male are less obvious than female. From the results in Tables 2, 3 and 4, it is shown that, the risks of MetS in female with abnormal clinical variables are higher than that in male of the same age groups, but the true prevalence of MetS in female is lower than male group. The potential reason is that, male groups, in comparison with female of the same age groups, generally have irregular diets and unhealthy lifestyle 16 , such as drinking, smoking, etc. Besides, for young and middle-aged female groups, the self-protection mechanism of female's estrogen 16,32 is also an important reason for the low prevalence of MetS.
Impact of important DSFs on MetS risk in different gender and age groups. Next, we statistically analyze the impact that DSFs' abnormalities have on the MetS of different gender and age groups. The results are shown in Tables 5 and 6, respectively.    (2), the features include N2A (represents specific clinical variable is abnormal in recent 1 year), A2A (represents specific clinical variables are abnormal for past 2 years) and N2N (represents specific clinical variables are normal for past 2 years). For analysis, we evaluate the OR of DSFs' abnormal states (N2A and A2A) of different gender and age groups by taking N2N state as a control group. For ease of analysis, the two largest values of OR w.r.t. N2A and A2A in different age groups are bold marked, respectively, and the values of OR w.r.t N2A higher than A2A are underlined.
For male aged 18-44, TG and BMI in N2A state have a relative high risk of MetS, and they have the highest risk when in A2A state. In addition, all the features show that compared with abnormality in the only recent 1 year, the risk of people with abnormality in both 2 years was significantly increased. It is still applicable to male over 45 years old. The difference is that with the increase of age, the risk of BMI in A2A state significantly reduced, even less than in the N2A state. And FGLU showed similar characteristics in male over 60 years old. This means that middle-aged and elderly male may have universal abnormal body weight, and the contribution to MetS is relatively stable when there is no significant change in this feature. Similarly, elderly male should also be aware of the significant changes in FGLU. A2A states of TG and HGB hold the highest risks in this age group.
It can be seen from Table 6 that for female aged from 18 to 44, the abnormality of TG, BMI, FGLU and FL lead to a higher risk of MetS in comparison with other clinical variables. When TG, FGLU and FL were abnormal for two consecutive years, the risk of MetS increased significantly. It is also noted that the impact of the abnormal DSFs in terms of TG, FGLU and FL on female aged from 45 to 59 was similar to that of the clinical variables on female aged from 18 to 44. This means that, benefiting from the protection of estrogen, the incidence of abnormal endocrine indicators in female ≤ 59 is lower than that in male. Meanwhile, when TG and FL are abnormal for two consecutive years, it reflects that the endocrine mechanism disorder of people has exceeded their ability of self-protection by regulating the level of estrogen, leading to a significant increase in the risk of MetS. For female aged over 60, persistent obesity (associated with the abnormalities of both WC and BMI) and abnormal FL were also important risk factors of MetS.
In summary, the results shown in both Tables 5 and 6 demonstrate that, regardless of age and gender, the abnormal clinical variables of two consecutive years lead to higher MetS risk than that of only a single year. Clearly, the results encourage people to carry out necessary measures to avoid abnormal clinical variables for two consecutive years.
Finally, Table 7 shows the comparison between the proposed MetS predictive model and the state-of-the-art studies. It can be seen from Table 7 that, the proposed method, by taking advantages of the differential features of examination indicators over the past consecutive 2 years, yields the highest performance with AUC up to 0.930. Moreover, it is worth noting that the number of samples in dataset analyzed in this work reaches up to 67,730, which is larger than that has been reported yet. Such a large number of dataset can guarantee the robustness to the risk prediction of MetS.

Discussion
Studies have shown that MetS is a major cause of diseases such as diabetes and CVDs. Based on a three-consecutive years longitudinal study, this paper studied the risk prediction by taking advantage of the examination records of the current year as well as the differential features of the past two consecutive years.
Based on XGBoost classifier, the impact of 10 clinical variables with the most importance to the risk of MetS is statistically analyzed on different gender and age groups. Specific observations are summarized as follows. Due to the relatively irregular lifestyle, male suffers from a higher prevalence of MetS in comparison with female of different age groups, suggesting that male should pay more attention to the risk of MetS. Thanks to the protective mechanism of estrogen, the ratio of young-aged female with MetS is significantly lower than other age groups. For elder female aged ≥ 60, the prevalence of MetS is approximately to that of male group. As regards male group, BMI 21,33 and FL 30,34,35 are critical to the risk of MetS for all age groups. In particular, the prevalence of MetS in young-aged group is sensitive to the abnormal of weight (in terms of BMI, WHR, WC and FL), suggesting that male ≤ 44 years old should pay more attention to control their weight and shape of body. As regards female group, the abnormalities of endocrine clinical variables (in terms of TG, FL and FGLU) are highly related to the prevalence of MetS, especially for young-aged group, i.e., female ≤ 44 years old. BMI is also of importance to the risk of MetS. In addition, the abnormality of WHR is more and more important to the risk of MetS as age grows, suggesting that middle-aged and older female should pay more attention to the changes of body shape. Owing Table 7. Comparison between the proposed MetS model and the state-of-the-art contributions. The result with the best performance in each metric using different classifiers are marked in bold characters. www.nature.com/scientificreports/ to the interaction of concomitant disease, the importance of clinical variables abnormality on the risk of MetS is lower in the elderly than in the young and middle-aged groups. Furthermore, we take the advantages of DSF w.r.t. the abnormal of clinical variables over the past 2 years, aiming to access the relationship between the DSF of specific clinical variables and the risk of MetS prevalence. Statistical results in terms of OR values w.r.t specific DSFs show that the most of the abnormal states over the past 2 years (A2A) lead to higher risk of MetS in comparison with the abnormal states occured only in recent 1 year (N2A). The result behind the observation suggests that any possible intervention should be carried out to prevent the abnormal state of clinical variables over consecutive 2 years. Additionally, it is observed that the abnormality of HGB lasts for consecutive 2 years significantly increases the risk of MetS for male group aged over 45. This result has not been reported yet, and may be explained by the correlation between HGB abnormalities and the occurrence of insulin resistance or MetS in 36,37 .
More importantly, it is noted that, for BMI and FGLU in middle and old-aged groups (i.e., aged ≥ 45), the state N2A yields a higher risk of MetS than A2A, suggesting people of such age groups with normal weight and blood glucose should pay special attention to the abnormal state changes of such clinical variables.
In conclusion, with the help of three consecutive years of physical examination records, this paper analyzed the risk of MetS in different age and gender groups by using machine learning algorithms. The statistical results between the onset of MetS and the specific clinical variables (with corresponding state changes over the past consecutive 2 years) could benefit to understand the relationship between the lifestyle and pathogenesis of MetS.
Last but not least, this study has the following two limitations. Firstly, in view of the normal range of each examination indicator, the considered DNFs by taking the advantages of only numerical difference for two consecutive years could not be sufficient without non-uniform mapping w.r.t the specific range. This could be of the potential reason why the contributions of DNFs are trivial to the prediction of MetS. In further study, the nonuniform mapped w.r.t the numerical range of DNFs will be examined. In addition, all samples of dataset in this study are from Guangdong Province, China, and thus, the experimental results may have regional characteristics. Since part of the indicators were recorded manually according to tons of physical examination reports, inevitably there will be some mistakes. Then we used the upper and lower thresholds, which were set by doctors according to their experience for filtering of the outliers.

Methods
After desensitization, integration and cleaning, we obtained the usable structured data (537,283 records for males, and 403,899 records for females). The detailed statistical characteristics are shown in Table 8. There are 32 raw indicators collected in the examination, including anthropometry, blood parameters, other biochemical indicators, medical histories, gender and age.
The study was conducted under the approval of the Academic Committee of South China Normal University (Approval No.: SCNU-PHY-2020-063). All methods we used in the study were adherence to relevant ethical guidelines and regulations (Declaration of Helsinki). All subjects signed an informed consent form before inclusion in the present study. Fig. 3 (MS_ result is the status whether suffering from MetS or not. MS_result = 0 and 1 represent the status with MetS and without MetS, respectively). Unlike the conventional methods, we take both indicators of the current year and the latest one before the current year into consideration in order to obtain features of physical change in time dimension. The prediction can be regarded as a supervised classification, where the status suffering from MetS in the next year is labeled as "1", and records of the current year and differential features extracted from the past two records as the model input. Thus, a sample contains three records in the model.

Longitudinal MetS risk prediction model. The risk prediction model for MetS is shown in
Since the risk prediction of MetS represents the process suffering from MetS from a healthy state, the first two records in all three records should be healthy state. Considering the time difference of taking physical examination (usually in the first or third quarter in a year in CHINA), we set the maximum time interval between the first two records and the third one to 540 days.
After the above processing, 67,730 usable samples were obtained, in which the samples with/without MetS are 7971 and 59,759, respectively. For all samples, male and female account for 56% and 44% respectively. www.nature.com/scientificreports/ Differential numerical feature (DNF). The differential numerical feature can be characterized as where I 0 and I −1 denote the values of specific indicator I of current year and that of the latest record before current year, respectively. As a consequence, I_DNF can describe the absolute numerical difference of indicators over years, including the increment, decrement, invariableness, and missing value. This kind of feature is extracted from the indicators with a numerical number, and thus 21 features are extracted.
Differential state feature (DSF). DSF describes the state change process of indicator I over the past two examination records, and it can be characterized as where S(I −1 ) and S(I 0 ) represent the state of indicator I in the latest record before the current year and the current year, respectively, and its values are normal, abnormal or null. We set the upper limit of the clinical reference range of indicators except for HDL-C as the threshold, and beyond the threshold as "abnormal" state, since the increase in the values of indicators is associated with the risk of MetS. Among them, we set threshold of BMI as 28 kg/m 2 . The "abnormal" state of HDL-C is defined as the value lower than its clinical range, since such indicator is protective to MetS.
The status of I_DSF can be normal-to-normal (N2N, represents indicators are normal for past 2 years), normal-to-abnormal (N2A, represents specific indicator is abnormal in recent 1 year), abnormal-to-normal (A2N, represents the indicator changes from abnormal to normal), abnormal-to-abnormal (A2A, represents Table 8. Basic statistical characteristics of the raw data set. Continuous indicator is expressed as mean ± standard deviation, discrete indicator is expressed as a percentage (%). MS_result is the numerical result of MetS, the value is 0 or 1, 0 represents no disease, 1 represents disease. -means less than 0.1% of the data is available due to missing or gender specific examinations. ALT, alanine aminotransferase; AST, aspartate aminotransferase; CR, creatinine; DM_H, history of diabetes mellitus; HBA1c, hemoglobin a1c; HM, hysteromyoma; HTN_H, history of hypertension; LDL-C, low-density lipoprotein cholesterol; MGH, mammary gland hyperplasia; N, numbers; PLT, platelets; RBC, red blood cell count; SMK_H, history of smoking; TC, total cholesterol; TN, thyroid nodules; UA, uric acid; UALB, urine albumin; WBC, white blood cell count.   Dealing with missing value and normalization. The regular physical examination generally involves a fixed part of the items, so the presence of missing values is common in the records, which bring challenges to MetS prediction. In this study, we propose to fill the missing values of indicators based on the following criteria in terms of missing rate, data type and distribution.
• If the amount of missing value is relatively large (70% or more of the data is missing), delete the features directly (in this case, the indicators HBA1c, PG and SMK_H are removed from the dataset.). For features deleted due to the high missing rate, the corresponding DNF and DSF are also deleted. After the above processing, there are 72 features in total, including 29 raw features, 19 DNFs and 24 DSFs. Finally, we use the standard deviation normalization for features to normalize the contributions of different features to the model. Figure 4 shows the framework of predictive model for MetS based on machine learning techniques.

Experimental setup.
In the experiments, the training set and test set are divided randomly by a ratio of 7 to 3. In order to validate the generalization ability of the model, the age and gender of the samples in the test set and the training set are of the same level.
We use three commonly used decision tree-based ensemble classification algorithms, namely, Random Forest (criterion = 'entropy' , max_depth = 8, max_features = 'sqrt' , n_estimators = 500), XGBoost (max_depth = 4, n_estimators = 500, learning_rate = 0.03, colsample_bytree = 0.5) and Stacking (combination of the above two algorithms), to perform the prediction of MetS. Without loss of generality, a threshold of probability should be set for the final decision. In the experiments, the maximum Youden index criteria is employed to determine the optimal threshold.
For measurement, we assess the performance of the proposed MetS prediction model by employing Accuracy, Precision, Recall (Sensitivity), Specificity, F1-score, F2-score (it favors Recall over Precision), which are given as