Development and Validation of Prediction Model for Risk Reduction of Metabolic Syndrome by Body Weight Control: A Prospective Population-based Study

Several studies have reported that weight control is of paramount importance in reducing the risk of metabolic syndrome. Nevertheless, this well-known association does not provide any practical information on how much weight loss in a given period would reduce the risk of metabolic syndrome in individuals in a personalized setting. This study aimed to develop and validate a risk prediction model for metabolic syndrome in 2 years, based on an individual’s baseline health status and body weight after 2 years. We recruited 3,447 and 3,874 participants from the Ansan and Anseong cohorts of the Korean Genome and Epidemiology Study, respectively. Among the former, 8636 longitudinal observations of 2,412 participants (70%) and 3,570 of 1,034 (30%) were used for training and internal validation, respectively. Among the latter, all 15,739 observations of 3,874 participants were used for external validation. Compared to logistic regression, Gaussian Naïve Bayes, random forest, and deep neural network, XGBoost showed the highest performance (area under curve of 0.879) and a significantly enhanced calibration of the predictive score with the prevalence rate. The model was ported onto an application to provide the 2-year probability of developing metabolic syndrome by simulating selected target body weights, based on an individual’s baseline health profiles. Further prospective studies are required to determine whether weight-control programs could lead to favorable health outcomes.

www.nature.com/scientificreports www.nature.com/scientificreports/ We found that a prediction model for reducing the risk of metabolic syndrome could be developed with a large-scale analysis of repeatedly measured data derived from a population-based data source. The aim of this study was to develop and validate a model that predicts 2-year metabolic syndrome based on an individual's baseline health status and body weight after 2 years.

Results
participants. In total, 12,206 eligible consecutive visit-pairs of 3,447 participants and 15,739 visit-pairs of 3,874 participants were extracted from the Ansan cohort, which represents an industrialized community, and Anseong cohort, which represents a rural area, respectively ( Fig. 1) 19 . From the former one, 8,636 visit-pairs of 2,412 participants (70% of 3,447 participants) were used for training, and 3,570 visit-pairs of 1034 participants (30%) were used for internal validation of the models. On the other hand, from the latter one, all 15,739 visit-pairs of 3,874 participants (100%) were used for external validation of the models. The baseline characteristics at the initial visit of the participants are summarized in Table 1.
Model development and validation. We fitted several models using logistic regression, Gaussian Naïve Bayes 20 , random forest 21 , XGBoost 22 , and deep neural networks 23 . The results of internal and external validation for the trained models are shown in Fig. 2. The area under receiver operating characteristics curve (AUROC) values of the machine learning-based models are slightly greater than those of the logistic regression model. The performance metrics and confusion matrices at the optimal operating point are summarized in Table 2 and Supplementary Table S1, respectively. The XGBoost-based model was selected as our final model because it consistently showed the best performance both in Figure 1. Participant selection flowgram. The community-based Korean Genome and Epidemiology Study (KoGES) was our data source. Among its two sub-cohorts, 12,206 eligible consecutive visit-pairs of 3447 participants were established from the Ansan cohort that represents an industrialized community. Of them, visit-pairs of 2412 participants (70%) were used for training, and of 1034 participants (30%), for internal validation of the model. On the other hand, 15,739 visit-pairs of 3874 participants from the Anseong cohort that represents a rural area were used for external validation of the model. www.nature.com/scientificreports www.nature.com/scientificreports/ internal and external validation. The result was consistent in two sensitivity analyses, in which (1) the combined dataset from two regions was used for training and validation of the model (Supplementary Fig. S1) and a (2) 4-year prediction model was established (Supplementary Fig. S2). For comparison of performance, the logistic regression-based model was set as a control to represent a conventional statistical approach in the further analyses.
Metabolic Syndrome Prediction Index. The Metabolic Syndrome Prediction Index (MPI Loss , where Loss is the amount of loss in body weight), which represents how likely an individual is to develop metabolic syndrome after 2 years with a target body weight, could be drawn from the model. Figure 3 shows illustrative examples of the application of MPI Loss in three differently obese individuals. In a normal-weight participant (body mass index [BMI] of 21.2 kg/m 2 , Fig. 3a), the trends in MPI Loss estimated by XGBoost and logistic regression were nearly identical. However, the slopes showed some differences in an overweight participant (BMI of 31.2 kg/m 2 , Fig. 3b) or underweight participant (BMI of 17.6 kg/m 2 , Fig. 3c). Moreover, unlike logistic regression, MPI Loss estimated by the XGBoost-based model, did not change further with extensive weight gain or loss in such participants. Figure 4 shows the MPI Loss estimated from each model and the actual prevalence rate of metabolic syndrome after 2 years in the cohort. Although linear association was observed in both models, the Pearson correlation coefficient of the prevalence rate was significantly greater in MPI Loss estimated from XGBoost (Fig. 4a) than that estimated from logistic regression (Fig. 4b) for the internal and external validation cohorts. Therefore, we determined that the XGBoost-based model not only predicts disease status accurately but also yields well-calibrated outputs representing the 2-year probability of developing metabolic syndrome.

Ansan Cohort
Anseong Cohort www.nature.com/scientificreports www.nature.com/scientificreports/ The final model was ported onto an in-house web application using the Flask Python library (https://github. com/pallets/flask). When provided with the baseline health profile of a user, the program calculates the probabilities of developing metabolic syndrome after 2 years by running simulations with some selected target body  www.nature.com/scientificreports www.nature.com/scientificreports/ weights ( Supplementary Fig. S2). We supplemented a selected package of the final model and simplified the code for its implementation (see Data Availability in Material and Method section).

Discussion
In this study, we developed and validated a prediction model for metabolic syndrome status after 2 years with an AUROC of over 0.850. In addition, we established an individualized program that can promote weight control by presenting reliable probabilities for having metabolic syndrome.
Metabolic syndrome is a dynamic status 11 . In terms of preventing MACE and its associated mortalities 24,25 , it would be beneficial if an appropriate program for its prevention or treatment could be provided [26][27][28] . Nevertheless, although several studies have reported that body weight is closely associated with metabolic syndrome, they did not provide a practical goal for weight control and risk reduction in a personalized setting. Setting a realistic goal has significant benefits in health risk reduction 29,30 . Therefore, rather than merely revealing a single statistical estimate of how much the risk can be reduced by a unit of weight reduction, our model can present a realistic probability of risk reduction in an individual with body weight control. In addition, the estimates of our model are based on an individuals' baseline health profile. The baseline risk and benefits from weight control would be different even among individuals with similar body weight if they have distinct risk factors or components of metabolic syndrome [16][17][18] .  www.nature.com/scientificreports www.nature.com/scientificreports/ Our study excluded elderly individuals and patients with a history of MACE or malignant diseases from the analysis for the following reasons: 1) to reduce the potential bias arising from reverse causation because weight loss may be the consequence of severe diseases, 2) metabolic syndrome is one of the important risk factors and strongest predictors of the development of major cardiovascular diseases, and 3) weight reduction is generally not recommended in elderly individuals because being overweight is minimally associated with high mortality in this population 31 . However, the risk of metabolic syndrome is considerably greater in this population than in the normal population. Therefore, the application of these excluding criteria may have caused an overall reduction in the prevalence of metabolic syndrome among the study participants compared to that in the real world. Therefore, the predictive performance of the model could be affected since the excluding criteria can worsen the class imbalance between the participants with and without metabolic syndrome.
Along with logistic regression as a conventional statistical approach, we recruited multiple machine learning-based models. Although the machine learning-based models showed slightly better performance than the logistic regression model, the difference was not considerable (Fig. 2). There may be some reasons behind this observation-i) The data derived from the cohort database were "typical data. " In general, machine learning methods benefit more from "atypical data" (e.g., unprocessed nature language, network, image, and signal data) that can rarely be analyzed by conventional statistical approaches. ii) Among hundreds of variables, a few clinically important variables were selected as input variables 32 . Since this existing knowledge was based on the conventional statistical approach, machine learning could not outperform logistic regression considerably.
The XGBoost-based model had greatest AUROC both in internal and external validation. Unlike logistic regression that assumes log-linearity in all body weights and has equal effect size across individuals, the XGBoost-based model showed a different trend in MPI Loss , especially in overweight or underweight participants (Fig. 3b,c). Since weight gain or loss would have different effects according to individuals' baseline health status, this difference could have contributed to some improvement in prediction (e.g., weight loss in underweight individuals would have no or minimal effect in further risk reduction). In addition, MPI Loss calculated by XGBoost showed better calibration with the actual prevalence rate. This allowed the model to provide a reliable 2-year probability of developing metabolic syndrome, reflecting real-world data.
In the management of diverse chronic diseases as well as medical intervention by medication and procedures, the importance of lifestyle modification has been greatly emphasized. With the development of digital healthcare, medicine, and therapeutics, the patients' biometrics are periodically being measured outside the clinic and are increasingly being utilized as adjuvant information in clinical practice. Using our model, patients will be provided the objective effect of weight loss on the risk of metabolic syndrome with considerable reliability, based on each patient's own cardiovascular risk profile and goal for weight control. Although this study did not validate whether the utilization of our model would achieve a significantly high rate of weight reduction in individuals, as opposed to merely emphasizing the need for weight control in the clinic, our model can be expected to increase compliance considerably.
Our study has some limitations. The study population consisted of participants of a single ethnicity (Asian) and nationality. Further studies are required to determine whether the model can show consistent performance in individuals with different biological and cultural backgrounds. In addition, since the prevalence of metabolic syndrome was low in the study population, a class imbalance in the dataset could have impaired the performance of the models 33 . Nevertheless, this study is advantageous since it established a model that can present a well-calibrated probability that would have a practical meaning for the users, which has rarely been reported. Moreover, this study achieved a considerable generalizability in that the model showed consistent performance not only in internal validation (in an industrialized community) but also in external validation (in a rural area). Further prospective studies are required to determine whether the use of the model-based weight-control program could lead to improved health outcomes. 19 . The community-based KoGES consists of two sub-cohorts: one based on the Ansan region, representing an industrialized community (5012 participants), and the other based on the Anseong region, representing a rural area (5018 participants). The participants of both cohorts have been followed-up every 2 years. The health check-up and measurements of biomarkers are carried out at each visit to identify risk factors for the development of chronic disease such as lifestyle (e.g., alcohol intake, smoking, and exercise), diet profile, and diverse environmental factors. The study was based on data from up to seven repeatedly measured datasets from baseline to 2014 over a 14-year period in the two cohorts. The study was conducted in accordance with the Declaration of Helsinki. Informed written consent was obtained from all participants. Demographic information was collected at the baseline and follow-up examination using a standard questionnaire that was administered during face-to-face interviews. The study was approved by the institutional review board of Yonsei University Wonju Severance Christian Hospital (CR105024).

Data Source and Study Approval. The Korean Genome and Epidemiology Study (KoGES) is a prospective population-based cohort launched in 2001 in South Korea
Validation of Data Source. The prediction model would be reliable only is the model was trained and validated by the representative data for the general population. We validated the representativeness of our dataset by the additional analyses of the National Health Insurance Service of Korea-National Sample Cohort (NHIS-NSC) as a reference cohort 34 , which includes approximately 1 million individuals, (2% of the total South Korean population). South Korea has a single universal health coverage system providing insurance to over 99% of the South Korean population. Since 2014, the NHIS has made the Bigdata Sharing Service available to researchers; the database includes information recorded since 2002. The health examination results were collected from the general health examination database. This examination is offered (bi)annually to all employees, householders, or any citizen aged 40 years or older.
www.nature.com/scientificreports www.nature.com/scientificreports/ The data source was validated as follows: (1) Table S3). A separate validation was required because lipid profile (e.g., total cholesterol triglyceride, high-and low-density lipoprotein study) was included in the general health examination since 2009. As a result, we determined that our data source can represent the general population appropriately since the baseline demographics and metabolic syndrome risk profile of the study population did not considerably differ from those of the nationwide cohort database.
Data conditioning. We aimed to develop a model that predicts the likelihood of an individual developing metabolic syndrome after 2 years according to weight changes during that period. Therefore, a visit-pair was constructed with the health status at baseline, body weight, and metabolic syndrome status after 2 years (Supplementary Fig. S4). The following baseline characteristics were used as input variables: age, sex, height, weight, alcohol intake (no/current) and amount, smoking status (never/ever/current) and pack-years, systolic and diastolic blood pressure, waist circumferences, fasting glucose level, triglyceride levels, total cholesterol levels, and high-density lipoprotein cholesterol levels. The use of anti-hypertensive, anti-glycemic, or lipid-lowering agents was also reported. The body weight measured after 2 years served as another input variable. The metabolic syndrome status after 2 years was the prediction target of dichotomous classifiers (labeled as 0 for no and 1 for yes). Metabolic syndrome was determined according to the definition described in the next section. Any records with baseline age of 65 years or above and self-reported history of MACE or other malignant diseases were excluded to reduce potential bias caused by reverse causation. Incomplete records with any missing values for the input variables were also excluded. All visit-pair records were established using only the remaining consecutive records collected every 2 years.

Definition of Metabolic Syndrome. According to the National Cholesterol Education Program Adult
Treatment Panel, metabolic syndrome can be confirmed when three or more of the following components are present: 35 increased waist circumference (≥90 cm for Asian men and ≥80 cm for Asian women), elevated triglyceride level (≥150 mg/dL) or use of a lipid-lowering agent, reduced high-density lipoprotein cholesterol level (≤40 mg/dL for men and ≤50 mg/dL for women), elevated blood pressure (systolic blood pressure ≥130 mmHg or diastolic blood pressure ≥80 mmHg) or use of an antihypertensive agent, and elevated fasting glucose level (≥100 mg/dL) or use of a sugar-lowering agent.
Dataset, Model Training, and Validation. We used the Ansan cohort among for development and internal validation of the model. On the other hand, the Anseong cohort was used for external validation. To avoid overestimation of the predictive performance, all data partitioning was done on a per-participant basis. For the Ansan cohort, 70% of the unique participants were allocated to the train cohort and 30%, to the internal validation cohort. The cohort assignment was based on the pseudorandom number generator without any stratifying or matching variables. The visit-pairs of the train cohort were then used to determine the optimized parameters to predict the outcome; those of the internal and external validation cohorts were used to evaluate the performance of the trained model in a homogenous and heterogeneous setting, respectively. There was no overlap between the cohorts.
We recruited logistic regression, gaussian Naïve Bayes 20 , random forest 21 , XGBoost 22 , and deep neural network 23 as potential candidates for developing our model. In the logistic regression, all variables were input at once without any variable-selection algorithm. No significant collinearity among the input variables was detected. The Naïve Bayes classifier 20 is a probabilistic classifier and among the simplest Bayesian network models. The random forest 21 and XGBoost 22 are machine learning algorithms based on a combination of decision trees and a gradient-boosting framework, respectively. Two algorithms were optimized with the grid searches 36 and trained with 500 epochs with 5-fold cross validation. Deep neural network 23 is a method in which complex hierarchical representations are learned with multiple levels of abstraction. A fully connected multilayer perceptron was recruited. There were two hidden layers with 200 nodes. Each layer included the rectified linear unit function for non-linear activation. The loss function was binary cross entropy and the optimizer was Adam with a learning rate of 5 × 10 −3 . In all four algorithms, the output was transformed to have a numeric value between 0 and 1, representing the confidence score for metabolic syndrome after 2 years, referred to as MPI Loss . Statistical Analysis. The baseline characteristics at study entry were summarized using the mean with standard deviation and frequency with percentage, as appropriate. The AUROC was the primary measure of model performance. The optimal operating point was determined at the point at which the Youden index was maximized 37 . The accuracy, sensitivity, specificity, positive predictive value, and negative predictive value were also calculated. A 2-tailed P value of <0.05 was used to determine the statistical significance. All statistical analyses and the development and validation of the model were carried out with Python version 3.7.0 with pandas (http://pandas. pydata.org) and the scikit-learn (http://scikit-learn.org) library. Sensitivity Analysis. Two sensitivity analyses were conducted to ensure the robustness of our model. 1) Rather than performing external validation with a separate dataset, the combined dataset from the two regional cohorts was used for model training and validation. This was done to prevent over-fitting due to under-presentation of certain populations among a single regional cohort. 2) The study was repeated with visit-pairs established with consecutive records from every 4 years, instead of 2 years. Therefore, this model predicted the metabolic syndrome status after 4 years. However, the weight-control program is practically feasible when it can present a short-term goal. Therefore, the 2-year prediction model was retained as our main model. www.nature.com/scientificreports www.nature.com/scientificreports/