Development and validation of a new diabetes index for the risk classification of present and new-onset diabetes: multicohort study

In this study, we aimed to propose a novel diabetes index for the risk classification based on machine learning techniques with a high accuracy for diabetes mellitus. Upon analyzing their demographic and biochemical data, we classified the 2013–16 Korea National Health and Nutrition Examination Survey (KNHANES), the 2017–18 KNHANES, and the Korean Genome and Epidemiology Study (KoGES), as the derivation, internal validation, and external validation sets, respectively. We constructed a new diabetes index using logistic regression (LR) and calculated the probability of diabetes in the validation sets. We used the area under the receiver operating characteristic curve (AUROC) and Cox regression analysis to measure the performance of the internal and external validation sets, respectively. We constructed a gender-specific diabetes prediction model, having a resultant AUROC of 0.93 and 0.94 for men and women, respectively. Based on this probability, we classified participants into five groups and analyzed cumulative incidence from the KoGES dataset. Group 5 demonstrated significantly worse outcomes than those in other groups. Our novel model for predicting diabetes, based on two large-scale population-based cohort studies, showed high sensitivity and selectivity. Therefore, our diabetes index can be used to classify individuals at high risk of diabetes.

Diabetes mellitus is a chronic metabolic disorder characterized by disrupted glucose homeostasis, resulting from increased insulin resistance and/or impaired insulin secretion. People with diabetes mellitus are predisposed to metabolic disorders, such as cardiovascular disease (CVD), which affects 32.2% of all people with diabetes mellitus globally. Moreover, their complications are leading causes of morbidity and mortality 1,2 . The prevalence and socioeconomic burden of diabetes are rapidly increasing worldwide. Approximately 1 in 11 adults have diabetes, and 90% of people with diabetes have type 2 diabetes mellitus 2 .
Previous large-scale studies suggest that diet and lifestyle modifications can prevent or delay the development of diabetes mellitus in high-risk individuals by Refs. 2,3 . The Diabetes Prevention Program conducted in the Unites States reported that lifestyle modification reduced the incidence of diabetes mellitus by 58% compared with control after a 2.8-year mean follow-up 4 . Toshikazu et al. also demonstrated that lifestyle modification reduced the overall relative risk of diabetes mellitus by 44.1% in Japan 5 . Clinical studies conducted in China 6 and India 7 have reported 42% and 38% risk reductions, respectively. Therefore, developing risk prediction models for diabetes mellitus and identifying high-risk individuals have become a challenging issue in clinical research. To explore the risk factors and formulate predictive models for diabetes development, machine learning techniques have been widely used 8 . These methods help researchers discover unknown significant figures and solve scientific problems from large quantities of datasets 9,10 . In the fields of medical science and healthcare, machine learning provides useful classification and prediction models with high accuracy 11 Table 1), in 20 variables present in both KNHANES and KoGES. Table 2 displayed the selection process by means of a univariate LR in men and women, respectively. All 20 features from Model 1 were selected as candidate variables for univariate analysis in Model 2. By means of multivariate analysis (Models 2 and 3), we identified 16 and 18 variables as diabetes risk factors to be utilized as the input features for formulating the classification model in men and women, respectively. Thereafter, based on these variables, we generated a gender-specific diabetes classification model using LR. Note that the feature selection and the formulation of the prediction model were conducted using only the derivation dataset.
We used this gender-specific diabetes classification model to calculate the probabilities of diabetes in subjects from the internal validation dataset. The area under the receiver operating curve (AUROC) was 0.941 and 0.939 in men and women, respectively (Fig. 1). The area of under the precision-recall (PR) curve was 0.475 and 0.381 in men and women, respectively (Fig. 1). Moreover, we evaluated the model performance via calibration, the agreement between observed and predicted probabilities using val.prob function in the rms package. As a result, the classification model for women was a well-calibrated model, besides the model for men was not according to the Spiegelhalter Z-test and its two-tailed p-values (S:p for men: 0.008; S:p for women: 0.588, Supplementary Fig. 2).
External validation of the classification model. Table 3 shows baseline characteristics of the KOGES dataset. By using our gender-specific classifying model constructed from the derivation dataset, we calculated the probabilities of the presence of diabetes in subjects from the external validation dataset. These subjects were categorized into five groups according to the probabilities of the subjects in ascending order. Figure 2 shows the cumulative incidence of new-onset diabetes. Most groups had significant differences from other groups. For both men and women, group 5 yielded significantly worse outcomes than those in other groups.

Discussion
Our novel model for the risk classification of diabetes mellitus, based on two large-scale population-based cohort studies, showed high sensitivity and selectivity. Our model yielded AUROCs of 0.941 and 0.939 in men and women, respectively. The Finnish Diabetes Risk Score (FINDRISC) model is a well-known, recommended tool for diabetes mellitus prediction 14 20 yielded AUROC of 0.71 and 0.76 in men and women, respectively. Note that the predictive performance by our model is for the presence of DM, not the new-onset DM, thereby, somewhat outperforms compared to previous models predicting the new-onset DM. We had performed the literature-review and statistical methods to select more than 15 predictors, which are the potentially appropriate model for DM that has the complex pathophysiology.
With the help of machine learning techniques, we can handle large numbers of participant features that may have positive or negative correlations with the prevalence of diabetes mellitus. To obtain input features for our model, we used data from the KNHANES, a large-scale cross-sectional study that includes approximately 10,000 participants. As a result, we were able to use the 16 and 18 variables in men and women, respectively, during the analysis (Table 3).
Among these variables, glycosuria showed the highest odds ratio (OR) in men (OR 1.35; 95% CI 1.32-1.39). In general, glycosuria has been used as a biomarker for renal complication in diabetes 8,21 , not as a predictor for diabetes. Although glycosuria is a result of hyperglycemia, it also occurs with normal blood glucose levels due to renal injury. Moreover, hyperglycemic patients can also secrete normal range glucose in their urine 22,23 . This implies that we need to identify a new risk factor that, despite being considered negligible, may have a significant  27 . They also suggested that reducing triglycerides can decrease the risk of developing diabetes 27 . This implies that a high TG level is a modifiable risk factor for diabetes and should be managed in people predisposed to diabetes. Alcohol consumption was related to a decreased risk of diabetes in both men and women (KNHANES dataset). This finding is consistent with previous studies about alcohol consumption. Moreover, heavy and moderate consumption showed deleterious and protective effects on diabetes, respectively 28 . BMI and waist circumference (WC) showed positive relationships in univariate analysis. However, multivariate analysis revealed that BMI had a negative relationship, whereas WC had a positive relationship with diabetes. In light of this, waist circumference, a well-known parameter for central obesity, may be a better parameter for risk assessment of obesity than is BMI, a general obesity indicator. Wang et al. reported similar results regarding risk prediction for diabetes. According to their analysis, abdominal adiposity was superior to abdominal obesity as a predictor for new-onset diabetes 29 . Peter et al. also reported that WC showed higher mortality risk than BMI (WC: HR 1.40 [95% CI 1.14-1.72] and BMI: HR 1.29 [1.04-1.61]) in adults with diabetes 30 .
Risk group classification is one of the most critical uses of machine learning techniques in medical research 31 . Using logistic regression, the combinatory effect of selected risk factors on the disease of interest could be calculated as a probability. Moreover, based on the probability obtained from LR, the participants were classified into five groups. Subsequently, we assessed the risk of each group by analyzing the cumulative incidence of diabetes using cox regression analysis. As expected, and as per our prediction model, participants at high risk showed a high incidence of diabetes (Fig. 2).
Our study had several limitations. First, we could not distinguish type 1 diabetes mellitus from type 2 diabetes mellitus because there were no biomarkers or clinical information for classifying the new-onset diabetes in the KoGES. The risk factors for each type of diabetes are different. Therefore, distinguishing the type of diabetes may be preferable when formulating a prediction model with high accuracy. However, new-onset type 1 diabetes mellitus in a patient over 30 years of age is rare 32 . Hence, this prediction model may be used to classify groups with a high risk for type 2 diabetes mellitus. Second, we could not use menopausal status as a predictive factor in women. The effects of various post-menopausal hormones in women must be considered 33 . Previous cohort studies reported controversial results regarding the role of menopausal status in diabetes development 34,35 . Kim et al. reported that there was no association between natural menopause and the risk for diabetes mellitus 34 . However, early menopause showed significant association with type 2 diabetes mellitus 36 . Unfortunately, KoGES data at baseline did not include the menopausal status of participants. Therefore, we could not use this factor. Third, we used two large cohort composed of Koreans. So, our diabetes index has high generalizability in Koreans, but not high in other populations. However, we had used the nationally representative surveys to establish the DM classification model. Moreover, we validated the model using the KoGES that is also a nation-wide longitudinal study. Due to setting healthy subjects as target population, our model might have the generalizability compared to other models using hospital-based participants.
In conclusion, we developed a diabetes mellitus risk classification model and validated it using Korean datasets. Although the variables used in this model cannot be counted directly, they can be easily collected in real clinical practice. Hence, this new diabetes index can be used to classify individuals at a high risk for diabetes mellitus, who should prevent the disease by managing their risks through lifestyle modification.  37 . Subjects aged 40 years and older were included. Subjects with incomplete data regarding demographics and laboratory information were excluded. Furthermore, we excluded subjects with a fasting blood glucose level ≥ 126 mg/dL regardless of a diagnosis of diabetes mellitus. When constructing prediction models, subjects with hyperglycemia may cause bias as this may involve predicting the development of an anticipated pre-existing condition. We determined 2013-16 KNHANES data as the www.nature.com/scientificreports/ derivation set and 2017-18 KNHANES data as the internal validation set. The target population of KHANES consists of nationally representative non-institutionalized civilians 38 . The KoGES is an ongoing, prospective, large cohort study conducted by the Korean government. It involves a biannual examination related to life-style surveys, biochemical profiles, and incidences of common chronic diseases of Korean adults since 2001. Details of the KoGES have been described elsewhere 39 . We used the Ansan-Ansung cohort study, a KoGES 10-year data follow-up study, for the external validation set. Subjects who were already diagnosed with diabetes mellitus or exhibited diabetic profiles in lab tests (a fasting glucose level ≥ 126 mg/dL, a 2-h post glucose level ≥ 200 mg/dL in a 75 g oral glucose tolerance test [OGTT], or a glycosylated hemoglobin A1c[HbA1c] level ≥ 6.5%) were excluded at baseline. Finally, 14,977, 9611, and 7140 subjects were used in the derivation, internal validation, and external validation sets for analysis, respectively. The major steps of inclusion/ exclusion processes of this study are described at Supplementary Fig. 1.

Definition of diabetes.
Diabetes was defined according to the American Diabetes Association (ADA) guidelines 40 as follows: a fasting blood glucose level ≥ 126 mg/dL, a 2-h post glucose level ≥ 200 mg/dL during OGTT, or an HbA1c ≥ 6.5%. Participants who were previously diagnosed as having diabetes or who exhibited diabetic features in their blood samples were categorized as the diabetes group in the KNHANES. In the KoGES, because it is a longitudinal observational study, we included non-diabetic patients in the initial cohort data. Moreover, we detected new-onset diabetes in accordance to the criteria of the ADA during the observation period.

Variable selection and statistical analysis.
To determine predictive risk factors for deriving the risk prediction model, candidate variables were selected based on literature review. Two endocrinologists performed literature review and selected 40 risk factors (Supplementary Table 1). Subsequently, we determined predictive risk factors using backward stepwise logistic regression (LR) method 41 after applying weight values to all subjects www.nature.com/scientificreports/ in the KNHANES. Weight values were used for the processes of determining the significant risk factors and deriving the prediction model. These values were determined during data construction and denoted the subjects in the study cohort in which a number of people were represented. Normal distribution of candidate variables was verified using the Kolmogorovo-Smirnov test. Differences in variables were analyzed based on diabetes status by means of the student's t-test and Chi-square test for continuous and categorical variables, respectively. Associations between candidate variables were analyzed separately for men and women. The LR model was used to determine the risk factors for the presence of diabetes mellitus, and to formulate the diabetes mellitus prediction model. The AUROC and the Cox regression model were used to measure the performance of the prediction model for the internal validation set and for the external validation set, respectively. Statistical analysis was performed using R language (R packages ver.3.6.1). P-value < 0.05 was considered statistically significant.
Ethical considerations. The Institutional Review Board of Gwangju Institute of Science and Technology (South Korea) approved the study protocol (IRB No. 20200414-EX-01-02). All research procedures were performed in accordance to the relevant guidelines and regulations. All participants volunteered and provided written informed consent prior to enrolment, and their records were anonymized before being accessed by the authors. Table 3. Baseline characters of external validation set. Continuous and categorical variables are described as mean ± standard error and number (percent), respectively. P-values are measured using nominal population, not weighted population. P-values of continuous and categorical variables are measured by Student t-test and Chi-squared test, respectively. BP blood pressure.