Introduction

Diabetes mellitus is a chronic metabolic disorder characterized by disrupted glucose homeostasis, resulting from increased insulin resistance and/or impaired insulin secretion. People with diabetes mellitus are predisposed to metabolic disorders, such as cardiovascular disease (CVD), which affects 32.2% of all people with diabetes mellitus globally. Moreover, their complications are leading causes of morbidity and mortality1,2. The prevalence and socioeconomic burden of diabetes are rapidly increasing worldwide. Approximately 1 in 11 adults have diabetes, and 90% of people with diabetes have type 2 diabetes mellitus2.

Previous large-scale studies suggest that diet and lifestyle modifications can prevent or delay the development of diabetes mellitus in high-risk individuals by Refs.2,3. The Diabetes Prevention Program conducted in the Unites States reported that lifestyle modification reduced the incidence of diabetes mellitus by 58% compared with control after a 2.8-year mean follow-up4. Toshikazu et al. also demonstrated that lifestyle modification reduced the overall relative risk of diabetes mellitus by 44.1% in Japan5. Clinical studies conducted in China6 and India7 have reported 42% and 38% risk reductions, respectively.

Therefore, developing risk prediction models for diabetes mellitus and identifying high-risk individuals have become a challenging issue in clinical research. To explore the risk factors and formulate predictive models for diabetes development, machine learning techniques have been widely used8. These methods help researchers discover unknown significant figures and solve scientific problems from large quantities of datasets9,10. In the fields of medical science and healthcare, machine learning provides useful classification and prediction models with high accuracy11. Recently, Hang Lai et al. proposed a risk prediction model with 84.7% area under the receiver operating characteristic curve (AUROC) from 13,309 Canadian patients12. Furthermore, Maniruzzaman et al. built a classifying model that yielded 94.25% accuracy for the prediction of diabetes mellitus from an American diabetes dataset13.

In this study, we aimed to propose a novel diabetes index based on machine learning techniques for diabetes mellitus with high accuracy from two large community-based cohort studies. We formulated a risk classification model using logistic regression to measure the probability of diabetes presence, based on non-diabetic participants’ demographic information and laboratory data from the Korea National Health and Nutrition Examination Survey (KNHANES). Thereafter, we externally validated this model by predicting new-onset diabetes mellitus in a large prospective cohort study known as the Korean Genome and Epidemiology Study (KoGES).

Results

Baseline characteristics from the KNHANES

Table 1 show the general characteristics from the KNHANES. These depict the derivation and internal validation datasets, respectively, according to gender and diabetes. Subjects with diabetes were older than those without in both datasets. In the derivation dataset, diabetes prevalence was 4.9% in men and 3.8% in women. The prevalence of obesity (Body mass index, BMI ≥ 25 kg/m2) was 38% in men (38% in normal and 38% in diabetes) and 28.1% in women (27.3% in normal and 47.3% in diabetes). In the internal validation dataset, diabetes prevalence was 4.6% in men and 3.9% in women. The prevalence of obesity (BMI ≥ 25 kg/m2) was 40.8% in men (40.6% in normal and 44.5% in diabetes) and 27.6% in women (26.8% in normal and 46% in diabetes). Subjects with diabetes in both datasets exhibited lower socioeconomic status and education, higher fasting glucose levels, as well as higher incidence of glycosuria, hypertension, and dyslipidemia than did subjects without diabetes.

Table 1 General characteristics of training set (2013–16 KNHANES) and testing set (2017–2018) according to gender and diabetes.

Feature selection and classification model by logistic regression

Based on literature review, we identified about 40 candidate risk factors (Supplementary Table 1), in 20 variables present in both KNHANES and KoGES. Table 2 displayed the selection process by means of a univariate LR in men and women, respectively. All 20 features from Model 1 were selected as candidate variables for univariate analysis in Model 2. By means of multivariate analysis (Models 2 and 3), we identified 16 and 18 variables as diabetes risk factors to be utilized as the input features for formulating the classification model in men and women, respectively. Thereafter, based on these variables, we generated a gender-specific diabetes classification model using LR. Note that the feature selection and the formulation of the prediction model were conducted using only the derivation dataset.

Table 2 Backward stepwise logistic regression of men and women in training set.

We used this gender-specific diabetes classification model to calculate the probabilities of diabetes in subjects from the internal validation dataset. The area under the receiver operating curve (AUROC) was 0.941 and 0.939 in men and women, respectively (Fig. 1). The area of under the precision-recall (PR) curve was 0.475 and 0.381 in men and women, respectively (Fig. 1). Moreover, we evaluated the model performance via calibration, the agreement between observed and predicted probabilities using val.prob function in the rms package. As a result, the classification model for women was a well-calibrated model, besides the model for men was not according to the Spiegelhalter Z-test and its two-tailed p-values (S:p for men: 0.008; S:p for women: 0.588, Supplementary Fig. 2).

Figure 1
figure 1

ROC and PR curves for the present gender-specific diabetes prediction model using the KNHANES dataset from 2017 to 2018. (A) Men (B) women. KNHANES Korea National Health and Nutrition Examination Survey, ROC receiver operating characteristic, AUC area under the curve, PR precision recall, FPR false positive rate.

External validation of the classification model

Table 3 shows baseline characteristics of the KOGES dataset. By using our gender-specific classifying model constructed from the derivation dataset, we calculated the probabilities of the presence of diabetes in subjects from the external validation dataset. These subjects were categorized into five groups according to the probabilities of the subjects in ascending order. Figure 2 shows the cumulative incidence of new-onset diabetes. Most groups had significant differences from other groups. For both men and women, group 5 yielded significantly worse outcomes than those in other groups.

Table 3 Baseline characters of external validation set.
Figure 2
figure 2

Cumulative incidence difference of new-onset diabetes between five groups, divided according to the expected probabilities of participants in the KoGES study. Group 5 showed highest cumulative incidence in these five groups of both (A) men, (B) women. KoGES Korean Genome and Epidemiology Study.

Discussion

Our novel model for the risk classification of diabetes mellitus, based on two large-scale population-based cohort studies, showed high sensitivity and selectivity. Our model yielded AUROCs of 0.941 and 0.939 in men and women, respectively. The Finnish Diabetes Risk Score (FINDRISC) model is a well-known, recommended tool for diabetes mellitus prediction14. The AUROC of the FINDRISC model was 0.77 and 0.74 in the Norwegian15 and Spanish16 populations, respectively. The Framingham Diabetes Risk Scoring Model (FDRSM) by Wilson et al.17 yielded an AUROC of 0.85 and 0.78 in middle aged American and Canadian populations, respectively18. In the Asian population, Quan Zou et al. predicted new-onset diabetes using the machine learning technique from a Chinese cohort. Their model yielded an AUROC of 0.808419. The diabetes risk score model from the KoGES by Kim et al.20 yielded AUROC of 0.71 and 0.76 in men and women, respectively. Note that the predictive performance by our model is for the presence of DM, not the new-onset DM, thereby, somewhat outperforms compared to previous models predicting the new-onset DM. We had performed the literature-review and statistical methods to select more than 15 predictors, which are the potentially appropriate model for DM that has the complex pathophysiology.

With the help of machine learning techniques, we can handle large numbers of participant features that may have positive or negative correlations with the prevalence of diabetes mellitus. To obtain input features for our model, we used data from the KNHANES, a large-scale cross-sectional study that includes approximately 10,000 participants. As a result, we were able to use the 16 and 18 variables in men and women, respectively, during the analysis (Table 3).

Among these variables, glycosuria showed the highest odds ratio (OR) in men (OR 1.35; 95% CI 1.32–1.39). In general, glycosuria has been used as a biomarker for renal complication in diabetes8,21, not as a predictor for diabetes. Although glycosuria is a result of hyperglycemia, it also occurs with normal blood glucose levels due to renal injury. Moreover, hyperglycemic patients can also secrete normal range glucose in their urine22,23. This implies that we need to identify a new risk factor that, despite being considered negligible, may have a significant impact on predicting diabetes through machine learning techniques. High triglyceride (TG) levels showed the highest OR in women (OR 1.49; 95% CI 1.45–1.54). High TG levels are known to be a result of metabolic dysfunction in patients with diabetes24 and a risk factor for diabetes development25,26. Recently, a rural Chinese cohort study by Yongcheng et al. reported that hypertriglyceridemia is a risk factor for diabetes27. They also suggested that reducing triglycerides can decrease the risk of developing diabetes27. This implies that a high TG level is a modifiable risk factor for diabetes and should be managed in people predisposed to diabetes.

Alcohol consumption was related to a decreased risk of diabetes in both men and women (KNHANES dataset). This finding is consistent with previous studies about alcohol consumption. Moreover, heavy and moderate consumption showed deleterious and protective effects on diabetes, respectively28. BMI and waist circumference (WC) showed positive relationships in univariate analysis. However, multivariate analysis revealed that BMI had a negative relationship, whereas WC had a positive relationship with diabetes. In light of this, waist circumference, a well-known parameter for central obesity, may be a better parameter for risk assessment of obesity than is BMI, a general obesity indicator. Wang et al. reported similar results regarding risk prediction for diabetes. According to their analysis, abdominal adiposity was superior to abdominal obesity as a predictor for new-onset diabetes29. Peter et al. also reported that WC showed higher mortality risk than BMI (WC: HR 1.40 [95% CI 1.14–1.72] and BMI: HR 1.29 [1.04–1.61]) in adults with diabetes30.

Risk group classification is one of the most critical uses of machine learning techniques in medical research31. Using logistic regression, the combinatory effect of selected risk factors on the disease of interest could be calculated as a probability. Moreover, based on the probability obtained from LR, the participants were classified into five groups. Subsequently, we assessed the risk of each group by analyzing the cumulative incidence of diabetes using cox regression analysis. As expected, and as per our prediction model, participants at high risk showed a high incidence of diabetes (Fig. 2).

Our study had several limitations. First, we could not distinguish type 1 diabetes mellitus from type 2 diabetes mellitus because there were no biomarkers or clinical information for classifying the new-onset diabetes in the KoGES. The risk factors for each type of diabetes are different. Therefore, distinguishing the type of diabetes may be preferable when formulating a prediction model with high accuracy. However, new-onset type 1 diabetes mellitus in a patient over 30 years of age is rare32. Hence, this prediction model may be used to classify groups with a high risk for type 2 diabetes mellitus. Second, we could not use menopausal status as a predictive factor in women. The effects of various post-menopausal hormones in women must be considered33. Previous cohort studies reported controversial results regarding the role of menopausal status in diabetes development34,35. Kim et al. reported that there was no association between natural menopause and the risk for diabetes mellitus34. However, early menopause showed significant association with type 2 diabetes mellitus36. Unfortunately, KoGES data at baseline did not include the menopausal status of participants. Therefore, we could not use this factor. Third, we used two large cohort composed of Koreans. So, our diabetes index has high generalizability in Koreans, but not high in other populations. However, we had used the nationally representative surveys to establish the DM classification model. Moreover, we validated the model using the KoGES that is also a nation-wide longitudinal study. Due to setting healthy subjects as target population, our model might have the generalizability compared to other models using hospital-based participants.

In conclusion, we developed a diabetes mellitus risk classification model and validated it using Korean datasets. Although the variables used in this model cannot be counted directly, they can be easily collected in real clinical practice. Hence, this new diabetes index can be used to classify individuals at a high risk for diabetes mellitus, who should prevent the disease by managing their risks through lifestyle modification.

Materials and methods

Study population

This study used demographic data and biochemical profiles from the 2013–18 KNHANES. The KNHANES is a national surveillance system assessing the health and nutritional status of the Korean population. It is conducted annually by the Korea Centers for Disease Control and Prevention (KCDC). Details of this nationwide survey have been described elsewhere37. Subjects aged 40 years and older were included. Subjects with incomplete data regarding demographics and laboratory information were excluded. Furthermore, we excluded subjects with a fasting blood glucose level ≥ 126 mg/dL regardless of a diagnosis of diabetes mellitus. When constructing prediction models, subjects with hyperglycemia may cause bias as this may involve predicting the development of an anticipated pre-existing condition. We determined 2013–16 KNHANES data as the derivation set and 2017–18 KNHANES data as the internal validation set. The target population of KHANES consists of nationally representative non-institutionalized civilians38.

The KoGES is an ongoing, prospective, large cohort study conducted by the Korean government. It involves a biannual examination related to life-style surveys, biochemical profiles, and incidences of common chronic diseases of Korean adults since 2001. Details of the KoGES have been described elsewhere39. We used the Ansan–Ansung cohort study, a KoGES 10-year data follow-up study, for the external validation set. Subjects who were already diagnosed with diabetes mellitus or exhibited diabetic profiles in lab tests (a fasting glucose level ≥ 126 mg/dL, a 2-h post glucose level ≥ 200 mg/dL in a 75 g oral glucose tolerance test [OGTT], or a glycosylated hemoglobin A1c[HbA1c] level ≥ 6.5%) were excluded at baseline. Finally, 14,977, 9611, and 7140 subjects were used in the derivation, internal validation, and external validation sets for analysis, respectively. The major steps of inclusion/ exclusion processes of this study are described at Supplementary Fig. 1.

Definition of diabetes

Diabetes was defined according to the American Diabetes Association (ADA) guidelines40 as follows: a fasting blood glucose level ≥ 126 mg/dL, a 2-h post glucose level ≥ 200 mg/dL during OGTT, or an HbA1c ≥ 6.5%. Participants who were previously diagnosed as having diabetes or who exhibited diabetic features in their blood samples were categorized as the diabetes group in the KNHANES. In the KoGES, because it is a longitudinal observational study, we included non-diabetic patients in the initial cohort data. Moreover, we detected new-onset diabetes in accordance to the criteria of the ADA during the observation period.

Variable selection and statistical analysis

To determine predictive risk factors for deriving the risk prediction model, candidate variables were selected based on literature review. Two endocrinologists performed literature review and selected 40 risk factors (Supplementary Table 1). Subsequently, we determined predictive risk factors using backward stepwise logistic regression (LR) method41 after applying weight values to all subjects in the KNHANES. Weight values were used for the processes of determining the significant risk factors and deriving the prediction model. These values were determined during data construction and denoted the subjects in the study cohort in which a number of people were represented.

Normal distribution of candidate variables was verified using the Kolmogorovo–Smirnov test. Differences in variables were analyzed based on diabetes status by means of the student’s t-test and Chi-square test for continuous and categorical variables, respectively. Associations between candidate variables were analyzed separately for men and women. The LR model was used to determine the risk factors for the presence of diabetes mellitus, and to formulate the diabetes mellitus prediction model. The AUROC and the Cox regression model were used to measure the performance of the prediction model for the internal validation set and for the external validation set, respectively. Statistical analysis was performed using R language (R packages ver.3.6.1). P-value < 0.05 was considered statistically significant.

Ethical considerations

The Institutional Review Board of Gwangju Institute of Science and Technology (South Korea) approved the study protocol (IRB No. 20200414-EX-01-02). All research procedures were performed in accordance to the relevant guidelines and regulations. All participants volunteered and provided written informed consent prior to enrolment, and their records were anonymized before being accessed by the authors.