Introduction

Diabetes is a metabolic disease that shows itself clinically as chronic hyperglycemia, blood lipid and protein abnormalities, and other symptoms that increase the risk of morbidity and mortality1. Diabetes is a significant public health issue in the U.S. and around the world; it has been categorized as type 1, type 2, and gestational diabetes2. Type 2 diabetes mellitus (T2DM) is rising is relation to urbanization, population aging, and related lifestyle changes, especially in people over 653. Adults with diabetes were anticipated to number 415 million worldwide in 2015, and by 2040, that number will increase to 642 million2,4. The national Coronary Artery Disease (CAD) risk factors monitoring report estimates that among Iranians aged 15–64, the prevalence of diabetes was 8.7% in overall, with nearly half (4.1%) of those patients were newly diagnosed cases5. Diabetes is a serious and chronic disorder that has a significant negative impact on people's lives, families, and societies all over the world. Uncontrolled diabetes also increases the risk of metabolic, cellular, and blood disturbances leading to vascular complications, cancer, and all-cause death. According to estimates, it was one of the top 10 causes of mortality for adults and resulted in 4 million deaths worldwide in 20172,4. Despite the lack of the typical hematologic pathologic features associated with T2DM, several hematologic abnormalities have been identified in patients with this illness. It has been demonstrated that T2DM is closely related with several hematological abnormalities affecting platelets (PLTs), white blood cells (WBCs), red blood cells (RBCs), and the coagulation systems6,7.

Recent studies have indicated a correlation between some hematological parameters and diabetes, such as a reduction in RBC count in developing T2DM, and an increase in total WBC and PLT count in type 2 diabetic patients7,8,9. However, previous studies identified no relationship between T2DM and the other hematological parameters. Although a cross-sectional study has shown the association between the hematological parameters and T2DM in adult patients, its sample size was meaningfully lower than the current study7.

An early diagnosis and management of diabetes are crucial to reducing the risks of cardiovascular disease, cancer, and mortality due to the rising prevalence of diabetes and its relation to these diseases. The objective of this present study was to determine the association between diabetes and hematological factors.

Methods

Participants

The participants were recruited from the baseline of the Mashhad Stroke and Heart Atherosclerotic Disorders (MASHAD) study, Mashhad, north-eastern Iran, following a similar research protocol10. Nine thousand seven hundred four (9704) individuals aged 35–65 years were enrolled regarding their T2DM status were studied from the baseline of this cohort. T2DM was defined as a fasting blood glucose (FBG) ≥ 126 mg/dl or being treated with available oral hypoglycemic medications or insulin. Also, we consider triglyceride-glucose (TyG) index for the diagnosis of T2DM that defined as follows11:

$$TyG = \ln \left( {\frac{{\text{triglyceride* glucose }}}{2}} \right)$$

Also, we categorize the TyG index by using the median of our data. The median of TyG index in our data is 8.831. The inclusion criteria were males and females between the age of 35 and 65 years. We are dealing with data that is unbalanced (Diabetic vs. Non-Diabetic) in this investigation. One of the approaches that can be used for solving this problem is Synthetic Minority Oversampling Technique (SMOTE)12. The SMOTE algorithm is one of the most widely used under sampling and over sampling methods that create synthetic minority class samples. Therefore, in this study, the SMOTE algorithm was used to balance the classes. The observations were then analyzed on a balanced data set and after cleaning the data in each of the measured variables, finally with 9000 observations. After the cleaning data, we used the data from 9000 individuals in this study (Fig. 1).

Figure 1
figure 1

Flow chart of this study.

At the beginning of this study, we measured the demographic characteristics (including gender and age) and hematological information including HGB (Hemoglobin), HCT (Hematocrit), MCH (Mean Corpuscular Hemoglobin), PLT (Platelet count), LYM (Lymphocyte Count), MXD (Mixed Cell Count), NEUT (Neutrophil Count), RDW (Red cell Distribution Width), PDW (Platelet Distribution Width), MPV (Mean Platelet Volume), RBC (Red Blood Cell), MCV (Mean Corpuscular Volume), MCHC (Mean Corpuscular Hemoglobin Concentration), and WBC (White Blood Cell).

Blood sampling

According to a standard protocol, all blood samples were taken from an antecubital vein of all participants who were in a sitting position, between 8–10 am, after 14 h of fasting. The samples were collected in 20 ml vacuum tubes and centrifuged for 30–45 min to separate the serum and plasma, and later sent to Bu Ali Research Institute, Mashhad, for laboratory examinations. Aliquots of serum were also kept frozen at -80 C for future analysis. The details of laboratory measurements and cut-offs are explained in the baseline report of the MASHAD cohort study10.

Statistical analysis and model building

To describe the quantitative and qualitative variables, mean ± SD and frequency (%) were reported, respectively. Chi-square and Fisher’s exact tests were applied to measure the association between categorical variables. Also, the mean of quantitative variables between the two groups were compared by independent T test. In addition, machine learning techniques such as logistic regression (LR) and decision tree (DT) algorithms have been used to analyze data. In fact, we applied these algorithms to deduce the association between T2DM and hematological factors. We considered two models for the prediction of T2DM. Model I investigated the association of T2DM with hematological factors and Model II investigated the association of the TyG index with hematological factors. All analysis were performed using SPSS version 22 (Armonk, NY: IBM Corp.) and SAS JMP Pro (SAS Institute Inc., Cary, NC) at the significant level of 0.05.

Logistic regression (LR) modeling

Logistic Regression is a popular model to evaluate the relationship between various predictor variables (either categorical or continuous) and binary outcomes in medicine, public health, etc.13.

Let \({Y}_{i}\) denotes the response variable and takes the values of 0 or 1 depending on whether response occurs or not. Also, \({\varvec{X}}\) be vectors of covariates associated with response variable, \({\varvec{\beta}}\) is the corresponding vectors of regression coefficients. So, the association between the covariates and binary response variable can be investigated as follows:

$$logit\left\{ {E\left( {Y_{i} } \right)} \right\} = logit\{ Pr(Y_{i} = 1| {\varvec{X}},{\varvec{\beta}})\} = {\varvec{\beta}}^{{\varvec{T}}} {\varvec{X}}.$$

Decision tree (DT) modelling

Machine learning is one of the artificial intelligence analyses that emerged in the late twentieth century14,15. In other words, machine learning is a process for extracting hidden knowledge in large data sets. One of the important problems for researchers in this process is data classification16. There are different techniques for classification problems16. DT can be applied in various applications in the medical field17,18,19. Due to the simplicity in understanding and clarity and extracting simple and understandable rules, it is widely applied and studied in these fields16. The DT consists of components, nodes, and branches. So that, there are three types of nodes: (1) a root node represents the result of subdividing all records into two or more exclusive subsets. (2) The internal nodes represent a possible point in the tree structure connected to the root node from the top and the leaf nodes from the bottom. (3) Leaf nodes that show the tree’s final results in dividing records into target groups. Branches in the tree indicate the chance of placing records in target groups that emanate from the root node and the internal nodes14,15. DT algorithm uses the Gini impurity index for selecting the best variable.

$$Gini\left( D \right) = 1 - \mathop \sum \limits_{i = 1}^{m} P_{i}^{2}$$

where \({P}_{i}\) is the probability that a record in D belongs to the class \({C}_{i}\) and is estimated by |\({C}_{i}\),D|/|D. Logistic regression or LR is a statistical model applied to modeling dichotomous targets and investigating the effect of explanatory variables on the dichotomous target variable. In LR, the probability of placing each of the records in the target groups is also presented20,21. The main advantage of using the LR is that it can provide a good direct or inverse relationship between the inputs or explanatory variables and the target. It is also a flexible method22.

Bootstrap forest (BF) modeling

BF platform fits an ensemble model by averaging several decision trees, each of which is fit to a bootstrap sample of the training data. Each split in each tree shows a random subset of the predictors. In this way, many weak models are combined to produce a stronger model. The final prediction for an observation is the average of the predicted values for that observation over all the decision trees. In fact, the BF determines the significant factors associated with diabetes.

Receiver operating characteristic (ROC) curves were used to evaluate the accuracy, precision, and specificity for all three algorithms. Also, the confusion matrix of the three algorithms were given.

Ethics approval

All the participants consented to take part in the study by signing written informed consent. The study protocol was reviewed and all methods are approved by the Ethics Committee of Mashhad University of Medical Sciences with approval number IR.MUMS.REC.1399.660. All methods were carried out in accordance with relevant guidelines and regulations.

Results

A total of 9000 complete datasets of participant were analyzed in this cohort study (N = 4500 with Diabetes [female 62.77% vs male 37.22%], N = 4500 without Diabetes [female 59.15% vs male 40.84%]). The main baseline characteristics of the study population are summarized in Table 1. All the variables were significantly different between the two groups, including age, WBC, PDW, RDW, RBC, sex, PLT, MCHC, and HCT (P < 0.05). According to previous studies on the positive relationship of the TyG index with the presence of T2DM, we also considered the association of the TyG index with the hematological factors11,23,24.

Table 1 Clinical characteristics at the baseline of Mashhad stroke and heart atherosclerotic disorder (MASHAD) study used in this paper.

Three machine learning techniques were used to investigate the relationship between hematological predictors and binary response variables (diabetic, and non-diabetic). So, the main objective of this study was to anticipate diabetes using the LR, DT, and BF models and to determine their associated factors, especially hematological markers. For this purpose, the dataset was randomly split into two parts: training data, and test data (75% vs 25%). The training dataset was utilized to develop the DT and BF models, which was then validated using test data (25%) that hadn't been used during training.

LR model

Results from the multiple LR model revealed that all variables were significantly associated with having of diabetes (P < 0.05). In other words, our findings after adjusting the effect of other variables in the Model I presented that the odds of having diabetes in males is 0.69 times than of females (P < 0.05). Also, after adjusting the effect of other variables for each increasing in age, the odds of having diabetes raises by 8 percent (P 0.05). Among the analyzed hematological variables, age (OR = 1.08, 95%CI = (1.07,1.08)), WBC (OR = 1.29, 95%CI = (1.24,1.33)), and PDW (OR = 1.11, 95%CI = (1.08,1.14)), had the greatest associations with having of diabetes, especially WBC because for each unit increase in WBC, the odds of having diabetes increases by 29 percent (P < 0.001) (Table 2 Model I). Also, our findings after adjusting the effect of other variables in the Model II presented that the odds of having high TyG index in males is 0.66 times than of females (P < 0.05). Also, after adjusting the effect of other variables for each increasing in age, the odds of having high TyG index raises by 7 percent (P < 0.05). Among the analyzed hematological variables, age (OR = 1.07, 95%CI = (1.06, 1.08)), RBC (OR = 1.74, 95%CI = (1.36, 1.38)), WBC (OR = 1.33, 95%CI = (1.28,1.38)), and PDW (OR = 1.08, 95%CI = (1.05,1.12)), had the greatest associations with having high TyG index, especially WBC because for each unit increase in WBC, the odds of having high TyG index increases by 33 percent (P < 0.001) (Table 2 Model II).

Table 2 The results of multiple LR model.

For comparison models the confusion matrices of the models I and II are given in Table 4. Moreover, Fig. 2 (a) and (b) depicts the ROC curves of the models I and II.

Figure 2
figure 2

ROC curves for LR, DT, and BF algorithms for models I and II. Figures (a, c and e) show the ROC curves for LR, DT, and BF algorithms in model I. Also, figures (b, d and f) show the ROC curves for LR, DT, and BF algorithms in model II.

DT model

Figures S1 and S2 in Supplementary Information file illustrates the outcomes of the DT training for hematological factors. The DT algorithm determined the various diabetes risk factors and categorized them into 5 layers. According to the DT model, the first variable (root) is of the utmost significance for classifying data, with the subsequent variables having the subsequent levels of significance25. Figures S1 and S2 in Supplementary Information file illustrates that WBC, followed by age and RDW, has the greatest impact on the diabetes presence risk for models I and II.

In Model I participants with age < 47, WBC < 5.9, and RDW ≥ 41.2 had lower diabetes, according to the DT model, than those with higher WBC and RDW levels and older ages (0.8793 vs. 0.1207 incident rate). Eighty percent of patients had diabetes in the subgroup with older age (> = 47), low RDW (41.7), and high WBC (> = 6.8). More diabetes cases were represented by older age, higher WBC, and lower RDW levels than their corresponding opposite groupings. Table 3 (Model I) illustrates the specific diabetic rules developed by the DT model. The important variables in Table 2 (Model I) are used as input for this model. Age and WBC were thus determined to be the most crucial variables in the DT model and in the diagnosis of diabetes. In Model II participants with age < 47, WBC < 6.1, and RDW ≥ 39.4 had lower TyG index, according to the DT model, than those with higher WBC and RDW levels and older ages (0.8341 vs. 0.1659 incident rate). Eighty three percent of patients had high TyG index in the subgroup with older age (> = 47), high WBC (> = 6.3), and low RDW (< 41.7). Cases with high TyG index were represented by older age, higher WBC, and lower RDW levels than their corresponding opposite groupings. Table 3 (Model II) illustrates the specific rules developed by the DT model. The important variables in Table 2 (Model II) are used as input for this model. Age and WBC were thus determined to be the most crucial variables in the DT model and in the diagnosis of diabetes. For evaluation, the confusion matrices of the models I and II are given in Table 4. Moreover, Fig. 2(c) and (d) depicts the ROC curves of the models I and II.

Table 3 Detailed rules based on DT in models I and II.
Table 4 Model performance indices of the LR, DT, and BF algorithms for models I and II.

BF model

Finally, for another analysis we used BF for classification the data based on diabetes. The factors included in this BF algorithm are 9 hematological factors for model I and 10 hematological factors for model II used in previous models. Also, in this case, we set the following specifications: Number of Trees in the Forest: 43, Number of Terms Sampled per Split: 2, Training Rows: 6750 for model I and 6734 for model II, Test Rows: 2250 for model I and 2244 for model II, Minimum Splits per Tree: 10, Minimum Size Split: 9. Again for comparison, the confusion matrices of the models I and II are given in Table 4. Moreover, Fig. 2 (e) and (f) show the ROC curves of the models I and II. As shown in Table 4 the accuracy of the models I and II are 83.33 and 97.43 percent. Furthermore, the important variables associated with T2DM based on BF algorithm are given as: Age, WBC, PLT, RBC, RDW, PDW, HCT, MCHC, and Sex in model I and Age, WBC, RBC, HGB, RDW, PDW, PLT, HCT, MCHC, and Sex in model II. As one can observe Age, and WBC were the most significant factors which equal to the obtained results from LR and DT models. We summarize this study in a graphical abstract in Fig. 3.

Figure 3
figure 3

Graphical Abstract.

Discussion

In this study, a large number of biological and hematological factors like age, WBC, PDW, RDW, RBC, Sex, PLT, MCHC, and HCT had a significant relationship with T2DM. As we mentioned previously, we considered the association of the TyG index with hematological factors because of its positive relationship with T2DM presence. We found that the association of hematological factors with the TyG index was aligned with their results regarding T2DM, except MCHC. Therefore, we will continue the discussion based on the results of the T2DM and hematological factors. The most important and effective factors associated with T2DM presence were found to be age (as the most important and significant factors in the first line of DT) and WBC (as the second factor).

We found that in people over age of 47, the risk of diabetes increased dramatically. In line with our study, one study conducted in western Algeria on a sample of 1852 subjects, get these results with age 5025. In another study, the researchers indirectly found that the prevalence of T2DM was higher in middle-aged patients than in younger patients26. Contrary to our findings, a study on 307 diabetics showed that age had no significant relationship with the incidence or prevalence of diabetes1.

Our findings show that the WBC may be associated with the presence of T2DM. In people with a WBC ≥ 5.4, the prevalence of diabetics was 4 times more than of non-diabetics. Similarly, Lindsay et al. found that high WBC has the protentional to be considered as T2DM after adjusting for age and sex31. Another study conducted in 2018 showed that high WBC count, a marker of subclinical inflammation, can be used as an indicator of T2DM due to obesity32.

One of the most important difficulties for diabetics is the increased risk of thrombotic events and coagulation problems33. Platelets, are the main cellular element of coagulation, and play an important role in this process, and disruption in their number, shape, and activation pathways (measured by PT and MPV criteria) can lead to coagulation problems. The results of our study indicated a direct association between PLT count and the risk of diabetes. Conversely to our findings, the results obtained from a study of 1852 Algerian subjects with 1059 type 2 diabetic patients showed negative effect of PLT on the onset of T2DM25 and Some studies just showed that PLT levels are not involved in the development of diabetes pathology34,35. The association between PLT and MPV and their effects on each other has been investigated and confirmed in other studies, but surprisingly we could not find any significant association between MPV and the incidence of diabetes. Similarly, a number of studies could not find any association36,35,36,39, but some have found conflicting results with showing positive effects40,39,42.

The association of RDW and diabetes are stated in different studies. G. Smith and et al. found that Low RDW is associated with increased incidence of T2DM43 .Nevertheless, a study conducted in China in 2018 shows a direct link between RDW and the incidence of diabetes44.

We also found that HCT was negatively associated with the presence of diabetes, and a 2020 study in Northwest Ethiopia confirmed this inverse relationship45. But in another study, they could not find a significant link between HCT and diabetes46.

We also found that like high WBC, high RBC and MCHC can also increase the risk of diabetes. As shown in the decision tree, it can be inferred that a decrease in RBC, lower than 4.73, can greatly decrease the risk of diabetes.

Similar to our results, a study of 87 bangal T2DM showed a correlation between high MCHC and RDW with T2DM47. However, the study carried out in Saudi Arabia on a population with T2DM showed a negative association between diabetes and MCHC48. And so, this factor needs to be further investigated to determine its exact link to diabetes.

According to the results of the study, we obtained that for each unit increase in RBC, the odds of having diabetes 1.64 times which indicates a strong effect of red blood cell count on the risk of T2DM. However, very few studies in the world have linked this factor, and most studies have only reported the effect of T2DM on changes in the appearance and properties of red blood cells49,48,51. Even a 2013 study by Zhan-Sheng Wang and Zhan-Chun Song, which examined the relationship between red blood cell count and its effect on microvascular complications in Chinese patients with T2DM, yielded conflicting results. It was found that the proportion of patients with microvascular complications increases with decreasing red blood cell count (p value below 0.001)52. Another study in India in 2019 examined the association and role of hematological factors in diabetes mellitus reported that poorly controlled diabetics were more likely to develop anemia53.

One of the most important strengths of our study is the large sample size used. The second strength is the wide age range used in the study, which easily includes the three age groups of young, middle-aged, and elderly, and examines this relationship in them. Also, in this study we examined a relatively large number of hematological factors and for some of these factors not many studies have been done globally yet.

One of the limitations of this study is that we did not measure HbA1c in participants of the MASHAD cohort study. Moreover, it would have been much better if we could have enriched the target community in terms of cultural diversity because our study population was adults in the Mashhad cohort who all live in a common geographical area with relatively similar customs and lifestyles. This makes it impossible to generalize the results of this study to the other countries or even the total population of Iran.

The results of this study can help health authorities in early diagnosis and prevention of diabetes by examining only a few simple hematological criteria.

Conclusion

Our study showed that the BF model showed a better performance for the prediction of T2DM than the DT and LR models. According to our results, it may be concluded that some of the hematological factors could be valuable tool in the prediction of T2DM such as WBC, PDW, RDW, RBC, PLT, MCHC, and HCT. Among these hematological factors, WBC had the most significant role in the prediction of T2DM. Our findings indicates that hematological factors can be of value for using in the health care setting to predict the T2DM, as they are cost-effective, accessible, and simple markers.